Code snippet scanning is a common question we get from prospects. We typically try to dig at why the prospect actually thinks they need snippet matching. We think this comes from mis-informed demand. To create conversation with the masses on this topic, I've shared my perspective so you have a complete picture of the risk and cost of code snippet scanning.
Prospect Question: Is there an inexpensive option for code snippet scans of source code that we could use in conjunction with Component Lifecycle Management?
I believe people think they need snippet matching because that was actually common in c/c++ and people assume it happens frequently in modern languages (untrue), and vendors have been successful in raising awareness of this problem because that's what they are good at. It's like going to a surgeon and asking him what to do. Of course he's going to say you need the operation. That's what he does.
While it is true that developers could copy code around, in a component based language like Java (and every language since) the reality is they don't. I can't recall any well known, high profile lawsuits involving snippets. They involved wholesale reuse of components/frameworks/operating systems. As an example, in 2013 Fantec was taken to court because firmware of the media player included the iptables software which is licensed under the GPLv2. Specifically, this wasn't really source code cut and paste, they included the entire iptables application in the linux based firmware.
In Mark Radcliffe's list of the "Top Ten FOSS Legal Developments" of 2012, Item 2 states, "A separate but related case also involved the Android operating system. Oracle sued Google for the alleged infringement of Oracle’s copyrights in the Java software (which it had acquired from Sun Microsystems, Inc.)"...." However, at the end of May, Judge Alsup issued a decision finding that the Java APIs were not protectable under copyright law."
Continuing with case analysis, Radcliffe states in Item 3, "The case involved the copying of the scripts and certain functions of the SAS analytical software." ...."The court found that such functions and programming language were not protected under the EU Directive on Protection of Computer Programs".
These examples reflect an accurate risk vs reward calculation: Do I need snippet matching if people aren't really doing it and/or it's not yet proven to be a real-world risk?
In addition to the real world risk assessment, there needs to be consideration of the cost of actually performing detailed, line by line analysis of source. This level of analysis is expensive both in terms of time and compute resources, but also generally leads to indeterminate results that require human analysis. The end result is that it can't be done fast enough to be fully integrated into the development lifecycle and it's not precise enough to program actionable results against.
To be clear, I'm not condoning copyright infringement. Stealing someone else's work is wrong, plain and simple. Nor am I saying that scanning source doesn't have its place. OpenLogic and Palimeda have built a solid following for a reason. Dave McLoughlin from OpenLogic lays out a good case for scanning in his presentation "Understanding the Value of Scanning for Open Source Software".
If you absolutely, positively must ensure the provenance of every single line of code and can put the resources behind it, go for it....but do it with an awareness of the real world risk and costs. However, if you need a place to get started and want something that can cover the cases that are more likely to happen in the real world, perhaps snippet scanning isn't of the highest priority.
Ultimately it's like buying insurance. You need to assess how likely a given risk is and how much you have to lose. Just don't expect the agent to provide realistic answers for you. Only you or your organization can make that ROI assessment, I just ask you do it with a complete picture of both the risk and the cost.