Code Snippet Scanning: Is it Really Needed Anymore?


April 4, 2014 By Brian Fox

Code snippet scanning is a common question we get from prospects. We typically try to dig at why the prospect actually thinks they need snippet matching. We think this comes from mis-informed demand. To create conversation with the masses on this topic, I’ve shared my perspective so you have a complete picture of the risk and cost of code snippet scanning.

Prospect Question: Is there an inexpensive option for code snippet scans of source code that we could use in conjunction with Component Lifecycle Management?

I believe people think they need snippet matching because that was actually common in c/c++ and people assume it happens frequently in modern languages (untrue), and vendors have been successful in raising awareness of this problem because that’s what they are good at. It’s like going to a surgeon and asking him what to do. Of course he’s going to say you need the operation. That’s what he does.

While it is true that developers could copy code around, in a component based language like Java (and every language since) the reality is they don’t. I can’t recall any well known, high profile lawsuits involving snippets. They involved wholesale reuse of components/frameworks/operating systems. As an example, in 2013 Fantec was taken to court because firmware of the media player included the iptables software which is licensed under the GPLv2. Specifically, this wasn’t really source code cut and paste, they included the entire iptables application in the linux based firmware.

In Mark Radcliffe’s list of the “Top Ten FOSS Legal Developments” of 2012, Item 2 states, “A separate but related case also involved the Android operating system. Oracle sued Google for the alleged infringement of Oracle’s copyrights in the Java software (which it had acquired from Sun Microsystems, Inc.)”….” However, at the end of May, Judge Alsup issued a decision finding that the Java APIs were not protectable under copyright law.”

Continuing with case analysis, Radcliffe states in Item 3, “The case involved the copying of the scripts and certain functions of the SAS analytical software.” ….”The court found that such functions and programming language were not protected under the EU Directive on Protection of Computer Programs”.

These examples reflect an accurate risk vs reward calculation: Do I need snippet matching if people aren’t really doing it and/or it’s not yet proven to be a real-world risk?

In addition to the real world risk assessment, there needs to be consideration of the cost of actually performing detailed, line by line analysis of source. This level of analysis is expensive both in terms of time and compute resources, but also generally leads to indeterminate results that require human analysis. The end result is that it can’t be done fast enough to be fully integrated into the development lifecycle and it’s not precise enough to program actionable results against.

To be clear, I’m not condoning copyright infringement. Stealing someone else’s work is wrong, plain and simple. Nor am I saying that scanning source doesn’t have its place. OpenLogic and Palimeda have built a solid following for a reason. Dave McLoughlin from OpenLogic lays out a good case for scanning in his presentation “Understanding the Value of Scanning for Open Source Software“.

If you absolutely, positively must ensure the provenance of every single line of code and can put the resources behind it, go for it….but do it with an awareness of the real world risk and costs. However, if you need a place to get started and want something that can cover the cases that are more likely to happen in the real world, perhaps snippet scanning isn’t of the highest priority.

Ultimately it’s like buying insurance. You need to assess how likely a given risk is and how much you have to lose. Just don’t expect the agent to provide realistic answers for you. Only you or your organization can make that ROI assessment, I just ask you do it with a complete picture of both the risk and the cost.

  • David Grierson

    The reason that companies ask about snippet scanning is that they are worried about the potential for having to open their code base to end users. All it takes is for a snippet of code to be copy-and-pasted from a GPL’ed piece of code and suddenly you’re looking at a risk of significant cost and/or reputational loss.

    If you’re needing to guarantee that this is absolutely not the case (for example in a consumer device such as a set top box) then you’re going to look for more analysis than just a BoM based scan.

    Both of the examples which you cite (Google/Oracle & SAS) are concerned with “copying” API’s and so don’t necessarily fall foul of copyright issues. Performing a direct copy-and-paste of some code from (for example) the Linux kernel (which is GPL’ed) does become an issue because the use of the copied code means that the pasted target becomes a derivative work. The copyleft action of the GPL then kicks in and you are required to release the source code of the derivative work to its end users.

    Okay – the likelihood of “being caught” in this case may be small but companies don’t necessarily want to have to justify their release/non-release in a courtroom and would often prefer to have that guarantee up-front.

    As you say, it’s like insurance … it’s also kind of like having confirmation as to every component used in your product. Some companies (such as Red Hat) indemnify the users of their product (RHEL) against any patent or license claims upon their products. Code snippet checking is validating that your developers are doing their jobs right and not cutting legal corners.

    Obviously this is all IMHO, IANAL and TINLA.