Protecting software developers from malware with AI/ML insights

April 20, 2023 By Mandeep Singh

6 minute read time

In my last post I talked about solutions to address malware and the increase in attacks. Today I’ll dig into what’s necessary to find and avoid malware.


Organizations are changing their mindset from how a data breach or a cyberattack could hurt customers, to how this can directly hurt the company being breached, e.g. ransomware. The 2023 Harvard Business Review trends suggest that the shift of the cyberattack burden is moving up from the customers to the organizations creating the software. Meaning software developers are under attack.

In light of this danger, some organizations have taken extreme measures, even barring developers from using the latest component versions. The assumption is that the longer malware is posted publicly, the more likely it will be discovered and taken down.

That's unfortunately a fallacy. While you can force your direct dependencies to use an older version, you can't really control “transitive” dependencies. This means malware components can come into your environment via the dependencies of your direct dependencies.

Development teams need protection

Not analyzing software components as they enter your development environment leaves you exposed, and it only takes one compromised component to create a breach. Research around the cost of recovery for a serious breach is on average $4.4 million.

Unfortunately, many security analysis tools use a mostly manual process, as they do not yet have a machine learning-based toolset.

To stay safe and block malware, organizations need immediate access to the best information. They need deep analysis and comprehensive malware data to train malware prediction tools.

Seeking malware patterns

To find problems, models need to dig deeper and look at the “bytecode” analysis of what's in that package. At this level, it’s possible to look through the details of any given dependency and look for patterns associated with malware. There are specific patterns aligned with malicious attacks and even specific bad actors.

Various packages with a given pattern identified by analysisPackages with a key pattern are identified as malware.

Some bad actors have very specific or similar signatures. Identifying these signatures and other patterns is a key signal when trying to understand if a given component is malicious. Unfortunately, this is a game of cat-and-mouse because sometimes a bad actor might add them to otherwise benign packages to make it more confusing for someone trying to track them.

It's a difficult process that’s often convoluted and misleading. So not every attack signature means dangerous software and not every malware developer can be identified.

Another possible indicator is a similarity in package naming and composition. So if it looks like a relatively unpopular package is trying to mimic a popular software’s naming or code, they may be trying to borrow their legitimacy. When users download it, often malicious code is included like the classic Trojan horse: a gift with dangerous contents.

Artificial intelligence / Machine learning (AI/ML) malware insights

The natural language processing (NLP) models for thorough analysis can identify a variety of crucial patterns. Insights are gained by looking at multiple models that include a range of indicators. Sometimes it’s a few reliable and strong patterns, while other times it’s the presence of many weaker flags.

These indicators could be standard attack signatures, patches with known-malicious packages, or look-alike packages (a wolf disguised as a sheep).

These insights are difficult to discern and only come from a long history of malware analysis to find examples with patterns that repeat over time. Unfortunately, you need many thousands of known malicious packages to be analyzed to be able to get this insight.

Our data sources include:

  1. Historical insights – Including the last 10+ years of Sonatype history.

  2. Expertise in data science and cybersecurity – Finding malware and suspicious software components.

  3. Access to Maven Central – As stewards of this Java-based public repository, we are able to better record, categorize, and track malware issues submitted by bad actors or within the transitive dependencies of legitimate projects.

  4. Dedicated security research team – An in-house team of 65+ researchers looking into the latest attacks.

It’s also important to keep adding data and retraining analysis tools, as malware developers constantly evolve.

Ongoing improvement model

Since 2019, we’ve discovered 108,232 packages confirmed as malicious.

This creates a positive feedback loop where the more malware we find, the more data we have to feed our data. This in turn better trains our models.


Malware analysis feedback

A cyclical relationship where finding malware continuously improves the analysis

When malicious packages are found, they are often taken down from public repositories. This means there’s no permanent resource for malware analysis. So quality security tools have to be active and engaged in ongoing capture and analysis.

With malware analysis, time is of the essence

Unfortunately, quality data on malware dangers isn’t enough. Although Sonatype tries to quickly disclose malware that’s in the public interest, not everything gets taken down instantly from npm, PyPI, and other public repositories.

Prediction speed is also important, meaning malware analysis must scale to the volume of available new releases.

Additionally, software supply chain attacks on popular resources can have tremendous reach, meaning that, even when malware sees quick removal, it still has enough of a window to cause great harm. As a result, it only takes one malware component to compromise your development environment.

You need to avoid problems in real-time.

Automated malware prevention

Improve your software supply chain security by automatically detecting known and unknown malware from entering development cycles. Sonatype Repository Firewall’s malware blocking is the most intelligent and secure way to prevent a host of issues as soon as they are released. These include malware like ransomware, crypto-jacking, next-gen supply-chain attacks, brandjacking, typosquatting, and dependency confusion-type attacks.

When a concern is discovered, Sonatype Repository Firewall will automatically replace a suspicious dependency with an alternative version without breaking any builds. If the update is deemed safe after analysis, the suspicious version is automatically released back into your pipeline.


article - repo firewall flowchart

Repository Firewall flowchart

Repository Firewall combines over 60+ signals to identify potentially malicious activity and block risks before download. These signals feed into a first-of-its-kind AI/ML-powered automated malware detection and protection system. This analysis means that our prediction moves fast enough to block malware as soon as it’s published.

Sonatype’s platform reduces risk, helps avoid software supply chain attacks, and transparently protects developers from known and unknown risk.

 

Tell Me More

 

Tags: featured, News and Views, Post security/devsecops, Malware Analysis, Sonatype Repository Firewall

Written by Mandeep Singh

Mandeep Singh is a Product Manager for Sonatype Repository Firewall. He has a background in business management and a degree in Computer Science.