In software, the term "hash" has several meanings, but what we discuss here is loosely focused on what Wikipedia calls a “cryptographic hash function”.
In short, hashes are strings of letters and numbers meant to identify a set of information by a smaller, unique [1] code. You may have seen articles here on Sonatype’s blog or elsewhere referring to hashing. If you’ve seen a random-looking text string like the one below, it may have been a “hash.”
The various hash formats come with a long list of odd-sounding names like:
… but they all do similar things. Hashing is something everyone can use, from average users to cybersecurity experts.
Hashing might seem strange and complex at first, but it’s actually very simple. Hashes are a bit like image thumbnails in that they are tiny compared to the files they identify.
The file can be any size from 1 kilobyte or 100 terabytes, and the hash will always be the same size. And the hash value is always the same; no matter how large the file or what computer is used to compute it.
The task of hashing focuses on one thing: assigning a unique value.
I started long ago with hashes while trying to make sure my company report had no issues. I was working at a bank and using Microsoft Excel to find old data, and that started by looking for duplicate entries.
Fortunately, Excel has an easy option for highlighting duplicate values:
But finding individual cells was not useful. There were a lot of similar numbers throughout.
Instead, I needed to find duplicate rows.
There are many tricks to enable this, but at the time, I was in a rush to catch these embarrassing extras. I decided to just multiply an entire row together (as below) and check the results column for a duplicate result.
Multiplying all cells together to get a unique value.
Because the result was always unique, I could easily flag duplicate rows.
Unfortunately, they weren’t always unique. I came across an issue where two very obviously different rows happened to get the same multiplied result, or a “false positive.”
I needed to find a way to show an absolutely unique value for every unique row in the spreadsheet.
Unfortunately, I ended up doing a lot of extra work manually checking each duplicate row. It was better than submitting a bad report, but I knew there was a better method.
Not long afterwards I learned about a trick that could deliver a unique number for each row: hashing. And it’s a technique in use throughout computing.
First, no matter how large the file or what computer is used to compute it, the hash value is always the same.
And these unique values carry valuable information that lets you:
Although hashing has been around since the early days of computers, more recently they have been used as a way to quickly fingerprint files on the internet.
The primary task of hash files by security software and professionals is to determine the status of a file, whether good or bad. For example, hashes that show up in a virus database should get blocked from your computer. Hashes considered safe and well-known (such as the Firefox and Chrome browsers) can be installed without issue.
Most of these tools for checking reputation are built right into the software, meaning programs check hashes as a normal part of their operations.
Firefox uses hashes in the background to know if a file is malicious
One important job that Sonatype Repository Firewall performs for our customers is keeping bad, outdated, or malicious software out of the development process.
When a new program is analyzed, it’s checked against our database for problems. If it’s a known-good file, it’s passed along as normal. If a file is unknown or has a bad reputation (the objects in red and yellow below), it’s blocked. After they’re fully analyzed, any files with that same hash will always get treated the same way.
Whether that’s given the green light as great software or blocked from ever being used, hashing helps make sure its cataloged and managed according to your policy.
You can also manually check the hash values within the software:
Although many hashing tools are often built-in, it’s possible to manually check the results.
One way to use a hash is to check the downloads from an untrusted website. Some security researchers will check hash values on files even from trusted locations, especially when saved to a critical workstation or server.
While there are dozens of tools that can do this, I use the open source PeaZip archive manager for Windows, Mac, and Linux.
To view hashes, right-click on a file, choose File Manager – File Tools – Checksum/hash file(s) and select the “Clipboard” tab.
From there, you can double-click on the SHA256 value and copy (CTRL+C or Apple+C). This value is the standard for security analysis.
Now that you have this long string of text, you can view its reputation in services like VirusTotal.com. This will show whether the file is considered good, bad, or unknown. Just click the Search tab and paste in the value.
A good reputation is a score of 0, meaning “zero threats.” Choosing to use a file above score 0 comes with some caveats. Where a score of 1 or 2 may be considered “false positives,” or over-cautious anti-virus tools, scores higher than 3 should take additional steps. These could include researching the author, interaction within a secure sandbox, or other precautions.
The file may not have been evaluated if there’s no reputation (“No matches found”) as pictured below.
At this point, you can set aside the file and either check later or assume it’s unsafe and delete.
Although all of these tools use hashing as part of their operations, they are separate topics. In short:
—
Hashes are a simple tool with many uses, including duplication, security, and reputation. The capabilities are built into many software programs and tools, but you can use them to solve problems in computing today.
Software development teams interested in learning about how Sonatype uses AI analysis to build file reputations can schedule a demo today.
---
[1] One of this article's readers reached out to me and let me know that it's more accurate to call "distinct" rather than unique. Just like there's only so many possible pin numbers a person could choose for their bank account, there are a limit (in the millions or more) of possible hash values. As such, it's possible for the same hash value to be assigned to different files. This is known as a "hash collision."
The best way to approach values that are totally unique is using high quality hashing tools that use SHA 256, a format with 256^32 possible combinations. Here, a collusion is extremely unlikely.