What is hashing? A look at unique identifiers in software

In software, the term "hash" has several meanings, but what we discuss here is loosely focused on what Wikipedia calls a “cryptographic hash function”.

What is hashing?

In short, hashes are strings of letters and numbers meant to identify a set of information by a smaller, unique _[1] code. You may have seen articles here on Sonatype’s blog or elsewhere referring to hashing. If you’ve seen a random-looking text string like the one below, it may have been a “hash.”

The various hash formats come with a long list of odd-sounding names like:

MD5
SHA1
Whirlpool
CRC32

… but they all do similar things. Hashing is something everyone can use, from average users to cybersecurity experts.

Hashing is a surprisingly simple technology

Hashing might seem strange and complex at first, but it’s actually very simple. Hashes are a bit like image thumbnails in that they are tiny compared to the files they identify.

A photo and it's thumbnail next to an image of lots of data next to a small text digest

The file can be any size from 1 kilobyte or 100 terabytes, and the hash will always be the same size. And the hash value is always the same; no matter how large the file or what computer is used to compute it.

The task of hashing focuses on one thing: assigning a unique value.

Why are unique values so important in hashing?

I started long ago with hashes while trying to make sure my company report had no issues. I was working at a bank and using Microsoft Excel to find old data, and that started by looking for duplicate entries.

Fortunately, Excel has an easy option for highlighting duplicate values:

A screesnhot of the navigation menu in Excel under the "Home" drop down. The photo shows a user navigating from "Home" to "Highlight Cells Rules" to "Duplicate Values..."

But finding individual cells was not useful. There were a lot of similar numbers throughout.

Instead, I needed to find duplicate rows.

There are many tricks to enable this, but at the time, I was in a rush to catch these embarrassing extras. I decided to just multiply an entire row together (as below) and check the results column for a duplicate result.

A screenshot of a row in an excel spreadsheet. It is demonstrating what it looks like when you multiply all the cells together to get a unique value.

Multiplying all cells together to get a unique value.

Because the result was always unique, I could easily flag duplicate rows.

A screenshot of four rows in an Excel sheet. Row 3 is highlighted in red and is marked as a "duplicate". A row with the same inputs and the same outputs (in red).

Unfortunately, they weren’t always unique. I came across an issue where two very obviously different rows happened to get the same multiplied result, or a “false positive.”

A screenshot of 4 rows in Excel. Rows with the same inputs and the same outputs are highlighted in red. Row 3 has a note marking it as a "duplicate". Row four has a note marking it as a "false positive". A additional row with different inputs but the same output as the other two (in red).

I needed to find a way to show an absolutely unique value for every unique row in the spreadsheet.

Unfortunately, I ended up doing a lot of extra work manually checking each duplicate row. It was better than submitting a bad report, but I knew there was a better method.

Not long afterwards I learned about a trick that could deliver a unique number for each row: hashing. And it’s a technique in use throughout computing.

Why would I use a hash file?

First, no matter how large the file or what computer is used to compute it, the hash value is always the same.

And these unique values carry valuable information that lets you:

Find duplicate files such as finding and deleting duplicate photos. Any files with the same hash are duplicates – you don’t need to open and compare them.
Identify a file - You and a coworker are updating the same file and upload it to a server. If the server doesn’t show who posted what, how do you determine which one was yours without going line-by-line for changes? Just compare your machine's hash with the remote file hash.
Ensure the file you’ve downloaded is the right one. For example, if you get a software program from a website, how do you know the website or the upload was hijacked or corrupted? Hashes can help detect problems.
Assign a reputation to a file. If an older version of a program worked better than the latest, knowing the hash lets you identify which one to use.

Although hashing has been around since the early days of computers, more recently they have been used as a way to quickly fingerprint files on the internet.

How are hashes used in security?

The primary task of hash files by security software and professionals is to determine the status of a file, whether good or bad. For example, hashes that show up in a virus database should get blocked from your computer. Hashes considered safe and well-known (such as the Firefox and Chrome browsers) can be installed without issue.

Most of these tools for checking reputation are built right into the software, meaning programs check hashes as a normal part of their operations.

A screenshot of Firefox blocking a virus or malware from being downloaded.

Firefox uses hashes in the background to know if a file is malicious

How Sonatype uses hashes

One important job that Sonatype Repository Firewall performs for our customers is keeping bad, outdated, or malicious software out of the development process.

When a new program is analyzed, it’s checked against our database for problems. If it’s a known-good file, it’s passed along as normal. If a file is unknown or has a bad reputation (the objects in red and yellow below), it’s blocked. After they’re fully analyzed, any files with that same hash will always get treated the same way.

article - repo firewall flowchart

Whether that’s given the green light as great software or blocked from ever being used, hashing helps make sure its cataloged and managed according to your policy.

You can also manually check the hash values within the software:

A screenshot of what a hash looks like listed inside of a Sonatype image. An example of a hash listed inside a Sonatype image.

How you can use hashes today

Although many hashing tools are often built-in, it’s possible to manually check the results.

One way to use a hash is to check the downloads from an untrusted website. Some security researchers will check hash values on files even from trusted locations, especially when saved to a critical workstation or server.

While there are dozens of tools that can do this, I use the open source PeaZip archive manager for Windows, Mac, and Linux.

To view hashes, right-click on a file, choose File Manager – File Tools – Checksum/hash file(s) and select the “Clipboard” tab.

A screenshot of how to view hashes in files within the PeaZip archive manager.

From there, you can double-click on the SHA256 value and copy (CTRL+C or Apple+C). This value is the standard for security analysis.

Using VirusTotal

Now that you have this long string of text, you can view its reputation in services like VirusTotal.com. This will show whether the file is considered good, bad, or unknown. Just click the Search tab and paste in the value.

A screenshot of how to paste your hash number into VirusTotal to find out if it good, bad, or unknown.

Interpreting your score

A good reputation is a score of 0, meaning “zero threats.” Choosing to use a file above score 0 comes with some caveats. Where a score of 1 or 2 may be considered “false positives,” or over-cautious anti-virus tools, scores higher than 3 should take additional steps. These could include researching the author, interaction within a secure sandbox, or other precautions.

The file may not have been evaluated if there’s no reputation (“No matches found”) as pictured below.

A screenshot of VirusTotal on the "no matches found" screen.

At this point, you can set aside the file and either check later or assume it’s unsafe and delete.

Are hashes related to digital signatures, cryptography, or cryptocurrency?

Although all of these tools use hashing as part of their operations, they are separate topics. In short:

Cryptography and digital signatures use hashes to ensure the encrypted files are not changed between sender and receiver.
Cryptocurrency uses a complex form of digital signatures for transactions.

—
Hashes are a simple tool with many uses, including duplication, security, and reputation. The capabilities are built into many software programs and tools, but you can use them to solve problems in computing today.

Software development teams interested in learning about how Sonatype uses AI analysis to build file reputations can schedule a demo today.

---

[1] One of this article's readers reached out to me and let me know that it's more accurate to call "distinct" rather than unique. Just like there's only so many possible pin numbers a person could choose for their bank account, there are a limit (in the millions or more) of possible hash values. As such, it's possible for the same hash value to be assigned to different files. This is known as a "hash collision."

The best way to approach values that are totally unique is using high quality hashing tools that use SHA 256, a format with 256^32 possible combinations. Here, a collusion is extremely unlikely.