Wednesday, December 19, 2018

MD5 should not be used in forensics (or anywhere else)

A few days ago, I drafted (but had not yet published) a post about using MD5 for validating or authenticating evidence in digital forensics.  MD5 has had security problems for twenty years, but it's still been used in forensics, although the trend has been toward SHA-1 (which has some problems of its own) and SHA-2.

After drafting the post, I discovered that the Scientific Working Group on Digital Evidence has released a draft endorsing the use of MD5 and SHA-1.  I wrote in to share my concerns, but I also reached out to some cryptographers via Twitter.  Dr. Marc Stevens, a cryptographer known for his expertise in attacking MD5 and other hash functions, released a series of tweets that was even more critical of MD5 than I anticipated and that was incredibly damning for any forensic expert who continues to rely on MD5.

First, I'll share my original thoughts in abbreviated form.  Then I'll share some highlights from Dr. Stevens' tweets.  If you're interested in Dr. Stevens' views, consider reading all of what he had to say on Twitter and in his scientific work.  If I have misrepresented or misunderstood his views in any way, I apologize.

When we image and process digital evidence, we use a hash function to fingerprint that data so that we can compare it to other known files and so that, later on, we can verify that the evidence hasn't changed.  SHA-1 is probably the most common hash function used in forensics and there is some support for SHA-256, which is what we should be moving toward.

In order to be considered secure, a hash function should be strong against two attacks: collisions and preimages.  A collision occurs when we find two "messages" (files, strings, whatever) that have the same hash value.  To be secure, it should be hard to find two files that have the same hash.  Note that in this scenario we are allowed to to pick both messages.  If we can find any two that match, we have a collision.  A preimage is a little different because one of messages has already been picked.  To find a preimage, we have to find a second message that has the same hash value.  The distinction is like the difference between trying to find two people in a room with the same birthday (anybody can match anybody) versus trying to find somebody in a room with your birthday.

Note: I'm glossing over the difference between preimages and second preimages because I don't think its important for this discussion.

MD5 is considered a weak hash function because there are practical attacks for finding collisions.  There aren't any practical attacks for finding preimages for MD5.

If we need to verify that a file hasn't changed, MD5 is plenty good enough to detect accidental modification.  If the file was corrupted or inadvertently modified by a careless examiner, there is an infinitesimally small chance that the hash will come out the same.  If we're worried that someone has intentionally altered the data, they would have to be able to execute an attack (find a preimage) that is beyond what anyone is currently able to do using publicly-known attacks.  Hell, even if the file wasn't hashed, a court would probably not allow someone to assert that the evidence had been altered without some evidence suggesting it had.

So, we can use MD5, right?

I think you do so at your own peril.

The problem is that cryptographers, the people who are experts in making hashes and ciphers, have been saying not to use MD5 for 20 years and the attacks against MD5 have gotten much, much better since then.  When a forensic examiner goes into court, he or she serves the court as an "expert".  I feel like I could offer a reasonable defense/explanation for using MD5.  I've read books on cryptography and took a grad-level class in it.  I'm knowledgeable (enough to be dangerous).  I think I understand it well enough to say that despite the warnings it's okay to use it in certain circumstances.  But I'm not an expert in cryptography so why would I try to weigh in as one?  [Note: Dr. Stevens' tweets indicate that he disagrees with my contention that MD5 would be acceptable in some circumstances.  But, that's my point.  Any situation where I think it might be okay to use MD5 is based on my amateur understanding of cryptography, not the expert-level understanding that he or his colleagues would have.]

There's an added complication.  Even if MD5 is okay to use in these scenarios, trying to justify it without a good understanding of why could lead you into some murky waters.   Simply not being careful about how you answer questions could get you trapped by a well-prepared attorney.

Imagine this: You go into court and explain how you verified the images in your case using MD5.  The defense attorney asks you some very innocent questions about it:  "What's MD5?", "can two files have the same hash?".

You give the best explanation that you remember from your training: "the odds of two files having the same hash are like 1 in 80 bajillion."

"So", he says "I couldn't just change the file and tweak it so the hash would be the same?"

"No way", you say.  "It's like winning the lottery five times in a row."

The defense attorney smiles back at you and grabs a stack of papers off of his table.  He has an article about how some researchers forged digital certificates that used MD5.  He'd like you to read the highlighted portion.  He has another about how the Flame malware hijacked Windows Update because of MD5.  Would you please read the paragraph he highlighted there as well?  He picks up a USB drive and tell you he has pictures of Jack Black, James Brown, and Barry White and they all have the same hash.  He has a picture of a ship and a plane and those two have the same hash. He'd like you to hash these files to demonstrate.

"So", he says again.  "What you told us a few minutes ago about the hashes.  It wasn't true, was it?"

That's about all I had in my original draft.  Here's what Dr. Stevens had to say; the tweets are not necessarily in order:

I think these tweets are key because they argue (from his expert perspective) that we should not use MD5 but also point out that this is the prevailing opinion among cryptographers. This is really key because the methods that we use in a legal case are supposed to meet a standard, namely the Daubert standard which considers five factors:

1. Whether a theory or technique can be and has been tested
2. Whether the theory or technique has been subject to both peer review and publication
3. The known or potential error rate of the method
4. The existence and maintenance of standards controlling its operations; and
5. Whether it has attracted widespread acceptance within the relevant scientific community

Looking at #5, I don't know whether a court would consider forensic experts or cryptographers to be the relevant scientific community, but cryptographers (who are responsible for almost every publication on the analysis of MD5) have widely rejected it.  They have tested it (#1), subjected it to peer review (#2), found errors (#3) and they have declared in academic papers and in public that it should not be used.  Many forensic examiners, however, find it acceptable.

Responding to an argument that MD5 is still acceptable for use in forensics, Dr. Stevens countered that the defense was based on the other circumstances (e.g. chain of custody) that make the evidence trustworthy, not on the assurance provided by MD5.

While Dr. Stevens might not have been thinking about Daubert, that certainly sounds like an argument that MD5 would not meet the Daubert standard.

I included this tweet because it relates to what I said previously about the fact that we (forensic examiners) are not experts in cryptography.  Understanding which specific use cases might be okay for MD5 requires a good understanding of the attacks against MD5 and how they can be used.  We don't have that expertise so we should trust in cryptographers and not use MD5.

Most of the time, the authenticity of an image or other files is assured by having good procedures and a proper chain of custody.  Any cryptographic hash or CRC function can detect accidental modification.  We don't use cryptographic hashes in case something is accidentally modified.  We use them either to prevent intentional modification or to provide a scientific air of respectability.  It's pretty clear that Stevens does not think that MD5, or even SHA-1, should be used to provide any sort of higher guarantee about the authenticity of digital evidence.

At this point, I think anyone trying to rely on MD5 in court is committing a grave error.  And any forensic examiner trying to defend MD5 is out over his skis.


  1. I would be interested to see Dr Stevens manipulate a forensic image to put artefacts of interest to prove or disprove an incident. I think that's what Boucher was arguing as to why it's still valid in forensic cases.
    I also dont think that the argument of the cluey defense lawyer really stacks up for this reason.

    That being said, many forensic utilities with hash to md5 and sha1 upon creating a forensic image, so it's a moot point for the time being.
    Using it to highlight that a file in a hashset has been found in a seized dataset is a very different use case for manipulating a certificate on a live server - which is where the cryptographers and security folks are coming from, and for their purposes md5 and now sha1 is broken

    1. My argument about the defense lawyer is isn't about whether some examiners can defend MD5 successfully, it's that it leaves an opening. My belief is that many, possibly most, examiners aren't very knowledgeable about cryptography and can't do more than parrot what they learning in training class X. Using MD5 and having to explain why the various security problems aren't an issue in your case is not worth the risk when SHA-2 exists and is free. I don't want to ever have to explain to a jury why preimages are different from collisions.

      The other issue is that Dr Stevens and other cryptographers, who are the experts in their area, say that we should not use MD5 at all. That's a big problem when you consider that we're supposed to use valid/accepted scientific methods. The scientists in this area say that MD5 is not valid and forensic examiners, with no special training in cryptographer, want to argue that it is.

      Identifying known-bad hashes is different than validating acquired evidence. I recall that Dr Stevens was okay with this since it's only used to triage files and the positive results will be verified visually.


Understanding Scope in Go

As per my New Year's resolution, I've been learning to program in Go and reading  The Go Programming Language .   On page 141 of the...