After drafting the post, I discovered that the Scientific Working Group on Digital Evidence has released a draft endorsing the use of MD5 and SHA-1. I wrote in to share my concerns, but I also reached out to some cryptographers via Twitter. Dr. Marc Stevens, a cryptographer known for his expertise in attacking MD5 and other hash functions, released a series of tweets that was even more critical of MD5 than I anticipated and that was incredibly damning for any forensic expert who continues to rely on MD5.
First, I'll share my original thoughts in abbreviated form. Then I'll share some highlights from Dr. Stevens' tweets. If you're interested in Dr. Stevens' views, consider reading all of what he had to say on Twitter and in his scientific work. If I have misrepresented or misunderstood his views in any way, I apologize.
When we image and process digital evidence, we use a hash function to fingerprint that data so that we can compare it to other known files and so that, later on, we can verify that the evidence hasn't changed. SHA-1 is probably the most common hash function used in forensics and there is some support for SHA-256, which is what we should be moving toward.
In order to be considered secure, a hash function should be strong against two attacks: collisions and preimages. A collision occurs when we find two "messages" (files, strings, whatever) that have the same hash value. To be secure, it should be hard to find two files that have the same hash. Note that in this scenario we are allowed to to pick both messages. If we can find any two that match, we have a collision. A preimage is a little different because one of messages has already been picked. To find a preimage, we have to find a second message that has the same hash value. The distinction is like the difference between trying to find two people in a room with the same birthday (anybody can match anybody) versus trying to find somebody in a room with your birthday.
Note: I'm glossing over the difference between preimages and second preimages because I don't think its important for this discussion.
Note: I'm glossing over the difference between preimages and second preimages because I don't think its important for this discussion.
MD5 is considered a weak hash function because there are practical attacks for finding collisions. There aren't any practical attacks for finding preimages for MD5.
If we need to verify that a file hasn't changed, MD5 is plenty good enough to detect accidental modification. If the file was corrupted or inadvertently modified by a careless examiner, there is an infinitesimally small chance that the hash will come out the same. If we're worried that someone has intentionally altered the data, they would have to be able to execute an attack (find a preimage) that is beyond what anyone is currently able to do using publicly-known attacks. Hell, even if the file wasn't hashed, a court would probably not allow someone to assert that the evidence had been altered without some evidence suggesting it had.
So, we can use MD5, right?
I think you do so at your own peril.
The problem is that cryptographers, the people who are experts in making hashes and ciphers, have been saying not to use MD5 for 20 years and the attacks against MD5 have gotten much, much better since then. When a forensic examiner goes into court, he or she serves the court as an "expert". I feel like I could offer a reasonable defense/explanation for using MD5. I've read books on cryptography and took a grad-level class in it. I'm knowledgeable (enough to be dangerous). I think I understand it well enough to say that despite the warnings it's okay to use it in certain circumstances. But I'm not an expert in cryptography so why would I try to weigh in as one? [Note: Dr. Stevens' tweets indicate that he disagrees with my contention that MD5 would be acceptable in some circumstances. But, that's my point. Any situation where I think it might be okay to use MD5 is based on my amateur understanding of cryptography, not the expert-level understanding that he or his colleagues would have.]
There's an added complication. Even if MD5 is okay to use in these scenarios, trying to justify it without a good understanding of why could lead you into some murky waters. Simply not being careful about how you answer questions could get you trapped by a well-prepared attorney.
Imagine this: You go into court and explain how you verified the images in your case using MD5. The defense attorney asks you some very innocent questions about it: "What's MD5?", "can two files have the same hash?".
You give the best explanation that you remember from your training: "the odds of two files having the same hash are like 1 in 80 bajillion."
"So", he says "I couldn't just change the file and tweak it so the hash would be the same?"
"No way", you say. "It's like winning the lottery five times in a row."
The defense attorney smiles back at you and grabs a stack of papers off of his table. He has an article about how some researchers forged digital certificates that used MD5. He'd like you to read the highlighted portion. He has another about how the Flame malware hijacked Windows Update because of MD5. Would you please read the paragraph he highlighted there as well? He picks up a USB drive and tell you he has pictures of Jack Black, James Brown, and Barry White and they all have the same hash. He has a picture of a ship and a plane and those two have the same hash. He'd like you to hash these files to demonstrate.
You give the best explanation that you remember from your training: "the odds of two files having the same hash are like 1 in 80 bajillion."
"So", he says "I couldn't just change the file and tweak it so the hash would be the same?"
"No way", you say. "It's like winning the lottery five times in a row."
The defense attorney smiles back at you and grabs a stack of papers off of his table. He has an article about how some researchers forged digital certificates that used MD5. He'd like you to read the highlighted portion. He has another about how the Flame malware hijacked Windows Update because of MD5. Would you please read the paragraph he highlighted there as well? He picks up a USB drive and tell you he has pictures of Jack Black, James Brown, and Barry White and they all have the same hash. He has a picture of a ship and a plane and those two have the same hash. He'd like you to hash these files to demonstrate.
"So", he says again. "What you told us a few minutes ago about the hashes. It wasn't true, was it?"
That's about all I had in my original draft. Here's what Dr. Stevens had to say; the tweets are not necessarily in order:
That's about all I had in my original draft. Here's what Dr. Stevens had to say; the tweets are not necessarily in order:
I disagree: cryptography is notoriously hard to get right. You should rely on expert cryptographic advice. And the prevailing expert opinion is: do not use MD5 for security.— Marc Stevens (@realhashbreaker) December 16, 2018
And nowhere MD5 actually helps you in court, and can only hurt, since any cryptographic expert would say it should not be used for that. While SHA2 would help you in court. So what would be the best advice?— Marc Stevens (@realhashbreaker) December 16, 2018
I think these tweets are key because they argue (from his expert perspective) that we should not use MD5 but also point out that this is the prevailing opinion among cryptographers. This is really key because the methods that we use in a legal case are supposed to meet a standard, namely the Daubert standard which considers five factors:
1. Whether a theory or technique can be and has been tested
2. Whether the theory or technique has been subject to both peer review and publication
3. The known or potential error rate of the method
4. The existence and maintenance of standards controlling its operations; and
5. Whether it has attracted widespread acceptance within the relevant scientific community
Looking at #5, I don't know whether a court would consider forensic experts or cryptographers to be the relevant scientific community, but cryptographers (who are responsible for almost every publication on the analysis of MD5) have widely rejected it. They have tested it (#1), subjected it to peer review (#2), found errors (#3) and they have declared in academic papers and in public that it should not be used. Many forensic examiners, however, find it acceptable.
The problem of course is that your defence for MD5 is based on the entire situation and not on any properties of MD5 itself. SWDGE's document claims that MD5 provides integrity guarantees when it actually can't (for data from untrusted sources).— Marc Stevens (@realhashbreaker) December 16, 2018
Responding to an argument that MD5 is still acceptable for use in forensics, Dr. Stevens countered that the defense was based on the other circumstances (e.g. chain of custody) that make the evidence trustworthy, not on the assurance provided by MD5.
(1) The document clearly relies on MD5 against tampering. (2) There is ample scientific evidence that MD5 is insecure to use against tampering. (3) The document says MD5 is still 'acceptable' for forensic use, does not show any scientific support.— Marc Stevens (@realhashbreaker) December 16, 2018
While Dr. Stevens might not have been thinking about Daubert, that certainly sounds like an argument that MD5 would not meet the Daubert standard.
I scientifically reason where MD5 is still secure, you work the other way round: MD5 is still usable. Oops no, lets exclude that case. Oops, no lets also exclude that case. etc. etc. Its the wrong approach in security.— Marc Stevens (@realhashbreaker) December 17, 2018
I included this tweet because it relates to what I said previously about the fact that we (forensic examiners) are not experts in cryptography. Understanding which specific use cases might be okay for MD5 requires a good understanding of the attacks against MD5 and how they can be used. We don't have that expertise so we should trust in cryptographers and not use MD5.
And all I'm saying is that MD5 does not protect against active evidence tampering.— Marc Stevens (@realhashbreaker) December 16, 2018
I agreed MD5 is ok for file discovery of bad files if you still actually check content, I disagree for whitelisting. I also disagree that you can make any claims that images/files have been changed by any middleman just based on their MD5 hash.— Marc Stevens (@realhashbreaker) December 16, 2018
In light of SHA2 and SHA3, the only value of MD5 for forensics is due to a legacy of old hashsets that cannot be recreated. It has only downsides & no merits by itself. Saying MD5 is suitable that is like saying salt is still suitable as money.— Marc Stevens (@realhashbreaker) December 17, 2018
Doesn't matter at all: using the MD5 hash you simply have no scientific cryptographic basis to claim that any file in transit has not been changed by any middleman.— Marc Stevens (@realhashbreaker) December 16, 2018
Most of the time, the authenticity of an image or other files is assured by having good procedures and a proper chain of custody. Any cryptographic hash or CRC function can detect accidental modification. We don't use cryptographic hashes in case something is accidentally modified. We use them either to prevent intentional modification or to provide a scientific air of respectability. It's pretty clear that Stevens does not think that MD5, or even SHA-1, should be used to provide any sort of higher guarantee about the authenticity of digital evidence.MD5 provides no additional security over CRC32. So you can claim it was not accidentally changed, but you cannot claim no middleman changed it.— Marc Stevens (@realhashbreaker) December 16, 2018
At this point, I think anyone trying to rely on MD5 in court is committing a grave error. And any forensic examiner trying to defend MD5 is out over his skis.
I would be interested to see Dr Stevens manipulate a forensic image to put artefacts of interest to prove or disprove an incident. I think that's what Boucher was arguing as to why it's still valid in forensic cases.
ReplyDeleteI also dont think that the argument of the cluey defense lawyer really stacks up for this reason.
That being said, many forensic utilities with hash to md5 and sha1 upon creating a forensic image, so it's a moot point for the time being.
Using it to highlight that a file in a hashset has been found in a seized dataset is a very different use case for manipulating a certificate on a live server - which is where the cryptographers and security folks are coming from, and for their purposes md5 and now sha1 is broken
My argument about the defense lawyer is isn't about whether some examiners can defend MD5 successfully, it's that it leaves an opening. My belief is that many, possibly most, examiners aren't very knowledgeable about cryptography and can't do more than parrot what they learning in training class X. Using MD5 and having to explain why the various security problems aren't an issue in your case is not worth the risk when SHA-2 exists and is free. I don't want to ever have to explain to a jury why preimages are different from collisions.
DeleteThe other issue is that Dr Stevens and other cryptographers, who are the experts in their area, say that we should not use MD5 at all. That's a big problem when you consider that we're supposed to use valid/accepted scientific methods. The scientists in this area say that MD5 is not valid and forensic examiners, with no special training in cryptographer, want to argue that it is.
Identifying known-bad hashes is different than validating acquired evidence. I recall that Dr Stevens was okay with this since it's only used to triage files and the positive results will be verified visually.