This post is a brief introduction to passwords. All of my other posts assume some prerequisite knowledge that may make them inaccessible. If you already know about password cracking, hash functions, salting, and stretching, you can probably skip this post or perhaps just skim it. If those concepts are new to you, you're in the right place.
Please post any questions you have in the comments. I will try to answer questions and/or update this post as required.
Most operating systems and web applications authenticate people using passwords. In order for this to work, the server (or application) has to store some information that will allow it to validate the password. One way to accomplish this would be to just store the passwords in plain text, but this would be a big problem if the password file or database was ever stolen. The solution most systems use is to hash the passwords.
Hashes
A hash is a one-way function that produces a fixed-size output. Hashes are often used in programming to sort data, but those hashes aren't useful for cryptography because it's easy to invert them or to find two inputs with the same hash value. A cryptographic hash, however, is designed so that given a hash value y, it's computationally difficult to find the message x so that hash(x) = y. There are some additional properties that are important for other uses in cryptography, but for passwords this is the one we care about.
By storing hashes instead of password, we can minimize the impact of an attacker stealing the password database since he (usually) won't be able to log in directly using the hashes. Unfortunately, there are some problems with using an ordinary cryptographic hash for passwords. The first problem is that we can easily tell if two users have the same password. Suppose we have two users, Alice and Bob, who have these entries in the password file:
alice:425b7f90de888a65a6fb223397a470eaefb387ad
bob:425b7f90de888a65a6fb223397a470eaefb387ad
Even without knowing what their password is, we know they picked the same one. If an attacker cracks (guesses) Alice's password, he will also crack Bob's and vice-versa. This indicates that Alice and Bob probably didn't pick good passwords since if they picked them randomly there is almost no chance that they would pick the same password. An attacker can also try to match this hash against hashes taken from another site or against a list of hashes he has previously cracked. This is not good, but it's not the biggest problem.
Password Cracking
Before we get to the other problems, we need to understand password cracking. Password cracking is the term we use for trying to guess other people's passwords. An attacker can try to do this online or offline. An online attack is easier but less effective: the attacker just tries to log in as a user over and over until he get locked out or guesses the right password. This is very slow (tens of guesses per second), but it can work if the passwords are badly chosen. It's also noisy because it can generate log entries or affect the target system so someone might notice. Another downside, for the attacker and the defender, is that the attacker may just end up locking a bunch of accounts.
Instead of locking accounts after 3-5 failed login attempts, sites would do much better to just delay or rate limit login attempts so that a user or attacker has to wait a few seconds after each failed attempt. This prevents accounts from getting locked out and still makes online guessing hard for an attacker. Using CAPTCHA forms can be useful for preventing automated attacks, but they are difficult for users too. If you use CAPTCHA, only require it after a few failed login attempts. This will keep it away from legitimate users most of the time.
If he is able to get access to the password hashes, however, he can try an offline attack. Offline attacks are very, very fast (millions or billions of guesses per second).
Once the attacker has the hashes, he can launch his offline attack using programs like John the Ripper, Cain and Abel, Cryptohaze or oclHashcat to crack the hashes. The programs work by making password guesses, hashing each guess, and comparing the hash to the actual password hashes. Early on, most of the cracking programs were pretty simple. They'd either guess by brute-force ("aaaa", "aaab", "aaac", etc.) or iterate through a large word list. The current programs are much more complex and can apply complex rules and manipulations to word lists in order to guess a wider range of passwords. Brute force guessing can use statistical rules and patterns to decide which password to guess next.
Getting access to the password hashes is more difficult, but it's definitely possible. There were several high-profile disclosures earlier this year where someone was able to steal a large number of password hashes from a website (e.g. LinkedIn). Many of these are due to an attack called SQL injection. Many websites depend on databases to store content and account information. To access these back end databases, they use SQL queries. Attackers can sometimes use SQL injection attacks to execute their own custom queries which allows them to access data they would not ordinarily have access to (e.g. password hashes). Websites aside, it is also possible to steal passwords from Unix/Linux or Windows, but this generally requires root or administrator access. On Linux and Unix, passwords are typically stored in /etc/shadow or /etc/passwd. On Windows networks, password hashes are stored in Active Directory or the local registry rather than a human-readable file, but there are tools available to steal them (they require Administrator rights).
Password Cracking Example
Let's pretend we've just used SQL injection to steal the passwords from a fantasy football site. Among the hashes, we have our friend Alice:
alice:425b7f90de888a65a6fb223397a470eaefb387ad
The hash is 160-bits long (40 hex characters) which means it's probably SHA-1. Since it's a fantasy football site, let's start with a list of football-related words and passwords:
touchdown
manning
nosaints
go49ers
superbowl
I wrote a small script to hash each of these guesses and compare it to Alice's password. Here's the output:
Cracking: alice:425b7f90de888a65a6fb223397a470eaefb387ad
No match: touchdown
No match: manning
No match: nosaints
Match: go49ers:425b7f90de888a65a6fb223397a470eaefb387ad
We cracked Alice's password and she happens to be a 49ers fan. We also cracked Bob's password since he had the same hash. In reality, a password cracker would not be this verbose, but my example illustrates the basic concept.
More Problems
I've pointed out that two users with the same password will have the same hash, but this is actually a minor part of a larger problem. Since hashing with an ordinary cryptographic hash function is deterministic (the output is always the same for a given input), an attacker can attack every user at once. Let's say a website has a million users. It would take a very long time to try to crack a million accounts individually. Even if an attacker only spent one minute on each account before giving up, it would take two years to get through the whole list. But, an attacker doesn't have to do that. He can sort the hashes and for each guess he just needs to do a quick search to see if his guess is anywhere in the list. For a small list, the overhead is miniscule. For millions of users, it might cut his speed in half, but that's a pretty small penalty compared to the time it would take to check each user individually.
An attacker can also precompute hashes to save time. By doing this, the attacker can save himself time in later. This is the idea behind the now famous rainbow tables. With ordinary precomputation, an attacker would have to store every hash value which quickly becomes problematic. With rainbow tables, he can generate "hash chains" and only store the first and last value in the chain which reduces the storage by a factor of a thousand or more. Looking up a hash in a rainbow table takes a minute or two whereas a brute-force attack could take days or longer (depending on the strength of the password). The details are beyond the scope of this post but you can read a basic description in my ;login: article (from 2004) or the details in the original paper by Phillipe Oecschlin.
The other big problem is that cryptographic hashes are designed to be fast. They need to be implemented in smart cards or used on networks without introducing any noticeable latency. For other purposes, that's fine but for passwords it's the opposite of what we want. With algorithms like MD5 and SHA-1, password crackers can make tens of millions of guesses per second using a fast CPU. More recently, GPU-based password cracking has come into fashion. GPU password crackers like oclHashcat and Cryptohaze can try billions of passwords per second for these algorithms. To withstand an attack like that, users may have to pick passwords that are 14-15 characters long.
Solutions (Salting and Stretching)
The solutions to the problems I've pointed out are salting and stretching. Salting means adding a random value to every password before it is hashed. The salt is not a secret. The salt value is randomly generated each time a new password is set and is stored with the password hash. This means that attackers can't precompute hashes, can't use rainbow tables, and can't attack every user at once. With a large (i.e. 128-bit) salt, the attacker will have to try to crack each user's hash individually. Even if an attacker only spent one minute per user before giving up, it would take about two years to try to crack the hashes for a million users.
Here's what Alice and Bob's passwords look like with a salt:
alice:190db5882ce03b6d16414a3fbbf63d22.a8dd428638791fc0c5ac9f128b1e8a5cab8c3d5b
bob:a9c9c692e5c0cab7656e103bd64885de.245cc7b0b8b93a69244fe92df792c89075bd198832
The new hash is much longer. The 32 characters before the period are the random 128-bit salts. The 40 characters after are the 160-bit SHA-1 hashes. You should notice that the salts and hashes are different for Alice and Bob. Now, cracking Alice's password does not give out any information about Bob.
Stretching means to slow down the hashing algorithm so that password cracking becomes very slow even for a single hash. Some password hashing algorithms, such as md5crypt and PBKDF2, use a cryptographic algorithm and iterate it thousands of times. The Unix crypt algorithm only used 25 iterations of a modified DES algorithm, but that's a lot on a PDP-11. Other password hashing algorithms such as bcrypt and scrypt aren't as straightforward. I talk about some of these algorithms in the ;login article I mentioned previously
Disclaimer: I no longer stand by the advice I gave at the end of the ;login article. Password expiration, for instance, is worthless. Read the article for the technical bits and skip the ending.
If you need to pick a password hashing algorithm, I recommend scrypt, bcrypt, or PBKDF2. I prefer them in the order I listed them, but you can safely use whatever is supported in your environment/library.
If you want to learn more about passwords, please check out the rest of my blog. These are good posts to start with:
Passwords: Attacks and Threats
How long should passwords be?
I also recommend this recent ArsTechnica article:
Why passwords have never been weaker--and crackers have never been stronger
Edit: I updated this post to emphasize online versus offline guessing. Thanks for the recommendation @thorsheim.
remember in 2010 when RockYou‘s SQL database got hacked exposing 32 million passwords? i was amazed at the top 10 passwords that people used.
ReplyDelete1. 123456
2. 12345
3. 123456789
4. Password
5. iloveyou
6. princess
7. rockyou
8. 1234567
9. 12345678
10. abc123
[ source: http://bit.ly/M1pLer ]
In a group that large, there will always be a lot of people who are completely clueless and, unfortunately, strong password hashing won't save them. Using scrypt, bcrypt, or PBKDF2 slows things down a lot and provides safety in numbers but anyone using "abc123" or is still going to get cracked.
DeleteOn the other hand, a weak, but not top-10-weak password like "nathan03" or "85bears" may hold up as long as the attacker is targeting a large (1M+) user list and not focusing on one account.