The NT Dialect/NTLM hash in Windows uses the MD4 algorithm. Password input is treated as Unicode when it is hashed so ASCII characters are converted to 16-bits ('A' = 0x41 = 0x0041).
The MD4 algorithm consists of 48 steps which turn a 512-bit input into a 128-bit output.
MD4 pads all inputs to a multiple of 512 bits, though it can iterate over several
512-bit blocks if necessary.When passwords are 13 characters long or less, the inputs to steps 46-48 are null. An attacker can use this information to reverse the last three steps of any password hash he or she is trying to crack. Subsequently, the attacker
only needs to compute the first 45 steps for each password tried. This results in a
speedup of about 6.25%.
Further improvements to this optimization are possible. The 128-bit output of MD4 is
really four separate 32-bit values. Only one of these values changes in each step. The fourth of these 32-bit values changes last in steps 42 and 46. Since step 46 is reversed
by the previous optimization, an attacker can compare the fourth 32-bit word of the hash
value after step 42 to the calculated value and stop computing if the values do not
match. An attacker will only have to compute beyond step 42 once in every 4 billion
tries. This results in a speedup by about 12.5% over a naive implementation. To simplify things, we could choose to always compute through step 43 which gives us a speedup of 10.4% but always gives us 64-bits for comparison, effectively requiring us to compute further only to verify an actual match; false positives would be very rare.
Even for passwords greater than 13 characters, we can speed things up a bit. For any password of 30 characters or less, the message input to step 48 will be null. This allows us to reverse step 48 which gives us the value of the second word which was calculated in step 44. Similar to the previous optimization, we only need to compute past step 44 once in every 4 billion tries. This gives us an improvement of about 8.3%. Simplifying again, we could compute 45 steps so that we always have 64-bits for comparison and a speedup of 6.25%.
It looks like John the Ripper reverses the last step, but I don't know if it uses the additional improvements I mentioned.
Update: @Hashcat on Twitter claims that
Update 2: @Bitweasil on twitter pointed out that Hashcat-lite is for attacking a single hash only. If we peel off additional rounds by holding some bytes constant, we have to repeat the process for each of our choices and for each user. This would not scale well to a large number of users. The more bytes we hold constant, the worse it scales. The number of iterations we perform for each choice of fixed values needs to be larger than the number of users by some multiple.