Comparison of Automated Password Guessing...

80
Master of Science Thesis in Electrical Engineering Department of Electrical Engineering, Linköping University, 2019 Comparison of Automated Password Guessing Strategies Tobias Lundberg

Transcript of Comparison of Automated Password Guessing...

Page 1: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Master of Science Thesis in Electrical EngineeringDepartment of Electrical Engineering, Linköping University, 2019

Comparison of AutomatedPassword GuessingStrategies

Tobias Lundberg

Page 2: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Master of Science Thesis in Electrical Engineering

Comparison of Automated Password Guessing Strategies:

Tobias Lundberg

LiTH-ISY-EX--19/5213--SE

Supervisor: Niklas Johanssonisy, Linköpings universitet

Examiner: Jan-Åke Larssonisy, Linköpings universitet

Information CodingDepartment of Electrical Engineering

Linköping UniversitySE-581 83 Linköping, Sweden

Copyright © 2019 Tobias Lundberg

Page 3: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Abstract

This thesis examines some of the currently available programs for password guess-ing, in terms of designs and strengths. The programs Hashcat, OMEN, PassGAN,PCFG and PRINCE were tested for effectiveness, in a series of experiments sim-ilar to real-world attack scenarios. Those programs, as well as the program Tar-Guess, also had their design examined, in terms of the extent of how they usedifferent important parameters. It was determined that most of the programsuse different models to deal with password lists, in order to learn how new, sim-ilar, passwords should be generated. Hashcat, PCFG and PRINCE were foundto be the most effective programs in the experiments, in terms of number of cor-rect password guessed each second. Finally, a program for automated passwordguessing based on the results was built and implemented in the cyber range atthe Swedish defence research agency.

iii

Page 4: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 5: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Acknowledgments

Thanks to Niklas Johansson and Hannes Holm for supervising the work and Jan-Åke Larsson for examining it.

Linköping, June 2019Tobias Lundberg

v

Page 6: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 7: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

3 Theory and related work 53.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Hash function . . . . . . . . . . . . . . . . . . . . . . . . . . 53.1.2 Password cracker . . . . . . . . . . . . . . . . . . . . . . . . 53.1.3 Password guesser . . . . . . . . . . . . . . . . . . . . . . . . 63.1.4 Sister Password . . . . . . . . . . . . . . . . . . . . . . . . . 63.1.5 Online/offline guessing . . . . . . . . . . . . . . . . . . . . . 63.1.6 User and attacker . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Password Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.1 Personal information . . . . . . . . . . . . . . . . . . . . . . 73.2.2 Password Policies . . . . . . . . . . . . . . . . . . . . . . . . 83.2.3 Password Re-use . . . . . . . . . . . . . . . . . . . . . . . . . 83.2.4 Dictionary words . . . . . . . . . . . . . . . . . . . . . . . . 83.2.5 Native Language . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Password guessing programs . . . . . . . . . . . . . . . . . . . . . . 93.3.1 Hashcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3.2 OMEN and OMEN+ . . . . . . . . . . . . . . . . . . . . . . . 103.3.3 PassGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.4 PCFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.5 PRINCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3.6 TarGuess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Other programs and techniques . . . . . . . . . . . . . . . . . . . . 133.4.1 Rainbow Tables . . . . . . . . . . . . . . . . . . . . . . . . . 143.4.2 Other password guessers . . . . . . . . . . . . . . . . . . . . 14

vii

Page 8: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

viii Contents

4 Method 154.1 How password guessing is performed . . . . . . . . . . . . . . . . . 15

4.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Overall design of password guessing programs . . . . . . . . . . . 17

4.2.1 Input parameters . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Overall effectiveness of password guessing programs . . . . . . . . 18

4.3.1 Metrics for measuring effectiveness . . . . . . . . . . . . . . 184.3.2 Used sets of data . . . . . . . . . . . . . . . . . . . . . . . . 204.3.3 Studied scenarios and tests . . . . . . . . . . . . . . . . . . . 20

4.4 The implementation of the system . . . . . . . . . . . . . . . . . . . 244.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.2 The requirements . . . . . . . . . . . . . . . . . . . . . . . . 244.4.3 Expected number of guesses . . . . . . . . . . . . . . . . . . 254.4.4 Expected number of correct guesses . . . . . . . . . . . . . 254.4.5 Tests to determine the parameters for the curves . . . . . . 264.4.6 Putting together the material into a program . . . . . . . . 26

5 Result 275.1 How password guessing is performed . . . . . . . . . . . . . . . . . 27

5.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Overall design of password guessing programs . . . . . . . . . . . 30

5.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.2 Hashcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.3 OMEN and OMEN+ . . . . . . . . . . . . . . . . . . . . . . . 325.2.4 PassGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.5 PCFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2.6 PRINCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.7 TarGuess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Overall effectiveness of password guessing programs . . . . . . . . 385.3.1 Test to measure guessing speed . . . . . . . . . . . . . . . . 385.3.2 Test with password list . . . . . . . . . . . . . . . . . . . . . 395.3.3 Test with dictionary words and password list . . . . . . . . 445.3.4 Test using dictionaries of different languages . . . . . . . . 445.3.5 Test against passwords with a minimum length policy . . . 44

5.4 The implementation of the system . . . . . . . . . . . . . . . . . . . 455.4.1 Expected number of correct guesses . . . . . . . . . . . . . 46

6 Discussion 496.1 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.1.3 Test with password list . . . . . . . . . . . . . . . . . . . . . 506.1.4 Test with dictionary words and password list . . . . . . . . 506.1.5 Test using dictionaries of different languages . . . . . . . . 506.1.6 Test against passwords with a minimum length policy . . . 51

6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Page 9: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Contents ix

6.2.1 Criticism of sources . . . . . . . . . . . . . . . . . . . . . . . 526.2.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2.3 The design of the programs . . . . . . . . . . . . . . . . . . 536.2.4 The effectiveness of the programs . . . . . . . . . . . . . . . 536.2.5 The development and implementation of the system . . . . 55

6.3 The work in a wider context . . . . . . . . . . . . . . . . . . . . . . 56

7 Conclusions 577.0.1 Answers to the research questions . . . . . . . . . . . . . . . 577.0.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 67

Page 10: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

x Contents

Page 11: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

1Introduction

1.1 Motivation

Passwords are currently one of the most widely used authentication methods. Acompromised password can enable an attacker to obtain sensitive informationor abuse privileges. This makes passwords a very interesting topic to study. Inrecent years, multiple different websites have been breached and millions of pass-words have been leaked [15]. Notable examples includes RockYou (32 millionspasswords, 2009), LinkedIn (100 millions passwords, 2012), and Google Plus (0.5millions, 2018). Although many websites take security measures such as hash-ing and salting the user passwords, a weak password can still be guessed by anattacker with access to the password hash [16]. This is known as an offline pass-word guessing attack. When humans pick passwords they tend to come up withsomething that can be remembered and entered again in the future [10]. Thatimplies that passwords selected by humans typically have some sort of structureinstead of being entirely random, or they would be much harder to remember[35]. If such structures can be discovered and modelled, it enables easier guess-ing of passwords. That is a good reason for users to try to avoid those structures,and for those structures to be uncovered so that users can be told about them.Another reason to model how users select password is so that a potential attacker,who targets the passwords of users, can be modelled and simulated. This wouldenable better understanding of how secure passwords truly are against real at-tacks, and help when coming up with protection mechanisms. Knowledge ofhow an attacker might target passwords would also help with defending againstpassword guessing attacks.

1

Page 12: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

2 1 Introduction

1.2 Aim

The goal of this project is to examine how password guessing is typically madeby people with professional experience, and how users typically pick passwords.This knowledge will be used to examine and compare state-of-the-art automatedpassword guessing programs. The state-of-the-art password guessing programswill be compared in terms of effectiveness and design. This is done so that anattacker that guesses passwords can be modelled and implemented, which willallow for cheaper and less time-consuming attacks to be defended against in atraining environment. Such an automated system will be implemented in theCyber Range at The Swedish Defence Research Agency (FOI). The goal of theimplemented system is to take the role of an attacker, which is to be defendedagainst in the security training that FOI carries out.

1.3 Research questions

The following research questions are to be answered in this report.

1. How are password guessing attacks performed by people with professionalexperience?

2. What are the overall designs of current state-of-the-art automated passwordguessing programs?

3. How does the password guessing programs compare in terms of effective-ness?

4. How can an automated password guessing system be implemented in theCyber Range at FOI?

1.4 Delimitations

This report will not look into implementation-specific details on various hashingalgorithms. Although cryptoanalysis can be performed in order to analyse andfind weaknesses of such algorithms, this paper make the assumption that thehash algorithms are not possible to reverse and focuses only on the syntax andstructure of passwords. This way, the results and conclusions reached can beapplied to any system in which passwords are entered by users, regardless ofwhich specific algorithm is used.

Page 13: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

2Background

The work was carried out at the Swedish Defence Research Agency (FOI). Themain purpose was to implement an automated password cracker in the cyberrange CRATE 1. A cyber range is a simulated network environment, used fortraining and research. CRATE has a system which emulates attacks against net-works (an automated red team), and this system is to be expanded with passwordcracking capabilities.

1https://www.foi.se/en/foi/resources/crate---cyber-range-and-training-environment.html

3

Page 14: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 15: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

3Theory and related work

This chapter introduces some of the theory behind passwords, hashes, passwordguessers and password cracking programs.

3.1 Terminology

This section lists some of the terms used in the report.

3.1.1 Hash function

A function which calculates a hash of a fixed length, from an input of any length.Specifically, cryptographic hash function are considered here, which are designedto be infeasible to invert. They are often used to store passwords in systems thatrequires authentication, since they allow easy checking of passwords without re-vealing what the password is. An attack that tries to find an input with a spe-cific hash value is known as a preimage attack. In this paper, a password being‘cracked’ refers to a preimage attack which has successfully recovered the pass-word for a specific hash value.

3.1.2 Password cracker

A password cracker is a program which attempts to find the password which wasused to generate a given hash. The input to a password cracker is the hash, the

5

Page 16: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

6 3 Theory and related work

hash algorithm used, as well as some mode of operation. Depending on the modeof operation, other inputs (such as a dictionary) might be required as well.

3.1.3 Password guesser

A password guesser is a program which comes up with guesses for passwords.They do not handle the generation of hashes or comparisons with the targethashes, unlike a password cracker. Instead, a password guesser just generatescandidates of passwords, typically in order of decreasing probability. A passwordguesser can be used together with a password cracker to find hashes. The rela-tionship between password guessers, password crackers and hash functions canbe seen in figure 1. It is worth noting that password guessers are often includedin password crackers (such is the case with John the Ripper and Hashcat).

Figure 1: The relationship between various parts used for password cracking

3.1.4 Sister Password

Two passwords are sister passwords if they belong to the same user and are usedas authentication for two different systems. For example, a password on websiteA by user U is a sister password to a password on website B by user U .

3.1.5 Online/offline guessing

When password guesses are made against the live authentication system, it isknown as an online attack. If the hash digest of the password has been leaked andguesses are made without the live system, it is known as an offline attack.

Page 17: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

3.2 Password Content 7

3.1.6 User and attacker

In this report, ‘user’ refers to the person who set the password which is beingtargeted by an attacker. ‘Attacker’ refers to the person who runs a passwordcracker or password guesser in order to find the password of the user.

3.2 Password Content

Some previous research have been made on what content users have in their pass-words, and what influences how users pick their passwords. This section exam-ines research related to password content, which is important to know when youexamine which factors a password guessing program should take into account.This section shows that personal information, password policies, previous pass-words of the user, and dictionary words in the native language of the user shouldbe used when performing targeted password guessing.

3.2.1 Personal information

Su et al. showed the extent of personal information usage in Chinese passwords[29]. When they analysed more than 200 million leaked password and togetherwith 20 million personal information records in China, they came to the conclu-sion that at least 37% of the passwords contains personal information. The per-sonal information they looked for was birth dates (9.44% of the passwords con-tained some sub-string of the birth date of the user), cell-phone number (8.97%),name pinyin (12.56%), name acronyms (12.72%), and email addresses (0.035%).

Wang et al. conducted similar research, using five datasets of leaked passwordsfrom English websites and five datasets of leaked passwords from Chinese web-sites [32]. Their results show that between 0.75% and 1.87% of users use theirfull name as their passwords, depending on which password leak you look at.Between 1.00% and 5.15% of Chinese users use their birth dates. Between 0.54%and 2.34% of passwords have the username in it, and between 0.77% and 5.07%have some part of the users email address in it. Furthermore, the paper by Wanget al. also shows the difference between English and Chinese users when it comesto picking passwords.

In 2017, Li et al. analysed the personal information content of 130000 passwordsleaked from the Chinese website 12306.cn, along with the corresponding per-sonal information of the users [19]. Their research show that 24.10% of the usersuse a subset of their birth date in their password, 23.60% use some part of theiraccount name, 22.35% use the users real name, 12.66% use part of the emailof the user, 2.996% use the users ID number, and 2.726% use the cell phonenumber of the user. Furthermore, they conducted similar research on the leaked

Page 18: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

8 3 Theory and related work

passwords from the English website rockyou.com. Since that dataset only con-tained passwords and no personal information of the users, the paper looked forany real-world name in all of the passwords. Their conclusion is that more than24.7% of the passwords contains a name with 4 or more characters, which canreasonably be assumed to be a name with some relation to the user.

Castelluccia et al. [7] looked at leaked passwords from facebook.com, and con-cluded that about 35% of the passwords are ‘somewhat correlated’ with one ofthe users attributes. The attributes they looked at were first name, last name,usernames, friends names, education and work, relatives and birth dates. Theyconclude that first name, user name, and birth dates are the most commonly usedpersonal attributes in passwords.

The mentioned papers used different metrics to measure the amount of personalinformation used in passwords, so their results can not be compared by just look-ing at the numbers. However, one important conclusion you can draw is thatpersonal information of users is often included in the passwords.

3.2.2 Password Policies

In an attempt to get users to pick stronger passwords for services, requirements(known as policies) are put on the passwords by system administrators. A paperby Komanduri et al. studies the effect of various password policies on the choicesof passwords users typically make [17]. They found that requirements on lengthincrease the security of the picked passwords, as measured by Shannon entropy,more than requirements on special characters or numbers.

3.2.3 Password Re-use

Users of multiple systems tend to re-use their passwords across multiple systems.Wash et al. studied to which extent this happens [33], and concluded that userstend to re-use strong passwords, presumable since they are harder to remember.The median number of websites for which a password is used was found to be3, while the most used password of each participant was found to be used onan average of 9 different websites. Even when a user do not re-use a passwordexactly as it is, a password used on one service is often modified before it is usedon a different service [31].

3.2.4 Dictionary words

In 2006, Cazier et al. showed that roughly 90% of passwords on an e-commercesite could be guessed by using dictionary words [8]. Hunt showed in 2011 thatroughly 25% of the passwords used on gawker.com was present in a general En-glish dictionary [14].

Page 19: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

3.3 Password guessing programs 9

3.2.5 Native Language

Bonneau showed that using a dictionary in the language of the targeted usercould improve password guessing by up to 25% [6], as opposed to using a dic-tionary of a different language. Maoneke et al. performed a survey in which107 Namibian and South African students were asked to generate a password,which was then checked for content [22]. The result show that 47% of the pass-words were generated using English words, and 30% of them were generatedusing words from the participants native languages. The studies by Su et al. [29]and Wang et al. [32] also demonstrate the extent of how the culture and nativelanguage of a user influence password choice.

3.3 Password guessing programs

This section examines various specific programs which are related to passwordguessing and password cracking. Section 3.3.1 presents Hashcat, which is a gen-eral password cracking program. The following sections presents PCFG, OMENand PassGAN, which are programs which learns the structure of passwords indifferent ways and uses this information to generate likely password candidates.PRINCE, introduced in section 3.3.5, is the only program intended to be entirelyautomated. TarGuess, introduced in section 3.3.6, also learns password struc-tures, and in addition it also augments the guessing generations with data re-lated to the person behind the target hash. All the programs except for Hashcatare password guessers, and do not handle the hashing and comparison of theguesses.

3.3.1 Hashcat

Hashcat [2] is a popular password cracking program. It comes with the followingdefault ‘attack modes’, which are different ways of generating password guesses.The attack modes can further be combined in various ways for additional flexibil-ity.

1. Straight, which tests all the words in a dictionary. This mode also supportword mangling rules, which are ways to manipulate the dictionary words.Examples of mangling rules includes appending a digit, reversing the word,or capitalising the first letter. 1

2. Combination, which combines words from multiple dictionaries. Like thestraight mode, word mangling rules can be applied in this mode as well.

3. Brute-Force, which tests all of the key-space for a given password structure(known as a ‘mask’). The Brute-Force mode uses Markov chains, similar

1For a complete list, see https://hashcat.net/wiki/doku.php?id=rule_based_attack

Page 20: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

10 3 Theory and related work

to those used by OMEN, which is described in section 3.3.2. The Markovmodel comes pre-trained with passwords leaked from the website RockYou,but can be trained by a attacker-provided list of words.

4. Hybrid Dictionary +Mask, which takes a dictionary and a password struc-ture, and puts every possible character combination for the structure af-ter each word in the dictionary. This can be seen as a combination of thestraight mode and brute-force mode.

5. Hybrid Mask + Dictionary, which is the same as above, except that it putsevery possible combination before each word in the dictionary.

3.3.2 OMEN and OMEN+

A Markov chain is a model which describes a sequence of possible events, whereeach event depends on one or more of the most recent previous events. If thelikelihood of an event occurring depends on the last n events, the Markov chainis called an n-gram Markov chain. The idea of using Markov chains to guesspasswords was first proposed by Narayanan et al in 2005 [24]. They used 0-gramand 1-gram Markov chains to generate passwords, where each letter appearingwas the event. For example, an 1-gram Markov chain would look at the mostrecent letter generated, and add output the next letter with the highest probabil-ity given that letter. This idea was later extended by Duermuth et al. [9] into aprogram known as Ordered Markov ENumerator (OMEN). OMEN is a passwordguesser which outputs the guesses in order of probability, given a model of aMarkov chain [4]. It uses training data in form of previously found passwords,to determine the n-gram Markov model. n is set by the attacker and can go ar-bitrarily high, but bigger numbers increase computation time dramatically. Theattacker can also provide an alphabet of the characters OMEN should consider,as well as the ‘level’ of computation. A level is a notation for how inaccurate theguesses should be before the program quits, where a low level implies that onlyhigh probability guesses will be made. For more details, see the original paperby Duermuth et al. [9]. The original paper by Duermuth et al used 4-grams, a 72character alphabet, and 10 levels, and found those numbers to be a good trade-off between computation time and password guess accuracy. OMEN was laterextended into OMEN+, which takes the personal information of the target intoconsideration [7]. They tried various types of personal information, but decidedthat the most worthwhile four are first name, username, date of birth and emailaddress. OMEN+ is implemented in the same binary as OMEN.

3.3.3 PassGAN

A Generative Adversarial Network (GAN) is made of two neural networks. Oneof the networks, the generative network, tries to learn the statistical structure ofsome data set in order to generate new, statistically similar, samples. The other

Page 21: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

3.3 Password guessing programs 11

network, known as the discriminating network, tries to detect which samplescome from the generative network and which samples come from the originaldata. The training is complete when the discriminating network is unable todetect the source of the samples it receives. PassGAN is an implementation of aGAN which is used to generate new password guesses, from a large amount ofother passwords [12].

3.3.4 PCFG

A context-free grammar is a set of rules on how a symbol can be transformed.The rule X → Y means that the abstract symbol ‘X’ can be transformed intothe abstract symbol ‘Y ’. There can be multiple such rules for a given symbol,which makes each transformation possible. If there are multiple rules which areassigned a probability of occurring, it is known as a probabilistic context-freegrammar.

The concept of using probabilistic context-free grammars to generate passwordguesses was first studied by Weir et al. in 2009 [34]. Since then, the concept hasbeen studied and extended multiple times in the literature [13] [32]. The idea isto use training data of passwords to generate a probabilistic context-free gram-mar, which can then be used to generate password guesses. The PCFG passwordguesser looks at a large amount of passwords, in order to determine which trans-formation are the most likely for the general symbol ‘Password’. The possibleabstract symbols are: Alpha Letters (A), Digits (D), Capital Letters (U), Small Let-ters (L), Keyboard Patterns (K), and Special Characters (O). In the original paperby Weir et al., a symbol was indexed by how many times it repeated. So UL3D2has the meaning ‘capital letter followed by 3 lowercase letters and two digits’,and would match ‘Pass23’, among others.

3.3.5 PRINCE

PRINCE stands for PRobability INfinite Chained Elements, and is a stand-alonepassword guesser by Jens Steube [5]. It uses a dictionary as a basis for the guesses,and combines the dictionary words in various ways. The ‘Infinite’ in its nameoriginates from the fact that it will run until it exhausts the key space, which willtake a very long time [27]. Internally, PRINCE reads each word in the dictionaryinto a table of ‘elements’ (the words), which is sorted by length. The elementswill be combined to create ‘chains’ of a certain length. As an example, a chain oflength four can be created by two elements of length two (2 + 2), or four elementsof length one (1 + 1 + 1 + 1), among others. Depending on the length of the chainand the number of elements in the chain, the number of ways to materialise thechain can be different. This number is referred to as the ‘key space’ of the chain.When PRINCE guesses passwords of a length x, it sorts the different chains whosetotal length is x by their key space size, and exhausts them in order. [28]

Page 22: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

12 3 Theory and related work

If PRINCE attempts to guess passwords of length 3, it would try the chains 1 +1 + 1, 2 + 1, 1 + 2 and 3. Suppose the dictionary fed to PRINCE contains 20words of length 3, 5 words of length 2, and 3 words of length 1. Then PRINCEwould exhaust the chains in the order indicated by table 1, starting with thewords of length 2 concatenated to the words of length 1, and finishing with allthe combinations of 3 words of length 1.

Chain Elements Key space1 + 2 3 × 5 152 + 1 5 × 3 153 20 201 + 1 + 1 3 × 3 × 3 27

Table 1: An example of chains, elements, and key spaces used by PRINCE

3.3.6 TarGuess

The previously mentioned password guessers are mainly concerned with offlinepassword guessing, and are intended to use huge data sets of leaked passwordsto make their guesses. To demonstrate the effectiveness of employing personalinformation in online password guessing, Wang et al. created TarGuess, a frame-work for guessing passwords of a user given some target-specific information[32]. They defined personal information as any information which is related to auser, and split it up into several categories: Type-1, which is names, birth dates,phone numbers, and national ID number. Type-2, which is gender, age, and lan-guage. Furthermore, they also used user identification credentials, such as pre-vious passwords, personal ID numbers, user names and email addresses. Type-2was assumed to have an implicit role in passwords, which means that they im-pact how a user pick their passwords. The other categories was instead supposedto have an explicit role, which means that the information occurred directly inthe picked password. All their versions of TarGuess was demonstrated to be bet-ter than previous attempts at using personal information, but unfortunately theydid not release any source code or pre-trained models of their work.

They modelled four different versions of TarGuess, each for a different attackscenario. They are briefly described here, but for more details, see the originalpaper by Wang et al.

TarGuess-I

TarGuess-I deals with the scenario of an attacker getting access to the name andbirth date of the user, and thus employs some Type-1 Personal Information. Itis built on PCFG (described in section 3.3.4), and extends the model to use thesymbol N for name and B for birth date. They trained their model on a set of

Page 23: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

3.4 Other programs and techniques 13

passwords with corresponding personal information for each user, by looking formatches between the personal information and the sub strings of the passwordfor each user.

TarGuess-II

TarGuess-II deals with the scenario of an attacker getting access to one sisterpassword of the targeted user. With training data of multiple different pairs ofpasswords, from the same user but for different sites, they tried to determine howthe first password can be modified into the second. The modification rules theyused were insertion, deletion, capitalisation, leet speak2, sub string movementand reversing. The result was a set of transformations sequences, each associatedwith a probability of occurring. This set was then used to make guesses, by ap-plying each transformation sequence on the sister password of the targeted userin order of decreasing probability.

TarGuess-III

TarGuess-III deals with the scenario of an attacker getting access to the name,birth date (Type-1 information), and one sister password of the targeted user. Itdoes so by combining the grammar of TarGuess-I and the grammar of TarGuess-II directly, and can thus be seen as a combination of them.

TarGuess-IV

TarGuess-IV deals with the scenario of an attacker getting access to the name,birth date (Type-1 information), one sister password, and either the gender, age,or language (Type-2 information) of the user. Combining the other versions ofTarGuess, with some help of Bayesian theory, yielded a trained model which usesall the available information.

3.4 Other programs and techniques

This section lists some of the programs and techniques which are found in theliterature but are not considered in this paper. The reason is either that more pro-grams would dramatically increase the scope of the paper or that the programsare no longer relevant.

2This is when you transform a letter into a similar-looking digit, such as changing an ‘E’ to a ‘3’

Page 24: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

14 3 Theory and related work

3.4.1 Rainbow Tables

Rainbow tables are large pre-computed tables of passwords and hashes [26], usedto quickly look up the password which was used to calculate the hash. Rainbowtables have been studied for a long time in the literature, but will not be consid-ered in this paper. This is because they only work against passwords which havebeen hashed without salt, which is no longer common. When a salt is used, eachguess must be hashed with the salt before it is compared with the targeted hash,which can be done when a password cracker is used but not when rainbow tablesare used. Modern hardware also guesses passwords quickly enough that storinga large table of hashes on disk will not save much time.

3.4.2 Other password guessers

There are multiple other password guessers which can be found both in the litera-ture and on the web. One such program is GENPass, which is based on recurrentneural networks and PCFG [20]. GENPass was not included because there wasno implementation available, and because programs based on PCFG and neuralnetworks were already included. A program developed by Melicher et al., whichis also based on recurrent neural networks, is another program which was notincluded [23] [3]. A search of ‘password cracker’ on code hosting service GitHubreveals hundreds of password crackers [1]. Naturally, testing all of these pro-grams would be out of scope for this paper. Instead, this paper describes testsof programs which were either popular or commonly cited in the literature, andworks differently from each other.

Page 25: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4Method

This chapter walks through the methods which was used to answer the researchquestions.

4.1 How password guessing is performed

This section describes the method used to answer the first research question,‘how are password guessing attacks performed by people with professional ex-perience?’.

4.1.1 Interviews

Several interviews was carried out with people with professional experience inpassword guessing. The interviewees was selected to be people working withsome sort of password guessing at either a company or an agency. The goal ofthe interviews was to get a clear idea of what programs are commonly used, howthey are used, and what assumptions are usually made when guessing passwords.Interviews was picked out as the method of choice, as it enables the possibilityto get qualitative data from a relatively small sample size. Since the goal is to getan idea how someone with professional experience would approach the problem,the interview was intended to be be semi-structured with questions that allow foropen and flexible answers. Since the goal is to get answers from well-informedpeople, the interviews are of the type informant interview. The interviews wastranscribed and summarised, as have been described by Sarah Tracy among oth-ers [30].

15

Page 26: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

16 4 Method

The interview started with getting consent and stating the goals of the interview.The opening questions below was used to get to know the interviewee and theirrelation with password cracking.

Opening questions

1. What is your current profession?

2. How much of that is related to password cracking?

Generative questions are made early in the interview, after the opening questions.Their purpose is to get the interviewee to talk freely about the topic and providea framework for further questions [30].

Generative questions

1. Can you tell me about a job or assignment you had, which was related topassword cracking?

Directive questions were asked, depending on the outcome of the generative ques-tions. The reason for them was to look into some specific programs.

Directive questions

1. What type of hardware do you use?

2. Which programs do you typically use?

3. How are the programs used?

4. What information do you usually have access to?

5. How is that information used?

6. Which hash functions do you typically encounter?

In order to finish off the interview, some catch-all questions were asked. Thepurpose was to get the interviewee to finish up any loose ends [30], and to get theinterviewer to know if they have missed anything.

Closing questions

1. Is there anything else you think is important for me to know about?

These questions were tested in a test session, where they were asked to someonewith with some experience on the topic. This was done in order to see if theyworked as intended.

Page 27: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4.2 Overall design of password guessing programs 17

4.2 Overall design of password guessingprograms

This section describes the method used to answer the second research question,‘what are the overall designs of current state-of-the-art automated password guess-ing programs?’. As defined in chapter 3, there are multiple different considera-tions to take when implementing an automated password guessing system.

4.2.1 Input parameters

This section describes how the programs were compared with regards to the con-siderations they take into account. As described in section 3.2, there are somecommon patterns in user-picked passwords. It was therefore decided that theprograms should be tested for how well they can consider the different patternswhen they make guesses. The following attributes were considered, since theyare the ones considered to be the most important in the literature. The programsHashcat, OMEN, PassGAN, PCFG, PRINCE, and TarGuess (listed in section 3.3)were checked. They were picked out because of either their frequency in thepublished literature, or because the interviewees (see section 4.1.1) mentionedthem.

• Password structure - The structure of how a password is made. This refersto groups of characters and their position, but not the meaning the charac-ters had for the person who came up with the password.

• Dictionary words - This refers to regular words being used as part of theguesses.

• Sister Passwords - One or more passwords which has been used elsewhereby the targeted user.

• Policies - The requirements the system puts on passwords. The two mostcommon requirements on passwords are a minimum number of charactersand an inclusion of characters from different groups (such as numbers andcapital letters). For this reason, this attribute was split up into Length andCharacters.

• Personal Information - Non-public data which is related to the real-life per-son behind the password. Will be split up into Names (which includes, butis not limited to, the real-life name of the user), Birthdays, both the usersown and others, ID numbers and Email addresses.

The programs can use the data of the above attributes in different ways. Becauseof this, it was decided that how a program uses data of the attribute should becategorised. For each of the attribute that was considered, it was decided that thefollowing categorisation should be used.

Page 28: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

18 4 Method

• Automatically - This means that the program makes some automatic deduc-tion of how to use the attribute, and if it will be included or not. Typically,it involves some sort of training phase.

• Directly - This mean that the attribute is specified by the attacker as a set-ting to the program. If some data related to the attribute is fed to the pro-gram when it starts to generate guesses, and the program make a distinctionbetween this data and data of a different attribute, then the program is saidto use the attribute directly.

• Indirectly - This means that the attacker can run the program in some waywhich makes it base the guesses on the data. If the attacker can feed data re-lated to the attribute in some form to the program, and the program makesguesses which are based on the data in some form, then it is used indirectly.The main distinction between ‘directly’ and ‘indirectly’ is that attributeswhich are used indirectly were probably an unintended consequence of howthe program was designed, and the program makes no distinction betweendata of the attribute and data of other attributes.

• None - This means that the program is entirely unable to use the attributewhen making guesses. If you have access some data of the attribute, itwould not make any difference for the generated guesses or how you runthe program.

4.3 Overall effectiveness of password guessingprograms

This section describes the method used to answer the third research question,‘how does the password guessing programs compare in terms of effectiveness?’.Section section 4.3.1 defines the different metrics which are considered for thetests, which are described in section 4.3.3. Section 4.3.2 specifices which sets ofdata have been used in the tests.

4.3.1 Metrics for measuring effectiveness

This section describes the metrics used to measure effectiveness. The differentmetrics are used in the various tests, described in section 4.3.3.

Number of passwords cracked

One metric which is often used to compare password guessers is the number ofguesses required to reach a certain number of cracked passwords in a big data set[34] [9] [23] [29]. This metric was considered here as well, since it measures thequality of guesses, in a hardware-independent manner.

Page 29: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4.3 Overall effectiveness of password guessing programs 19

Number of cracked passwords per second

This metric encapsulates what an attacker would be interested in. Passwordguessing attacks are typically made with run-time in mind, and not necessar-ily the number of guesses. This metric can thus be seen as hardware-dependentversion of ‘number of passwords cracked’, as described above.

Uniqueness of guesses

If a program makes duplicate guesses, the effective time to correctly guess a pass-word goes up, which is why a high number of unique guesses is desirable. Forthat reason, uniqueness of guesses was a metric which was considered in thisthesis.

Uniqueness can also be computed in pairs, by calculating the number of overlap-ping guesses between two programs. This metric is useful to see how similar twoprograms are in terms of guesses. If two programs generate very similar guesses,one of them could be replaced by the other with similar results, which is why itis worth considering.

Finally, uniqueness can also be computed among all the programs. This is doneby calculating how many guesses a program made, which was not made by anyother program.

This thesis used all three measures of uniqueness.

Length distributions of guesses

Password guessers should make guesses which have some similarities to the train-ing data. One metric for that is the length of the passwords each program gener-ates. This metrics tells us if the program considers the length of the training datawhen it makes its guesses.

Guessing speed

As described in chapter 3, a full password cracking solution must generate guesseswhich are then hashed and tested. If either the guessing speed or the crackingspeed is slower than the other, the throughput will be limited by the slowest ofthem. Therefore, the guessing speed for each program is a useful metric to deter-mine the effectiveness of a program.

Page 30: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

20 4 Method

4.3.2 Used sets of data

This section presents the sets of data used for the tests, as well as some motivationwhy each set was used.

LinkedIn

In 2012, social networking site LinkedIn was attacked and millions of passwordhashes were leaked. This thesis used the 60572669 unique passwords which havebeen cracked and can be found on hashes.org, as of February 2019. The LinkedIndata set has been used in multiple different research papers [12] [20]. The pass-words can be assumed to have been picked to protect some mildly sensitive userinformation, given the purpose of the website. For these reasons, LinkedIn wasconsidered to be suitable for this thesis.

RockYou

Gaming website RockYou was targeted in 2009 and 14341564 unique plain-textpasswords were leaked. RockYou was used for a few of the tests in this thesis.This is because it is commonly used by researchers [9] [20], and easily accessibleby real-world attackers.

LastFm

In 2016, music website LastFm was breached and several million password hasheswere leaked. Roughly 20 million of them have been cracked and can be found onhashes.org. This dataset was used in a couple of the tests, and is suitable as it isonly a few years old and sufficiently large.

Bloggtrafik

Swedish website Bloggtrafik was hacked and shutdown in 2016. Roughly 144400users had their credentials leaked in plain text, but only 85027 unique passwordswas used between them. This thesis used the Bloggtrafik data, as it is of a reason-able size, and contains passwords set by Swedish-speaking users.

4.3.3 Studied scenarios and tests

This section presents the specific tests which was used to study the effectiveness.The tested programs were Hashcat, OMEN, PassGAN, PCFG, and PRINCE, aslisted in section 3.3. TarGuess was excluded as it had no available implemen-tation and because no dataset containing personal information could be used for

Page 31: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4.3 Overall effectiveness of password guessing programs 21

this thesis. How the programs were executed in each test, as well as specificationsfor the used hardware, can be found in the appendix.

Unless otherwise noted, the word mangling rules used by Hashcat was OneRule-ToRuleThemAll. OneRuleToRuleThemAll is a set of rules experimentally foundby combining the top-performing rules from various other well-known sets ofrules [25].

Because of computational limitations, PassGAN had to be limited to only con-sider passwords up to a certain length. This means that it did not have exactlythe same input as the other programs. The number of passwords ignored by Pass-GAN, as well as how much of the total training data this represents, has beennoted for each test.

Test to measure guessing speed

The purpose of this test was to study the guessing speed of each program.

Each program was run for 10 minutes each, and the number of guesses madeduring that time was measured. From this, the number of guesses made persecond was calculated as well. If the program required any sort of training, thetime taken for that was measured as well.

In addition to the guessing speed metric, the speed for some common hash func-tions was calculated as well. This was determined by running Hashcat in itsbenchmark mode. This does not test all of the available hash functions in Hash-cat, but it provides a reasonable sample.

Test with password list

The purpose of this test was to study the quality of the guesses made by eachpassword guesser, when run in similar circumstances. In particular, the effective-ness of each program when using a password list for training/as a dictionary wasmeasured. The considered metrics were uniqueness of guesses, length distribu-tions, and number of passwords cracked, as described in section 4.3.1. Since themetrics are hardware-independent, each program was set to generate the sameamount of guesses, in this case 100 millions.

This test used passwords from LinkedIn as both training data and test data. First,the passwords were split into 10 different groups, each assigned a number be-tween 0 and 9. Each group was defined to be test data for the corresponding testsession, with the remaining 9 groups defined as training data for the same testsession. This means that the training data for each of the 10 test sessions was54515402 words large, and each set of test data was 6057267 words large. Ingeneral, this is known as a k-fold cross-validation with k = 10. This approach iscommon in validation of machine learning models, and the purpose is to reduce

Page 32: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

22 4 Method

bias from data selection. The choice k = 10 is considered to be standard in thefield, but other numbers are common as well [18].

Hashcat used the set of word mangling rules known as ‘Best64’. This is becauseBest64 is shipped with Hashcat, and is commonly featured in usage examples inthe documentation and on the web. Some of the literature in which Hashcat isused use the Best64 rules [13] [12]. For these reasons, the Best64 set of rules canreasonably be called the ‘default’ set of rules.

PassGAN was limited to only consider passwords of length 16 or less. This re-sulted in less than 1% of the passwords being removed.

Test with dictionary words and password list

The purpose of this test was to study to which extent regular dictionary words areuseful as a basis for password guesses. Since the programs which uses some train-ing data (PassGAN, PCFG, OMEN) are assumed to be run on lists of passwords,it was decided to see how they perform when given a list of regular words. Asdescribed in section 3.2, regular words are often found in passwords, so it can beassumed that both regular dictionaries and password lists are useful. The metricused to study the effectiveness of each option was number of cracked passwordsper second, as described in section 4.3.1.

This test used passwords from LastFm as test data, and passwords from RockYouas training data.

The traditional dictionary was constructed by fetching data from the articles onEnglish-language Wikipedia. A copy of all current articles (as of January 2019)was downloaded and each string of alpha characters between two whitespacecharacters was extracted. This approach has been used by other researchers toconstruct a general dictionary [21]. English-language Wikipedia contains tech-nical terms, names, and some common numbers, which makes it more suitablefor password guessing than a general English language dictionary. Duplicatedwords were ignored. This resulted in a list of 14353606 words being used for thistest.

Three different tests were run, using different dictionaries.

• One run with only the Wikipedia dictionary as input.

• One run with the RockYou dump as input.

• One run with the RockYou dump and Wikipedia dictionary merged to-gether.

Each program ran for 10 minutes, for each test.

PassGAN was limited to only consider passwords of length 30 or less. Thisresulted in 0.36% of the words being removed from the Wikipedia dictionary,0.03% from the RockYou set, and 0.20% from the combined set.

Page 33: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4.3 Overall effectiveness of password guessing programs 23

Test using dictionaries of different languages

The purpose of this test was to see to which extent the language of the dictionaryinfluences the effectiveness of the password guessers. The research mentionedin section 3.2.5 showed that a dictionary in the native language of the targeteduser could yield better result. To test this, each program targeted passwordsfrom a Swedish websites with a dictionary in Swedish and a dictionary in English.The metric used to study the effectiveness of each option was number of crackedpasswords per second, as described in section 4.3.1.

The dictionaries picked for this test were ‘English (British)’ and ‘Swedish’, foundin a repository on GitHub. 1 Including all forms, the Swedish dictionary included400468 words and the English dictionary included 145390 words.

Each program ran for one hour, for each test.

PassGAN was limited to only consider passwords of length 30 or less. This re-sulted in 0.002% of the words in both the English and Swedish dictionary beingremoved.

Test against passwords with a minimum length policy

The purpose of this test was to study the effectiveness of the programs whentargeting passwords of a know minimum length. In this test, each program tar-geted passwords of a know minimum length, using either other passwords of thesame minimum length or all the available passwords. The metric used to studythe effectiveness of each option was number of cracked passwords per second, asdescribed in section 4.3.1.

This test used passwords from LastFm as test data and training data. Each pass-word in the set was put into one or more of the following groups:

• All of the passwords (10234857)

• Passwords of length over 8 (7406354)

• Passwords of length over 12 (1099628)

• Passwords of length over 16 (124271)

• Passwords of length over 20 (16652)

The current NIST recommendation for password policies is that they should re-quire at least 8 characters [11], which is why it was picked as the smallest length.Passwords of length 20 or longer are fairly uncommon (around 0.16% of theLastFm passwords), which is why it was decided to not try to target any pass-words longer than that. The steps between were picked to give some data for thepolicies in between the shortest and longest policy.

1https://github.com/titoBouzout/Dictionaries

Page 34: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

24 4 Method

Each group was split into two random groups, one of which was training data (ordictionary) and one of which was test data (targeted passwords). Excluding thetest set based on all of the passwords, each test set was targeted twice by eachprogram. One time with training data based on all of the passwords, and onetime with training data based on only the passwords of the corresponding thetargeted minimum length.

Each program ran for 10 minutes, for each test.

PassGAN was limited to only consider passwords of length 30 or less. This re-sulted in 0.006% of the passwords of length 12 or longer being removed, 0.04%of the passwords in of length 12 or longer being removed, 0.33% of the passwordsin of length 16 or longer being removed, and 2.5% of the passwords of length 20or more being removed.

4.4 The implementation of the system

This section describes how the system was implemented in the Cyber Range atFOI, and thus answers the fourth research question, ‘how can an automated pass-word guessing system be implemented in the Cyber Range at FOI?’.

4.4.1 Overview

This section will refer to the system implemented in the FOI Cyber Range asSvedcrack. Svedcrack is a concatenation of ‘Sved’ (the sub-system in the CyberRange which uses the program) and ‘crack’.

Section 4.4.2 describes the requirements on Svedcrack. Section 4.4.3 thru 4.4.5 de-scribes the design of Svedcrack, and how the additional tests were performed.

4.4.2 The requirements

The requirements of the automated systems were discussed, and it was decidedthat the following things should be considered by the Svedcrack.

• The targeted hash (may be multiple)

• Hash function used(A)

• Time to run (t)

• Minimum password length policy (l)

• The native language of the user (s) (simplified to ‘International’ and ‘Swedish’)

Both the theory and the test results of the other tests showed that these werereasonable things to consider.

Page 35: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

4.4 The implementation of the system 25

4.4.3 Expected number of guesses

Using the results from the test described in section 4.3.1, it is possible to use Aand t to calculate the expected number of guesses a program will be able to testin a given time.

Eg (program, t) = the number of expected guesses a program will make duringthe given time.

Eh(A, t) = the number of expected hash function calculations which can be madeduring the given time.

Et(program, A, t) = the expected number of tested guesses during the given time,calculated as Et(program, A, t) = min(Eg (program, t), Eh(A, t)). This is becauseof the fact that either the guessing speed or the hash calculation speed will be abottleneck for the throughput of actual tested guesses.

Calculating Et(program, A, t) for each program ∈ {hashcat, pcf g, prince} yieldsthe expected number of tested guesses for each program.

4.4.4 Expected number of correct guesses

From running tests, it is possible to determine how many passwords a programis able to guess after a certain number of guesses. When you divide this numberby the total number of passwords in the set, you get the percentage of the pass-words cracked. That percentage could also be seen as the probability of a pro-gram being able to crack a single password after the number of guesses, underthe assumption that a targeted password follows the same statistical distributionas the passwords in the training data and the test data.

This reasoning was applied when designing Svedcrack. By determining a func-tion which predicts the probability of a program cracking a password hash aftera certain number of guesses, you can find out which program has the highestprobability of successfully cracking a password hash. Let this function be calledEc(program, guesses, l, s), where l and s is the password policy minimum lengthand language of the user, respectively.

This function was determined by plotting the number of guesses against the per-centage of passwords guessed so far, and then determining the exponential func-tion f (x) = a · e−b · x + c which best fitted in it the least-square sense. This func-tion was used because of previous observations that the curve tended to increaserapidly in the beginning before it stops increasing completely. Ec(program, guesses, l, s)can now be described as Ec(program, guesses, l, s) = ftls(guesses), where ftls(x) isf (x) with different parameters for a, b, and c, depending on the program, targetedlanguage, and targeted length.

By first calculating guesses = Et(program, A, t) as described in section 4.4.3, it ispossible to find out Ec(program, guesses, l, s) for each program.

Page 36: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

26 4 Method

4.4.5 Tests to determine the parameters for the curves

The results from the other tests (section 4.3.3) were intended to be used to findthe parameters to f (x), as described in section 4.4.4. Unfortunately, because ofhow one of the programs generated data, the test results could not be used asintended and had to be redone. However, the test results showed that some ofthe programs performed better than others, so the new tests could focus only onthe relevant programs.

Specifically, the programs Hashcat (with dictionary rules), PRINCE, and PCFGwere considered. The programs were tested against every combination of thelanguage (‘International’ and ‘Swedish’) a minimum password length in the set8, 12, 16, 20. This means that 8 different targets were considered. Each programran for one hour each, before the execution was terminated.

The Swedish data was combined from leaks from anstalten.nu, bloggtrafik.nu,gratisbio.se, and an unknown leak commonly known as ‘hoppstylta’, for a total of1068211 passwords. The international data was combined from leaks from elite-hacker, myspace, faithwriters, phpbb, hak5, honeynet, rockyou, muslimmatch,singles.org, and an unidentified porn site, for a total of 34424159 passwords.They were picked because the contained duplicates of password, unlike the previ-ous considered sets of data. Both the Swedish and the international data were putinto groups containing only password of the minimum length and of the targetlanguage, similar to how the data was treated in the test described in section 4.3.3.Each of these sets were split into two, one for training and one for testing.

4.4.6 Putting together the material into a program

Once a function which describes the expected number of correct guesses whengiven the input listed in section 4.4.2 had been made, SvedCrack could be imple-mented. SvedCrack simply calculates the expected probability that a programwill be able to guess the password(s) in the given time, and starts the one withthe highest probability.

Page 37: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5Result

This chapter describes the results.

5.1 How password guessing is performed

This section describes the results for the first research question, ‘how are pass-word guessing attacks performed by people with professional experience?’.

5.1.1 Interviews

Two interviews was carried out. The names and workplace of the intervieweeshas been removed.

Interview One

The first interview was made at FOI and took about 45 minutes. The intervieweehad been working with password cracking for about six years, and they mostlytargeted hard drive encryption schemes. The questions listed in section 4.1.1 wasnot followed strictly, as the answer to the generative question also answered mostof the directive questions.

Summary When asked to describe a previous job or assignment related to pass-word cracking, the interviewee stated that they often started out with runningprograms with default and known ‘generic’ settings. This was done until you

27

Page 38: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

28 5 Result

had gathered more information which is related to the user whose password youwish to crack. When asked about what kind of information that would be, namesand numbers related to user was mentioned. The names could be the name of theuser, family members, places, addresses, and pets. The number could be dateswhich are important to the user.

When asked about programs, Hashcat, John the Ripper, PRINCE, and PasswareKit was mentioned. When asked to compare them, the interviewee describedthat Hashcat was generally the best. John the Ripper was described to have a goodcommunity with many plugins, so it was sometimes used to target systems whichwas not implemented in Hashcat. Passware Kit was used by colleagues of the in-terviewee, and was good as it was heavily automated and simple to use. PRINCEwas also something that was used by the colleagues of the interviewee, and as faras he knew it was considered to be good. The interviewee said that he preferredmore control over the programs he used. Rainbow Tables was also mentioned,as something the had been using earlier but which they stopped using about ayear ago. The reason was that many contemporary hashes they found used salts,which is a way to defeat rainbow tables. It was also mentioned that modern hard-ware checks hashes so quickly that having large pre-computed tables did not savemuch time. The interviewee also described a system in which two networks com-pete to create and detect password guesses, which was something he said wasinteresting and worth looking in to. The interviewer assumed this was a descrip-tion of a GAN, even though it was not mentioned by name.

When asked about how the programs was being used, the interviewee said that acracking session sometimes started with plain brute-force to remove the shortestand most simple candidates. A cracking session often ended with brute-forceas well, when all the easier options had been exhausted. In between, word listswith mangling rules was often used. The rules they used was something theyhad developed by themselves over the years. The word lists used was createdfor each specific targeted user, but also contained a generic Swedish dictionary,names, idioms and common misspellings. They also categorised the words intodifferent classes, which they used differently, but the interviewee was not allowedsay to how exactly they did that. The interviewee mentioned that you could becreative when you created the word list. The interviewee also pointed out atmultiple times in the interview that a feedback loop was important, and thatnew information discovered should be added in to the dictionary. It was alsomentioned that different speeds of the hash function determined how they usedtheir programs, and for slow hashes a slower guesser could be used.

A topic which was brought up was training data. The interviewee mentioned thatthe model of passwords that many programs used was based on the RockYou dataleak. He said that RockYou was not very good as it old at this point, and that it wasalso too web specific. Since they typically try to guess passwords for decryptinghard drives, their targets picked different password than they probably would fora website. He mentioned that the RockYou data set did not entirely correlate tohow the users they targeted picked their passwords. The interviewee did mention

Page 39: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.1 How password guessing is performed 29

that the good things with using database leaks was that they were picked byreal humans, and thus could show some tendencies people have when they pickpasswords.

When asked about hardware used, the interviewee said that most of the workis done on GPU, specifically GTX 1080 Tis. They had previously used customFPGA:s, which they stopped using as they were too expensive for what they of-fered. The interviewee stated that building a computing cluster with CPUs mightbe relevant, since one of the new hashing functions they encountered required alot of memory, which GPUs did not have enough of.

The topic of password content was brought up at multiple points in the interview.Personal information was something they often used when guessing passwords.Previous passwords was something they often looked at as well, if they couldfind it. The interviewee said that they typically try to target the weakest pointof the targeted user, since it might reveal information or other passwords. It wasmentioned that you could often see passwords of a target evolve over time, withnew parts being added at the end. For that reason, the interviewee considered itto be important to label and classify the personal information they found, sincethe point in time it was used might be relevant. It was also mentioned that theremight be cultural differences between how users of different groups might selecttheir passwords. The groups mentioned was age and type of work, but otherprofiling might be relevant as well. Keyboard walks was mentioned as somethingthey had found in password but that it was not very common to encounter.

Interview Two

The second interview was carried out at a company and took about 25 minutes.The interviewee had been working with password cracking for multiple years,although it was a fairly minor part of his work. Most of their work was relatedto examining IT-related crimes, so hard drive encryption which used some pass-word was often targeted.

Summary When asked about the hardware they used, the interviewee statedthat they had two GTX 1080 Tis.

When asked about programs, the interviewee mentioned that Hashcat was oftenused on the lowest level. However, they did not use Hashcat directly, but insteadused a graphical wrapper known as Password Recovery Toolkit (PRTK). PRTK letthe attackers build combinations of mangling rules with a graphical editor, andmanage and merge different list of words. The interviewee mentioned that it wasnot always entirely clear how PTRK called on Hashcat, but that it at least startedwith exhausting easy searches before it moved on to searched which would takelonger time.

When asked about how the programs were being used, dictionaries and manglingrules was mentioned. They used a regular dictionary which also included many

Page 40: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

30 5 Result

combinations of digits as well as some personal data of the victim. The personalinformation could include things taken from the social media of the user, such asnames (both of the target and people related to the target) and important dates.This also answered the question about what information they usually have accessto. The interviewee mentioned that you could be creative when you came up withthe word list.

When asked about which hash functions they usually encounter, the intervieweementioned NTLM. But since his work was mostly related to showing weaknessesin the systems he tested, it was often enough to simply tell the operators that theyhad used an insecure hash function, and in such cases there was no need to try toguess the password.

The topic of password re-use was brought up. The interviewee stated that it wasnot something they used directly themselves, but his colleagues had used oldpasswords of the target sometimes. Typically, users appended stuff to either thebeginning or the end of an old password they had used.

The interview mentioned that it was easier to target a group of people insteadof an individual, since in a group it is likelier that one of them will use a badpassword.

5.2 Overall design of password guessingprograms

This section describes the results for the second research question, ‘what arethe overall designs of current state-of-the-art automated password guessing pro-grams?’.

5.2.1 Summary

Table 2 summarises the results. An ‘A’ in the table means that the attribute isautomated, a ‘D’ means that it is used directly, a ‘I’ means that it is used indirectly,and a ’-’ means that it is unused.

5.2.2 Hashcat

Hashcat is not intended to be an automated program, and none of the attributesare considered in an automated fashion. An attacker that uses Hashcat can, how-ever, decide if and how to use most of the attributes.

Page 41: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.2 Overall design of password guessing programs 31

Hashcat OMEN(+) PassGAN PCFG PRINCE TarGuessStructure D A A A - ADictionary D I I I D -Sister PW I D I D I APolicy: Length D D D I I IPolicy: Chars D I I I I IPII: Names I D I I I APII: Birthdays I D I I I APII: IDs I I I I I APII: Emails I D I I I A

Table 2: Table of the extent different password guessers consider differentattributes commonly found in passwords

Password Structure

An attacker using Hashcat can feed it password structures, and Hashcat will pro-ceed to make guesses which fulfils those structures. This is what the ‘brute force’mode is for. It is specified in terms of ‘masks’ which is a sequence of charactergroups which matches the password to be guesses. Since the mask fed to Hashcatis used directly and purposefully, it is considered to be used directly. It is not pos-sible to use Hashcat to come up will password structures automatically.

Dictionary Words

An attacker using Hashcat can feed it a dictionary which will be used as thebasis for the guesses, which is what the ‘Straight’ mode is for (as described insection 3.3.1). It is therefore reasonable to claim that this attribute is used directlyby the program.

Sister Passwords

It is possible for the list of words which Hashcat uses will contain sister pass-words of the user, which will then be used as a basis for new password guessesto be generated. Hashcat consider these to be regular words, and not sister pass-words, so no distinction is made. For that reason, this attribute is used indirectlyby Hashcat.

Policies

When it comes to the sub-attribute length, it is possible to set a maximum lengthin some of the modes. When it comes to the sub-attribute special characters, youcan specify masks which will generate only passwords which is accepted by the

Page 42: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

32 5 Result

policy. This is one reason to classify length as something directly used by Hashcat.It is not possible to specify a password policy to Hashcat and have it generateonly guesses which is accepted by the policy. You can however specify differentpassword structures (masks) which will only match passwords for a given policy.This requires some additional work, but it can be said that Hashcat uses this sub-attribute directly.

Personal Information

The dictionary which is used by Hashcat can contain personal information, whichis then used to create password guesses. However, Hashcat does not distinguishbetween this attribute and, for example, sister passwords which is why this at-tributes is classified as being used indirectly by Hashcat.

5.2.3 OMEN and OMEN+

OMEN comes up with password structures and corresponding candidates auto-matically, and allows length and some personal information to be specified inaddition to that.

Password Structure

OMEN uses two phases, a training phase and a generation phase. The trainingphase takes a set of passwords and come up with a Markov-chain which is used togenerate password candidates in the generation phase. This attribute can there-fore be considered to be automatically considered.

Dictionary Words

It is possible to include normal dictionary words in the data set which is usedin the training phase. However, this is not intended, and the program will con-sider the words as passwords and generate new words which follow the samestructure. Since including dictionary words in the data set would still have someeffect on the guesses generated, this attribute is considered to be indirectly con-sidered.

Sister Passwords

Since you feed passwords to the program during the training phase and the pass-words will be used directly when generating the Markov model, this can be con-sidered to be used directly. It is worth noting that the program does not make

Page 43: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.2 Overall design of password guessing programs 33

any distinction between the sister passwords of a targeted user and a passwordused by someone else.

Policies

The implementation used does not allow the attacker to specify requirements onspecial characters, but it is possible to specify the length of the password candi-dates to be guessed. Similar to PCFG, it is also possible to use only passwordswhich fulfils the policies for the training phase, which will make it more likelythat password candidates that are allowed under the policy gets generated. Sothe sub-attribute length is used directly by the program, while the sub-attributespecial characters is used indirectly.

Personal Information

OMEN+ (which is an extension of OMEN and is part of the used implementation)works with so-called ‘hint files’, which is to be used as hints for targeted passwordguessing. These files can contain first name, username, date of birth, and emailaddress of the user. These attributes can therefore be said to be used directlyby the program. Other personal attributes can still be included in the normalset of passwords and are indirectly used in the passwords, similar to dictionarywords.

5.2.4 PassGAN

PassGAN comes up with password structures and corresponding candidates au-tomatically, and allows length to be specified as well.

Password Structure

PassGAN uses a training phase and a guessing phase. During the training phase,a list of passwords is used to train the model that is then used to generate pass-words during the guessing phase. This attribute can therefore be considered tobe used automatically by PassGAN.

Dictionary Words

It is possible to include normal dictionary words in the data set which is usedin the training phase. However, this is not intended, and the program will con-sider the words as passwords and generate new words which follow the samestructure. Since including dictionary words in the data set would still have some

Page 44: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

34 5 Result

effect on the guesses generated, this attribute is considered to be indirectly con-sidered.

Sister Passwords

Similar to dictionary words, the list of passwords could contain the personal in-formation of the user. This information will not be directly considered as per-sonal information, but it will still influence guesses and make them more likelyto contain data of the attribute, which is why it is considered indirectly by theprogram.

Policies

It is possible to specify the maximum length of passwords to be guessed whenthe model is trained, so the sub-attribute length is directly considered. Require-ments on special characters can not be stated to the program directly, but the listof passwords used to train the model can be picked to only contain passwordswhich is accepted by the policy. The sub-attribute special characters is thereforeconsidered to be indirectly considered.

Personal Information

Similar to dictionary words, the list of passwords could contain the personal in-formation of the user. This information will not be directly considered as per-sonal information, but it will still influence guesses and make them more likelyto contain data of the attribute, which is why it is considered indirectly by theprogram.

5.2.5 PCFG

PCFG comes up with password structures and corresponding candidates auto-matically, but does not allow for any direct augmentation with additional at-tributes.

Password Structure

PCFG uses two phases, a training phase and a generation phase. The trainingphase takes a set of passwords and comes up with a probabilistic context-freegrammar which describes the structure. This PCFG is then used to generate pass-word guesses entirely automatically, which is why password structure is consid-ered to be automatically considered.

Page 45: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.2 Overall design of password guessing programs 35

Dictionary Words

It is possible to include normal dictionary words in the data-set which is usedto come up with the PCFG. However, this is not intended, and the program willconsider the words as passwords and generate new words which follow the samestructure. Since including dictionary words in the data set would still have someeffect on the guesses generated, this attribute is considered to be indirectly con-sidered.

Sister Passwords

Since you feed passwords to the program during the training phase and the pass-words will be used directly when generating the PCFG, this can be consideredto be used directly. It is worth noting that the program does not make any dis-tinction between the sister passwords of a targeted user and a password used bysomeone else.

Policies

The PCFG implementation does not allow the attacker to specify length or char-acter set manually. It is possible to feed a password set which fulfils the policies,which will make it more likely that a password candidates fulfilling the structureis generated. This attribute can be considered to be indirectly considered.

Personal Information

This information can be included in the dictionary, but the PCFG implementationdoes not consider it directly. The PCFG implementation used only looks at theuppercase, lowercase and numeric character groups. Since including personalinformation would change the guesses being generated, it can be considered tobe indirectly used, just like dictionary words. A side-note of interest is that ex-tensions to PCFG which uses personal information have been proposed. Li etal. showed in 2017 how the PCFG method could be updated to include personalinformation [19]. In addition to the previous mentioned symbols, they also in-cluded Name, Email address, Phone number, Birth date, Account name, and IDnumber.

5.2.6 PRINCE

PRINCE is intended to be an entirely automated program.

Page 46: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

36 5 Result

Password Structure

You can not specify a password structure to PRINCE. PRINCE will take a set ofwords and combines them in various ways to form guesses. User can not spec-ify a structure, neither directly or indirectly, so this attribute can not be consid-ered.

Dictionary Words

Using dictionary words as a base for password guessing is what PRINCE wasmade for, so this attribute can be said to be used directly.

Sister Passwords

You can include it in the dictionary, but PRINCE will not distinguish betweenthese words and any other words, or use it in any special way. So this attributecan be said to be used indirectly.

Policies

PRINCE uses the input dictionary to calculate the most common length, and pro-ceeds to generate guesses of that length first. As such, it can be considered touse the sub-attribute length indirectly. There does not appear to be a way to getPRINCE to take policies on special character requirements into account. How-ever, using a dictionary with only ‘words’ that fulfils the character requirementswill likely lead to guesses which are accepted by the policy. For that reason,the sub-attribute special characters was decided to be indirectly considered aswell.

Personal Information

You can include it in the dictionary, but PRINCE will not distinguish betweenthese words and any other words, or use it in any special way. So this attributecan be said to be used indirectly.

5.2.7 TarGuess

TarGuess comes up with password structures automatically, and in addition tothat also attaches meaning to the structure in the form of allowing some personalinformation and sister passwords to be part of it.

Page 47: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.2 Overall design of password guessing programs 37

Password Structure

The four different versions of TarGuess uses a training phase and a generationphase, for which some structure is first learned and then used as the basis forguesses. This attribute can therefore be considered to be used automatically.

Dictionary Words

Dictionary words are not mentioned in the original TarGuess paper. While it maybe possible to include normal dictionary words in the training data, it is not clearon how, since the training data is passwords with related personal information.This attribute can be considered to not be taken into account at all.

Sister Passwords

TarGuess-II, TarGuess-III, and TarGuess-IV uses a sister password of the user insome form. How a sister password should be used is determined by the trainingphase, which looks for common transformation patters from one password to adifferent password. This information is then used when making guesses. Thisinformation is thus considered automatically, since it determines usage in thetraining phase.

Policies

The authors of the original TarGuess paper did mention password policies a fewtimes. Specifically, they mention that the process of how users transform one sis-ter password from one service to another depends on the password policy (amongother things). When they implemented this information in their programs, theyused different password lists depending on the policy of the target site. This areason to classify the usage of this information as indirect, since it is similar tohow OMEN and PassGAN can treat password policies.

Personal Information

In TarGuess-I, TarGuess-III and TarGuess-IV, all the specific sub-attributes ofpersonal information is specified and used automatically. In addition to the sub-attributes considered in this report, the original TarGuess paper also looked at at-tributes like the gender, age, username and language of the user. These attributescan be said to be used automatically by the framework.

Page 48: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

38 5 Result

5.3 Overall effectiveness of password guessingprograms

This section describes the results for the third research question, ‘how does thepassword guessing programs compare in terms of effectiveness?’.

5.3.1 Test to measure guessing speed

Table 3 shows how many guesses each program had generated after running for10 minutes. In addition, the time it took to train each program on the RockYoudataset is included as well.

Program Time to train Generated Guesses Guesses/secondHashcat N/A 2002511070 3337518OMEN 5 s 517330625 862217PassGAN 14 h, 12 m, 38 s 22464064 37440PCFG 9 m, 58 s 100029606 166716PRINCE N/A 65761893094 109603155

Table 3: The time it took to train each program, the number of guesses gen-erated after 10 minute, and the corresponding guesses/second

PassGAN was by far the slowest program, both in terms of training and generat-ing guesses. The program which managed to generate most guesses was PRINCE,with 65.8 billion guesses made in 10 minutes, followed by Hashcat (2.0 billionguesses).

When Hashcat ran in benchmark mode, the speeds in table 4 were obtained. Therounding in the table was performed by Hashcat, and was part of the output.NTLM and MD5 were the fastest, with more than 10 billion hashes calculatedeach second on the reference system.

Page 49: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.3 Overall effectiveness of password guessing programs 39

Hash algorithm Hashes per secondNTLM 17211300000MD5 10360900000NetNTLMv1 10264200000LM 9800500000SHA1 4007700000SHA2-256 1488300000NetNTLMv2 843800000SHA2-512 495500000descrypt, DES (Unix), Traditional DES 415300000Kerberos 5 TGS-REP etype 23 152600000Kerberos 5 AS-REQ Pre-Auth etype 23 150100000md5crypt, MD5 (Unix) 4383800WPA-EAPOL-PBKDF2 188100TrueCrypt PBKDF2-HMAX-RIPEMD160 134300LastPass + LastPass sniffed 113800Sha512crypt $6$, SHA512 68338KeePass 1 66655DPAPI masterkey file v1 32214RAR3-hp 21715DPAPI masterkey file v2 21291RAR5 17887Bcrypt $2*$, Blowfish (Unix) 7684macOS v10.8+ 58727-Zip 4180Bitcoin/Litecoin wallet.dat 2189

Table 4: Benchmark result on the reference system

5.3.2 Test with password list

This section describes the results acquired from performing the tests on the LinkedIndata, described in section 4.3.3. The result for each considered metrics is pre-sented in its own section.

Number of passwords cracked

Each generated set of of password guesses was checked against the correspond-ing test data. Figure 2 shows the distribution of correctly guessed passwords overtime. For PassGAN, the difference between the highest and lowest number of cor-rect guesses have been marked. For the other programs, the difference betweenthe highest and lowest number of correct guesses was not big enough to be visiblein the graph.

Page 50: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

40 5 Result

Figure 2: Number of passwords cracked for the first 100 000 000 guesses

The highest and lowest number of correctly guessed passwords for each programcan also be seen in table 5.

Value Hashcat OMEN PassGAN PCFG PRINCESmallest 229025 334997 114558 1178905 2797Largest 230190 336272 168787 1182099 5369

Table 5: The best and worst number of correct passwords guessed after 100000 000 guesses, for each program

Uniqueness of guesses

Among the 100 million password guesses generated by Hashcat, 77.41% wereunique, for OMEN this number was 100.00%, PassGAN 95.36%, PCFG 95.44%,and PRINCE 99.07%. This can be seen in table 6.

Page 51: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.3 Overall effectiveness of password guessing programs 41

Training set Hashcat OMEN PassGAN PCFG PRINCE1 77411468 100000000 95728342 95435212 991950102 77404515 100000000 94835574 95439577 981496903 77407381 100000000 94835255 95440647 990208524 77410591 100000000 96214933 95434682 989867515 77415660 100000000 94702496 95437952 975727456 77465855 100000000 95357165 95441415 993952317 77414766 100000000 95480193 95438189 994606778 77403445 100000000 96896144 95441464 995589719 77412057 100000000 93857816 95440526 9955396910 77398180 100000000 95678352 95438897 99803268Average% 77.41% 100.00% 95.36% 95.44% 99.07%Table 6: Number of unique passwords guesses that each program generatedfor each training set

The amount of overlap between the unique guesses of each pair of programs canbe seen in table 7. This can also be seen in table 8, as a percentage of uniqueguesses for each program, when compared with the other program. Overall,PCFG and Hashcat appear to have the most similarities in generated guesses, as9.63% of the unique guesses generated by Hashcat was also found in the set ofunique guesses made by PCFG. OMEN and PRINCE appear to have least in com-mon, with only 0.57% of unique guesses generated by PRINCE being generatedby OMEN as well.

Program Hashcat OMEN PassGAN PCFG PRINCEHashcat - 1749240.3 1093072.4 7452723.8 1669732OMEN 1749240.3 - 2377058.5 7851419.3 564545.8PassGAN 1093072.4 2377058.5 - 4214922.8 254132.0PCFG 7452723.8 7851419.3 4214922.8 - 4060475.2PRINCE 1669732 564545.8 254132.0 4060475.2 -

Table 7: Number of average overlaps between each pair of programs

Program AProgram B

Hashcat OMEN PassGAN PCFG PRINCE

Hashcat - 2.26% 1.41% 9.63% 2.16%OMEN 1.75% - 2.38% 7.85% 0.56%PassGAN 1.27% 2.77% - 4.91% 0.30%PCFG 7.81% 8.23% 4.42% - 4.25%PRINCE 1.69% 0.57% 0.26% 4.10% -Table 8: Percent of guesses generated by a program (Program A) which wasalso generated by the other programs (Program B)

Page 52: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

42 5 Result

When each program was checked against every other program, it turned out thatOMEN had generated the most unique guesses, and Hashcat the least. This canbe seen in table 9.

Training set Hashcat OMEN PassGAN PCFG PRINCE1 68458281 90453028 90727927 76151234 941394282 68278935 90201740 88103594 75355296 929484933 68484623 90259512 88915144 76050177 944231734 68521109 90304145 91124505 76499284 943874565 68482729 90307528 88249573 75471528 928389586 68551969 90451980 89964423 75997624 944687987 68478811 90170904 89288505 75924597 947799528 68545369 90298826 91999523 76734957 950815389 68483558 90036132 87829985 75815597 9458584310 68489753 90483978 90629133 76157537 94853715Average% 68.48% 90.30% 89.68% 76.02% 94.25%Table 9: Number of unique guesses when compared to the guesses of all theother programs

Length distributions

The length distribution for the training data and the guesses made by each pro-gram can be seen in table 3. Hashcat generally guessed passwords of the mostcommon lengths, but it does not follow the distribution closely. OMEN heavilybased its first 100 million guesses on the most common lengths in the trainingdata. PassGAN appears to follow the overall length distribution closely. PCFGfollows the overall length distribution of the training data fairly well, except forthe number of guesses around length 7 being flattened out. PRINCE favoured itsguesses toward short ones, but with only a few under the length of six.

Page 53: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.3 Overall effectiveness of password guessing programs 43

Figure 3: Length distributions of training data, compared with the guessespassword from each program

Page 54: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

44 5 Result

5.3.3 Test with dictionary words and password list

This section describes the results acquired from performing the test describedin section 4.3.3. For all of the programs, basing the guesses on RockYou wasfar superior to basing guesses on dictionary words. Combining the two sets ofwords did not increase the number of correct passwords guessed in the time-span,except for PassGAN which did slightly better than using RockYou alone.

Wordlist Hashcat OMEN PassGAN PCFG PRINCEWikipedia 5870 707 203 2747 981RockYou 16231 4861 330 8952 8500Combined 10604 2647 353 7815 3350

Table 10: Correct passwords per second

In all the tests, Hashcat found the most passwords, followed by PCFG, PRINCE,OMEN and PassGAN.

5.3.4 Test using dictionaries of different languages

This section describes the results acquired from performing the test described insection 4.3.3. The number of correct passwords guesses per second, on average,can be seen in table 11. These numbers were calculated by dividing the numberof correct guesses after one hour with 3600. Hashcat, OMEN, and PassGAN didbetter when using an English dictionary compared to using a Swedish one. Interms of correct guesses per second, they had an increase of 14%, 20% and 13%when switching from Swedish to English, respectively. PCFG and PRINCE didbetter with a Swedish dictionary. They had a 50% and 68% increase in correctguesses when they switching to Swedish, respectively.

Wordlist Hashcat OMEN PassGAN PCFG PRINCEEnglish 6.37 0.64 0.89 2.75 2.33Swedish 5.57 0.53 0.79 4.14 3.91

Table 11: Correct passwords per second

In both tests, Hashcat found the most passwords, followed by PCFG, PRINCE,PassGAN and OMEN.

5.3.5 Test against passwords with a minimum length policy

This section describes the results acquired from performing the test described insection 4.3.3. The result can be seen in table 12.

Page 55: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.4 The implementation of the system 45

All the programs performed worse when they had to guess longer passwords.

Hashcat performed better when it used input passwords of any length, comparedto using only passwords of the targeted length. The difference between themdecreased as the length of the targeted passwords increased.

OMEN did better when it trained on data with the same minimum length as thetargeted passwords.

PassGAN did better when it trained on data with the same minimum length as thetargeted passwords, except in the case of the minimum length being 12.

PCFG did better when using all the passwords instead of just the ones of the cor-rect length. The difference between them increased as the length of the targetedpasswords increased.

PRINCE did not find any password when it used the filtered password lists.

Targeted length Training data Hashcat OMEN PassGAN PCFG PRINCE

8+ Any length 541.10 1647.99 39.51 4131.80 6167.70Longer than 8 246.84 1648.36 54.20 2901.87 0.00

12+ Any length 49.39 12.78 3.07 393.00 849.55Longer than 12 13.78 45.72 0.16 255.55 0.00

16+ Any length 5.04 0.10 1.12 39.45 50.21Longer than 16 4.75 0.98 1.23 10.51 0.00

20+ Any length 0.50 0.00 0.41 4.75 0.00Longer than 20 0.74 0.00 0.64 0.46 0.00

Table 12: Correct passwords per second

Against all targeted lengths except for 20+, PRINCE found the most passwords,followed by PCFG. When the targeted length was over 20, PCFG found the mostpasswords in the given time. For shorter targeted lengths, OMEN performed bet-ter than Hashcat, while Hashcat performed better than OMEN when the targetedlength was longer than 12.

5.4 The implementation of the system

This section describes the results for the fourth research question, ‘how can anautomated password guessing system be implemented in the Cyber Range atFOI?’.

The system was implemented as described in section 4.4.

Page 56: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

46 5 Result

5.4.1 Expected number of correct guesses

The figures below shows the real guesses (‘Target’) as well as the fitted curvewhich describes the percentage of passwords guessed as a function of the num-ber of guesses. All of the figures are not included, for the sake of brevity. Figure4 shows the result of Hashcat targeting all Swedish passwords, when it used adictionary of all international passwords. Figure 5 shows the result of PCFG tar-geting all international passwords of length 8 or more, when it used a dictionaryof all international passwords. Similar curve fits were obtained for each programand combination of settings, as described in section 4.4

Figure 4: Fitted curve for Hashcat, when using all international passwordsagainst all Swedish passwords

Page 57: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

5.4 The implementation of the system 47

Figure 5: Fitted curve for PCFG, when using all international passwordsagainst all international passwords of length 8 or more

The interesting thing here is how the programs are selected. Figure 6 and figure7 shows some fitted curves which Svedcrack uses to calculate which program torun. Both images had PCFG and PRINCE using all the available internationalpasswords, and Hashcat used all international passwords of length 20 or more.In figure 6, Blowfish is assumed to have been used. If Blowfish is used, thefigure shows that PCFG should be used, if the time to run is less than aroundtwo hours. In figure 7, the same scenario is calculated but for WPA-EAPOL-PBKDF2 hashes. If WPA-EAPOL-PBKDF2 hashed are targeted, the figure showsthat PRINCE should be used, if the time to run is more than a couple of min-utes.

Page 58: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

48 5 Result

Figure 6: Expected result of Hashcat, PCFG, and PRINCE guessing Blowfish(Unix) hashes, which are international and of length 16 or more

Figure 7: Expected result of Hashcat, PCFG, and PRINCE guessing WPA-EAPOL-PBKDF2 hashes, which are international and of length 16 or more

Page 59: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

6Discussion

This chapter discusses and analyses the results and used methods in this thesis.It also talks about societal and ethical aspects of this thesis.

6.1 Result

This section discusses the results, and in particular the relation between the re-sults and the theory.

6.1.1 Interviews

The result of the interviews matched the theory fairly well. What the theory saidwas important was also what the interviewees mentioned. In particular, the meth-ods of using dictionaries with word mangling rules with Hashcat was stressedboth by the theory and by the interviewees. All of the tested programs exceptfor OMEN was mentioned by the first interviewee (directly or indirectly), whichimplies that selection of programs was reasonable.

6.1.2 Design

Most of the programs are not automated, but instead leaves it up to the attacker torun the program in the way that makes the most sense. This shows that entirelyautomated password cracking solutions which employs all the available theoryare lacking or non-existent.

49

Page 60: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

50 6 Discussion

6.1.3 Test with password list

This section looks at the result of the test which used a password list as trainingdata. Each metric is discussed below.

Number of passwords cracked

It is clear that the model used by PCFG outperformed the other programs in thistest. This indicates that PCFG is the strongest in terms of generalising knowledgeof a password list and using it to generate high-quality guesses.

PRINCE found by far the fewest passwords after 100 million guesses. This in-dicates that the algorithm does not necessarily care about putting the highest-quality guesses first.

Uniqueness of guesses

That Hashcat made so relatively few unique guesses, compared to the other pro-grams, is dependent on dictionary and mangling rule choice. A careful selec-tion of dictionary words and mangling rules could guarantee that no overlappingguesses are made. But it is generally expected that a lot of overlapping guessesare generated, since a lot of passwords can be found by simple modifications ofother passwords.

Length distributions

Hashcat applies various rules which appends and removes characters from theinput words, which appears to have flattened out the length distribution. All theother programs followed the length distribution in some manner.

6.1.4 Test with dictionary words and password list

This test shows that password lists perform much better than dictionary wordsin nearly all of the cases. The implication of this is that password lists containmore information about other passwords, than dictionary words does, which iswhat you can expect. Password lists used for training will most likely containmany regular words, as well as additional information on the content of pass-words.

6.1.5 Test using dictionaries of different languages

It is remarkable that Hashcat, OMEN and PassGAN did better with an Englishdictionary than a Swedish dictionary, when it comes to guessing passwords set

Page 61: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

6.1 Result 51

by Swedish speaking users. It is unclear why this is the case, but it does indicatethat passwords picked by Swedish-speaking users have much in common withpassword picked by international users.

It can also be noted that PCFG and PRINCE had a large increase in correctlyguessed password when switching from English to Swedish, which was expected.This increase was much larger (50% and 60%) than the increase for the otherthree programs (14%, 20%, and 13%).

6.1.6 Test against passwords with a minimum length policy

That all programs performed worse when guessing longer passwords was to beexpected, due to the increased amount of possible passwords when the requiredlength increases. Discussion on the result of specific programs can be found be-low.

Hashcat

As can be seen in section 5.3.2, Hashcat guesses passwords which are fairly closeto following the length distribution of the input dictionary, but with the outputlength distribution being ‘smoothed’ out. Since the length distribution of theoutput guesses does not follow the length distribution of the input it can be ex-pected that using more passwords for input would be better, which is what theresult shows.

OMEN

Looking at the graph in section 5.3.2, we can see that OMEN appears to first andforemost guess passwords of the most common length. Since the most commonpasswords are of a rather short length, removing them from the training inputwould increase the length of the guesses passwords. This means that result intable 12 is expected.

PassGAN

The most notable thing about PassGAN in this test is the result for the minimumlength being 12. For all the other lengths, performance was better when usingthe filtered word lists. It is unclear why this is the case.

PCFG

Given that PCFG guesses closely follows the length distribution of the traininglist, is was unexpected that using only passwords with the minimum targeted

Page 62: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

52 6 Discussion

length would decrease the performance. One conclusion to draw from this is thateven passwords of a shorter length than the targeted password length containsenough information to be useful for the PCFG model.

PRINCE

That PRINCE would not be able to guess many passwords using the filtered datasets was expected, as its output guesses are concatenations of words in the inputlist. If PRINCE uses a dictionary of words or length 8 or more, it would onlybe able to find passwords of length 16 or more. Passwords of longer lengths areharder to guess, so the probability of correctly guessing a password decreases ifPRINCE uses a filtered word list.

6.2 Method

This section discusses the methods used to answer the research questions. Inparticular, it talks about the faults and alternatives to selected methods.

6.2.1 Criticism of sources

This thesis, and in particular chapter 3, refers to sources found both on the weband in published scientific papers. The published papers were found by search-ing on aggregate sites such as Semantic Scholar and Google Scholar, but alsodirectly on sites of specific scientific associations, such as IEEE and ACM. Thesepapers are generally peer-reviewed with many citations and references, and cantherefore be considered to be high-quality sources and thus suitable for this the-sis. In particular, the papers behind OMEN [9], PassGAN [12], PCFG [34], andTarGuess [32] were found this way. Furthermore, the majority of the paperscited in this thesis are at most a couple of years old, and are still relevant inthe field.

Other sources are of less high quality. PRINCE, for example, is a program whichwas not mentioned in the scientific literature, and the main sources are web pagesand forum posts. The informal literature (such as [27]) makes reasonable claimsabout the strengths of PRINCE, which is why it was decided that it was worth-while to put it to the test. Some of the sources about password content (such as[14]) are blog posts which have not been formally peer-reviewed, but describesmethods and results in a way that makes it verifiable and believable. For thatreason, such sources were considered in this thesis as well.

Overall, this thesis used a mix of academic resources and informal sources whichseemed high-quality enough to be included.

Page 63: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

6.2 Method 53

6.2.2 Interviews

Using interviews as the method fitted well with the related research question,’How are password guessing attacks performed by people with professional ex-perience?’ Unfortunately, the author could not find more than two interviewees,which makes the sample size rather small. For a more valid result, more inter-views should have been carried out. Both the interviewees were experienced inthe subject, so the result that was obtained can be assumed to be reliable.

There are some alternative methods which could have been used to answer the re-search question. An informal search could have been carried out on the Internet,to find blog posts and opinion pieces by people working with password guessing.This would have lead to a more quantitative result, but less qualitative and reli-able, which is why the method was not used. Another alternative would be to usequestionnaires sent out to a number of people.

6.2.3 The design of the programs

This paper used a rather informal examination of the programs in order to answerto research question ‘What are the overall designs of current state-of-the-art auto-mated password guessing programs?’ In particular, the measurements for degreeof automation (‘Automatically’, ‘Directly’, ‘Indirectly’, and ‘None’) was made upfor this paper and does not appear to have been used anywhere else. To the bestof the authors effort, no formal method to measure the degree of automation ina satisfactory manner could be found. Ideally, such a method should have beenfound and used to strengthen the validity of the result.

6.2.4 The effectiveness of the programs

To answer the research question ‘How does the password guessing programs com-pare in terms of effectiveness?’, multiple tests were carried out. In particular,tests involving some data set of leaked passwords, split up into multiple parts,were often used. The author believes that this was in general the right choice, asit closely emulates real attacks. The results can thus be considered to have a de-cent reliability and validity. However, some of the specifics involved, mentionedbelow, could have been done differently.

Test to measure guessing speed

In the guessing speed test, Hashcat was used in stdout mode, which prints theguesses. This means that data will have to be copied from the GPU to the CPU,which can be very slow. Hashcat is intended to be used entirely on the GPU, sothis result is worse than what can be expected in normal operation.

Page 64: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

54 6 Discussion

Used sets of data

This thesis used lists of passwords which have been leaked from websites. Al-though such lists are common in the literature and easy to get access to, they arelikely to contain different password than passwords used to protect importantsystems. In particular, RockYou and Bloggtrafik protected no valuable informa-tion at the time of attack, so it is possible the users picked weak and memorablepasswords. However, more research about the differences between system pass-words and website passwords would be needed to conclude to which extent thiswould affect the result of this report.

Different dictionaries

For the test comparing dictionaries and password lists, a dictionary using Wikipediawords was constructed. For the test using dictionaries, a dictionary of the samelanguage was extracted from OpenOffice. To make the result comparable be-tween both tests, the same general purpose dictionary should have been usedfor both.

Since the two tests had different purposes, the author felt that this was not re-quired, and both ways to construct a dictionary are valid.

Usage of real passwords

This paper used various sets of data on real passwords picked by real people. Thedata was not intended to be used for security research when the users came upwith the passwords, and can be considered to be private information. The datawas also stolen from the websites by illicit means. However, given the ease ofaccess to the password lists used in this report, as well as their wide usage inother papers on password guesses, it was decided that the information was okayto use.

Lack of personal information in tests

The relevant theory in section 3.2.1 discussed the importance of using personalinformation when guessing passwords of a user. The interviewees in section 5.1.1also mentioned the importance of employing personal information when guess-ing passwords of a user. Yet, this information was not used when the practicaltests were designed. This is because the personal information was considered tobe sensitive information, which we would not analyse unless we could obtain con-sent from the relevant users. This would not be possible to accomplish as partof this thesis work, so it was decided to leave out any tests that uses personal

Page 65: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

6.2 Method 55

information in some form. The sets of passwords used for the tests were not con-sidered to be personal information, since a person can not generally be uniquelydefined from a password alone.

One way this is solved in the literature is to ask users to come up with a passwordand hand it over to the researchers [35]. This was not done, as the data wouldnot be as realistic as analysing a set of real passwords. This would thus make theresult less reliable than using real passwords.

How the programs were run

Hashcat was only used in dictionary mode. While this mode is commonly used,it is not the only mode, and the mask-based mode could have been tested as well.That would require a selection of masks to be made, which would require moreeffort than picking the set of word mangling rules.

The reason why Hashcat was only used in dictionary mode was to make the re-sult somewhat comparable with the other programs. The other programs usesdictionaries, typically intended to contain passwords, to train some model fromwhich guesses are then generated. Using Hashcat in dictionary mode was themost similar way run the programs, in that it used the same information as theother programs to make its guesses. So as a way to compare the different pro-grams to each others, using dictionary is more reliable than using a mask attack.But as a way to measure the effectiveness of Hashcat as a password cracker, bothdictionary attacks and mask attacks should have been considered for a more validresult.

Password policies

One test related to password policies was performed, and it considered the lengthpart. As mentioned by Komanduri et al. [17] (see section 3.2.2), length influ-ences password strength more than requirements on special characters. This iswhy length was considered, and special characters were not. For a more com-plete result on how password policies can be considered by password guessers,requirements on special characters should have been considered as well.

6.2.5 The development and implementation of the system

The method to answer the research question ‘How can an automated passwordguessing system be implemented in the Cyber Range at FOI?’ was straightfor-ward: use the result and observations from the other tests and implement a sys-tem. However, there are likely other ways to use the knowledge gained from theother tests. In particular, more tests could have been carried out in order to figureout the margin of error of the results.

Page 66: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

56 6 Discussion

6.3 The work in a wider context

The goal of this paper was to examine the ways passwords can be guessed in anautomated manner, with the intention of creating a system which can be used fordefensive training. However, the material presented in this paper could be usedby an attacker to gain illicit access to the personal data of a user, even thoughthis was not intended. The implemented system is for training of defensive usersonly, and is not released publicly, which minimises the risk of the material beingused in such a way. Furthermore, it was decided that it is good that weaknessesof systems, regardless of if they are technical or social, are brought into light sothat they can be fixed.

Page 67: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

7Conclusions

This chapter concludes the paper, answers the research questions and poses ques-tions for further research.

7.0.1 Answers to the research questions

This section answers the specific research questions asked in section 1.3, andsummarises the results of the thesis.

How are password guessing attacks performed by people with professionalexperience?

Typically, password cracking sessions are carried out on GPU:s running a pro-gram like either John the Ripper or Hashcat. The type of attack most commonlyused were dictionary attacks, where each word is modified in multiple ways be-fore being hashed and tested. The dictionary used are often tailored specificallyfor the targeted user, with words with some relation to targeted user being putin the dictionary. In addition to dictionary attacks, brute force attacks was oftenused early to eliminate easy candidate and at the end when all faster options haverun out.

57

Page 68: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

58 7 Conclusions

What are the overall designs of current state-of-the-art automated passwordguessing programs?

This paper looked into the designs of Hashcat, PRINCE, PCFG, OMEN, PassGAN,and TarGuess. In particular, it determined to which extent the programs consid-ers password structures, dictionary words, sister passwords, password policies,and personal information when they make their guesses.

It was determined that most of them do not automatically consider these at-tributes. The exception to this is TarGuess, which considers common passwordstructures, as well as sister passwords and personal information of the targeteduser automatically.

OMEN, PassGAN, and PCFG learns structures of input words (typically lists ofpasswords) and generates new candidates based on them. As such, they onlyconsider password structures automatically, but can implicitly make use of somepersonal attributes if the attacker chose to put them in the dictionary. Some ofthe programs also have flags and options for things like minimum length, whichis can be set by the attacker using the program.

PRINCE is intended to be an entirely automated solution, and only takes a listof words as input. These words are then concatenated in various ways. Eventhough it runs with minimal interaction from the attacker, it can not be saidto consider the above attributes automatically. Personal information can be putin the list of words, which would make PRINCE include them in the passwordguesses implicitly, similar to OMEN, PassGAN, and PCFG.

Hashcat, being a general-purpose password cracker, has many options and set-tings which is set by the attacker, but does not consider any of the attributes inan automated fashion. Instead, it is up to the attacker to run the program in waythat best generates guesses.

How does the password guessing programs compare in terms ofeffectiveness?

In terms of guessing speed, PRINCE and Hashcat were the fastest on the systemused, with over 100 million guesses being made by PRINCE each second. Pass-GAN was by far the slowest, with only about 40000 guesses per second.

The experiments on the LinkedIn data show that password guessing using PCFGgenerates the highest quality guesses among the tested programs. However, giventhat the PCFG implementation used is slower than many of the other tested pro-grams, it could be worth using other programs as well. In particular, if the timeto hash the guesses is slow, a slower but more high-quality program such as PCFGcan be used. If the time to hash the guesses is fast, a faster but less high-qualityprogram can be used.

Page 69: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

59

Uniqueness among the password guesses was considered. In terms of no dupli-cates being generated, each program had more than 75% unique guesses. Whenoverlap between the various programs was tested, the result showed that eachprogram made fairly unique guesses, with between 68% and 94% of guesses notbeing made by any other program. This means that running multiple programscould be a viable strategy, since each program will be able to find passwords thatthe other programs do not.

The test using Wikipedia and RockYou as basis for guesses shows that a set ofleaked passwords performs better than a set of general encyclopedia words forall of the tested programs. Combining the two sets would reduce performancein nearly all the cases. This shows that sets of leaked passwords contain moreinformation about passwords than a general list of word does.

The test using both an English and a Swedish dictionary against a set of pass-words from a Swedish-language website shows that Hashcat, OMEN, and Pass-GAN performs better with an English dictionary. PCFG and PRINCE both per-formed a lot better when they used a Swedish dictionary, which indicated thatthese two programs should be using a dictionary with the same language as thetargeted user.

This test using passwords of different lengths shows that it is generally not worthfiltering training data to only contain the targeted lengths, unless you are usingOMEN or PassGAN. In the case of Hashcat, it could be worth using a dictionarywith longer words, if the targeted passwords are long as well (more than 16 char-acters). If PCFG or PRINCE is being used, the dictionary should not be filteredto only contain words of the targeted password length.

In terms of correct password guesses per second, Hashcat (in dictionary mode),PCFG, and PRINCE were generally the most effective.

How can an automated password guessing system be implemented in theCyber Range at FOI?

This paper has showed how an automated system based on the available theory,expert opinions, and test results can be built. The program was named Svedcrack.Svedcrack uses benchmarks of hashing speed and guessing speeds to determinehow many guesses each program is able to make in a given time frame. It thenproceeds to map the number of guesses for a particular program to a likelihoodthat the program will be able to find a password. Then the program with thehighest likelihood is started, and the generated guesses are hashed and checkedagainst the targeted password hash by Hashcat.

Page 70: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

60 7 Conclusions

7.0.2 Future work

This paper demonstrated one way to build a system which employs some sort ofknowledge to determine which password guesser is best suited for the task. Itremains an open question if this is the best way to build such a system. Moreresearch on alternative ways to automate the password guessing process couldanswer this question.

Page 71: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Appendix

Page 72: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 73: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

63

Used system

The programs were tested on a system, a MSI GS73 Stealth 8RE, with the follow-ing specifications.

• Kali GNU/Linux Rolling, 2019.1

• Nvidia GeForce GTX 1060 with 6GB GDDR5

• Intel Core i7-8750H @ 2.20GhZ

• 16GiB SODIMM DDR4 memory @ 2667 MHz

Page 74: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 75: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

65

How the programs were used

This appendix describes how the programs were generally used in the tests.

Hashcat Hashcat version 5.10 was used. Hashcat ran in ‘straight’ (dictionary)mode, with different mangling rules used depending on the test.

hashcat -a 0 dictionary -r ruleset.rule --stdout

OMEN OMEN version 0.3.0 was used. The dictionary was first used to create theMarkov probabilities. Once the training for a dictionary was done, the passwordswere generated.

createNG --iPwdList=dictionaryenumNG -p

PassGAN The latest version (as of January 2019) from the GitHub repositorywas used. The dictionary was first used to train the model. Once the training fora set was done, the passwords were generated.

python2 train.py --output-dir trained --training-data dictionarypython2 sample.py --input-dir trained \

--checkpoint trained/checkpoints/195000.ckpt \--batch-size 1024

PCFG The latest version (as of January 2019) from the GitHub repository wasused. The dictionary was first used to train the model. Once the training for a setwas done, the passwords were generated.

python3 pcfg_trainer.py --training dictionary --rule trainedpython3 pcfg_manager.py --rule trained

PRINCE PRINCE version 0.22 was used.

pp64.bin < dictionary

Page 76: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions
Page 77: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Bibliography

[1] Search - password cracker - github, March 2019. URL https://github.com/search?q=password+cracker&type=Repositories.

[2] Hashcat - advanced password recovery, January 2019. URL https://hashcat.net/hashcat/.

[3] Code for cracking passwords with neural networks, May 2019. URL https://github.com/cupslab/neural_network_cracking.

[4] A deep learning approach for password guessing, May 2019. URL https://github.com/brannondorsey/PassGAN.

[5] Standalone password candidate generator using the prince algorithm, Febru-ary 2019. URL https://github.com/hashcat/princeprocessor.

[6] Joseph Bonneau. The science of guessing: Analyzing an anonymized corpusof 70 million passwords. In 2012 IEEE Symposium on Security and Privacy.IEEE, may 2012. doi: 10.1109/sp.2012.49.

[7] Claude Castelluccia, Abdelberi Chaabane, Markus Dürmuth, and DanielePerito. When privacy meets security: Leveraging personal information forpassword cracking.

[8] Joseph A. Cazier and B. Dawn Medlin. Password security: An em-pirical investigation into e-commerce passwords and their crack times.Information Systems Security, 15(6):45–55, dec 2006. doi: 10.1080/10658980601051318.

[9] Markus Dürmuth, Fabian Angelstorf, Claude Castelluccia, Daniele Per-ito, and Abdelberi Chaabane. OMEN: Faster password guessing usingan ordered markov enumerator. In Lecture Notes in Computer Science,pages 119–132. Springer International Publishing, 2015. doi: 10.1007/978-3-319-15618-7_10.

67

Page 78: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

68 Bibliography

[10] Dinei Florencio and Cormac Herley. A large-scale study of web passwordhabits. In Proceedings of the 16th international conference on World WideWeb - WWW '07. ACM Press, 2007. doi: 10.1145/1242572.1242661.

[11] Paul A Grassi, James L Fenton, Elaine M Newton, Ray A Perlner, Andrew RRegenscheid, William E Burr, Justin P Richer, Naomi B Lefkovitz, Jamie MDanker, Yee-Yin Choong, Kristen K Greene, and Mary F Theofanos. Digitalidentity guidelines: authentication and lifecycle management. Technicalreport, jun 2017.

[12] Briland Hitaj, Paolo Gasti, Giuseppe Ateniese, and Fernando Perez-Cruz.Passgan: A deep learning approach for password guessing.

[13] Shiva Houshmand, Sudhir Aggarwal, and Randy Flood. Next gen PCFGpassword cracking. IEEE Transactions on Information Forensics and Secu-rity, 10(8):1776–1791, aug 2015. doi: 10.1109/tifs.2015.2428671.

[14] Troy Hunt. Science of password selection, July 2011. URL https://www.troyhunt.com/science-of-password-selection/.

[15] Troy hunt. Have i been pwned: Pwned websites, 2019. URL https://haveibeenpwned.com/PwnedWebsites.

[16] Patrick Gage Kelley, Saranga Komanduri, Michelle L. Mazurek, RichardShay, Timothy Vidas, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor,and Julio Lopez. Guess again (and again and again): Measuring passwordstrength by simulating password-cracking algorithms. In 2012 IEEE Sym-posium on Security and Privacy. IEEE, may 2012. doi: 10.1109/sp.2012.38.

[17] Saranga Komanduri, Richard Shay, Patrick Gage Kelley, Michelle L.Mazurek, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Serge Egel-man. Of passwords and people. In Proceedings of the 2011 annual confer-ence on Human factors in computing systems - CHI '11. ACM Press, 2011.doi: 10.1145/1978942.1979321.

[18] Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer NewYork, 2013. doi: 10.1007/978-1-4614-6849-3. P.

[19] Yue Li, Haining Wang, and Kun Sun. Personal information in passwordsand its security implications. IEEE Transactions on Information Forensicsand Security, 12(10):2320–2333, oct 2017. doi: 10.1109/tifs.2017.2705627.

[20] Yunyu Liu, Zhiyang Xia, Ping Yi, Yao Yao, Tiantian Xie, Wei Wang, andTing Zhu. GENPass: A general deep learning model for password guessingwith PCFG rules and adversarial generation. In 2018 IEEE InternationalConference on Communications (ICC). IEEE, may 2018. doi: 10.1109/icc.2018.8422243.

[21] Jacob Löfvenberg. Användning av grafikkort för lösenordstestning. Techni-cal report, December 2009.

Page 79: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

Bibliography 69

[22] Pardon Blessings Maoneke, Stephen Flowerday, and Naomi Isabirye. Theinfluence of native language on password composition and security: A so-cioculture theoretical view. In ICT Systems Security and Privacy Protec-tion, pages 33–46. Springer International Publishing, 2018. doi: 10.1007/978-3-319-99828-2_3.

[23] William Melicher, Blase Ur, Sean M. Segreti, Saranga Komanduri, LujoBauer, Nicolas Christin, and Lorrie Faith Cranor. Fast, lean, and accurate:Modeling password guessability using neural networks. In Proceedings ofthe 25th USENIX Security Symposium, 2016.

[24] Arvind Narayanan and Vitaly Shmatikov. Fast dictionary attacks on pass-words using time-space tradeoff. In Proceedings of the 12th ACM confer-ence on Computer and communications security - CCS '05. ACM Press, 2005.doi: 10.1145/1102120.1102168.

[25] NotSoSecure. One rule to rule them all, June 2017. URL https://www.notsosecure.com/one-rule-to-rule-them-all/.

[26] Philippe Oechslin. Making a faster cryptanalytic time-memory trade-off. InAdvances in Cryptology - CRYPTO 2003, pages 617–630. Springer BerlinHeidelberg, 2003. doi: 10.1007/978-3-540-45146-4_36.

[27] Jens Steube. Hashcat forums - practical prince, December 2014. URLhttps://hashcat.net/forum/thread-3914.html.

[28] Jens Steube. Prince - modern password guessing algorithm, Decem-ber 2014. URL https://hashcat.net/events/p14-trondheim/prince-attack.pdf.

[29] Chen Su and Yuesheng Zhu. Using personal information to aid in guess-ing passwords of chinese webs. In 2017 IEEE International Conference onCommunications (ICC). IEEE, may 2017. doi: 10.1109/icc.2017.7997248.

[30] Sarah J. Tracy. Qualitative Research Methods: Collecting Evidence, CraftingAnalysis, Communicating Impact. JOHN WILEY & SONS INC, 2013. ISBN1405192038.

[31] Chun Wang, Steve T.K. Jan, Hang Hu, Douglas Bossart, and Gang Wang. Thenext domino to fall. In Proceedings of the Eighth ACM Conference on Dataand Application Security and Privacy - CODASPY '18. ACM Press, 2018.doi: 10.1145/3176258.3176332.

[32] Ding Wang, Zijian Zhang, Ping Wang, Jeff Yan, and Xinyi Huang. Targetedonline password guessing. In Proceedings of the 2016 ACM SIGSAC Con-ference on Computer and Communications Security - CCS'16. ACM Press,2016. doi: 10.1145/2976749.2978339.

[33] Rick Wash, Emilee Rader, Ruthie Berman, and Zac Wellmer. Understand-ing password choices: How frequently entered passwords are re-used across

Page 80: Comparison of Automated Password Guessing Strategiesliu.diva-portal.org/smash/get/diva2:1325687/FULLTEXT01.pdf · tionship between password guessers, password crackers and hash functions

70 Bibliography

websites. In Proceedings of the Twelfth Symposium on Usable Privacy andSecurity, 2016.

[34] Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. Passwordcracking using probabilistic context-free grammars. In 2009 30th IEEE Sym-posium on Security and Privacy. IEEE, may 2009. doi: 10.1109/sp.2009.8.

[35] J. Yan, A. Blackwell, R. Anderson, and A. Grant. Password memorability andsecurity: empirical results. IEEE Security & Privacy Magazine, 2(5):25–31,sep 2004. doi: 10.1109/msp.2004.81.