Wednesday, February 22, 2012

Get top 10 password CSDN user using, by Hadoop+dumbo

  In the end of 2011, CSDN (China Software Development Network), the biggest programming communities in China, leaked its user database which owns 6M records inside. Including username, password, and email. More importantly, all data in clear text because of some historical reasons.

  What I am interested is, which string was used in highest frequency as password. Someone had posted a blog, show how to stat. the top 10 passwords with Hadoop in Java(http://www.oschina.net/code/snippet_176897_8863). I will use dumbo (https://github.com/klbostee/dumbo) to do the same thing.

1, Put the dump file to HDFS
   1: bin/hadoop fs -put ../www.csdn.net.sql csdn.dat

2, Write code with dumbo library
format in the dump file is:
“username    # password     # email”
so, we split the string with ‘#’ character first, then remove blank characters in each slice by function strip(), and pop the 2nd slice which is actually the password.
Here’s the code , it’s much shorter than the code in Java, thanks to the dumbo
   1: # -*- coding: utf-8 -*- 
   2: from dumbo import sumreducer
   3:  
   4: def mapper(key, value):    
   5:     pwd = map(lambda x: x.strip(), value.split('#'))[1]
   6:     yield pwd, 1
   7:  
   8:  
   9: if __name__ == "__main__":
  10:     import dumbo
  11:     dumbo.run(mapper, sumreducer)

3, Run Hadoop
   1: dumbo start csdntoppwd.py -hadoop ~/Desktop/hadoop-0.22.0/ -input csdn.dat -output toppwd


4, Sort the result
   1: dumbo cat toppwd/part* -hadoop ~/Desktop/hadoop-0.22.0/ | sort -t $'\t' -k2,2nr | head -n 10 

result:
   1: 123456789    235037
   2: 12345678    212761
   3: 11111111    76348
   4: dearbook    46053
   5: 00000000    34953
   6: 123123123    20010
   7: 1234567890    17794
   8: 88888888    15033
   9: 111111111    6995
  10: 147258369    5966

>,< , as we seen, in professional programmer groups, there’re still a lot of guys using such weak passwords…

No comments:

Post a Comment