Wednesday, February 29, 2012

Set up CDH3 on Ubuntu 11.04/11.10 server & CentOS 5

Why CDH3
I was tired with version mismatch error while set up Hadoop common and Hbase environment. I keep getting strange problems even after I installed Hadoop & Hbase both in stable version.
Cloudera is the leading commercial company in Hadoop development. We should believe in its version control.

Install CDH3
follow instructions here:

[Ubuntu 11.04]
cloudera didn't provide package for ubuntu 11.04, which is known as its code name "natty"
we can use "lucid" instead. So, your cloudera.list under  /etc/apt/sources.list.d/  should looks like:
deb lucid-cdh3 contrib
deb-src lucid-cdh3 contrib

Config Java
#remove original gcj
[centos 5 only]
yum -y remove java-1.4.2-gcj-compat

#download jdk1.6, NOT 1.7

#install oracle java
rpm -ivh jre-6u31-linux-amd64-rpm.rpm

[ubuntu 11.04]
(in ubuntu , open-jdk works fine)

Export variable JAVA_HOME
vim /etc/profile
 - export JAVA_HOME=/usr/java/jdk1.6.0_31
 - export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile

Add finger print to SSH 
Since Hadoop is a distributed system, working processes on each node communicates over SSH. So we have to make sure no password requires when name node communicates with target nodes. In this example, namdenode and other nodes running on same machine, we just make sure it can ssh to localhost without password.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys

install single node configuration
install hadoop-0.20-conf-pseudo

Start hadoop
There's no or existing in raw Hadoop's bin folder, we have to run all node processes by command:
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

Monitor running status
use command "jps" to get current running java processes
you should see 5 hadoop related processes running after you started hadoop(ignore the process ID):

4316 SecondaryNameNode
4379 TaskTracker
4241 NameNode
4173 JobTracker
4101 DataNode

 will be collected under folder
you can use "tail -f xxxx" to view real time logs.

Error you might meet
" PriviledgedActionException as:hdfs"
Jobtracker might crash...
#change permission of <hadoop.tmp.dir>
sudo chmod 777 /var/lib/hadoop-0.20/

#add user "hadoop" in "hadoop" group
[centos 5]
adduser hadoop -g hadoop

[ubuntu 11.04]
adduser haddop --ingroup hadoop

#format hdfs by user hadoop
su hadoop
hadoop namenode -format

Verify installation
follow steps in:

Stop hadoop
for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done

Wednesday, February 22, 2012

Get top 10 password CSDN user using, by Hadoop+dumbo

  In the end of 2011, CSDN (China Software Development Network), the biggest programming communities in China, leaked its user database which owns 6M records inside. Including username, password, and email. More importantly, all data in clear text because of some historical reasons.

  What I am interested is, which string was used in highest frequency as password. Someone had posted a blog, show how to stat. the top 10 passwords with Hadoop in Java( I will use dumbo ( to do the same thing.

1, Put the dump file to HDFS
   1: bin/hadoop fs -put ../ csdn.dat

2, Write code with dumbo library
format in the dump file is:
“username    # password     # email”
so, we split the string with ‘#’ character first, then remove blank characters in each slice by function strip(), and pop the 2nd slice which is actually the password.
Here’s the code , it’s much shorter than the code in Java, thanks to the dumbo
   1: # -*- coding: utf-8 -*- 
   2: from dumbo import sumreducer
   4: def mapper(key, value):    
   5:     pwd = map(lambda x: x.strip(), value.split('#'))[1]
   6:     yield pwd, 1
   9: if __name__ == "__main__":
  10:     import dumbo
  11:, sumreducer)

3, Run Hadoop
   1: dumbo start -hadoop ~/Desktop/hadoop-0.22.0/ -input csdn.dat -output toppwd

4, Sort the result
   1: dumbo cat toppwd/part* -hadoop ~/Desktop/hadoop-0.22.0/ | sort -t $'\t' -k2,2nr | head -n 10 

   1: 123456789    235037
   2: 12345678    212761
   3: 11111111    76348
   4: dearbook    46053
   5: 00000000    34953
   6: 123123123    20010
   7: 1234567890    17794
   8: 88888888    15033
   9: 111111111    6995
  10: 147258369    5966

>,< , as we seen, in professional programmer groups, there’re still a lot of guys using such weak passwords…

Wednesday, February 15, 2012

Reverse proxy setting for Nginx to route by domain name

1, Assuming the server will own 2 domain names: and
You should get same IP address while ping these 2 domain names.
2, By default, when you visit it will redirect to
3, If you visit domain name, you will get the response from web server serving on
   1: server
   2: {
   3:         listen 80;
   4:         server_name ~^(www\.)?(?<domain>.+)$;
   6:         location / {
   7:                 #default redirect
   8:                 proxy_pass;
  10:                 #redirect by domain name.
  11:                 if ($domain ~* mysite)
  12:                 {
  13:                         proxy_pass;
  14:                 }
  15:               proxy_set_header Host $host;
  16:                 proxy_set_header X-Real-IP $remote_addr;
  17:                 proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  18:         }
  19:         access_log      /var/log/nginx/rproxy.access_log;
  20:         error_log       /var/log/nginx/rproxy.error_log;
  22: }


Monday, February 13, 2012

send gmail in django

Send Email by GMail in Django
1, make sure POP3 enabled with your Google Account

2, Add configuration to
EMAIL_HOST_PASSWORD = 'yourpassword'

update: you may need turn on 2 step verification in your Google account setting. Otherwise, you may get warning, and sending may be prohibited by Google.

Using multi-cache backends in Django, and an issue in django-redis-cache

The missing Manual
  It seems something was missed in Django Document about the Cache. Django start support multi-caches since version 1.3, but in its document 1.3 ( and Dev(, it didn't mentioned about how to do route between multi-caches at all.
  In the current document(, it introduced that there's a cache object already for default cache. 
"The cache module, django.core.cache, has a cache object that's automatically created from the 'default' entry in the CACHES setting: "
  I guess there's a existing route function for shifting between different cache backend, coz it mentioned about "created from default entry", I should be able to create one by other entry >.,< 
  And I found something as I wish:
   1: DEFAULT_CACHE_ALIAS = 'default'
   3: cache = get_cache(DEFAULT_CACHE_ALIAS)

  This is the cache object using default backend. 
  Well, now, we've found the "get_cache" method is the missing stuff Smile with tongue out
Let's try it out, if we have following settings in <>

   1: CACHES = {
   2:     'default': {
   3:         'BACKEND': 'redis_cache.RedisCache',
   4:         'LOCATION': '',
   5:         'OPTIONS': {
   6:             'DB': 2,
   7:             #'PARSER_CLASS': 'redis.connection.HiredisParser',
   8:         }
   9:     },
  11:     '2ndcache': {
  12:         'BACKEND': 'redis_cache.RedisCache',
  13:         'LOCATION': '',
  14:         'OPTIONS': {
  15:             'DB': 3,
  16:             #'PARSER_CLASS': 'redis.connection.HiredisParser',
  17:         },
  18:     }    
  19: }

code below in <> or somewhere should save to different Redis db.

   1: cache.set('x',1) #default cache object, using redis db 2
   2: get_cache('2ndcache').set('x',2) #save to 2ndcache(radis db 3)

But, unfortunately, you will find that , both operations write value to redis db 2.

An issue in redis_cache

That's because there is a bug, maybe we should call it an issue, in redis_cache 0.9.2 which we used as cache backend for django to connect redis.

I've submit the issue to author, and a patch in the issue page.
you can get issue fixed code here:

find <> in your python library path. Maybe in XXX\Python26\Lib\site-packages\django_redis_cache-0.9.2-py2.6.egg\redis_cache

modify get_connection_pool method in class CacheConnectionPool as follows:

   1: class CacheConnectionPool(object):
   2:     _connection_pools = {} #id: reis.ConnectionPool
   4:     def get_connection_pool(self, host='', port=6379, db=1,
   5:         password=None, parser_class=None,
   6:         unix_socket_path=None):
   7:         if self._connection_pools.get(db) is None:
   8:             connection_class = (
   9:                 unix_socket_path and UnixDomainSocketConnection or Connection
  10:             )            
  11:             kwargs = {
  12:                 'db': db,
  13:                 'password': password,
  14:                 'connection_class': connection_class,
  15:                 'parser_class': parser_class,
  16:             }
  17:             if unix_socket_path is None:
  18:                 kwargs.update({
  19:                     'host': host,
  20:                     'port': port,
  21:                 })
  22:             else:
  23:                 kwargs['path'] = unix_socket_path
  24:             self._connection_pools[db] = redis.ConnectionPool(**kwargs)
  25:         return self._connection_pools.get(db)