Wednesday, February 29, 2012

Set up CDH3 on Ubuntu 11.04/11.10 server & CentOS 5

Why CDH3
I was tired with version mismatch error while set up Hadoop common and Hbase environment. I keep getting strange problems even after I installed Hadoop & Hbase both in stable version.
Cloudera is the leading commercial company in Hadoop development. We should believe in its version control.

Install CDH3
follow instructions here:

[Ubuntu 11.04]
cloudera didn't provide package for ubuntu 11.04, which is known as its code name "natty"
we can use "lucid" instead. So, your cloudera.list under  /etc/apt/sources.list.d/  should looks like:
deb lucid-cdh3 contrib
deb-src lucid-cdh3 contrib

Config Java
#remove original gcj
[centos 5 only]
yum -y remove java-1.4.2-gcj-compat

#download jdk1.6, NOT 1.7

#install oracle java
rpm -ivh jre-6u31-linux-amd64-rpm.rpm

[ubuntu 11.04]
(in ubuntu , open-jdk works fine)

Export variable JAVA_HOME
vim /etc/profile
 - export JAVA_HOME=/usr/java/jdk1.6.0_31
 - export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile

Add finger print to SSH 
Since Hadoop is a distributed system, working processes on each node communicates over SSH. So we have to make sure no password requires when name node communicates with target nodes. In this example, namdenode and other nodes running on same machine, we just make sure it can ssh to localhost without password.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/ >> ~/.ssh/authorized_keys

install single node configuration
install hadoop-0.20-conf-pseudo

Start hadoop
There's no or existing in raw Hadoop's bin folder, we have to run all node processes by command:
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done

Monitor running status
use command "jps" to get current running java processes
you should see 5 hadoop related processes running after you started hadoop(ignore the process ID):

4316 SecondaryNameNode
4379 TaskTracker
4241 NameNode
4173 JobTracker
4101 DataNode

 will be collected under folder
you can use "tail -f xxxx" to view real time logs.

Error you might meet
" PriviledgedActionException as:hdfs"
Jobtracker might crash...
#change permission of <hadoop.tmp.dir>
sudo chmod 777 /var/lib/hadoop-0.20/

#add user "hadoop" in "hadoop" group
[centos 5]
adduser hadoop -g hadoop

[ubuntu 11.04]
adduser haddop --ingroup hadoop

#format hdfs by user hadoop
su hadoop
hadoop namenode -format

Verify installation
follow steps in:

Stop hadoop
for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done

