Wednesday, February 29, 2012

Set up CDH3 on Ubuntu 11.04/11.10 server & CentOS 5


----------------
Why CDH3
----------------
I was tired with version mismatch error while set up Hadoop common and Hbase environment. I keep getting strange problems even after I installed Hadoop & Hbase both in stable version.
Cloudera is the leading commercial company in Hadoop development. We should believe in its version control.




----------------
Install CDH3
----------------
follow instructions here:
https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation

[Ubuntu 11.04]
cloudera didn't provide package for ubuntu 11.04, which is known as its code name "natty"
we can use "lucid" instead. So, your cloudera.list under  /etc/apt/sources.list.d/  should looks like:
deb http://archive.cloudera.com/debian lucid-cdh3 contrib
deb-src http://archive.cloudera.com/debian lucid-cdh3 contrib




----------------
Config Java
----------------
#remove original gcj
[centos 5 only]
yum -y remove java-1.4.2-gcj-compat

#download jdk1.6, NOT 1.7
wget http://download.oracle.com/otn-pub/java/jdk/6u31-b04/jdk-6u31-linux-i586-rpm.bin

#install oracle java
./jre-6u31-linux-amd64-rpm.bin
rpm -ivh jre-6u31-linux-amd64-rpm.rpm

[ubuntu 11.04]
(in ubuntu , open-jdk works fine)




----------------
Export variable JAVA_HOME
----------------
vim /etc/profile
 - export JAVA_HOME=/usr/java/jdk1.6.0_31
 - export PATH=$PATH:$JAVA_HOME/bin
source /etc/profile
echo $JAVA_HOME




----------------
Add finger print to SSH 
----------------
Since Hadoop is a distributed system, working processes on each node communicates over SSH. So we have to make sure no password requires when name node communicates with target nodes. In this example, namdenode and other nodes running on same machine, we just make sure it can ssh to localhost without password.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys




----------------
install single node configuration
----------------
install hadoop-0.20-conf-pseudo




----------------
Start hadoop
----------------
There's no start-all.sh or stop-all.sh existing in raw Hadoop's bin folder, we have to run all node processes by command:
for service in /etc/init.d/hadoop-0.20-*; do sudo $service start; done




----------------
Monitor running status
----------------
process
use command "jps" to get current running java processes
you should see 5 hadoop related processes running after you started hadoop(ignore the process ID):


4316 SecondaryNameNode
4379 TaskTracker
4241 NameNode
4173 JobTracker
4101 DataNode


logs
 will be collected under folder
/usr/lib/hadoop-0.20/logs/
you can use "tail -f xxxx" to view real time logs.




----------------
Error you might meet
"org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs"
----------------
Jobtracker might crash...
#change permission of <hadoop.tmp.dir>
sudo chmod 777 /var/lib/hadoop-0.20/


#add user "hadoop" in "hadoop" group
[centos 5]
adduser hadoop -g hadoop

[ubuntu 11.04]
adduser haddop --ingroup hadoop

#format hdfs by user hadoop
su hadoop
hadoop namenode -format
exit




----------------
Verify installation
----------------
follow steps in:
https://ccp.cloudera.com/download/attachments/6553613/CDH3_Quick_Start_Guide_u3.pdf?version=1&modificationDate=1327713706000




----------------
Stop hadoop
----------------
for service in /etc/init.d/hadoop-0.20-*; do sudo $service stop; done

No comments:

Post a Comment