Thursday, March 29, 2012

Write Data to HBase over thrift (Python)

[[Find Thrift interface file]]  

the Thrift interface file should be located under
$HBASE_HOME/src/main/resource/org/apache/hadoop/hbase/thrift
 

[[Install Thrift]]
I tested on both Ubuntu 11.04 x32 & CentOS 5 x64.
+Download
  wget http://mirrors.axint.net/apache/thrift/0.8.0/thrift-0.8.0.tar.gz
  tar -xzvf thrift-0.8.0.tar.gz
+Compile & Install
  cd thrift-0.8.0
  ./configure
  make
  sudo make install

  //try thrift command, you should get the usage information
+Install Thrift library for your language
  Trhift provided a lot of libraries for different languages.
  I'll use python to make an example. So, install the python library for thrift first.
  cd thrift-0.8.0/lib/py
  sudo python setup.py isntall

  //verify the installaton by run "import thrift" in the python interactive shell.

[[Generate Hbase library "header file"]]
  thrift --gen py hbase.thrift  You'll get a folder named "gen-py", those are the python header files


[[write a script]]
  Let's write a script to 1,Create a table; 2,Show table names; 3,Inseart some data; 4,Read them.
   1: import sys
   2: sys.path.append('/root/Desktop/working/gen-py')
   3:  
   4: from thrift.transport.TSocket import TSocket
   5: from thrift.transport.TTransport import TBufferedTransport
   6: from thrift.protocol import TBinaryProtocol
   7: from hbase import Hbase
   8:  
   9:  
  10: transport = TBufferedTransport(TSocket('10.1.2.127', 9090))
  11: transport.open()
  12: protocol = TBinaryProtocol.TBinaryProtocol(transport)
  13: client = Hbase.Client(protocol)
  14:  
  15: columns = []
  16: col = Hbase.ColumnDescriptor();
col.name = "data:"
columns.append(col);
 
  17: client.createTable("test", columns)
  18: print client.getTableNames()
  19:  
  20: mutations = [Hbase.Mutation(column="data:1",value='value1')]
  21: client.mutateRow("test", "row1", mutations )
  22:  
  23: print client.getRow('test', 'row1')


[[test the script]]
  Make sure the thrift-server is running. (in this sample script, thrift server is running on the same machine)
  If you can not make your thrift-server run in a Cluodera-manager-managed cluster, look at the tail of http://blog.thisisfeifan.com/2012/03/set-up-cdh3-cluster.html
  Run the script: "python t2.py", Get stdout result:
   1: ['test']
   2: [TRowResult(columns={'data:1': TCell(timestamp=1333062795476L, value='value1')}, row='row1')]


[[Performance]]
 There's an article compared the performance between thrift python client and HBase native JAVA API by Jython.
http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html




[[Verify the stability when region server down]]
  As we known, Hbase is based on HDFS file system, and HDFS keeps replicas in data nodes by its coherency model
  You can read more in "Hadoop: The Definitive Guide" Chapter 3 > DataFlow
  And the setting to indicate how many replicas is "dfs.replication" in <hdfs-site.xml>. Deafault value is 3. It means, every data block in HDFS own 2 copies.
  Make a case to verify whether it work as we expect.
 
  Steps:
  1, On region server "REGIONSRV3", create a table named "test", and write some data in it.
  2, Check the table status from HBase master page. "http://HBASEMASTER:60010/table.jsp?name=test"
  It shows the "Table regions" is located on "REGIONSRV3", and the table is enabled.
  3, Then turn this region server "REGIONSRV3" down.
  4, Our expect we still able to query the table content from the cluster, coz there're 2 copies of the data in other alive nodes.
   run "scan 'test'" in hbase shell, we can see the result. That's what we expected :)


   1: hbase(main):025:0> scan 'test'
   2: ROW COLUMN+CELL 
   3: row1 column=data:1, timestamp=1332891318009, value=value1 
   4: row2 column=data:2, timestamp=1333049644415, value=value2 
   5: row3 column=data:3, timestamp=1333053002019, value=value3 
   6: 3 row(s) in 0.0890 seconds

  5, Check the table properties by URL "http://HBASEMASTER:60010/table.jsp?name=test"
   The "Table regions" of the table had moved to "REGIONSRV4"

Swap any smartphone to Windows phone 7...

http://dealsea.com/view-deal/62123

  It sounds like a good deal, though I don't know what I can do with a windows phone 7 now...

  There's a MS store near me, Valley fair in Santa Clara... I went there yesterday after work, with an old Sony Ericsson X8, which is an Android 2.0 phone, at around 7:30PM.
  I was glad to see there were just 5 people in a line near the door. But the MS Store staff told me in bad attitude that, the line had been closed since 5:00PM, and will open 7:00AM tomorrow.

 Oh, s***... Forget it. Goodbye window phone 7... I was so close to you once.

Wednesday, March 28, 2012

Set up a CDH3 cluster

[[Nodes distributing]]

I'm building a 4-nodes cluster. Here's plan:
Node A:
  HDFS: namenode+job_tracker
  HBase: hbase master, zookeeper server(quorum peer)
  Other: Hue
Node B:
  HDFS: datanode1+task_tracker
  HBase: region server1
Node C:
  HDFS: datanode2+task_tracker
  HBase: region server2
Node D:
  HDFS: datanode3+task_tracker
  HBase: region server3

[[Manually set up the cluster on Ubuntu 11.10 x64 servers]]

Follow these steps strictly:  https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster
*configuration management:
list all profiles
    sudo update-alternatives --display hadoop-0.20-conf
  add new profile based on conf.empty (use a greater value to indicate the profile priority)
    sudo cp -r  /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster
    sudo update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50
  remove profile
    sudo update-alternatives --remove hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster

*config DNS
edit /etc/hosts

*config hostname
/etc/hostname (Ubuntu)
  /etc/sysconfig/network-scripts/ifcfg-eth0 (CentOS)
  reboot

Problems you might meet:
  > Exception description containing "org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java"
    sol 1, make sure the owner of dfs.name.dir and dfs.data.dir directionries is the hdfs user.(https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster#CDH3DeploymentonaCluster-ConfiguringLocalStorageDirectoriesforUsebyHDFSandMapReduce )
    sol 2, delete folder configured in <core-site.xml>:hadoop.tmp.dir , <hdfs-site.xml>:dfs.data.dir and <hdfs-site.xml>:dfs.name.dir

  > Exception "java.net.BindException: Problem binding to xxx/ip:port", "ipc.RPC....."
    make sure you're starting job traker on the machine you configured. JobTracker is the one, like "master" while TaskTrackers works like "slaves"


  > Other strange problems:
    you might need to disable ipv6 on ubuntu
    make sure /etc/hosts & /etc/hostname were correctly configured



[[Deploy & manage the cluster by tool]]

Once you tried the Cloudera Manager, you may feel regret of “manually done”. However, manually set up the cluster is kind of experience, which will help you understand, look deep in some confused issues.

Requirement:  https://ccp.cloudera.com/display/FREE373/Requirements+for+Cloudera+Manager
  I used CentOS 5 x64 to build up the cluster.

Steps:
  Follow: https://ccp.cloudera.com/display/FREE374/Cloudera+Manager+Free+Edition+Installation+Guide

If you're using VM:
  1, prepare 2 vm, 1 for namnode, 1 for datanode
  2, install cloudera manager on vm1, then install components on vm2 over cloudera manager web UI
  3, config vm1~vm5 in DNS file(/etc/hosts) in both vm1 and vm2.
  4, config authed ssh login from vm1 to vm2
  5, clone vm2 to vm3, vm4, vm5...
  6, config IP(/etc/sysconfig/network-scripts/ifcfg-eth0) and hostname(/etc/sysconfig/network) on vm3~vm5, and restart network service(/sbin/service network restart)
  7, refresh cloudera manager UI, you should see 5 vms is ready.

Problems you might meet:
+ERROR 1: can not recognize OS version
  >you should use Red Hat-compatible systems or SUSE systems

+ERROR 2: installing components in cluster
  error msg in installation detail "scm agent could not be started, giving up"
  > host name not be set properly
  follow instruction here:https://ccp.cloudera.com/display/CDHDOC/CDH3+Deployment+on+a+Cluster#CDH3DeploymentonaCluster-ConfigureNetworkNames
  (If you can not get correct result with command host -v -t -A `hostname` as instrauction said, it doesn't matter, forget it)
  > error msg "remote package cloudera-manager-agent is not available, giving up" while installing Cloudera Manager agent package
  target node is not x64, no suitable cloudera-manager-agent for x32

+ERROR 3:
  install all components on a node successfully, not can not see the node in hosts list.
  > turn off firewall
  use /sbin/service iptables status/stop to indicate firewall status and turn it off if necessary.
  /sbin/service iptables stop
  chkconfig --level 35 iptables off

+ERROR 4:
  permission error "org.apache.hadoop.security.AccessControlException: Permission denied: user=xxx, access=WRITE, inode=" while running sample jobs in HUE web UI.
  > there's a warning "This job commonly fails because /user/<your-user-name> is not writable." before you run sample job "Pi Calculator".
  default permission of "/user/" foloder in HDFS is "drwxr-xr-x User:hdfs, Group hadoop"
  So, we have to change its permission.
  Now, check this link: https://ccp.cloudera.com/display/CDHDOC/Hue+Installation#HueInstallation-Authentication
  It said "every Hue user who wants to use the Shell application must have a Unix user account with the same name on the server that runs Hue."
  >>
  solution 1: add user "hdfs" by "Admin console" in HUE Web UI. And run the job with user "hdfs"
  solution 2: add user "hdfs" by "Admin console" in HUE Web UI. Change the permission of "/user/" to "everyone writable"(777).
  solution 3: in terminal, run "sudo -u hdfs hadoop fs -chmod 777 /user/"
 Run the job with other users.


[[Add more services on Cloudera Manager-managed cluster]]

  Cloudera manager-managed cluster using database to store its configurations, you can not find any meaningful conf under /etc/hadoop or /etc/hbae. So when you have to add services manually, you have to fill the conf files under /etc/hadoop or /etc/hbase, otherwise manually installed service will not find the conf currently cluster using.
Generate client configuration
  >Click "setting" icon in the top right corner.
  Select "Export" and "Download Configuration Script"

Copy 2 configuration folders to each nodes, follow the 2 readme text files in the folder
  >Hadoop: use alternatives tool to lead hadoop use the new configuration folder "hadoop-conf/"
  >Hbase & Zookeeper: export HBASE_CONF_DIR and ZOOKEEPER_CONF = "hbase_conf/"

Start service (e.g. thrift server)
  run: hbase thrift start
  If you did not follow the previous step, you will get an error "zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect"

Update (good news):
 In CDH4, you can use "Deploy Client Configuration..." in service actions menu instead of copy configuration files to each cluster member.
e.g. 
 after you deployed client configurations,  
 you will see folder "conf.cloudera.hbase1/" under /etc/hbase, and the /etc/hbase/conf was pointed to "/etc/hbase/conf.cloudera.hbase1/"
 you can start thrift by  command "hbase thrift start" (or install package hbase-thrift)