Thursday, March 29, 2012

Write Data to HBase over thrift (Python)

[[Find Thrift interface file]]  

the Thrift interface file should be located under
$HBASE_HOME/src/main/resource/org/apache/hadoop/hbase/thrift
 

[[Install Thrift]]
I tested on both Ubuntu 11.04 x32 & CentOS 5 x64.
+Download
  wget http://mirrors.axint.net/apache/thrift/0.8.0/thrift-0.8.0.tar.gz
  tar -xzvf thrift-0.8.0.tar.gz
+Compile & Install
  cd thrift-0.8.0
  ./configure
  make
  sudo make install

  //try thrift command, you should get the usage information
+Install Thrift library for your language
  Trhift provided a lot of libraries for different languages.
  I'll use python to make an example. So, install the python library for thrift first.
  cd thrift-0.8.0/lib/py
  sudo python setup.py isntall

  //verify the installaton by run "import thrift" in the python interactive shell.

[[Generate Hbase library "header file"]]
  thrift --gen py hbase.thrift  You'll get a folder named "gen-py", those are the python header files


[[write a script]]
  Let's write a script to 1,Create a table; 2,Show table names; 3,Inseart some data; 4,Read them.
   1: import sys
   2: sys.path.append('/root/Desktop/working/gen-py')
   3:  
   4: from thrift.transport.TSocket import TSocket
   5: from thrift.transport.TTransport import TBufferedTransport
   6: from thrift.protocol import TBinaryProtocol
   7: from hbase import Hbase
   8:  
   9:  
  10: transport = TBufferedTransport(TSocket('10.1.2.127', 9090))
  11: transport.open()
  12: protocol = TBinaryProtocol.TBinaryProtocol(transport)
  13: client = Hbase.Client(protocol)
  14:  
  15: columns = []
  16: col = Hbase.ColumnDescriptor();
col.name = "data:"
columns.append(col);
 
  17: client.createTable("test", columns)
  18: print client.getTableNames()
  19:  
  20: mutations = [Hbase.Mutation(column="data:1",value='value1')]
  21: client.mutateRow("test", "row1", mutations )
  22:  
  23: print client.getRow('test', 'row1')


[[test the script]]
  Make sure the thrift-server is running. (in this sample script, thrift server is running on the same machine)
  If you can not make your thrift-server run in a Cluodera-manager-managed cluster, look at the tail of http://blog.thisisfeifan.com/2012/03/set-up-cdh3-cluster.html
  Run the script: "python t2.py", Get stdout result:
   1: ['test']
   2: [TRowResult(columns={'data:1': TCell(timestamp=1333062795476L, value='value1')}, row='row1')]


[[Performance]]
 There's an article compared the performance between thrift python client and HBase native JAVA API by Jython.
http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html




[[Verify the stability when region server down]]
  As we known, Hbase is based on HDFS file system, and HDFS keeps replicas in data nodes by its coherency model
  You can read more in "Hadoop: The Definitive Guide" Chapter 3 > DataFlow
  And the setting to indicate how many replicas is "dfs.replication" in <hdfs-site.xml>. Deafault value is 3. It means, every data block in HDFS own 2 copies.
  Make a case to verify whether it work as we expect.
 
  Steps:
  1, On region server "REGIONSRV3", create a table named "test", and write some data in it.
  2, Check the table status from HBase master page. "http://HBASEMASTER:60010/table.jsp?name=test"
  It shows the "Table regions" is located on "REGIONSRV3", and the table is enabled.
  3, Then turn this region server "REGIONSRV3" down.
  4, Our expect we still able to query the table content from the cluster, coz there're 2 copies of the data in other alive nodes.
   run "scan 'test'" in hbase shell, we can see the result. That's what we expected :)


   1: hbase(main):025:0> scan 'test'
   2: ROW COLUMN+CELL 
   3: row1 column=data:1, timestamp=1332891318009, value=value1 
   4: row2 column=data:2, timestamp=1333049644415, value=value2 
   5: row3 column=data:3, timestamp=1333053002019, value=value3 
   6: 3 row(s) in 0.0890 seconds

  5, Check the table properties by URL "http://HBASEMASTER:60010/table.jsp?name=test"
   The "Table regions" of the table had moved to "REGIONSRV4"

4 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Big Data Course in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Big Data Training Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Training in Chennai | Big Data Training

    ReplyDelete
  2. I agree with your thoughts!!! As the demand of java programming application keeps on increasing, there is massive demand for java professionals in software development industries. Thus, taking training will assist students to be skilled java developers in leading MNCs. J2EE Training in Chennai | JAVA Training Institutes in Chennai

    ReplyDelete