Tuesday, June 26, 2012

HBase + Thrift performance test 2

Test purpose and design
Nginx work as a balancer, 8 tornado instances will serve at the back end. Each tornado instance owns a thrift connection to HBase. Since tornado is a single thread web server, so there's no "thread safe issue" we mentioned in previous blog here.

Code & Configuration file
Server side code: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/web%20service%20test/tornado_1.py
Test driven code: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/web%20service%20test/emu_massdata.py
Nginx configuration: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/web%20service%20test/Nginx%20setting/hbasetest
Supervisord configuration: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/web%20service%20test/Supervisord%20setting/supervisord.conf

CPU: Intel(R) Xeon(R) CPU            5150  @ 2.66GHz (4 core)
Memory: 4GB
Network: LAN

deploy tornado application

configuration file for supervisord
we'll start 8 tornado instances, they will listen on port 8870~8877.
command=python /root/tornado_1.py 887%(process_num)01d

verify working processes
root@fdcolo8:/etc/nginx/sites-enabled# supervisorctl
hbasewstest:hbasewstest_0        RUNNING    pid 2020, uptime 18:27:55
hbasewstest:hbasewstest_1        RUNNING    pid 2019, uptime 18:27:55
hbasewstest:hbasewstest_2        RUNNING    pid 2034, uptime 18:27:53
hbasewstest:hbasewstest_3        RUNNING    pid 2029, uptime 18:27:54
hbasewstest:hbasewstest_4        RUNNING    pid 2044, uptime 18:27:51
hbasewstest:hbasewstest_5        RUNNING    pid 2039, uptime 18:27:52
hbasewstest:hbasewstest_6        RUNNING    pid 2054, uptime 18:27:49
hbasewstest:hbasewstest_7        RUNNING    pid 2049, uptime 18:27:50

Nginx configuration
create new server profile under /etc/nginx/sites-enabled
 1 upstream backends{
 2     server;
 3     server;
 4     server;
 5     server;
 6     server;
 7     server;
 8     server;
 9     server;
10 }
14 server {
15     listen 8880;
16     server_name localhost; 
17     location / {
18         proxy_pass_header Server;
19         proxy_set_header Host $http_host;
20         proxy_set_header X-Real-IP $remote_addr;
21         proxy_set_header X-Scheme $scheme;
22         proxy_pass http://backends;
23         proxy_next_upstream error;
24     }  
25     access_log /var/log/nginx/hbasewstest.access_log;
26     error_log /var/log/nginx/hbasewstest.error_log;
27 }

Verify Nginx worked
after new Ngninx profile created, make sure nginx is now listen on port 8880
service nginx reload
lsof -i:8880

The test-driven application start 10 threads at beginning, and send 300KB-length data packages continually by HTTP POST.
Web application will split each 300K-length JSON data into hundreds of 1K-length data, and transform to HBase records. Web application use batch write mode, each coming JSON data will trigger one time write only...
Check more detail in source code, URL@2nd subparagraph.

Test Result
Data size                    | web app detail                        | time
60K records(60MB)       | 1 instance (port 8870)              |  12 seconds
60K records(60MB)       | nginx (8 instances, port 8880)  |  6.22 seconds
6 million records(6GB) | nginx (8 instances, port 8880)  |  768.79 senconds(12.8mins)

Web server status
CPU time



Friday, June 22, 2012

HBase + Thrift performance test 1

Thrift transport is not thread safe!
At the beginning, I used only 1 global thrift connection in my test app, and 10 concurrent threads send data to HBase by this thrift connection.

Then, lots of confusing exceptions came towards me!
- Server side printed exceptions:
"java.lang.OutOfMemoryError: Java heap space"

- Client side printed exceptions:
"... broken pipe..."

Performance test 
Data size:
  • write 60K records to database, 
HBase Row
  • each record is about 1024 bytes => 60MB data totally
  • each record owns 3 columns "f1:1", "f1:2", "f1:3" in 1 column family "f1". values in each column were formatted as "value_%s_endv" % "x"*(1024/3)
  • row key is formatted as  "RK_%s_%s" % (random.random(), time.time())
  • one record :

Write mode
  • 10 threads concurrent => each thread in charge of writing 6K records(6MB)
  • write to database every 300 records (mutateRows)

4 boxes in cluster:
  1. NameNode, Secondary NameNode, HBase Master, Zookeeper Server
  2. DataNode, Region server, Thrift
  3. DataNode, Region server
  4. DataNode, Region server

They're all Ubuntu 12.04 x64 servers, Intel Core2 Quad@2.66GHz. #1 #3 #4 are equipped with 8G memory, #2 is 16G because thrift was running on it.

HBase Configuration:
Most preferences keep default after Cloudera CDH4 Manager installed. The only two modifications:
HBase Master's Java Heap Size in bytes: 1073741824 -> 2147483648
HBase Client Write Buffer 2097152 -> 8388608

create test database "testdb1" with column family "f1"
> create 'testdb1','f1'
0 row(s) in 1.6290 seconds

[Test 1]
each thread own its private connection to Thrift, so there are 10 connections totally in this test.
test code: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/connectioninthread.py

result: 6.9139 seconds -> 60 000 records (60MB)

[Test 2]
use one global connection, each thread should acquire the global reentrant lock before write.
test code: https://github.com/feifangit/hbase-thrift-performance-test/blob/master/sharedconnection.py

result: 16.345 seconds -> 60 000 records(60MB)

Uh... Of course, 10 connections(test 1) are much faster than single connection(test 2). thrift is not the bottle neck in this test. More thrift connections bring you better performance.

Next week, I'll add a tornado Web application in front of the thrift interface to collect mass data.
I'll try to reach the best performance as I can.

I also did load test with 6 million records. It cost me 896 seconds(14 mins), so avg. time to store a record is 0.1ms. impressive performance!!

Here's the server status:(the regionserver where thrift located) 
CPU Time



Thursday, June 14, 2012

Deploy tornado application

Goal: deploy tornado application with Nginx.

Nginx+Tornado is absolutely a perfect combination for high performance web service. I used to configure Nginx as reverse proxy and load balancer in front of multiple tornado instances.

And I use the supervisord which is a Linux process management tool to make sure tornado applications are up and running.
supervisord: http://supervisord.org/ 

1, install supervisor by python setuptool:
$easy_install supervisor
since Ubuntu 12.04, supervisord had been indexed in apt-get repos. Install by command
apt-get install supervisor
put your configuration with .conf file ext in /etc/supervisord/conf.d/

2, generate a config file by template
echo_supervisord_conf > /etc/supervisord.conf

3, append configuration at end of the file
(here goes a simple example)
[program:mytornado] <-----prgoram name: mytornado
command=python /home/feifan/Desktop/tor.py 888%(process_num)01d <----you'd better provide a port parameter in application entrance.
process_name=%(program_name)s_%(process_num)01d <---progress name format, "mytornado_8880" and "mytornado_8881"
numprocs=2 <---- how many instances

4, start processes
We've registered the tornado processes in supervisord configuration, start the service:
To add supervisord at startup, check out item #8
If you installed by apt-get, make sure supervisord in running by command
service supervisor status
and reload new added supervisor configuration file by command
$supervisorctl update
check process up and running by command
$supervisorctl status

5, check tornado application running on specified ports
$lsof -i:8880
$lsof -i:8881
Kill one of the instance, you will get a new instance start up :)

6 config Nginx
upstream backends{
server {
 listen 8878;
 server_name localhost;
 location /favicon.ico {
  alias /var/www/favicon.ico;
 location / {
  proxy_pass_header Server;                       
  proxy_set_header Host $http_host;                       
  proxy_set_header X-Real-IP $remote_addr;                       
  proxy_set_header X-Scheme $scheme;                       
  proxy_pass http://backends;                       
  proxy_next_upstream error;
 access_log /var/log/nginx/tor.access_log;
 error_log /var/log/nginx/tor.error_log;

7 done
$service nginx reload
Now, Nginx is listening on port 8878, and requests send to this port will proxy to tornado applications on port 8880 and 8881

8 extra
If you installed supervisor by easy_install rather than apt-get, you probably need to make supervisor works as a service(upstart system)
a) (Ubuntu)create start script at /etc/init.d/supervisord
#! /bin/bash -e

OPTS="-c /etc/supervisord.conf"

test -x $SUPERVISORD || exit 0

. /lib/lsb/init-functions

export PATH="${PATH:+$PATH:}/usr/local/bin:/usr/sbin:/sbin"

case "$1" in
    log_begin_msg "Starting Supervisor daemon manager..."
    start-stop-daemon --start --quiet --pidfile $PIDFILE --exec $SUPERVISORD -- $OPTS || log_end_msg 1
    log_end_msg 0
    log_begin_msg "Stopping Supervisor daemon manager..."
    start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE || log_end_msg 1
    log_end_msg 0

    log_begin_msg "Restarting Supervisor daemon manager..."
    start-stop-daemon --stop --quiet --oknodo --retry 30 --pidfile $PIDFILE
    start-stop-daemon --start --quiet --pidfile /var/run/sshd.pid --exec $SUPERVISORD -- $OPTS || log_end_msg 1
    log_end_msg 0

    log_success_msg "Usage: /etc/init.d/supervisor
    exit 1

exit 0

b) make script executable
chmod +x /etc/init.d/supervisord

c) add to the startups
update-rc.d supervisord defaults

Tuesday, June 5, 2012



both CDH4 & Cloudera Manager start supporting Ubuntu 10.04 & 12.04 now!

It's really a great news to whom consider Ubuntu as their primary deploy platform :P

Monday, June 4, 2012

Cross domain AJAX

Same-orgin policy
The same-origin policy prevents a script loaded from one domain from getting or manipulating properties of a document from another domain.
This policy dates all the way back to Netscape Navigator 2.0.
You will meet failures while getting resource from  different host or different port or even a different port.

For example, an error will occurs when a page in domain "localhost" trying to get resource from domain "". Following is the exception description in Chrome Dev tool: 
XMLHttpRequest cannot load Origin http://localhost:8094 is not allowed by Access-Control-Allow-Origin.

Classic cross-domain communication: suppose we have Server/Domain A, Server/Domain B. Client, usually browser load web page from server B, and it's going to get JSON data from serer A.

Solution 1: insert script element dynamically (JSONP)
This solution seems like a heterodoxy("旁門左道"), but it's the most popular and well compatible solution currently, nicely supported by jQuery API.
It works because the same-origin policy doesn't prevent dynamic script insertions and treats the scripts as if they were loaded from the domain that provided the web page.

Server A was written in python and works on GAE. The request serves both same-domain GET request and cross-domain request, and has one parameter "y".
def makeJSONP(fn):
    def wrapped(itself):
        callback = itself.request.get("callback")
        response = fn(itself)
        if callback:
            itself.response.headers['Content-Type'] = 'application/x-javascript'
            itself.response.write( "%s(%s)" % (callback, fn(itself)) )
    return wrapped

class GetMStat(webapp2.RequestHandler):     #support JSONP
    def get(self):
        tyear = int(self.request.get("y"))
        yrecords = dbmodel.MonthlyStat.all().filter("ryear =", tyear).fetch(None)
        x= []
        for record in yrecords:    
        return json.dumps(x) 
JSONP requests server side serving URL follow the syntax:
And server side doesn't return the json data directly, it return slice of javascript code like "func(jsondata)", where func comes from the parameter "callback".

GetMStat is the HTTP handle code, it pass the raw json data to python decorator makeJSONP for more processes.
The decorator makeJSONP determine whether there was a "callback" in GET parameters, if there is, it return the js code style string as we mentioned just now, and set the HTTP header in response to type javascript.(otherwise, browser will give you a MIME mismatch warning)
If there was no "callback" in GET parameters, the request will be considered as request from same-domain, return the json data directly.

When client (javascript) get the response, the function "func" will be invoked with jsondata as parameter.
client B:
$.getJSON('http://localhost:8118/getmstatistics?callback=?',{y: 2012},
                    } //end function
The jQuery API getJSON will handle the JSONP callback function without additional work.

In Chrome Dev tool, you can find the client(browser) access an URL like this:
The callback function name is actually the 3rd anonymous parameter in getJSON function.

Cross-domain communications with JSONP, Part 1: Combine JSONP and jQuery to quickly build powerful mashups
jQuery API: getJSON 

Solution 2: HTTP access control (CORS) 
This solution is much easier than the JSONP solution, but works for modern browsers only. You can try it in your Chrome, not IE serial.

Check out the tech Draft from W3: http://www.w3.org/TR/cors/
And try to implement in your code, just one new line code :)
Server A
Add HTTP header "Access-Control-Allow-Origin" in JSON response.
response["Access-Control-Allow-Origin"] = "*"
That'all. It means this resource can be access by any domain in a cross-domain manner.