Monday, December 23, 2013

MongoDB GridFS performance test

Test code and test result: https://github.com/feifangit/MongoDB-GridFS-test

Purpose

MongoDB GridFS comes with some natural advantages such as scalability(sharding) and HA(replica set). But as it stores file in ASCII string chunks, there's no doubt a performance loss.
I'm trying 3 different deployments (different MongoDB drivers) to read from GridFS. And compare the results to classic Nginx configuration.

Credits

Configurations

1, Nginx

location /files/ {
  alias /home/ubuntu/;
} 
open_file_cache kept off during the test.

2, Nginx_GridFS

It's a Nginx plugin based on MongoDB C driver. https://github.com/mdirolf/nginx-gridfs

Compile code & install

I made a quick install script in this repo, run it with sudo. After Nginx is ready, modify the configration file under /usr/local/nginx/conf/nginx.conf (if you didn't change the path).

Configuration

location /gridfs/{
   gridfs test1 type=string field=filename;
}
Use /usr/local/nginx/sbin/nginx to start Nginx. And use parameter -s reload if you changed the configuration file again.

3, Python

library version

  • Flask 0.10.1
  • Gevent 1.0.0
  • Gunicorn 0.18.0
  • pymongo 2.6.3

run application

cd flaskapp/
sudo chmod +x runflask.sh
bash runflask.sh
Script runflask.sh will start gunicorn with gevnet woker mode. Gunicorn configuration file here

4, Node.js

library version

  • Node.js 0.10.4
  • Express 3.4.7
  • mongodb(driver) 1.3.23

run application

cd nodejsapp/
sudo chmod +x runnodejs.sh
bash runnodejs.sh

Test

Test items:

  1. file served by Nginx directly
  2. file served by Nginx_gridFS + GridFS
  3. file served by Flask + pymongo + gevent + GridFS
  4. file served by Node.js + GridFS

Files for downloading:

Run script insert_file_gridfs.py from MongoDB server to insert 4 different size of file to databasetest1(pymongo is required)
  • 1KB
  • 100KB
  • 1MB

Test Environment

2 servers:
  • MongoDB+Application/Nginx
  • tester(Apache ab/JMeter)
hardware:

Concurrency

100 concurrent requests, total 500 requests.
ab -c 100 -n 500 ...

Result

Throughput

Time per request (download)
File sizeNginx+Hard driveNginx+GridFS pluginPython(pymongo+gevent)Node.js
1KB0.1741.1241.9821.679
100KB1.0141.5723.1033.708
1MB9.5829.56715.97318.317
You can get Apache ab report in folder: testresult

Server load

The server load is be monitored by command: vmstat 2
Nginx:
Nginx
Nginx_gridfs
Nginx
gevent+pymongo
Nginx
Node.js
Nginx

Conclusion

  • Files served by Nginx directly
  • No doubt it's the most efficient one, whether performance or server load.
  • Support cache. In real world, the directive open_file_cache should be configured well for better performance.
  • And must mention, it's the only one support pause and resume during the download(HTTP range support).
  • For the rest 3 test items, files are stored in MongoDB, but served by different drivers.
  • serve static files by application is really not an appropriate choice. They drains CPU too much and the performance is not good.
  • nginx_gridfs (MongoDB C driver): downloading requests will be processed at Nginx level, which is in front of web applications in most deployments. Web application can focus on processing dynamic contents instead of static content.
  • nginx_gridfs got the best performance comparing to other applications written in script languages. - The performance differences between Nginx and nginx_gridfs getting small after file size increased. But you can not turn a blind eye on the server load.
  • pymongo and node.js driver: it's a draw game. Static files should be avoid to be served in productive applications.

Advantages of GridFS

  • Put files in database make static content management much easier. We can omit maintain the consistency between files and its meta data in database.
  • Scalable and HA advantages come with MongoDB

Drawbacks of GridFS

  • bad performance

When should I use MongoDB GridFS

There are rare use cases I can imagine, especially in a performance sensitive system. But I may taste it in some prototype projects.
Here goes the answer from MongoDB official website, hope this will help.http://docs.mongodb.org/manual/faq/developers/#faq-developers-when-to-use-gridfs

Monday, December 16, 2013

gevent + pymongo doesn't process replica set host names defined in /etc/hosts

I have a Flask application connecting to a MongoDB replica set via hostname "node1"~"node3"

I can run it properly in dev environment by
python webservice.py
But in production environment, it fails to connect to MongoDB by such a command :
gunicorn  -k gevnet webservice:app -b 0.0.0.0:8100
There's no such issue if I change the work class to sync or eventlet

In the gevent 0.13.x period, I met similar issues caused by gevent DNS mechanism. At that tough time, genent does not process /etc/hosts and /etc/resolv.conf at all. That's you can not reach 10.1.1.47 by hostname "node1" which defined in your /etc/hosts file. :( 

About this gevent + pymongo issue, here's the report, and several workarounds mentioned in the description


After read the 1.0 release note carefully, it may be a pitfall instead of bug. The ares resolver is not only a simple better choice, you must use the "better choice" in some cases.


Fix
for supervisord, add configuration:
environment=GEVENT_RESOLVER=ares
for command line:
env GEVENT_RESOLVER=ares yourcmd...




Tuesday, October 8, 2013

MongoDB M101P HW5 answers



5.1

db.posts.aggregate([
{$project:{comments:1, author:1}},
{$unwind:"$comments"},
{$group:{_id:"$comments.author",commentnum:{$sum:1}}},
{$sort:{commentnum:-1}},
{$limit:2}
]);

5.2

db.pop.aggregate([
{$group:{_id:{state:"$state",city:"$city"}, allpop:{$sum:"$pop"}}},
{$match:{allpop:{$gt:25000},"_id.state":{$in:["CA","NY"]}}},
{$group:{_id:null, avgpop:{$avg:"$allpop"}}}
]
)

5.3

db.stud.aggregate([
{$match:{"scores.type":{$in:["exam","homework"]}}},
{$unwind:"$scores"},
{$group:{_id:"$class_id",csall:{$avg:"$scores.score"}}},
{$sort:{csall:-1}},
{$limit:5}
]
)

5.4

db.zips.aggregate([
    {$project: 
     {
  fc: {$substr : ["$city",0,1]},
 city:1,pop:1
     }  
   },
{$match:{fc:{$lte:"9",$gte:"0"}}} ,  
   {
         $group: {
            _id: null,
            allpop: { $sum: "$pop"}
         }
      }
])

MongoDB aggregation use cases?

I put this question on MongoDB univerisy M101P discussion forum. https://education.mongodb.com/courses/10gen/M101P/2013_September/discussion/forum/i4x-10gen-M101P-course-2013_September/threads/5254490dabcee8ba1e0039d2

Q
I'm wondering an use case of aggregation framework. That's, suppose I have one Application server & one MongoDB server. As I previously learned in SQL(MySQL/Informix) world:

we'd better put computing in application instead of database if possible

E.g. if we can get avg value by application logic, we should not ask database to compute it.

MongdoDB aggregation and MapReduce framework are able to run on MongoDB cluster natively , that is an obvious advantage comparing to implement distributed framework self in distributed environment.

If the case is there's only one MongoDB node, what should I do or what's the best practice? The only advantage I can image is: "consumption of CPU and memory will happen in database server instead of application server. And if lucky, all documents already existed in working set, there may be few memory consumption during the data processing"



A
STAFF
mattcampbell
about an hour ago

You are right. We normally tell you to move as much as you can to the application layer so you relieve stress on the database layer.

However, aggregation calculations are normally performed on a large set of data (quite often a whole collection) so it can actually be more detrimental to transfer all that data over the wire to perform the calculations on the app server rather than simply allowing mongodb to do it. The transfer latency alone would normally make it slower unless you have specific servers which are optimised to perform these tasks.

Also you would need to have some kind of aggregation framework setup on your app server to be able to process the large amount of data coming in. That means sourcing such a framework from a software vendor or writing your own.

So there are some serious disadvantages to wanting to move aggregation functions away from mongodb.

What some users do is setup mongod secondary nodes dedicated to serving aggregation queries so this can relieve the stress on the primary.

The other thing to think about is optimising your aggregation queries by using indexes. If your aggregation query can use a covered index and all your indexes can fit in memory this will obviously also be a lot quicker to process aggregation functions on mongodb rather than looping over a cursor on the app server.

So unfortunately there is no right or wrong answer here. You would need to assess the load on your database server and see whether it is actually faster and worth the time and money moving those functions to the app server.

Matt

Friday, October 4, 2013

My viewpoint of the "sort key first + limit" strategy in book "MongoDB The Definitive Guide"

In the book "MongoDB:The Definitive Guide" (2nd,2013.5)] by Kristina.Chodorow, page 87. It says:
"Thus, putting the sort key first is generally a good strategy when you're using a limit so MongoDB can stop scanning the index after a couple of matches."
In page 89, author put an example shows the power of the combination of "sort key first" plus "limit", it's a bit misleading.

First, the conclusion is not rigorous although the author used the word "generally". The conclusion and example was based on the condition:
cursor collected enough result very early.

Suppose we have a collection with 1,000,000 documents.
{"x":1, "y":1}
{"x":2, "y":2}
...
{"x":1000000, "y":1000000}
create the colleciton by:
for(var i=1;i<=1000000;i++){db.col.insert({x:i,y:i});}

Add 2 compound indexes {x:1,y:1} and {y:1,x:1}
> db.col.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.col",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "x" : 1,
                        "y" : 1
                },
                "ns" : "test.col",
                "name" : "x_1_y_1"
        },
        {
                "v" : 1,
                "key" : {
                        "y" : 1,
                        "x" : 1
                },
                "ns" : "test.col",
                "name" : "y_1_x_1"
        }
]
Now, we're trying to get all documents with x less than 100.
After pre-heat data, let's make some test.

Case 1: all qualified documents are located at beginning 1,
1, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 500,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 500,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 1,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
2, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 999,
        "nscanned" : 999,
        "nscannedObjectsAllPlans" : 999,
        "nscannedAllPlans" : 999,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 9,
        "indexBounds" : {
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#1 query is faster than #2 a bit
#1 followed the strategy "putting sort key first" it perform a table scan,  before the cursor went to 501st object, it had already collected 500 qualified documents already, and sorted correctly. , task done!
#2 used index in matching, it first get 999 documents with field y less than 1000. Then it sort the 999 documents in memory, and return the first 500.

Case 2: all qualified documents are located at tail 
3, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 859591,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 859591,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 3477,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
4, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 100000,
        "nscanned" : 100000,
        "nscannedObjectsAllPlans" : 100000,
        "nscannedAllPlans" : 100000,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 470,
        "indexBounds" : {
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#3 query followed the strategy "putting sort key first", but it was much slower then #4.
#3 made a table scan as #1 did, but unfortunately, all qualified result are located far away from the beginning. before the 900,001st document, #3 cursor got nothing. From 900001, #3 finally get 500 qualified result.
#4 first get 100,000 results rapidly based on index. then made a sort in memory, and returned the first 500.

Analysis:
query time of #1 is made by "lucky short table scan time" + "0 sort time"
query time of #2 is made by "little b-tree searching time" + "sort in memory time"
query time of #3 is made by  "unlucky a lot of table scan time" + "0 sort time"
while the query time of #4 is made by  "little b-tree searching time" + "sort in memory time"

The conclusion in the book is not 100% fit for all cases, do explain before performing a mongoDB query.

Tuesday, September 10, 2013

Missing manuals of Gunicorn



Gunicorn provide a very convenience way to deploy Django/flask applications. What Gunicorn need is a WSGI app object, and some parameters to deinfe I/O model, worker numbers etc.

The command below will start the web app with 2 processes in sync mode.
$ gunicorn --workers=2 test:app

I ran into a requirement that needs to start the application with a configuration .ini file. But Gunicorn does not leave any interface for passing custom parameters.

I tried to dig a solution with the os.environ setting. Later was inspired by  a statement in Gunicorn document here:  http://docs.gunicorn.org/en/latest/run.html. It explained the Gunicorn command line format:

We must follow the foramt: $ gunicorn [OPTIONS] APP_MODULE

APP_MODULE must follow the pattern MODULE_NAME:VARIABLE_NAME while VARIABLE_NAME should be a callable object which could be find in MODULE_NAME.

A callable object >,<
I checked the source code of Gunicorn, and here's the key line in import_app method
https://github.com/benoitc/gunicorn/blob/master/gunicorn/util.py#L365
app = eval(obj, mod.__dict__)

if the variable obj is a object, the app will be assigned as a object. But if the obj is a function call expression, e.g. "func(1,2)" , the function func will execute!
def f(a,b):
  return a+b
execute eval("f(3,5"), you will get result 8.

This should enlighten you already. for example, in a Flask application.
previously, we have the WSGI object:
app = Flask(__name__)

Now, you can add a method in the application, say:
def get_app(arg1, arg2):
  print arg1, arg2 #now, you got the parameters from outside!
  #process sth.
  return app #return the real WSGI object

Start your web application with Gunicorn:
gunicorn "myapp:get_app('myarg1','myarg2')"

Do hope the solution helped you :)

Wednesday, July 24, 2013

Python object copy

We ran into a problem, which made me puzzled at the beginning. After minutes of debug, I realized I was in the "object reference" trap which I thought would never meet it :(
Here's the simplified code


"set a" changed since 2nd loop while we were expecting it as a consistent value.

From the output we got some information
- memory address of variable delta was re-allocated in every loop -> line#14,18,22
- item we changed 3 times in line #9, but they affected on same object because the id remains same line#15,19,23. And that's why the problem occur.

Analysis result: from the memory address we printed, we found that the object we changed(item, from delta), is no doubt a member in "set a".
The item is a

object reference

of an element of "set a"


We can created a copy of "delta" instead of reference to "set a", but there're 2 kinds of copy in Python. Shallow copy and deep copy.

Shallow copy



Deep copy




Pseudo code




Python code
Pseudo code

a = [1,2]
Node** pa=new Node*[3];
pa[0] = new Node(1);
pa[1] = new Node(2);
pa[2] = new Node*[2](1,2);

reference operation
b = a
b is a #True
id(b) == id(a) #True
id(b[0]) == id(a[0]) #True

Node**& pb = pa; //C++ reference
printf("%p", pa)
printf("%p", pb)
shallow copy
b=list(a)
b is a #False
id(b) == id(a) #False
id(b[0]) == id(a[0]) #True
Node** pb=new Node*[3];
pb[0] = pa[0];
pb[1] = pa[1];
pb[2] = pa[2];
deep copy
import copy
b = copy.deepcopy(a)
id(b) == id(a) #False
id(b[0]) == id(a[0]) #False
Node** pb=new Node*[3];
pb[0] = new Node(pa[0]);
pb[1] = new Node(pa[1]);
pb[2] = new Node(pa[2]);

Monday, April 15, 2013

org.apache.hadoop.hdfs.server.namenode.NameNode Exception in namenode join


upgrading a Hadoop cluster is really a big challenge , especially to a Non-IT guy, like me.
Even with "Cloudera CDH Manager ", I still met different problems when I upgrading the small 4 nodes cluster :(

Today, the problem is: after upgrade the CDH to latest one, the namenode can not start up

Component version:



Namenode error log

2:15:59.109 PM FATAL org.apache.hadoop.hdfs.server.namenode.NameNode
Exception in namenode join
java.lang.AssertionError: Should not purge more edits than required to restore: 8113483 should be <= 5202305
at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:132)
at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:950)
at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:935)
at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:872)
at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:852)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:593)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
12:15:59.111 PM INFO org.apache.hadoop.util.ExitUtil
Exiting with status 1
12:15:59.112 PM INFO org.apache.hadoop.hdfs.server.namenode.NameNode
SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at fdcloud1.fdservice.cloud/10.1.2.128
************************************************************/

Solution:
Thanks to this post:
https://groups.google.com/a/cloudera.org/forum/?fromgroups=#!topic/cdh-user/e1zCdmuIbkw

he mentioned add setting for namenode

<property>
<name>dfs.namenode.max.extra.edits.segments.retained</name>
<value>1000000</value>
</property>

Add setting in CDH Manager
1, find the "Advanced setting" for Namenode

2, find the NameNode Configuration Safety Valve for hdfs-site.xml


3, restart HDFS service

Sunday, April 7, 2013

Connect to Cisco VPN in Ubuntu 11 & 12

install Cisco library
sudo apt-get install network-manager-vpnc

{*} in "advanced" tab, select MPP

Thursday, March 28, 2013

Nginx upload module vs Flask

Env

Ubuntu 12.04 32bit
Nginx 1.1.19 (HTTP server)

Flask 0.9 (light python framework)
Gevent: 1.0rc2 (coroutine I/O)
Gunicorn 0.17.2 (WSGI server)

Nginx setting

Nginx listen to port 8666
serves requests to URL /upload
and also work as a reverse proxy to Flask which serve requests on port 8222

Server code

server application run in 5 worker mode with Gunicorn + Gevent
gunicorn -k gevent -b 0.0.0.0:8222 -w 5 t:app


Test code

start N threads, and upload file to the URL




Result


Performance test result:
Unit: seconds
file size: 3.3 MB
concurrent number
Flask 0.9 + gevent 1.0rc2 + gunicorn 0.17.2 + Nginx
(5 worker processes)
Nginx + Nginx upload module
50
6.029
2.328
100
12.788
4.995
200
28.828
10.813


As we expected, Nginx upload module which written in C, is about twice faster than pure python code while processing file uploading.


Wednesday, March 6, 2013

Add disk for datanode in CDH4

First step: add the disk, make the partition, mount to the machine

  • connect HDD to machine
  • ls /dev/[sh]d*
    The one not ending with a number should be the new HDD, e.g. /dev/sdb
  • create partition on the new disk (assume create one partition on the entire disk)
    fdisk /dev/sdb
    => command n   add a new partition
    => select primary
    => default partition number 1
    => default start cylinder
    => default end cylinder
    => command w   write table to disk and exit
    => command q leave fdisk
  • list partitions on the new disk
    fdisk -l /dev/sdb
    you should see /dev/sdb1
  • format the partition
    mkfs -t ext4 /dev/sdb1
  • get the UUID of disk
    blkid
    copy the UUID, assume it's a-b-c-d-e
  • mount the partition, assume we're trying to mount on /data2
    add the line to file </etc/fstab>
    UUID=a-b-c-d-e /data2 ext4 defaults 0
  • reboot
  • mount /data2

Step 2: change the attributes of the folder

This step is very important!
chgrp hadoop /data2
chown hdfs /data2

Step 3: add this folder to datanode setting by CDH Manager



Step 4: restart HDFS

check out the space now by namenode's status page: http://xxxx:50070/dfshealth.jsp

Monday, February 25, 2013

webhdfs unicode issue

One of my Map/Reduce task generated a folder, with name in Unicode.

I can access this folder from web page, I mean, from the port 50075 page to browser it.


But I just got "?" instead of these 2 Unicode characters by WebHDFS API.

I'm using HDFS version: 2.0.0-CDH4.0.1
Will work on this issue in the following days.