Tuesday, October 8, 2013

MongoDB M101P HW5 answers



5.1

db.posts.aggregate([
{$project:{comments:1, author:1}},
{$unwind:"$comments"},
{$group:{_id:"$comments.author",commentnum:{$sum:1}}},
{$sort:{commentnum:-1}},
{$limit:2}
]);

5.2

db.pop.aggregate([
{$group:{_id:{state:"$state",city:"$city"}, allpop:{$sum:"$pop"}}},
{$match:{allpop:{$gt:25000},"_id.state":{$in:["CA","NY"]}}},
{$group:{_id:null, avgpop:{$avg:"$allpop"}}}
]
)

5.3

db.stud.aggregate([
{$match:{"scores.type":{$in:["exam","homework"]}}},
{$unwind:"$scores"},
{$group:{_id:"$class_id",csall:{$avg:"$scores.score"}}},
{$sort:{csall:-1}},
{$limit:5}
]
)

5.4

db.zips.aggregate([
    {$project: 
     {
  fc: {$substr : ["$city",0,1]},
 city:1,pop:1
     }  
   },
{$match:{fc:{$lte:"9",$gte:"0"}}} ,  
   {
         $group: {
            _id: null,
            allpop: { $sum: "$pop"}
         }
      }
])

MongoDB aggregation use cases?

I put this question on MongoDB univerisy M101P discussion forum. https://education.mongodb.com/courses/10gen/M101P/2013_September/discussion/forum/i4x-10gen-M101P-course-2013_September/threads/5254490dabcee8ba1e0039d2

Q
I'm wondering an use case of aggregation framework. That's, suppose I have one Application server & one MongoDB server. As I previously learned in SQL(MySQL/Informix) world:

we'd better put computing in application instead of database if possible

E.g. if we can get avg value by application logic, we should not ask database to compute it.

MongdoDB aggregation and MapReduce framework are able to run on MongoDB cluster natively , that is an obvious advantage comparing to implement distributed framework self in distributed environment.

If the case is there's only one MongoDB node, what should I do or what's the best practice? The only advantage I can image is: "consumption of CPU and memory will happen in database server instead of application server. And if lucky, all documents already existed in working set, there may be few memory consumption during the data processing"



A
STAFF
mattcampbell
about an hour ago

You are right. We normally tell you to move as much as you can to the application layer so you relieve stress on the database layer.

However, aggregation calculations are normally performed on a large set of data (quite often a whole collection) so it can actually be more detrimental to transfer all that data over the wire to perform the calculations on the app server rather than simply allowing mongodb to do it. The transfer latency alone would normally make it slower unless you have specific servers which are optimised to perform these tasks.

Also you would need to have some kind of aggregation framework setup on your app server to be able to process the large amount of data coming in. That means sourcing such a framework from a software vendor or writing your own.

So there are some serious disadvantages to wanting to move aggregation functions away from mongodb.

What some users do is setup mongod secondary nodes dedicated to serving aggregation queries so this can relieve the stress on the primary.

The other thing to think about is optimising your aggregation queries by using indexes. If your aggregation query can use a covered index and all your indexes can fit in memory this will obviously also be a lot quicker to process aggregation functions on mongodb rather than looping over a cursor on the app server.

So unfortunately there is no right or wrong answer here. You would need to assess the load on your database server and see whether it is actually faster and worth the time and money moving those functions to the app server.

Matt

Friday, October 4, 2013

My viewpoint of the "sort key first + limit" strategy in book "MongoDB The Definitive Guide"

In the book "MongoDB:The Definitive Guide" (2nd,2013.5)] by Kristina.Chodorow, page 87. It says:
"Thus, putting the sort key first is generally a good strategy when you're using a limit so MongoDB can stop scanning the index after a couple of matches."
In page 89, author put an example shows the power of the combination of "sort key first" plus "limit", it's a bit misleading.

First, the conclusion is not rigorous although the author used the word "generally". The conclusion and example was based on the condition:
cursor collected enough result very early.

Suppose we have a collection with 1,000,000 documents.
{"x":1, "y":1}
{"x":2, "y":2}
...
{"x":1000000, "y":1000000}
create the colleciton by:
for(var i=1;i<=1000000;i++){db.col.insert({x:i,y:i});}

Add 2 compound indexes {x:1,y:1} and {y:1,x:1}
> db.col.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.col",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "x" : 1,
                        "y" : 1
                },
                "ns" : "test.col",
                "name" : "x_1_y_1"
        },
        {
                "v" : 1,
                "key" : {
                        "y" : 1,
                        "x" : 1
                },
                "ns" : "test.col",
                "name" : "y_1_x_1"
        }
]
Now, we're trying to get all documents with x less than 100.
After pre-heat data, let's make some test.

Case 1: all qualified documents are located at beginning 1,
1, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 500,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 500,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 1,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
2, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 999,
        "nscanned" : 999,
        "nscannedObjectsAllPlans" : 999,
        "nscannedAllPlans" : 999,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 9,
        "indexBounds" : {
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#1 query is faster than #2 a bit
#1 followed the strategy "putting sort key first" it perform a table scan,  before the cursor went to 501st object, it had already collected 500 qualified documents already, and sorted correctly. , task done!
#2 used index in matching, it first get 999 documents with field y less than 1000. Then it sort the 999 documents in memory, and return the first 500.

Case 2: all qualified documents are located at tail 
3, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 859591,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 859591,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 3477,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
4, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 100000,
        "nscanned" : 100000,
        "nscannedObjectsAllPlans" : 100000,
        "nscannedAllPlans" : 100000,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 470,
        "indexBounds" : {
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#3 query followed the strategy "putting sort key first", but it was much slower then #4.
#3 made a table scan as #1 did, but unfortunately, all qualified result are located far away from the beginning. before the 900,001st document, #3 cursor got nothing. From 900001, #3 finally get 500 qualified result.
#4 first get 100,000 results rapidly based on index. then made a sort in memory, and returned the first 500.

Analysis:
query time of #1 is made by "lucky short table scan time" + "0 sort time"
query time of #2 is made by "little b-tree searching time" + "sort in memory time"
query time of #3 is made by  "unlucky a lot of table scan time" + "0 sort time"
while the query time of #4 is made by  "little b-tree searching time" + "sort in memory time"

The conclusion in the book is not 100% fit for all cases, do explain before performing a mongoDB query.