Friday, October 4, 2013

My viewpoint of the "sort key first + limit" strategy in book "MongoDB The Definitive Guide"

In the book "MongoDB:The Definitive Guide" (2nd,2013.5)] by Kristina.Chodorow, page 87. It says:
"Thus, putting the sort key first is generally a good strategy when you're using a limit so MongoDB can stop scanning the index after a couple of matches."
In page 89, author put an example shows the power of the combination of "sort key first" plus "limit", it's a bit misleading.

First, the conclusion is not rigorous although the author used the word "generally". The conclusion and example was based on the condition:
cursor collected enough result very early.

Suppose we have a collection with 1,000,000 documents.
{"x":1, "y":1}
{"x":2, "y":2}
...
{"x":1000000, "y":1000000}
create the colleciton by:
for(var i=1;i<=1000000;i++){db.col.insert({x:i,y:i});}

Add 2 compound indexes {x:1,y:1} and {y:1,x:1}
> db.col.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.col",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "x" : 1,
                        "y" : 1
                },
                "ns" : "test.col",
                "name" : "x_1_y_1"
        },
        {
                "v" : 1,
                "key" : {
                        "y" : 1,
                        "x" : 1
                },
                "ns" : "test.col",
                "name" : "y_1_x_1"
        }
]
Now, we're trying to get all documents with x less than 100.
After pre-heat data, let's make some test.

Case 1: all qualified documents are located at beginning 1,
1, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 500,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 500,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 1,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
2, db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$lt:1000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 999,
        "nscanned" : 999,
        "nscannedObjectsAllPlans" : 999,
        "nscannedAllPlans" : 999,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 9,
        "indexBounds" : {
                "y" : [
                        [
                                -1.7976931348623157e+308,
                                1000
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#1 query is faster than #2 a bit
#1 followed the strategy "putting sort key first" it perform a table scan,  before the cursor went to 501st object, it had already collected 500 qualified documents already, and sorted correctly. , task done!
#2 used index in matching, it first get 999 documents with field y less than 1000. Then it sort the 999 documents in memory, and return the first 500.

Case 2: all qualified documents are located at tail 
3, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({x:1,y:1}).explain()
{
        "cursor" : "BtreeCursor x_1_y_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 500,
        "nscanned" : 859591,
        "nscannedObjectsAllPlans" : 500,
        "nscannedAllPlans" : 859591,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 3477,
        "indexBounds" : {
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ],
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
4, db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1})
> db.col.find({"y":{$gt:900000}}).sort({x:1}).limit(500).hint({y:1,x:1}).explain()
{
        "cursor" : "BtreeCursor y_1_x_1",
        "isMultiKey" : false,
        "n" : 500,
        "nscannedObjects" : 100000,
        "nscanned" : 100000,
        "nscannedObjectsAllPlans" : 100000,
        "nscannedAllPlans" : 100000,
        "scanAndOrder" : true,
        "indexOnly" : false,
        "nYields" : 0,
        "nChunkSkips" : 0,
        "millis" : 470,
        "indexBounds" : {
                "y" : [
                        [
                                900000,
                                1.7976931348623157e+308
                        ]
                ],
                "x" : [
                        [
                                {
                                        "$minElement" : 1
                                },
                                {
                                        "$maxElement" : 1
                                }
                        ]
                ]
        },
        "server" : "feifan-server:27017"
}
#3 query followed the strategy "putting sort key first", but it was much slower then #4.
#3 made a table scan as #1 did, but unfortunately, all qualified result are located far away from the beginning. before the 900,001st document, #3 cursor got nothing. From 900001, #3 finally get 500 qualified result.
#4 first get 100,000 results rapidly based on index. then made a sort in memory, and returned the first 500.

Analysis:
query time of #1 is made by "lucky short table scan time" + "0 sort time"
query time of #2 is made by "little b-tree searching time" + "sort in memory time"
query time of #3 is made by  "unlucky a lot of table scan time" + "0 sort time"
while the query time of #4 is made by  "little b-tree searching time" + "sort in memory time"

The conclusion in the book is not 100% fit for all cases, do explain before performing a mongoDB query.

No comments:

Post a Comment