您的位置:首页 > 数据库 > Mongodb

Napkin math for MongoDB performance

2011-03-29 23:04 162 查看
文章来源:http://rickosborne.org/blog/2010/02/napkin-math-for-mongodb-performance/

As we all know, there are lies, damned lies, and statistics.
What I’m about to present shouldn’t even qualify as statistics—it’s just
a bunch of damned lies. I’m not set up to do any sort of rigorous
performance testing, so these should not be construed as anything but
what they are: one guy’s half-assed and probably flawed measurements.

I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column)
functionality. The short answer is: don’t do it
.
Or, if you do it, figure out a better way than I did. Along the way, I
gathered some metrics on what types of operations cause what kinds of
performance hits.

The Setup

My set up is a database of 3,397,115 records, all of which look something like this:

{
"_id"   : 3002827,
"mm"    : 7,
"stars" : 5,
"date"  : "2005-07-18",
"dd"    : 18,
"cust"  : 2213,
"movie" : 14889,
"yy"    : 2005,
"title" : "Species",
"year"  : 1995
}


Yeah, I just took the Netflix prize data and inserted ~3M records. I
did the inserts across 3 shard services, all running on the same
machine, which led to 9 chunks of roughly equal size. I let MongoDB
handle the sharding—I didn’t manually split the shards. I ensured one
index on the collection, over movie
and cust
, which isn’t really used for the query in question, but I thought it was worth mentioning.

Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.

I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo
with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks
while running the tests, but nothing that should have interfered
greatly.

The Queries

Here’s the starting query’s SQL equivalent:

SELECT releaseYear,
COUNT(*) AS nRecords,
COUNT(DISTINCT movie) AS mMovies,
COUNT(DISTINCT cust) AS cCustomers,
SUM(stars) AS totalStars,
AVG(stars) AS avgStars
FROM training
WHERE (releaseYear = 1990)
GROUP BY releaseYear


And the MapReduce query itself, as I wrote it:

db.runCommand({
mapreduce: "training",
query: {
year: 1990
},
map: function() {
var m = {}, c = {};
m[this.movie] = true;
c[this.cust] = true;
emit(
this.year,
{ "stars": this.stars, "n": 1, "m": m, "c": c }
)},
reduce: function(key, vals) {
var stars = 0, n = 0, m = {}, c = {};
for(var i = 0; i < vals.length; i++) {
var v = vals[i];
stars += v.stars;
n += v.n;
for (var im in v.m) m[im] = true;
for (var ic in v.c) c[ic] = true;
}
return { "stars": stars, "n": n, "m": m, "c": c };
},
finalize: function(key, val) {
val.avg = val.stars / val.n;
var m = 0, c = 0;
for (var im in val.m) m++;
for (var ic in val.c) c++;
val.m = m;
val.c = c;
return val;
},
out: "result1",
verbose: true
});


Those nasty bits with the for-in
loops are for the COUNT(DISTINCT column)
logic. This query produces the following result set:

{
"_id"   : 1990,
"value" : {
"stars" : 593179,
"n"     : 154617,
"m"     : 7,
"c"     : 120259,
"avg"   : 3.8364410123078314
}
}


The Results

All times below are in mm:ss
format. (Minutes, not hours.)

QueryTotal TimeShards TimeFinal Function
110:44
03:4606:58
This was the starting query above, as written.
290:48
36:2654:22
I widened the release year restriction from just 1990 to 1990-1999, via { year: { $gte: 1990, $lte: 1999 } }
. That's close to a linear relationship between emitted records and time elapsed.
321:33
13:5307:40
I used movechunk
to consolidate all of the
chunks on one shard server, then shut down the other two. I reduced the
release year restriction back to just 1990. It takes 2x longer than
the first query, presumably due to disk bottlenecks? One shard trying
to reduce 9 chunks at once?
402:08
02:08-
I removed the for-in
loops and COUNT(DISTINCT)
logic, leaving only the plain record count and average, but was still
on the one shard server, implying a 10x slowdown for that type of logic.
QueryTotal TimeMap TimeEmit Loop
500:13
00:0600:13
I connected to the one remaining shard directly, instead of through mongos
, and ran the previous query (no for-in
). Again, this implies a 10x slowdown due to trying to process chunks simultaneously.
605:24
00:1501:14
Still connected directly to the one shard (no mongos
) with all of the records, I ran the original query (with for-in
logic). A slowdown of 25x seems a little high, but I ran the query twice to verify it.

Lessons Learned

Queries scream when a single shard is left to its own devices—but when parallelism is attempted on the same shard
you get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.

Don't try to emulate COUNT(DISTINCT)
. Really.

I have to wonder if mongos
can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?

推荐阅读:MongoDB: Terrible MapReduce Performance

MongoDB's performance on aggregation queries

Is this Map Reduce performance normal or I am missing something
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: