Napkin math for MongoDB performance
2011-03-29 23:04
162 查看
文章来源:http://rickosborne.org/blog/2010/02/napkin-math-for-mongodb-performance/
As we all know, there are lies, damned lies, and statistics.
What I’m about to present shouldn’t even qualify as statistics—it’s just
a bunch of damned lies. I’m not set up to do any sort of rigorous
performance testing, so these should not be construed as anything but
what they are: one guy’s half-assed and probably flawed measurements.
I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column)
functionality. The short answer is: don’t do it
.
Or, if you do it, figure out a better way than I did. Along the way, I
gathered some metrics on what types of operations cause what kinds of
performance hits.
Yeah, I just took the Netflix prize data and inserted ~3M records. I
did the inserts across 3 shard services, all running on the same
machine, which led to 9 chunks of roughly equal size. I let MongoDB
handle the sharding—I didn’t manually split the shards. I ensured one
index on the collection, over movie
and cust
, which isn’t really used for the query in question, but I thought it was worth mentioning.
Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.
I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo
with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks
while running the tests, but nothing that should have interfered
greatly.
And the MapReduce query itself, as I wrote it:
Those nasty bits with the for-in
loops are for the COUNT(DISTINCT column)
logic. This query produces the following result set:
format. (Minutes, not hours.)
you get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.
Don't try to emulate COUNT(DISTINCT)
. Really.
I have to wonder if mongos
can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?
推荐阅读:MongoDB: Terrible MapReduce Performance
MongoDB's performance on aggregation queries
Is this Map Reduce performance normal or I am missing something
As we all know, there are lies, damned lies, and statistics.
What I’m about to present shouldn’t even qualify as statistics—it’s just
a bunch of damned lies. I’m not set up to do any sort of rigorous
performance testing, so these should not be construed as anything but
what they are: one guy’s half-assed and probably flawed measurements.
I was playing around with MapReduce on MongoDB, trying to figure out how to code the equivalent of SQL’s COUNT(DISTINCT column)
functionality. The short answer is: don’t do it
.
Or, if you do it, figure out a better way than I did. Along the way, I
gathered some metrics on what types of operations cause what kinds of
performance hits.
The Setup
My set up is a database of 3,397,115 records, all of which look something like this:{ "_id" : 3002827, "mm" : 7, "stars" : 5, "date" : "2005-07-18", "dd" : 18, "cust" : 2213, "movie" : 14889, "yy" : 2005, "title" : "Species", "year" : 1995 }
Yeah, I just took the Netflix prize data and inserted ~3M records. I
did the inserts across 3 shard services, all running on the same
machine, which led to 9 chunks of roughly equal size. I let MongoDB
handle the sharding—I didn’t manually split the shards. I ensured one
index on the collection, over movie
and cust
, which isn’t really used for the query in question, but I thought it was worth mentioning.
Yeah, I know performance is going to suffer because I’m running 3 shards from the same hard drive. That’s kindof the point.
I ran all of this on my MacBook Pro, which is a 2.66 GHz Core 2 Duo
with 4GB of 1067 MHz DDR3. I continued to do other light-duty tasks
while running the tests, but nothing that should have interfered
greatly.
The Queries
Here’s the starting query’s SQL equivalent:SELECT releaseYear, COUNT(*) AS nRecords, COUNT(DISTINCT movie) AS mMovies, COUNT(DISTINCT cust) AS cCustomers, SUM(stars) AS totalStars, AVG(stars) AS avgStars FROM training WHERE (releaseYear = 1990) GROUP BY releaseYear
And the MapReduce query itself, as I wrote it:
db.runCommand({ mapreduce: "training", query: { year: 1990 }, map: function() { var m = {}, c = {}; m[this.movie] = true; c[this.cust] = true; emit( this.year, { "stars": this.stars, "n": 1, "m": m, "c": c } )}, reduce: function(key, vals) { var stars = 0, n = 0, m = {}, c = {}; for(var i = 0; i < vals.length; i++) { var v = vals[i]; stars += v.stars; n += v.n; for (var im in v.m) m[im] = true; for (var ic in v.c) c[ic] = true; } return { "stars": stars, "n": n, "m": m, "c": c }; }, finalize: function(key, val) { val.avg = val.stars / val.n; var m = 0, c = 0; for (var im in val.m) m++; for (var ic in val.c) c++; val.m = m; val.c = c; return val; }, out: "result1", verbose: true });
Those nasty bits with the for-in
loops are for the COUNT(DISTINCT column)
logic. This query produces the following result set:
{ "_id" : 1990, "value" : { "stars" : 593179, "n" : 154617, "m" : 7, "c" : 120259, "avg" : 3.8364410123078314 } }
The Results
All times below are in mm:ssformat. (Minutes, not hours.)
Query | Total Time | Shards Time | Final Function |
---|---|---|---|
1 | 10:44 | 03:46 | 06:58 |
This was the starting query above, as written. | |||
2 | 90:48 | 36:26 | 54:22 |
I widened the release year restriction from just 1990 to 1990-1999, via { year: { $gte: 1990, $lte: 1999 } } . That's close to a linear relationship between emitted records and time elapsed. | |||
3 | 21:33 | 13:53 | 07:40 |
I used movechunk to consolidate all of the chunks on one shard server, then shut down the other two. I reduced the release year restriction back to just 1990. It takes 2x longer than the first query, presumably due to disk bottlenecks? One shard trying to reduce 9 chunks at once? | |||
4 | 02:08 | 02:08 | - |
I removed the for-in loops and COUNT(DISTINCT) logic, leaving only the plain record count and average, but was still on the one shard server, implying a 10x slowdown for that type of logic. | |||
Query | Total Time | Map Time | Emit Loop |
5 | 00:13 | 00:06 | 00:13 |
I connected to the one remaining shard directly, instead of through mongos , and ran the previous query (no for-in ). Again, this implies a 10x slowdown due to trying to process chunks simultaneously. | |||
6 | 05:24 | 00:15 | 01:14 |
Still connected directly to the one shard (no mongos ) with all of the records, I ran the original query (with for-in logic). A slowdown of 25x seems a little high, but I ran the query twice to verify it. |
Lessons Learned
Queries scream when a single shard is left to its own devices—but when parallelism is attempted on the same shardyou get a massive performance hit. Don't run different shards off the same hard drive—no matter how many cores you have.
Don't try to emulate COUNT(DISTINCT)
. Really.
I have to wonder if mongos
can be tweaked to serialize queries against chunks on the same shard, to prevent disk contention issues?
推荐阅读:MongoDB: Terrible MapReduce Performance
MongoDB's performance on aggregation queries
Is this Map Reduce performance normal or I am missing something
相关文章推荐
- MongoDB Performance for more data than memory
- M101P: MongoDB for Developers - Chapter 4: Performance
- 高性能MySQL笔记-第5章Indexing for High Performance-004怎样用索引才高效
- MongoDB - The mongo Shell, Write Scripts for the mongo Shell
- Top 10 SQL Server Counters for Monitoring SQL Server Performance
- Note 23888 - SD performance and tuning: A basis for notes(sap条件类型性能优化NOTE)
- Multithreading For Performance
- Just another high performance string format function for JavaScript
- 连接mongodb时候报:Invalid credentials for database 'xxx'.
- MongoDb Driver For Net
- How To Collect 10046 Trace (SQL_TRACE) Diagnostics for Performance Issues
- Mongodb driver for java 使用样例
- Now you can provide attr "wx:key" for a "wx:for" to improve performance.
- BCC – Dynamic Tracing Tools for Linux Performance Monitoring
- Performance Testing Guidance for Web Applications
- Using C# Yield for Readability and Performance
- M102: MongoDB for DBAs chapter 2学习记录
- M102: MongoDB for DBAs chapter 1学习记录
- MathMagic Pro for InDesign for Mac(强大的公式编辑器)已激活版免注册码 v9.32最新破解版
- DB2性能监控系统,DB2数据库的实时监控和历史分析Foglight Performance Analysis for DB2