您的位置:首页 > 数据库 > Mongodb

Mongodb Journaling and replica-set oplog

2011-06-29 22:32 429 查看
Journaling

关于Journaling主要还是为了提高单机版的durability,通过命令行参数--journal决定是否开启Journaling功能,mongodb会提前为journal file分配空间,可以在数据库目录下面的journal/找到;另外一个就是关于group commit的介绍,主要是为了避免每次IO操作都要写日志,所以采用了一个时间间隔(比如几十毫秒)将日志刷到journal file中去。

文章来源:http://www.mongodb.org/display/DOCS/Journaling

MongoDB v1.7.5+ supports write-ahead journaling of operations to facilitate fast crash recovery and durability in the storage engine.

Enabling

To enable, use the mongod --journal command line option.

# mongod --journal


MongoDB may determine that it is faster to preallocate journal files than to create them as needed. If MongoDB decides to preallocate the files, it will not start listening on port 27017 until this process completes, which can take a few minutes. This means that your applications and the shell will not be able to connect to the database immediately on initial startup. Check the logs to see if MongoDB is busy preallocating. It will print the standard "waiting for connections on port whatever" when it has finished.

Disabling

If journaling has been on, and you would like to run without it, simply (1) shut down mongod cleanly; (2) restart without the --journaloption.

Journal Files

With --journal enabled, journal files will be created in a journal/ subdirectory under your chosen db path. These files are write-ahead redo logs. In addition, a last sequence number file, journal/lsn, will be created. A clean shutdown removes all files underjournal/.

The Mongo data files (database.ns, database.0, database.1, ...) have the same format as in previous releases. Thus, the upgrade process is seamless, and a rollback would be seamless too. (If you roll back to a pre v1.7.5 release, try to shut down cleanly first. Regardless, remove the journal/ directory before starting the pre v1.7.5 version of mongod.)

Recovery

On a restart after a crash, journal files in journal/ will be replayed before the server goes online. This will be indicated in the log output. You do not need to run a repair.

The journal Subdirectory

You may wish, before starting mongod to symlink the journal/ directory to a dedicated hard drive to speed the frequent (fsynced) sequential writes which occur to the current journal file.

Group Commits

MongoDB performs group commits (batch commits) when using the --journal option. This means that a series of operations over many milliseconds are committed all at once. This is done to achieve high performance.

Group commits are performed approximately every 100ms in v1.8.0. In future versions, the commit frequency will be more frequent.

Commit Acknowledgement

You can wait for group commit acknowledgement with the getLastError command. When running with --journal, the fsync:trueoption returns after the data is physically written to the journal (rather than actually fsync'ing all the data files). Note that the group commit interval (see above) is considerable: you may prefer to call getLastError without fsync, or with a w: parameter instead with replication. In releases after 1.8.0 the delay for commit acknowledgement will be shorter.

FAQ

If I am using replication, can some members use --journal and others not?
Yes.

How's performance?
Read performance should be the same. Write performance should be very good but there is some overhead over the non-durable version as the journal files must be written. If you find a case where there is a large difference in performance between running with and without --journal, please let us know so we can tune it. Additionally, some performancing tuning enhancements in this area are already queued for v1.8.1+.

Can I use the journaling feature to perform safe hot backups?
Not yet, as the journal files are rotated out (unlinked) after data is safely in the data files.

32 bit nuances?
There is extra memory mapped file activity with the --journal option. This will further constrain the limited db size of 32 bit builds. Thus, for now it is recommended to use --journal with 64 bit systems.

Why isn't --journal the default?
Mainly, just to be conservative for now, as it is new code. It will be the default at some point in the future.

When did the --journal option change from --dur?
In 1.8 the option was renamed to --journal, but the old name is still accepted for backwards compatibility; please change to --journal if you are using the old option.

Will the journal replay have problems if entries are incomplete (like the failure happened in the middle of one)?
Each journal (group) write is consistent and won't be replayed during recovery unless it is complete.

How many times is data written to disk when replication and journaling are both on?
In v1.8, for an insert, four times. The object is written to the main collection, and also the oplog collection (so that is twice). Both of those writes are journaled as a single mini-transaction in the journal file (the files in /data/db/journal). Thus 4 times total.

There is an open item in to reduce this by having the journal be compressed. This will reduce from 4x to probably ~2.5x.

The above applies to collection data and inserts which is the worst case scenario. Index updates are written to the index and the journal, but not the oplog, so they should be 2X today not 4X. Likewise updates with things like $set, $addToSet, $inc, etc. are compactly logged all around so those are generally small.

Replica Sets - Oplog

这里主要介绍的是replica-set是通过oplog来实现replication的,当master向oplog写操作记录,replica会负责将操作记录同步过来然后重放来达到replication的目的,这里要说明的是oplog,以为它是一个capped collection,所有它就有了一些capped有的特点,固定大小,顺序写,所以当replica落后的太多的话,master上的oplog开始的部分已经被overwrite了,那么mongodb会认为replica是一个stale节点了,对这样节点的处理方式就是不要让它去同步oplog然后重放了,直接对它像对待一个新的节点一样,全额的从master sync数据过来,这时当master的data set很大的时候,需要很长时间来 sync data,所以我们最好能够事先对oplog应该设置多大的值进行评估,评估时可能要考虑操作的频率、网络方面的问题等等,当评估了一个差不多的value以后就可以通过命令行参数oplogSize来设定了。

文章来源:http://www.mongodb.org/display/DOCS/Replica+Sets+-+Oplog

The basic replication process

All write operations are sent to the server (Insert, Update, Remove, DB/Collection/Index creation/deletion, etc.)

That operation is written to the database.

That operation is also written to the oplog.

Replicas (slaves) listen to the oplog for changes (known as “tailing the oplog”).

Each secondary copies the (idempotent) operation to their own oplog and applies the operations to their data.

This read + apply step is repeated

The oplog collection

The oplog is a special collection type known as a capped collection. The oplog is a collection of fixed size containing information about the operation and a timestamp for that operation.

Because the oplog has a fixed size, it over-writes old data to make room for new data. At any given time, the oplog only contains a finite history of operations.

Falling Behind

Each secondary keeps track of which oplog items have been copied, and applied locally. This allows the secondary to have a copy of the primary's oplog, which is consistent across the replicaset.

If a secondary falls behind for a short period of time, it will make a best effort to “catch-up”.

Example:

A secondary needs 5 minutes of downtime to be rebooted.

When this computer comes online, the mongod process will compare its oplog to that of the Master.

mongod will identify that it is 5 minutes behind.

mongod will begin processing the primary’s oplog sequentially until it is “caught up”.

This is the ideal situation. The oplog has a finite length, so it can only contain a limited amount of history.

Becoming Stale

If a secondary falls too far behind the primary’s oplog that node will become stale.

A stale member will stop replication since it can no longer catch up through the oplog.

Example:

an oplog contains 20 hours of data

a secondary is offline for 21 hours

that secondary will become stale, it will stop replicating

If you have a stale replica, see the documents for resyncing.

Preventing a Stale Replica

The oplog should be large enough to allow for unplanned downtime, replication lag (due to network or machine load issues), and planned maintanance.

The size of the oplog is configured at startup using the -oplogSize command-line parameter. This value is used when you initialize the set, which is the time when the oplog is created (1.7.2+). If you change the -oplogSize parameter later, it has no effect on your existing oplog.

There is no easy formula for deciding on an oplog size. The size of each write operation in the oplog is not fixed.

Running your system for a while is the best way to estimate the space required. The size and amount of time in your oplog is related to the types and frequency of your writes/updates.

Recovering a stale replica is similar to adding a new replica.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: