您的位置：首页 > 数据库

Cockroach Design 翻译 ( 七) 冲突解决方案

2016-12-06 08:10 351 查看

9 Conflict Resolution冲突解决方案

Things get more interestingwhen a reader or writer encounters an intentrecord or newly-committed value in a location that it needs to read or write.This is a conflict, usually
causing either of the transactions to abort orrestart depending on the type of conflict.

当一个读取者和一个更新者遭遇到同一条意向记录，或者新提交的值位于该记录需要被读取或者更新的位置，事情就变得更有趣了。这就是冲突，它通常会引致其中一事务被中止或者重新启动，这依赖于冲突的类型。

Transaction restart:事务重启
This is the usual (and more efficient) type ofbehaviour and is used except when the transaction was aborted (for instance byanother transaction). In effect, that reduces to two cases; the first being theone outlined above: An SSI transaction that finds upon
attempting to committhat its commit timestamp has been pushed. The second case involves atransaction actively encountering a conflict, that is, one of its readers orwriters
encounter
data that necessitate conflict resolution (seetransaction interactions below).

事务重启是通常被采用的（也更有效率的）行为类型，例外情况是当事务被中止时（例如被另一个事务中止）。在实际实行中，归于两种场景：第一种场景是上面粗略提到的：一个SSI隔离级别的事务发现正在试图提交时它的提交时间戳已经被推送过；第二种场景牵涉到一个主动遭遇冲突的事务，也就是说，事务的读取者或者更新者之一遭遇到需要解决冲突的数据（见下面的事务交叉章节）。

When a transaction restarts, it changes itspriority and/or moves its timestamp forward depending on data tied to theconflict, and begins anew reusing the same txn id. The prior run of thetransaction might have written some write intents, which need to be
deletedbefore the transaction commits, so as to not be included as part of thetransaction. These stale write intent deletions are done during the reexecutionof the transaction, either implicitly, through writing new intents to the samekeys as part of the reexecution
of the transaction, or explicitly, by cleaningup stale intents that are not part of the reexecution of the transaction. Sincemost transactions will end up writing to the same keys, the explicit cleanuprun just before committing the transaction is usually a
NOOP.

当一个事务重启时，它先改变优先级并将它的时间戳前移，或者不改变优先级只将时间戳前移，这取决于与冲突相关的数据，然后开始一个新事务，新事务重用原事务的txn
id。原事务先前的运行可能已经写了一些意向intents，这些需要在事务提交前被删除，以免作为新事务的部分而被包含。这些陈旧写意向intents的删除操作在事务重新执行过程中完成，删除操作可以是隐式的，如：通过对新事务中相同key写入新的意向intent进行覆盖；删除操作也可以是显示的，通过清除那些已不是新事务的陈旧意向intents来完成。因为大多数情况新事务与原事务相比都将最终写相同的Key，所以在事务提交前显示清除通常什么都不做。

Transaction abort:事务中止
This is the case in which a transaction, uponreading its transaction record, finds that it has been aborted. In this case,the transaction can not reuse its intents; it returns control to the clientbefore cleaning them up (other readers and writers would
clean up danglingintents as they encounter them) but will make an effort to clean up afteritself. The next attempt (if applicable) then runs as a new transactionwith a new txn id.

事务中止场景，是当一个事务在读它的事务记录时，发现它已经被中止了。在这种场景中，事务不能重新使用它的intents；它会在这些intents被清除之前将控制权交还客户端（其它读取者和更新者当遇到这些悬着的intents时将清除它们），之后也会努力清除自已。随后下一次尝试（若适用）将使用新的事务ID作为一个新事务运行。

Transaction interactions:事务交叉
There are several scenarios in which transactionsinteract:

事务互相影响有以下几种场景：

l Reader encounters write intent or value with newer timestamp farenough in the future: This is not a conflict. Thereader is free to proceed; after all, it will be reading an older version ofthe value and so does not conflict. Recall that
the write intent may becommitted with a later timestamp than its candidate; it will never commit withan earlier one. Side note: if a SItransaction reader finds
an intent with a newer timestamp which the reader’sown transaction has written, the reader always returns that intent's value.

l 读取者遇到写意向intent或者带有更新的足够未来的时间戳的值：这并不是一个冲突。读取者继续执行，不需特别处理，毕竟它将读取的是值的老版本，所以没有冲突。回想一下，写意向intent提交的时间戳可能比它的候选时间戳晚一些，所以它永远不能提交一个更早的时间戳。边注：如果一个SI隔离级别事务的读取者发现一个自己事务写的更新的时间戳的intent，该读取者将总是返回那个更新的intent的值。

l Reader encounters write intent or value with newer timestamp in thenear future: Inthis case, we have to be careful. The newer intent may, in absolute terms, havehappened in our read's past if the clock of the writer is ahead of the nodeserving
the values. In that case, we would need to take this value intoaccount, but we just don't know. Hence the transaction restarts, using insteada future timestamp (but remembering a maximum timestamp used to limit theuncertainty window to the maximum clock skew).
In fact, this is optimizedfurther; see the details under "choosing a time stamp" below.

l 读取者遇到写意向intent或者带有稍新的未来的时间戳的值：这种情况下，我们必须小心了。按绝对时间值计算，如果写时钟比服务该值的节点早，新intent可能发生在读之前。此时，我们需要考虑该值，但目前我们不知道是否发生了此种情况。因此事务使用一个未来时间戳代替现在时间戳来重新启动（但要记住最大时间戳用于将不确定窗口限定到最大时钟偏移）。事实上，这是更进一步的优化，详见下面的“选择时间戳”。

l Reader encounters write intent with older timestamp: the reader must follow theintent’s transaction id to the transaction record. If the transaction hasalready been committed, then the reader can just read the value. If the writetransaction
has not yet been committed, then the reader has two options. If thewrite conflict is from an SI transaction, the reader can push thattransaction's commit timestamp into the future (and
consequently
nothave to read it). This is simple to do: the reader just updates thetransaction’s commit timestamp to indicate that when/if the transaction doescommit, it should use a timestamp at least as high. However, if thewrite conflict is from an SSI transaction,
the reader must compare priorities.If the reader has the higher priority, it pushes the transaction’s committimestamp (that transaction will then notice its timestamp has been pushed, andrestart). If it has the lower or same priority, it retries itself using
as anew prioritymax(new random priority, conflicting txn’s priority
f4d9
- 1).

l 读取者遇到一个较早的写Intent：读取者必须追踪该intent的事务ID，找到事务记录。如果该事务已经被提交，那么读取者可以直接读取该值。如果写事务还没有被提交，读取者有两种选择。如果写冲突来自SI隔离隔离级别的事务，读取者可以将那个事务的提交时间戳推进到将来的时间点（因此也不必读它）。简单的做法是：读取者仅仅更新事务的提交时间戳，这用于当事务提交时给予指示，它至少使用一个更新的时间戳。然而，如果写冲突来自一个SSI隔离级别的事务，读取者就必须比较优先级。如果读取者有更高的优先级，它将推进事务的提交时间戳（那个事务随后会通知它的时间戳已经被推进，并重新启动）。如果它的优先级较低或者相同，它将使用新的优先级（新的随机优先级，冲突事务的优先级-1）进行重试。

l Writer encounters uncommitted write intent: If the other write intent has beenwritten by a transaction with a lower priority, the writer aborts theconflicting transaction. If the write intent has a higher or equal priority thetransaction
retries, using as a new priority max(new random priority,conflicting txn’s priority - 1); the retry occurs after a short, randomizedbackoff interval.

l 更新者遇到未提交的写intent：如果另外的写intent是一个低优先级事务写的，更新者会中止那个冲突的事务。如果写intent有更高或者相等的优先级，那么该事务将使用新的优先级（新的随机优先级，冲突事务的优先级-1）进行重试；重试使用一个短的、随机补偿间隔。

l Writer encounters newer committed value: The committed value could also bean unresolved write intent made by a transaction that has already committed.The transaction restarts. On restart, the same priority is reused, but thecandidate
timestamp is moved forward to the encountered value's timestamp.

l 更新者遇到较新的已提交的值： 这个已提交的值可能是一个已提交事务产生的未决定的写intent。此时更新者的事务会重新启动。在重启时，使用相同的优先级，但候选时间戳会前移至所遇到的值的时间戳。

l Writer encounters more recently read key: The read timestampcache is
consulted
on each write at a node.If the write’s candidate timestamp is earlier than the low water mark on thecache itself (i.e. its last evicted timestamp) or if the key being written hasa read timestamp later than the write’s candidate timestamp, this latertimestamp
value is returned with the write. A new timestamp forces atransaction restart only if it is serializable.

l 更新者遇到一个最近刚读的key: 节点上的每个写操作都会查询“读时间戳”的缓存。如果写操作的候选时间戳比缓存中的低水位标记更早（如最近刚过期的时间戳），或者如果正在写的key的读时间戳比写操作的候选时间戳更晚，则写操作使用较晚的时间戳。仅当事务是序列化时，新时间戳会强制事务重新启动。

Transaction management事务管理
Transactions are managed by the client proxy (orgateway in SQL Azure parlance). Unlike in Spanner, writes are not buffered butare sent directly to all implicated ranges. This allows the transaction toabort quickly if it encounters a write conflict. The client
proxykeeps track of all written keys in order to resolve write intentsasynchronously upon transaction completion. If a transaction commitssuccessfully, all intents are upgraded
to committed. In the event
atransaction is aborted, all written intents are deleted. The client proxydoesn’t guarantee it will resolve intents.

事务由客户端代理（类似微软SQL Azure说法中的网关）来管理。与Spanner中不同，写操作没有缓存而是被直接发送到所有相关联的ranges。这使得事务遇到写冲突时可以快速终止。客户端代理记录所有被写的key，以在事务完成时能够异步解析写intents。如果事务提交成功，所有intent被更新到已提交状态。如果事务被终止，所有写intents被删除。客户端代理不能保障一定会解析intents。

In the event the client proxy restarts before thepending transaction is committed, the dangling transaction would continue to"live" until aborted by another transaction. Transactionsperiodically heartbeat their transaction record to maintain liveness.Transactions
encountered by readers or writers with dangling intents whichhaven’t been heartbeat within the required interval are aborted. In the eventthe proxy restarts after a transaction commits but before the asynchronousresolution is complete, the dangling intents
are upgraded when encountered byfuture readers and writers and the system does not depend on their timelyresolution for correctness.

如果在pending状态的事务提交前，客户代理重启，那么这些悬挂着的事务将继续“存活”，直到被另一个事务中止。事务与事务记录之间的周期性心跳用于维护存活状态。当读取者或者更新者遇到无心跳的悬挂intents时，事务被中止。如果在事务提交后但在异步解析完成之前，客户代理重启，悬挂着的intents会在未来读取者和更新者遇到时被它们更新，系统不依赖于这些intents的及时处理来保障正确性。

An exploration of retries
withcontention and abort times with abandoned transaction is here.

对于竞争时重试和放弃事务时中止次数的探索请参见这里：

https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing

Transaction Records事务记录
Please see pkg/roachpb/data.proto for the
up-to-datestructures, the best entry point being message Transaction.

最新结构请看pkg/roachpb/data.proto ，是为事务发送消息的最佳入口点。

Pros优势
l No requirement for reliable code execution to prevent
stalled2PC protocol.

l 不需要可靠的代码执行来防止 2PC 协议陷入僵局；

l Readers never block with SI semantics; with SSI semantics, they mayabort.

l 带有SI语义的读取者永不会阻塞；带有SSI语义的读取者可能会终止；

l Lower latency than traditional 2PC commit protocol (w/o contention)because second phase requires only a single write to the transaction recordinstead of a synchronous round to all transaction participants.

l 比传统2PC提交协议有更低的延迟（无竞争时），因为第二阶段只需要单写，而不是同步到所有事务参与者；

l Priorities avoid starvation for arbitrarily long transactions andalways pick a winner from between contending transactions (no
mutualaborts).

l 使用优先级来防止长事务的饥饿等待，并总是从竞争的事务（不能彼此中止）中选择优胜者；

l Writes not buffered at client; writes fail fast.

l 写操作在客户端不缓存；写操作失败时处理更快速；

l No read-locking overhead required for serializable SI(in contrast to other SSI implementations).

l 与其它SSI实现相比，cockroach的SSI实现没有读锁的开销；

l Well-chosen (i.e. less random) priorities can flexibly giveprobabilistic guarantees
on latency for arbitrary transactions (forexample: make OLTP transactions 10x less likely to abort than low prioritytransactions, such as asynchronously scheduled jobs).

l 精心挑选的（即更少随机性）优先级可以灵活地更大概率性保证任意事务的低时延（举例：使OLTP事务中止的可能性比低优先级事务降低10倍，如异步调度任务）。

Cons劣势
l Reads from non-lease holder replicas stillrequire a ping to the lease holder to update the read timestamp cache.

l 从没有持有租期合约的副本读取数据仍然需要ping有租期合约的持有者以更新“读时间戳”的缓存；

l Abandoned transactions may block contending writers for up to theheartbeat interval, though average wait is likely to be considerably shorter(see graphin
link). This is likely considerably
more performant thandetecting and restarting 2PC in order to release read and write locks

l 已中止的事务有可能仍然阻塞有竞争的写操作直到一个心跳间隔，尽管平均等待时间设计上看来已相当短（参见graph in link）。这与检测并重启2PC以释放读写锁相比已相当高效。

l Behavior different than other SI implementations: no first writerwins, and shorter transactions do not always finish quickly. Element ofsurprise for OLTP systems may be a problematic
factor.

l 与其它SI实现的行为有所不同：先提交的写并不一定先执行，短事务不一定总是迅速完成。这对OLTP系统来说，是个令人惊奇的元素，可能是个问题。

l Aborts can decrease throughput in a contended system compared withtwo phase locking. Aborts and retries increase read and write
traffic,increase latency and decrease throughput.

l 在竞争性系统中与两阶段锁相比，中止事务会降低系统吞吐量。因为中止和重试会增加读写的通信成本，从而增加时延，减少系统吞吐量。

Choosing a Timestamp选择时间戳
A key challenge of reading data in a distributedsystem with clock skew is choosing a timestamp guaranteed to be greater thanthe latest timestamp of any committed transaction (in
absolute time). No systemcan claim consistency and fail to read already-committed data.

在一个存在时间偏移的分布式系统中，读取数据的关键挑战是如何选取时间戳，这个时间戳可以保障大于任何一个已提交事务的最近时间戳（指现实时间）。没有系统会宣称保证一致性，却不能读取已提交的数据。

Accomplishing consistency for transactions (orjust single operations) accessing a single node is easy. The timestamp isassigned by the node itself, so it is guaranteed to be at a greater timestampthan all the existing timestamped data on the node.

实现只访问单节点事务（或者仅是单一操作）的一致性是容易的。因为时间戳由节点自己分配，所以生成一个比该节点上所有已存在的打了时间戳数据都大的时间戳是容易保障的。

For multiple nodes, the timestamp of the nodecoordinating the transaction t is used. In addition, amaximum timestamp t+εis supplied to provide an upper bound on timestamps foralready-committed data (ε is the maximum clock skew). As the transaction progresses,
anydata read which have timestamps greater than t but less than t+ε cause the transaction toabort and retry with the conflicting timestamp tc, where tc > t.The maximum timestamp t+ε remains the same. This implies that transaction restarts
dueto clock uncertainty can only happen on a time interval of length ε.

对于多节点事务，使用协调事务节点的时间戳t。除此之外，最大时间戳t+ε
被补充用于提供已提交数据时间戳的上限（ε是最大时间偏移）。当事务执行时，任何时间戳大于t而小于t+ε的数据读操作都将引致事务被中止或者使用大于t的冲突时间戳tc进行重试。t+ε保持一成不变。这意味着，由于时钟的不确定性，事务只能以长度为ε的时间间隔进行重启。

We apply another optimization to reduce therestarts caused by uncertainty. Upon restarting, the transaction not only takesinto account tc, but thetimestamp of the node at the time of the uncertainread
tnode. Thelarger of those two timestamps tc and tnode(likely equal to the latter) is used to increase the read timestamp.Additionally, the conflicting node is marked as “certain”. Then, for futurereads to that node within the
transaction, we set MaxTimestamp = Read Timestamp,preventing further uncertainty restarts.

我们提供另一种优化来减少不确定性引起的事务重启。当重启时，事务不仅要考虑t c，也要考虑产生不确定读时间的那个节点的时间
t node。tc和
t node (可能是后者)中更大的时间戳将被用来增加读时间戳。此外，冲突的节点被标记为“certain”。此后，事务内对这个节点的读操作，我们将设置
MaxTimestamp = Read Timestamp ，以防止进一步不确定性引致的重启。

Correctness follows from the fact that we knowthat at the time of the read, there exists no version of any key on that nodewith a higher timestamp than tnode. Upon a restart caused by the node, if the transaction encounters akey with a higher
timestamp, it knows that in absolute time, the value waswritten after tnode wasobtained, i.e. after the uncertain read. Hence the transaction can move forwardreading an older version of the data (at the transaction's timestamp). Thislimits the time
uncertainty restarts attributed to a node to at most one. Thetradeoff is that we might pick a timestamp larger than the optimal one (>highest conflicting timestamp), resulting in the possibility of a few moreconflicts.

正确性来自以下事实：我们知道当读操作刚开始时，节点上没有任何key的版本有比tnode 更大的时间戳。当事务遇到一个有更高时间戳的key，它知道以绝对时间来计算，该值是在t node时间之后写的，也就是说在不确定读之后。因此事务会向前移动读取数据的一个较旧版本（位于该事务时间戳）。这将由时间不确定性引致的重启限定于一个节点，并且最多一个节点。但却以挑选了一个比最优时间（最高的冲突时间戳）更大的时间戳为代价，结果是可能造成更多几次的冲突。

We expect retries will be rare, but thisassumption may need to be revisited if retries become problematic. Note thatthis problem does not apply to historical reads. An alternate approach whichdoes not require retries makes a round to all node participants
in advance andchooses the highest reported node wall time as the timestamp. However, knowingwhich nodes will be accessed in advance is difficult and potentially limiting.Cockroach could also potentially use a global clock (Google
did this with Percolator), which would be feasiblefor smaller, geographically-proximate clusters.

我们期望重试会很罕见，但如果重试会成为问题，这种假设就需要被重新审视。注意，这个问题不适用于历史读操作。一个可替换的不需要重试的方法是，先对所有参与节点进行一次轮询，然后选择其中最大的系统时钟作为时间戳。然而，想预先知道哪些节点被访问是很困难的，并且有潜在的限制。Cockroach也潜在地使用了一个权威的全局时钟（Google的Percolator做法类似，译注：是Time
Oracle或是Chubby lockservice)，全局时钟对于更小型的、地理邻近的集群是可行的。（译注：google使用的是大型的、全球地域分布的集群，所以其使用的是原子钟，像Cockroach这种全局时钟是不行的）。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Cockroach 分布式OLTP 数据库

相关文章推荐

新的分享

章节导航