您的位置：首页 > 移动开发

Designing Data-Intensive Applications（设计数据密集应用）- O'Reilly 2017 读书笔记

2017-04-05 19:01 316 查看

Designing Data-Intensive Applications The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

可靠性、可伸展性、可维护性

数据模型与查询语言

schema更新：
关系SQL：~ 静态语言
NoSQL：动态类型语言
Document型：MongoDB的BSON --> 整个重写替换/部分更新？--> 增量快照？

grouping related data --> locality
Spanner的“嵌套表”
Oracle的“多表索引聚类表”

文档型与关系型的融合：RethinkDB（列向）--> 关系-like joins？
MongoDB 2.2+ 聚集管道，声明式的map-reduce
图
属性图：Neo4j、titan、InfiniteGraph
triple-store：Datomic、AllegroGraph

3' 声明式语言：Cypher，SPARQL（基于RDF），Datalog（Prolog/LISP的变体,学术型）
vs CODASYL（网络模型）

存储与检索

log --> segments with hash index --> compact去重
Range查询不够高效

SSTable --> memtable（key-value） + log日志
思想来自于BigTable，如LevelDB、RocksDB

LSM-tree
used by fulltext index（倒排的存储，Lucene）

Bloom-filter：早期排除不存在的key（？但不能排除的情况下还是一样的）
compact & merge的时机：
size-tierd
leveled

B-tree：标准索引实现
Making reliable：WAL（redo log）
写操作对SSD不友好？
并发控制：latch锁
优化*

LSM-树后台压缩影响性能，不如B-树可预测（实质是均摊了）
其他索引
secondary（Map<K, List<V>>）
clustered（value in index）
多列组合
全文与fuzzy

内存数据库
数据仓库：分析查询
产商产品：Teradata, Vertica, SAP HANA, ParAccel; Aws RedShift; SQL-on-Hadoop（Hive, SparkSQL, Impala, Presto, Tajo, 基于Dremel的Drill）'
星模式：事实表（events log？）+ 维度表（进一步导出的？）
雪花

列向存储
列压缩：位图编码（enum-->bool），适用于IN (...)类的查询？
多个sort orders？

聚集：Data Cube & 物化视图*

编码与演进

JSON，Bin（Base64？）
BSON：扩展了datatypes
Thrift & protobuf（2007年开源）
无field name字符串，而是tags（枚举整数）

Avro：len头部 + utf8字节
reader & writer's schema
模式演化规则：field有default value

ASN.1：如DER
分布式actors：Akka、Orleans、Erlang OTP

复制

replication lag下的一致性模型
Read-after-write consistency
Monotonic reads
Consistent prefix reads（因果序）

Leader-less Dynamo-style（不保证写的顺序？但是如果有应用级的hash路由呢？如果所有数据潜在可关联如社交网络，就不适用了...）
client发起并行读，取version最高的？
然后client回写更新版本过期的
或服务器Anti-entropy process后台同步？

Quorums：w + r > n
Quorums方法的局限性（略）
Monitoring staleness
度量replication lag

Sloppy Quorums and Hinted Handoff
增加写的可用性，但是读的可靠性下降
Sloppy quorums are optional in all common Dynamo implementations. In Riak they are enabled by default, and in Cassandra and Voldemort they are disabled by default（很奇怪的不同设计？）

Detecting Concurrent Writes（多个clients同时写同一个key）
Last write wins (discarding concurrent writes) 应用级别的版本化？
The “happens-before” relationship and concurrency（编程语言的内存模型里的术语）
两者不相关就是并发，不需要全局时间戳（网络～光）

Capturing the happens-before relationship
版本号：写之前必须读，获取当前的最新版本号（全局快照），而客户端写之前自己负责处理merge
妈的这不就是SVN／Git的通常使用模型嘛

Merging concurrently written values
Such a `deletion marker` is known as a `tombstone`.

Version vectors(自动branching...)

分区

分区策略
by Key Range：数据热点问题
by Hash of Key：不支持key range查询
Consistent Hashing：术语“一致性”指的是rebalancing方法
Cassandra：compound primary key + 仅hash第一个key（Hash of Range？）

分区不能处理对同一个key对大规模并发读写：Skewed Workloads and Relieving Hot Spots

& 二级索引（文档数据库中的文档属性）
document-based + local index：scatter／gather read
理论上，如果不是用文档id直接作为主key，而是用这些二级索引属性的组合hash作为key的话，可以避免

term-based：全局的index及其partition
写操作更复杂了：所有数据库都不支持“分布式事务”？

Rebalancing
不要使用 % N方法，减少data move的开销！
解决方法1：一个物理node上分配多个partition
一开始就固定住总的partition数？node与partition的关联需要手工维护？
how to 选择正确的总partition数？？

Dynamic partitioning

Request Routing
routing tier：acts as a partition-aware load balancer
ZooKeeper：keep routing info up-to-ate

事务

ACID
The word `consistency` is terribly overloaded（必须由应用来保证？）
Isolation
serializable：性能问题，Oracle甚至没有实现它！
snapshot（MVCC）

作者的批评：... such as Rails’s ActiveRecord and Django don’t retry aborted transactions

Weak Isolation Levels
Read Committed：无脏读、无脏写
dirty write：一个事务的写覆盖了另一个事务未提交的读
实现：
行级锁

Snapshot Isolation and Repeatable Read
不能容忍临时不一致的情况：
Backups
Analytic queries and integrity checks

实现：
关键原则：readers never block writers, and writers never block readers
txid：每一行关联created_by、deleted_by（额外的GC进程），update转换为delete + create？
对象可见性规则

Indexes：append-only B-tree
... As a result, nobody really knows what repeatable read means.

Preventing Lost Updates
2个并发的read-modify-write
原子写
应用侧的Explicit locking：SELECT ... FOR UPDATE
自动检测并abort
Compare-and-set
副本问题：Conflict resolution and replication（原子操作需要是可交换的）

Write Skew and Phantoms
2个事务快照读同样的对象，然后分别进行写更新操作（并发带来的问题）
=> SELECT ... FOR UPDATE（但不能处理后续写不是之前快照读的对象的情况）
Materializing conflicts

Serializability
Literally executing transactions in a serial order（内存数据库？Redis）
systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions.（存储过程？）
作者这里吹嘘了VoltDB一把～

2PL
2PL不是2PC
共享锁 -> 排它锁
deadlock

Predicate locks（锁住一个‘搜索条件’）
Index-range locks

Optimistic concurrency control techniques such as serializable snapshot isolation（SSI）
乐观的并发控制技术在高Contention下有性能问题？
检测outdated premise
Detecting stale MVCC reads（？？？）
这里的问题是：SSI怎么知道哪些数据属于快照创建时uncommited writes同时commit时已被修改？
the database needs to track when a transaction ignores another transaction’s writes due to MVCC visibility rules（？？？）

Detecting writes that affect prior reads

分布式系统的麻烦

Monotonic Versus Time-of-Day Clocks
confidence interval
Google’s TrueTime API in Spanner：[earliest, latest]

Knowledge, Truth, and Lies
The Truth Is Defined by the Majority：但是quorum的quorum，这会导致独裁吧？
Fencing tokens
Byzantine Faults
太空辐射导致的硬件错误？一个bit的翻转？
Most Byzantine fault-tolerant algorithms require a supermajority of more than 2/3 of the nodes to be functioning correctly

system model（fault的模型），如timing：
Synchronous model：有上界？不太现实
Partially synchronous model
Asynchronous model

Safety and liveness

一致性与选举

stronger consistency: worse performance or less fault-tolerant
Linearizability
linearizability is a recency guarantee（一旦x read到新值，则所有后续read都应该独到新值）
cas：只有当x值为old的时候更新到new，否则fail
vs Serializability
序列化是事务的隔离级别属性，而线性化是对register的读写的recency保证，它不能防止write skew问题
serializable snapshot isolation（SSI）is not linearizable

Uniqueness constraints
原文：Strictly speaking, ZooKeeper and etcd provide linearizable writes, but reads may be stale ...

实现：
Multi-leader replication (not linearizable)
Single-leader replication (potentially linearizable) + Consensus algorithms (linearizable)
Leaderless replication (probably not linearizable)

CAP
Consistency, Availability, Partition tolerance: pick 2 out of 3
网络分区是一种fault，不是你可以自由选择的（没得选择）

Ordering Guarantees
If a system obeys the ordering imposed by causality, we say that it is
causally consistent.
linearizability是全序，而causality是偏序关系（有没有可能是“线性化+可水平扩展”？）
Sequence Number Ordering
非single leader的情况：Noncausal sequence number generators
确保全局uid是可行的（预分配uid block）；使用高解析的全局时钟。缺点：not consistent with causality

Lamport timestamps：(counter, node ID)
every node and every client keeps track of the maximum counter value it has seen so far, and includes that maximum on every request.
Lamport timestamps are sometimes confused with version vectors（enforce a total ordering）

Timestamp ordering is not sufficient（两个用户同时注册以抢占一个userid？这个问题有点病态的说）
This idea of knowing when your total order is finalized is captured in the topic of “total order broadcast”.——有点新鲜，我怎么没听说过？

全序广播（原子广播）
算法需要确保safety属性：
Reliable delivery（消息必须可靠传播，保证每个node都接受到）
Totally ordered delivery（消息的node接受顺序相同）——缺点：这个顺序不能动态修改？？

Implementing linearizable storage using TOB
实现“线性化CAS操作”：... Read the log, and wait for the message you appended to be delivered back to you（？？？这里的描述有点含糊啊）
注释： If you don’t wait, but acknowledge the write immediately after it has been enqueued, you get something similar to the memory consistency model of multi-core x86 processors [43]. That model is neither linearizable nor sequentially consistent.

分布式事务与Consensus
FLP：The Impossibility of Consensus（in the asynchronous system model）
2PC与原子提交
coordinator：prepare/commit
A system of promises: a bit more details
全局唯一的transaction ID
各个参与者的读写操作都关联到此事务id
By replying “yes” to the coordinator, the node promises to commit the transaction without error if requested. In other words, the participant surrenders the right to abort the transaction, but without actually committing it（但是这个承诺可靠吗？由于不可抗的外部灾难导致硬件故障呢？？）
Once the coordinator’s decision has been written to disk, the commit or abort request is sent to all participants. If this request fails or times out, the coordinator must retry forever until it succeeds.（这玩意儿也能称得上是算法吗？？？扯淡）

3PC：blocking --> nonblocking
！3PC assumes a network with bounded delay and nodes with bounded response times
perfect failure detector（但是reliable基本上不可能，因网络超时在分布式系统里等同于节点失败）

实践中的分布式事务
internal（所有node运行同样的协议）vs heterogeneous
Exactly-once message processing
XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
接受repare通知相当于在XA Driver的C API上注册callback？？？靠

Holding locks while in doubt
Recovering from coordinator failure（commit log意外丢失导致不能决定提交/abort）
The only way out is for an administrator to manually decide whether to commit or roll back the transactions.（人工干预）

Fault-Tolerant Consensus
算法属性:
Uniform agreement：No two nodes decide differently.
Integrity：No node decides twice.
Validity：If a node decides value v, then v was proposed by some node.
Termination：Every node that does not crash eventually decides some value.

The best-known fault-tolerant consensus：Viewstamped Replication (VSR), Paxos, Raft, and Zab
Epoch numbering and quorums（每个epoch周期内leader唯一，但不需要是同一个）
... can't decide by itself, Instead, it must collect votes from a quorum of nodes ...
2轮选举: 一次 a leader, 一次 leader’s proposal.

Limitations of consensus
... designing algorithms that are more robust to unreliable networks is still an open research problem

Membership and Coordination Service
HBase, Hadoop YARN, OpenStack Nova, and Kafka all rely on ZooKeeper running in the background.
ZooKeeper is modeled after Google’s Chubby lock service, implementing not only total order broadcast：
Linearizable atomic operations
Total ordering of operations（zxid）
Failure detection
Change notifications

ZooKeeper, etcd, and Consul are also often used for service discovery
read-only caching replicas：只缓存consensus算法的结果，不参与投票

注意，本章没有描述consensus算法的细节（如Paxos或Raft）

批处理

MapReduce workflows（通过硬编码的HDFS ouput路径？）
一个单独的MapReduce job无法有效处理Top-N问题？

Reduce-Side Joins and Grouping
Sort-merge joins
Bringing related data together in the same place
GROUP BY（聚集查询）
Handling skew
hot keys问题：
skewed join in Pig（预采样处理？需要完全复制hot key关联的其他数据）
sharded join in Crunch：类似的，但不是预采样，而是要求人工指定（？）
Hive：mapper side join

Map-Side Joins
Broadcast hash joins：大数据集连接小数据集（后者可完全加载到内存hashtable）
Partitioned hash joins（如果连接双方的key都依据相同的规则partition的话）
Map-side merge joins

The Output of Batch Workflows
Key-value stores as batch process output
immutable input：buggy code can be corrected and re-run
using more structured file formats: Avro, Parquet

Hadoop vs MPP
优势：可以快速地载入到HDFS... (MPP系统需要预先做结构化建模)
允许Overcommitting resources
At Google, a MapReduce task that runs for an hour has an approximately 5% risk of being terminated to make space for a higher-priority process

Beyond MapReduce
“simple abstraction on top of a distributed filesystem”：理解容易，但是具体的job实现并不简单（MapReduce是很低级的API）
Materialization of Intermediate State
MapReduce的状态物化的缺点（vs Unix Pipes）：略
Dataflow engines：Spark、Tez、Flink
避免昂贵的排序？
*JVM进程可以被重用（Hadoop里面每个task都要重启新的）

Dataflow系统的Fault tolerance：
如何描述ancestry数据输入？（Spark RDD）
make operators deterministic

Graphs and Iterative Processing
PageRank
Pregel处理模型：bulk synchronous parallel (BSP), vertex之间相互发送消息
Pregel model guarantees that all messages sent in one iteration are delivered in the next iteration, the prior iteration must completely finish, and all of its messages must be copied over the network, before the next one can start.
This fault tolerance is achieved by periodically checkpointing the state of all vertices at the end of an iteration
问题：机器之间的通信开销比较大

High-Level APIs and Languages
UAF：Spark generates JVM bytecode and Impala uses LLVM to generate native code for these inner loops
Specialization for different domains：Mahout、Spatial-kNN

流处理

Transmitting Event Streams
Messaging Systems：pub/sub
Message broker（消息队列）：允许consumer暂时离线
Multiple consumers：fan-out
Acknowledgments and redelivery
the combination of load balancing with redelivery inevitably leads to messages being reordered

Partitioned Logs
log-based message brokers：Apache Kafka
典型的磁盘大小：6TB，写速度450MB/s，则大约11小时写满

Databases and Streams
Change Data Capture（CDC）：捕获对数据库的修改，作为流输出（需要解析redo log/WAL）
Initial snapshot
Log compaction
API support for change streams

Event Sourcing（来自于DDD社区）：从高层建模？
Deriving current state from the event log
Commands and events

State, Streams, and Immutability
分离数据读写的形式：command query responsibility segregation (CQRS)
append another event to the log，假装数据被“删除”了？哈哈哈

Processing Streams
Complex event processing (CEP)：查询（声明式的模式搜索）变成持久的
Stream analytics：window、概率算法
Maintaining materialized views
Search on streams
vs 其他Messaging/RPC（如actors模型）
融合：Apache Storm的distributed RPC特性

Event time versus processing time
Types of windows（感觉这里的讨论有点无趣）
Tumbling window
Hopping window：允许重叠
Sliding window
Session window

Stream Joins（感觉这里的讨论有点无趣）
Stream-stream join (window join) 略
Stream-table join (stream enrichment)
Table-table join (materialized view maintenance)

Time-dependence of joins

Fault Tolerance
Microbatching and checkpointing
Atomic commit revisited（‘exactly-once’）
Idempotence
Rebuilding state after a failure

这部分内容大部分都很扯淡（空谈），感觉有必要仔细看看Kafka的设计？？

数据系统的未来

数据集成
Schema Migrations on Railways（渐进迁移！）
The lambda architecture：immutable append-only + stream + batch？

Unbundling Databases
Transactions within a single storage or stream processing system are feasible, but when data crosses the boundary between different technologies, I believe that an asynchronous event log with
idempotent writes is a much more robust and practical approach
Observing Derived State
Taken together, the write path and the read path encompass the whole journey of the data ...
Materialized views and caching
Stateful, offline-capable clients
offline-first
In particular, we can think of the on-device state as a cache of state on the server. The pixels on the screen are a materialized view onto model objects in the client app; the model objects are a local replica of state in a remote datacenter

End-to-end event streams
Reads are events too（有点去中心化的感觉？作者的奇思妙想？）
Multi-partition data processing

Aiming for Correctness
“exactly once”, duplicate supression, operation identifiers
The end-to-end argument（需要使用一个端到端的事务ID吗？）
Applying end-to-end thinking in data systems
作者这里啥也没说。重新探寻“事务”的抽象是否合适？

Enforcing Constraints
不需要multi-partition atomic commit：基于event的log，附带一个“Request ID”即可（可trace&可audit）
备注：event log实际上就用在Bitcoint这种分布式P2P去中心化的应用中（不过Request ID机制在比特币的交易模型中并不明显，因为块链要求全局线性化同步。。。）
这里还有一个问题：Request ID到底由谁来分配？应该是app client side，而不是服务器前端（这不就是DDD嘛）
实际上，并不需要端到端的Request ID，只要保证领域转换（DDD术语）可以正确映射领域对象id即可

一致性：Timeliness and Integrity
Timeliness means ensuring that users observe the system in an up-to-date state.
Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data
event-based dataflow systems解耦这2者？（由于event处理是异步的，sender不能立即看到结果）
Passing a client-generated request ID through all these levels of processing, enabling end-to-end duplicate suppression and idempotence

Similarly, many airlines overbook airplanes in the expectation that some passengers will miss their flight,...（数据模型的一致性有时候被业务模型直接破坏，引入了补偿事务/系统外的人工干预）
apology

Coordination-avoiding data systems
可部署到“跨数据中心、multi-leader配置、异步复制”的环境，一致性总是可以临时violated，事后再修复即可（通过end-to-end的回溯log可以做到这一点）

Trust, but Verify
“system model”：计算机不会犯错误，磁盘fsync的数据不会丢失，...
but：硬件bit flip错误，“rowhammer”攻击

If you want to be sure that your data is still there, you have to actually read it and check.
Designing for auditability（可审查性）
Tools for auditable data systems
use cryptographic tools
(作者的看法)proof of work很浪费？
Merkle trees
certificate transparency

做正确的事情
A technology is not good or bad in itself—what matters is how it is used and how it affects people.
例如：Predictive Analytics
Bias and discrimination
Responsibility and accountability
Feedback loops

Privacy and Tracking（备注：尤其是医疗系统）
Surveillance
Consent and freedom of choice（de facto mandatory）
Having privacy does not mean keeping everything secret; it means having the freedom to choose which things to reveal to whom, what to make public, and what to keep secret.
备注：隐私问题的本质在于，数据本身一旦存储下来，理论是可以无限制的保存下去的（除非硬件失效），而人的大脑记忆可以选择性遗忘

Legislation and self-regulation

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 分布式存储数据密集应用分布式数据 NoSQL 事务

相关文章推荐

新的分享

章节导航