Book - Designing Data-Intensive Applications
Written
- Author: Martin Kleppmann
- Reliability
- Two ways things can go wrong
- A fault is when a component inside the system fails to function properly.
- A failure is when the system as a whole breaks, as a result of one or more faults.
- Prevent faults when you can, but it's not possible to prevent 100% of faults.
- The big idea of reliability is to keep faults from turning into failures.
The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason
- Chaos Monkey style testing helps here. As with backup, you only know if your error handling really works if you are testing it.
- Monitoring can help a lot to catch small problems before they balloon into big problems.
- Two ways things can go wrong
- Scalability
- To talk about scalability we have to think about what we are scaling and what the load on the system actually comprises.
- Describing Load
- Is it read-heavy? Write-heavy? Are there a lot of active simultaneous users?
- What kind of SLAs do we have?
- Both internal or external/contractual
- Do we care more about the average case or the tails?
- How do changes that we make affect the different types of load in the system?
- The nature of the load and what we want to optimize for makes a big difference in the design of the system.
- In extreme caes, a hybrid design may even work best.
- Twitter uses an approach in which each user has a timeline cache, and when someone tweets, it is replicated out to the timeline cache of all their followers at the time that the tweet is made, even if the follower isn't currently using Twitter.
- But this falls apart at the top where some users may have tens of millions of followers, so for those specific users they use a read-based system, where their tweets are fetched when the follower actually opens Twitter and then merged into the rest of the timeline cache.
- Describing Performance
- When you increase load, how is the performance of the system affected?
- When you increase load, how much do you need to increase the underlying resources (CPU, RAM, etc.) to keep performance the same?
- Which performance metric is relevant?
- Response time
- Latency to start serving a response
- Throughput of a stream
- Of course all these are distributions as well and our thinking needs to encompass that.
- Your best users may have the most data or the most complex queries, and thus experience the longest response times. This is an especially good reason to pay attention to the tail of the distribution.
- At some point the tail of the distribution is dominated by random events that you can't really control. AWS decided that optimizing for the 99.9 percentile to meet their response time SLA was useful, but optimizing the 99.99th percentile (1 in 10,000 requests) was not.
- Head-of-Line Blocking
- When multiple slow requests are running and blocking faster requests from running, so those faster requests end up taking a long time as well.
- Dealing with Load
- Each factor of 10 increase in load tends to require a change in architecture.
- The architecture depends a lot on how the load is distributed, read-heavy vs. write-heavy, OLAP vs OLTP, and so on.
- Maintainability
- Operability — Keeping things running smoothly shouldn’t be a chore.
- Monitoring
- Tracking down problems
- Updating systems software, OS, etc.
- How do systems afffect and depend on each other?
- Anticipating future problems, and having the visibility to do so. (Scaling up servers when load is getting high, but before it becomes a problem.)
- Good practices for deployment, configuration management, etc.
- Rearranging how applications are deployed in the system and other complex tasks
- Avoiding dependency on individual machines
- Maintaining security as things change
- Defining processes for all the above
- Writing knowledge down
- Simplicity — Easy for new engineers to understand the system and its components.
- Sources of complexity
- Explosion of the state space
- Tightly-coupled modules
- Tangled dependencies
- Inconsistent naming
- Performance problems solved via hacks that never go away
- Special cases to work around issues in other modules
- Simplicity makes software easier to write and deploy performant, correct software.
- Good abstractions enable simplicity of the layers above. Bad, leaky abstractions can hurt more than they help.
- Sources of complexity
- Evolvability — Easy to make changes in the future and adapt to new requirements.
- This is closely linked to the simplicity of the system. Simpler systems are easier to change.
- Operability — Keeping things running smoothly shouldn’t be a chore.
- Highlights
- Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. (Location 253)
- Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. (Location 260)
- there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (Location 294)
- The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason (Location 316)
- Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. (Location 331)
- Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue. (Location 345)
- You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly. (Location 2063)
- In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space. (Location 2129)
- is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. (Location 2190)
- Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme [21]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in “Snapshot Isolation and Repeatable Read” (Location 2203)
- LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree (Location 2245)
- A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. (Location 2258)
- The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance”) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable (Location 2261)
- Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation (Location 2270)
- An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree (Location 2273)
- A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index (Location 2316)
Thanks for reading! If you have any questions or comments, please send me a note on Twitter. And if you enjoyed this, I also have a newsletter where I send out interesting things I read and the occasional nature photo.