Big Data Storage Course Notes

Chapter 1


Horizontal scaling, extend horizontally; use more nodes to support a larger volume of requests. Vertical scaling, extend vertically; expand the capabilities of a single node to support a larger volume of requests. Characteristics of big data: volume, velocity, variety, value.

The demand for horizontal scaling, system reliability and availability, and consistency requirements cannot be effectively resolved under the traditional relational model.

What kind of storage does big data need?

For big data storage in a cluster system, it needs to:

  1. Be able to manage, schedule, and monitor computational and storage resources within the cluster in a unified manner
  2. Be able to disperse and manage the data within the cluster in a unified manner
  3. Computers within the cluster can work together on a task, dividing the work cooperatively, achieving load balancing
  4. When a computer in the cluster fails, the cluster can ensure the effectiveness of its functionality and that no data is lost (partition fault tolerance)
  5. The cluster can be deployed, expanded, and have failing nodes replaced in a simple way (scalability)

Technology classification

  • Divided by metadata management approach: Peer node strategy, due to the absence of a central node, offers higher availability. Central node strategy, offers better scalability.
  • Divided by data model: Databases for different business models with different data models
  • Distributed architecture Partition All - Database and table partitioning Partition Engine - Share Nothing Partition Storage - Share Storage, separation of storage and computation

NoSQL and newSQL

NoSQL is a non-relational database, mainly used to solve the scalability issues of SQL. It does not guarantee ACID properties, lacks transaction management, and can be horizontally scaled without extra operations;

NewSQL is a relational database, combining the massive storage management capabilities of NoSQL databases with the ACID features and SQL convenience of relational databases.

Chapter 2

Hierarchical architecture based on C/S

AP and DP

AP stands for Application Processor, which is a software for data processing, capable of forwarding user requests and data to CM.
DP stands for Data Processor, responsible for data management, similar to a centralized database management system, receiving commands and data from CM and providing feedback.

Changes in AP functionality

Centralized library - Does not store data, equivalent to input, all software runs on the main machine

Multiple clients/single server - Application, client services (query), communication

Multiple clients/multiple servers - Application, client services (catalog management, cache management, query processing, commit protocol), communication

Thin client/server - Client-side Server2Server, the above functions transferred to the server, only retaining the SQL interface and program interface; AP is related to computation: catalog management, cache management, query processing, transaction processing, lock management, access control. Additionally, the server side also has storage-related functions: log replay, fault recovery, index design, physical storage.

Three architectures

Personal understanding, correctness cannot be guaranteed. Reference article

  1. Three types of distributed architectures discussed based on the server-to-server thin client/server architecture
  2. server-search engine - query, optimization, extraction; concurrency control; transaction commit; engine-transaction part (for recovery, because it needs logs, is stateful), index, log, failure recovery, physical storage
  3. The definition of the three architectures
  4. The scalability and compatibility characteristics of each part

The essence of the three architectures is that newSQL wants to have both the strong consistency and transaction support of SQL and the easy scalability of NoSQL, but it falls into a tug-of-war between scalability and compatibility.

Hierarchical Structure

From top to bottom, they are Partition ALL (database and table partitioning), Partition Engine (share nothing), Partition Storage (separation of computing and storage); from top to bottom, scalability decreases in exchange for increased ecosystem compatibility.

Database and table partitioning have excellent scalability and high performance; however, they have a large business coupling, usually requiring design according to business scenarios, and users need to handle sharding strategies, distributed transactions, and distributed queries themselves, making it less universally applicable. Each node is a complete DBMS.

Share Nothing only partitions at the engine layer, with nodes being relatively independent. Compared to traditional database and table partitioning, it internalizes issues such as distributed transactions and distributed queries within the database, hiding details of distributed transactions from users and providing a unified database service, simplifying user usage.

Embarrassingly, most implementations of database and table partitioning also use middleware (data processing, data management, load balancing, driving databases) to hide the details of implementing distributed transactions, similarly adopting consistency protocols like Multi Paxos to ensure replica consistency, also providing users with unified database access. Thus, the strategic advantage of Partition Engine seems not so significant in comparison.

Continuing to move the partitioning boundary downward, to the lower layer of transactions and index systems. At this point, since Part 1 retains a complete transaction system, it is not stateless and usually keeps separate nodes for service handling. Thus, Part 1 mainly retains computational logic, while Part 2 deals with storage-related aspects like REDO, dirty flushing, and failure recovery. Therefore, this structure is what we often refer to as the compute-storage separation architecture, also known as the Share Storage architecture.

Because it retains a complete computing layer, it requires very little change perceived by users compared to traditional databases, allowing for greater ecosystem compatibility. At the same time, only the storage layer is partitioned and dispersed, making this solution less scalable than the two previously mentioned.

The Component Structure of DDBS


– Application Processor (AP) Functions:
User Interface: Verify user identity, accept user commands (such as SQL)

Semantic Data Controller: Some constraints (view management, security control, semantic integrity control)

Global Query Processor: Translates user commands into database commands; generates global query plans; collects partial query results and returns them to the user

Global Execution Monitor (Global Transaction Manager): Schedules and monitors AP and DP; ensures consistency of replicated data; ensures atomicity of global transactions

– Data Processor (DP) Functions:
Local Query Processor: Global commands → Local commands; selects the best access path for execution

Local Transaction Manager: Schedules execution on a per-local subtransaction basis

Local Scheduler Manager: Responsible for concurrent control at the local site

Local Recovery Manager: Maintains local database consistency for fault recovery

Storage Manager: Accesses database; controls database cache manager; returns local execution results

DDBS Schema Structure


  • Global External Schema (GES): The global external schema, or global user view, is the highest-level abstraction of the distributed database by global users.
  • Global Conceptual Schema (GCS): The global conceptual schema, or global conceptual view, is the overall abstraction of the distributed database, including all data characteristics and logical structure. The global conceptual schema is then mapped to local conceptual schemas through fragmentation and allocation patterns.
  • Fragmentation Schema is the view that describes the logical division of global data. That is, the global data logical structure is divided into local data logical structures based on certain conditions, with each division becoming a fragment. In relational databases, a subrelation of a relation is called a fragment of that relation.
  • Allocation Schema describes the local physical structure of the logical local data, that is, the physical allocation view of the fragments after division.
  • Local Conceptual Schema (LCS): The local conceptual schema is a subset of the global conceptual schema. It describes the logical structure of local data on a local site. When the global data model and local data model are different, it also involves data model transformation.
  • Local Internal Schema (LIS): Defines the local physical view, describing the physical database, similar to the internal layer of a centralized database.

DDBS Data Transparency

  • Fragmentation Transparency: Fragmentation involves dividing a relation into several subrelations, each called a fragment. The property that users do not need to consider to which fragment the data belongs is called fragmentation transparency. Located between the global conceptual schema and fragmentation schema.
  • Allocation Transparency: Distributed database supports controlled data redundancy, that is, data can be repetitively stored at different sites. The property that users do not need to consider the storage site of each fragment is called allocation transparency. Located between the fragmentation schema and allocation schema.
  • Local Mapping Transparency: The property that users do not need to consider the local storage form of data is called local mapping transparency. Located between the allocation schema and local conceptual schema.
select . from S --Fragmentation transparency
select . from S1 & S2 --Allocation transparency
select . from S1 at site1 --Local mapping transparency
Execute:$SUPIMS($SNO,$FOUND,$SNAME) at L1 --Opaque

MDBS V.S. DDBS 4 Points

Distributed Database Systems: Are designed from the top-down, allowing for flexible fragmentation and allocation design. However, distributed database systems have a limit on the number of database components, typically no more than a few dozen. Multi Database Integration Systems: Data and databases already exist, integrating local data at various sites from a bottom-up approach. Data integration systems, by restricting data management capabilities (support only reading), can expand the number of database components to hundreds. Both need to provide users with a unified data access environment, with data being scattered storage. The differences lie in:

  • Whether data schemas are predefined
  • Whether DBMSs are homogeneous
  • Whether query optimization strategies are generated automatically
  • Whether local users necessarily exist (MDBS does)

Chapter 3

3.1 Distributed Database Design (Fragmentation, Allocation, Replication)

3.1.1 Design Strategies and Steps

Top-down, requirement analysis->conceptual design->distribution design->physical design->performance tuning

3.1.2 Definition and Role of Fragmentation

Fragmentation: The logical division of global data. Allocation: The designation of storage sites for fragments, known as allocation. When a fragment is stored at more than one site, it is called data replication. If each fragment is stored at only one site, it is called data partition storage.

The Role of Sharding:

  • Reducing the amount of data transferred over the network
  • Increasing the locality of transaction processing
  • Improving the efficiency of data querying and system reliability
  • Balancing the load The sharding process involves logically dividing and physically allocating global data. The global data is defined by the sharding pattern into various data segments, which are stored at various sites as defined by the allocation pattern.

Sharding Principles: Completeness (no data loss), Reconstructability (no relationship loss), Non-overlapping (formal description)

Types of Sharding: Horizontal sharding (by tuple), Vertical sharding (by attribute), Hybrid sharding

3.1.3 Horizontal Sharding

Horizontal sharding: Selection

Derived formula: Semi-join

The design is based on the sharding requirement information, which comes from application factors and database factors

Design criteria: Define a set of simple predicates with completeness and minimality

3.1.4 Vertical Sharding

Sharding representation: Projection operation

Completeness proof: Union operation (attribute)

Reconstructability proof: Join operation

Non-overlapping proof: Intersection operation, the result is not empty, it is the primary key

3.1.5 Representation Methods for Sharding

Graphical representation (table) and tree representation



3.1.6 Allocation Design

The process of mapping fragments to physical site storage is called the allocation design process.

  • Non-replicated allocation If each fragment is stored at only one site, it is called partitioned distribution, and the corresponding distributed database is called fully partitioned distributed database.
  • Replicated allocation If each fragment has copies at every site, it is called full replication allocation, and the corresponding distributed database is called fully replicated distributed database. If each fragment has copies only at some sites, it is called partial replication allocation, and the corresponding distributed database is called partially replicated distributed database.

3.1.7 Data Replication Techniques

Synchronous replication and asynchronous replication; Master-slave replication and peer-to-peer replication;

3.2 Distributed Query Optimization

(Reflecting key steps, unfolding fragments, etc., ∪ is a binary operation, circle empty)

3.2.1 The Significance of Query Optimization

3.2.2 Query Processor

3.2.3 Query Decomposition

Decompose calculus queries into algebraic queries based on the global conceptual schema. Obtain a global logical query plan tree. The following five steps:

  1. Query normalization (laws of commutation, association, distribution)
  2. Syntax and semantic analysis (syntax errors, meaningless queries, no permissions, via query graph)
  3. Query reduction
  4. Query rewriting

3.2.4 Data Localization

Decompose global tables into local tables using AND and JOIN operations First draw the global tree, optimize the global tree, convert it into a fragment query tree, Place selection and join operations on hold in a timely manner, and move ∞ before the execution of ∪ (using the distributive law)

3.2.5 Optimization of Fragment Queries

3.3 Distributed Access Optimization

3.3.1 Basic Concepts

3.3.2 Theoretical Basis for Optimization

Basis of a Relation: Refers to the number of tuples contained in relation R, denoted as Card(R)

  • Length of an Attribute: Refers to the number of bytes that define the value of attribute A, denoted as Length(A)
  • Length of a Tuple: The number of bytes per tuple in relation R, denoted as Length(R), Length(R)=∑Length(Ai)
  • Size of a Relation: The total number of bytes contained in relation R, denoted as Size(R) Size(R)=Card(R)×Length(R)
  • Distinctiveness of an Attribute: Refers to the number of distinct attribute values of attribute A in relation R, denoted as Val(A)
  • Maximum and Minimum Values of Attribute A: Denoted as Max(A) and Min(A)

Selection Operation: Cardinality: Card(S)=ρ Card(R)* ρ calculation: Only considers the case of equality conditions for selection attribute A, where A is an attribute of R, and X is a constant. Then, $\rho = \frac{1}{Val(R,A)}$

Calculation of Val(S,B): When attribute B belongs to the selection condition, Val(S,B)=1 When attribute B is a key (primary key), Val(S,B)=ρ×Val(R,B) When attribute B does not belong to the selection predicates, $$Val(S,B)=\begin{cases} Card(S), \quad if \quad Card(S) <= \frac{Val(R,B)}{2} \ \frac{Card(S)+Val(R,B)}{3}, \ Val(R,B), \quad if \quad Card(S) \ge 2Val(R,B) \end{cases}$$

Join Operation: Image

Semi-Join Operation: 202311022155

3.3.3 Optimization Methods for Semi-Join

202311022156 It’s about evaluating if going such a circuit, the cost of semi-join is more or less than that of a full join.

Homework 1 - Fragmentation Design

There is a shopping system containing two global relations: USER table (UID, UNAME, ADDRESS, HOBBY, CITY) and ORDER table (UID, PID, PRICE), where UID is the user number, UNAME is the user’s name, CITY is the user’s city. PID is the product number, and PRICE is the total order price. UID is a primary key in the USER table and a foreign key in the ORDER table. To establish a distributed database, the rules for fragmentation are:

(1) The USER relation is vertically fragmented by sensitivity of the attributes U1 contains non-sensitive attributes: UID, UNAME, CITY U2 contains sensitive attributes: ADDRESS, HOBBY

(2) All non-sensitive attributes in USER are further horizontally fragmented based on CITY U11: CITY IN {Beijing, Shanghai, Guangzhou, Shenzhen} U12: CITY NOT IN {Beijing, Shanghai, Guangzhou, Shenzhen}

(3) The ORDER relation is fragmented based on its join relationship with USER, resulting in O1 and O2.

Homework 2 - Query Optimization

Query Q: “Find all orders for users in Xuzhou city buying product ‘P1’, obtaining the user number, user name, product number, and total price of the order.”

(1) Write the relational algebra expression for query Q and transform it into a fragment query

(2) Optimize the fragment query tree

Homework 3 - Access Optimization

Refer to the diagram below


Chapter 4 HBase

Problems with HDFS

  1. Doesn’t support random rewriting of data
  2. HDFS lacks the concept of data packets
  3. HDFS is unable to perform quick operations for common data query functions such as row count statistics, filtering scans, etc., generally requiring the use of Mapreduce.
  4. (Advantages include large file storage, multiple replicas, automatic partitioning)

Features of HBase

  1. Uses HDFS for underlying storage, but file structure and metadata are maintained by itself.
  2. Adopts a storage model oriented towards columns + key-value pairs
  3. Enables convenient horizontal expansion
  4. Can implement automatic data fragmentation
  5. Relatively strict read-write consistency and automatic fault transfer
  6. Supports full-text search and filtering The advantage lies in handling a large amount of data input, high performance, high reliability, scalability; the downside is the lack of support for associative query analysis of tables.


The Region server is a container for Regions; a Region is a part of the table’s data, akin to a shard in a relational database. A table might be stored across different Regions. Characteristics:

  1. A Region cannot span servers, a Region Server might host one or more Regions;
  2. As data volume increases, a Region might split;
  3. For the purpose of load-balancing, Regions might migrate;
  4. All data access operations in a Region are implemented through the HDFS client interface.

How to understand the statement that data from different rows of the same table can be stored in different servers, and data from the same row of the same table can also be stored in different servers? (I don’t understand, I think there is a problem with the latter part of the statement.)

A server is a storage structure for a Region, but storing a Region doesn’t mean storing a whole table; each Region contains several Stores, a Store is a family of columns, stored as objects, not necessarily as an entire table, possibly as shards of different tables.

Write-Ahead Logging (WAL): First written to the WAL (one WAL per Regionserver), then loaded into memStore;

inside each region, there are multiple store instances, each corresponding to a column family; each store has one memStore instance, when the memStore is full, a new HFile is generated on HDFS (using an LSM tree for storage, which is quick-sorted before the final flush to turn random writes into sequential storage, improving read efficiency); memStore serves as a cache in memory, and both reads and writes first consult the memStore.


CRUD Operations

  • Adding a new cell in HBase results in adding a new data entry in HDFS.
  • Modifying a cell in HBase results in adding a new data entry in HDFS with a version number higher than the previous one.
  • Deleting a cell in HBase results in adding a new data entry in HDFS without a value, marked as a Delete type, known as a Tombstone marker. These records are actually deleted during the merging of HFile.

Read and Write Process

Reference article: HBase Read and Write Process

Zookeeper(ROOT) -> RegionServer(META) -> Region -> memStore


  1. The client accesses the /hbase/meta-region-server node on Zookeeper to find out which RegionServer has the hbase:meta table.

  2. The client connects to the RegionServer that contains the hbase: meta table. The hbase: meta table stores the row key range information for all Regions, allowing the client to find the Region where the requested row key resides, and the RegionServer it is on.

  3. On the corresponding RegionServer, the client searches for the required information in MemStore, then in HFile.

  4. After the first access, the client caches the meta information (BlockCache), so the next operations directly look up the meta information from the BlockCache. image.png

  5. The client accesses the /hbase/meta-region-server node on Zookeeper to find out which RegionServer has the hbase:meta table.

  6. The client connects to the RegionServer that contains the hbase: meta table. The hbase: meta table stores the row key range information for all Regions, enabling the lookup of the Region containing the requested row key, and the corresponding RegionServer.

  7. On the specified RegionServer, the client writes data to both Hlog and memstore.

  8. When memstore reaches a threshold, it flushes the data into an HFile. When compacting occurs, it gradually forms larger HFiles until triggering a split, dividing the current HFile into two. This effectively splits a large region into two smaller ones.

  9. If there is data loss in the MemStore, it can be recovered from the HLog. When multiple HFile files reach a certain size, it triggers the Compact operation, merging them into a single HFile. This process also involves version merging and data deletion.

Rowkey Design

  • Three principles: Length principle (shorter is better), Hash principle (even data distribution), and Uniqueness principle.
  • Salting, allocating random numbers before the rowkey; a random prefix allows them to be distributed into different Regions.
  • Pre-splitting, addresses hot-spot problems and split-merge issues caused by automatic region splitting (also allows for future expansion); for example, generating random numbers from 0-499, defining ranges for Regions like 0-50, 50-100, etc.
  • Hashing, addresses the need for data from the same user on the same day to be stored together; hash a particular parameter (e.g., uid, date), take the result modulo 500, and prepend the remainder. This can be combined with pre-splitting to distribute data evenly across RegionServers while meeting specific needs.
  • Reversal, sacrifices the orderliness of rowkey for randomness; solves problems with hotspot issues where the beginning is fixed and the end changes, like with phone numbers; used for time (Long.MAX_VALUE - timestamp) to prioritize the most recent records.

Chapter 5: Big Data Index Structures

The three basic data storage engines are Hash (efficient for random lookups), B-tree (efficient for range lookups), and LSM tree (Log-Structured Merge Tree). A column family in HBase is an LSM tree, with its in-memory part being a skip list, while the external part utilizes a Bloom filter for quick determination.

Skip List

Skip List is a memory data structure that can efficiently implement insert, delete, and search operations, with the complexity of these operations being $O(logN)$. Application scenarios: Fast writing, low update cost, supports range queries; differs from B+ trees in its lower update cost, making it suitable for big data scenarios. Construction of the skip list:

  1. A given ordered linked list.
  2. Select the largest and smallest elements in the linked list, then randomly select some other elements according to a certain algorithm (randomly) to form a new ordered linked list. This new linked list is called one level, and the original linked list is called the next level down.
  3. Add a pointer field for each element just selected, this pointer points to the element in the next level down that has the same value as itself. The Top pointer points to the first element of that level.
  4. Repeat steps 2 and 3 until no further elements can be selected apart from the largest and smallest ones.

Insertion process in a skip list:

When inserting into the skip list, a new node must generate an index on the upper level with a certain probability. Find the predecessor node of the element to be inserted -> insert -> randomly generate a height value -> modify the index according to the height value

LSM Tree

Why is the LSM tree considered a write-friendly data structure?

The LSM tree is more write-friendly because write operations are sequential, leveraging the advantages of HDFS.

  1. Sequential writing: LSM tree writes are performed in a sequential manner. This is because new data is appended to a sequential log on disk (SSTables), rather than being directly written to the original data file. Compared to traditional random write operations, sequential writing has a lower overhead and can greatly improve write performance.

  2. Delayed merging: Merging operations in LSM trees are usually delayed, meaning that the merging of multiple SSTables does not occur immediately after each write operation. This avoids frequent merging during the writing process, thus reducing the delay and overhead of writing.

  3. Memory cache: LSM trees often maintain a data cache in memory to store recently written data. Even when flushing to disk, a new memstore is opened in memory to serve new writes. This avoids the need for disk access for every write, enhancing write performance. Also, the data in the memory cache can be periodically flushed to the SSTables on disk, ensuring data durability.

Application scenarios: High throughput (sequential) writing, random searching, scalability (LSM tree allows for data partitioning).

Compaction: Globally brings together data with the same key value, selecting the proper version to present to the user.

There are two main types:

  1. major compact: should not be used frequently

Advantages: After merging, there is only one file, offering the highest read performance. Disadvantages: Merging all files takes a long time and consumes a lot of bandwidth. 2. minor compact:

Advantages: Local compact operation, reduced I/O, fewer files, improved read performance. Disadvantages: Global operation, cannot be completed during the merge process.

Bloom Filter

Type of problem solved: Effectively excludes data that definitely is not in the database;

Principle: Implemented through an array and multiple hash functions. For each piece of data, perform k hash operations, each hash result corresponds to a position in the array set to 1; if querying data and finding all hash result indicated positions are 1, then the data might be in the database, otherwise, it is definitely not.

Why is HBase considered a “sequential write, random search” distributed database?

Random search: Although HBase uses an LSM tree for its index structure, HBase’s query operations are not based on the LSM tree but are performed based on the row keys in HBase tables. Organized by Region metadata.

Chapter 6 Consistency

Distributed Transactions

A transaction in a database is a sequence of operations that are either all completed or not completed; it consists of three parts: start identifier, operations, and end identifier (commit or abort); according to the structural composition, it can be divided into flat transactions (autonomous, independent transactions) and nested transactions (the execution of one transaction includes another transaction, inner child, and outer parent);

Characteristics of Nested Transactions

  • Commit dependency: The commit of a child transaction must wait for the commit of the parent transaction;
  • Abort dependency: If the parent transaction is aborted, the child transaction must be aborted;

Consistency of Distributed Transactions

This issue arises from data replication in distributed databases (which also brings reliability and availability); Three levels:

  • Strong consistency: Immediately accessible post-update
  • Eventual consistency: Accessible after some time
  • Weak consistency: Inaccessible (online shopping reviews, advertisements)


A distributed system cannot simultaneously satisfy consistency, availability, and partition tolerance, at most two;

  • Consistency is the characteristic that data is kept consistent among multiple copies;
  • Availability is the state where the service provided is always available — returning a result within a limited time;
  • Partition tolerance means maintaining service availability and consistency in the event of a network partition;

For example, a successful write to both the Beijing and Guangzhou DBs returns success only if both are successful, and provides degraded service (inaccessible) during network failures, meeting CP requirements.


Basically Available, Soft State, Eventually consistent; it represents a trade-off between availability and consistency in CAP;

In case of failure, it allows a loss of some availability; Soft state allows for temporary inconsistencies in data, assuming it doesn’t affect availability; All data copies will eventually reach a consistent state after some time;

Overall, BASE theory targets large-scale highly available scalable distributed systems, which is opposite to the traditional transactional ACID properties. It’s entirely different from the strong consistency model of ACID, sacrificing strong consistency for availability, and it allows data to be inconsistent for a certain period but eventually reaches a consistent state.

HBase’s ACID Properties (For Reference)

Atomicity: Only guarantees the atomicity of the WAL;

Consistency: Strong consistency;

2PC (key point)

The global transaction in a distributed database is decomposed into subtransactions executed at various sites. Only when the subtransactions on all sites execute correctly can the global transaction commit. If any subtransaction cannot commit, the global transaction should be discarded, and subsequently, all subtransactions should also be discarded. Therefore, the correct commitment of all subtransactions is a prerequisite for the distributed transaction commitment.

Execution Flow

Decision Phase: The coordinator sends a prepare command and waits for participants’ responses. If all participants return “ready to commit”, the coordinator decides to commit; if at least one participant returns “ready to abort”, the coordinator decides to abort.

Execution Phase: The coordinator sends its decision from the Decision Phase to the participants. If the coordinator sends a “Commit” command, the participants execute commit; if the coordinator sends an “Abort” command, the participants execute abort, undoing changes to the database. Regardless of “Commit” or “Abort”, after execution, each participant returns an “Ack” acknowledgment to the coordinator to signal the end of the execution.


Existing Problems

Synchronous blocking, single point of failure, inconsistency, overly conservative. Coordinator failure, participants hold resources but cannot proceed with transactions, entering blocked state; can avoid with the three-phase commit protocol, if already blocked, use termination protocol for recovery.

Step 1: Select a participant PT as the new coordinator.

Step 2: PT sends a “request status” command to all participants, and each participant returns its state.

Step 3: PT makes a decision based on the current state of each participant:

  1. If some participants are in the “initial” state and some in the “ready” state, then PT sends an abort command;
  2. If all participants are in the “ready” state, then PT sends a Commit command;
  3. If at least one participant is in the “Commit” state, then PT sends a Commit command;
  4. If at least one participant is in the “Abort” state, then PT sends an abort command;



Proposer, Acceptor, Learner

Chapter 7 Concurrency Control


The main purpose of concurrency control is to ensure transaction isolation, ultimately ensuring data consistency; to solve problems brought by concurrency such as lost modifications, non-repeatable reads, dirty reads; concurrency control is about scheduling concurrent operation sequences in the right way, avoiding data inconsistencies; ensuring a transaction’s execution is not interfered with by other transactions, ensuring serializability of concurrent transactions.


A schedule is considered serial if the last operation of one transaction occurs before the first operation of another, or vice versa. Equivalence criterion: Consistent order of conflicting operations


Serializable: Equivalent to Serial Scheduling

Serializability in Distributed Transactions:

A concurrent sequence of n transactions on m sites is denoted by E; when each site’s local scheduling is serializable and in the total order, if $T_{i}<T_{j}$, such a relationship must also exist in each local schedule.

Given data items a, b located at site S1, and x, y located at site S2. Considering distributed transactions T1 and T2, determine whether each execution below is locally serializable, globally serializable, and explain the reasons for each.
1. Execution 1: at site S1 R1(a) R2(a) W2(b) W1(a),
               at site S2 R1(x) W1(x) R2(y) W2(x),
2. Execution 2: at site S1 R1(a) R2(a) W1(a) W2(b),
               at site S2 W2(x) R1(x) R2(y) W1(x).

Distributed Concurrency Control

Distributed database concurrency control refers to the process of controlling concurrent accesses in a distributed database system, with the aim of ensuring data consistency and integrity while meeting users’ needs for concurrent access, through adopting certain technical measures. It mainly addresses the following aspects:

  1. Data consistency issue: In a distributed environment, as data might be dispersed across multiple nodes, measures must be taken to ensure data consistency across nodes, to avoid data conflicts and inconsistency issues.

  2. Concurrency control issue: Multiple users may read and write the same data simultaneously, thus requiring concurrency control strategies to ensure the correctness and integrity of data, while maximizing the system’s concurrent processing capability.

Three Typical Types of Distributed Locks

Database (MySQL) method: Use a table as a lock, locking by inserting a record using the resource ID as the primary key, and unlocking by deleting the record; Redis distributed lock: setnx, which is an abbreviation for set if not exists; if the key does not exist, the key’s value is set to the value; when the key exists, no action is taken. To unlock, delete the key;

Zookeeper distributed lock: Create a directory for lock management, add a lock by creating a temporary sequential node in this directory, gain the lock if the sequence number is the smallest, otherwise listen to the directory and wait; unlock by deleting the node;


  • From the perspective of ease of understanding (from low to high): Database > Cache > Zookeeper
  • From a performance perspective (from high to low): Cache > Zookeeper >= Database
  • From a reliability perspective (from high to low): Zookeeper > Cache > Database
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay