Big Data Storage Course Notes

Chapter 1

Background

Horizontal scaling, horizontal expansion; using more nodes to support a larger number of requests.

Vertical scaling, vertical expansion; expanding the capacity of a node to support a larger number of requests.

Characteristics of big data: volume, velocity, variety, value

The need for horizontal expansion, system reliability, availability, and consistency cannot be effectively addressed under the traditional relational model

What kind of storage does big data need

A cluster system for big data storage needs to meet:

  1. The ability to manage, schedule, and monitor computer and storage resources within the cluster uniformly
  2. The ability to store and manage data within the cluster in a distributed manner
  3. Computers within the cluster can complete a task together, collaborating and load balancing
  4. When a failure occurs in a computer within the cluster, the cluster can ensure the effectiveness of the function and data will not be lost (partition tolerance)
  5. The ability to deploy clusters, expand clusters, and replace faulty nodes in a simple way (scalability)

Technical Classification

  • Classified by metadata management method: Peer node strategy, with higher availability due to the lack of central node constraints. Central node strategy, with better scalability.
  • Classified by data model: Databases with different data models for different business models
  • Distributed architecture Partition All - database and table partitioning

Partition Engine - Share Nothing

Partition Storage - Share Storage, separating storage and computation

NoSQL vs. NewSQL

NoSQL is a non-relational database, mainly used to solve the scalability problem of SQL. It does not guarantee ACID properties, has no transaction management, and can scale horizontally without redundant operations;

NewSQL is a relational database that combines the massive storage management capabilities of NoSQL databases with the ACID properties and SQL convenience of relational databases.

Chapter 2

C/S-based Hierarchical Structure

AP and DP

AP is a user-oriented application processor used to complete data processing software, capable of transmitting user requests and data to CM.

DP is a data processor responsible for data management, similar to a centralized database management system, accepting commands and data transmitted by CM and providing feedback.

Changes in AP Functionality

Centralized database - does not store data, equivalent to input, the host runs all software

Multi-client/single-server - application, client service (query), communication

Multi-client/multi-server - application, client service (directory management, cache management, query processing, commit protocol), communication

Thin client/server - client Server2Server transfers the above functions to the server, retaining only the SQL interface and program interface; AP is related to computation: directory management, cache management, query processing, transaction processing, lock management, access control. In addition, the server also has storage-related functions: log replay, fault recovery, index design, physical storage.

Three Architectures

Personal understanding, correctness cannot be guaranteed. Reference article

  1. These three distributed architectures are discussed based on the Server-to-server thin client/server architecture
  2. server - search engine - query, optimization, extraction; concurrency control; transaction commit; engine - transaction part (for recovery, because logs are needed, stateful), index, log, fault recovery, physical storage part
  3. Definitions of the three architectures
  4. Scalability and compatibility characteristics of each part

The essence of the three architectures is that NewSQL wants to have the strong consistency and transaction support of SQL and the easy scalability of NoSQL, but it falls into a tug-of-war between scalability and compatibility.

Hierarchical Structure

From top to bottom, they are Partition ALL (database and table partitioning), Partition Engine (share nothing), Partition Storage (separating storage and computation); from top to bottom, scalability decreases, in exchange for increased ecological compatibility. Database and table partitioning have excellent scalability and high performance; but business coupling is large, usually requiring design based on business scenarios, requiring users to handle sharding strategies, distributed transactions, and distributed queries themselves, with poor generality. Each node is a complete DBMS. Share Nothing only shards at the engine layer, with relatively independent nodes. Compared to traditional database and table partitioning, distributed transaction and distributed query issues are handled internally by the database, shielding users from distributed transaction details, providing unified database services, simplifying user usage.

Awkwardly, most implementations of database and table partitioning also shield distributed transaction details through the introduction of middleware (data processing, data management, load balancing, database drivers), using consistency protocols like Multi Paxos to ensure replica consistency, and providing unified database access to users, so the strategy advantage of Partition Engine seems not very significant.

Continue to move the boundary of sharding down to the lower layer of the transaction and index system. At this time, because Part 1 retains a complete transaction system, it is no longer stateless, and usually retains separate nodes to handle services. Thus, Part 1 mainly retains computation related logic, while Part 2 is responsible for storage related tasks like REDO, flushing dirty data, and fault recovery. Therefore, this structure is also known as the compute-storage separation architecture, also known as the Share Storage architecture.

Because it retains a complete compute layer, compared to traditional databases, the changes that users need to perceive are very few, achieving greater ecological compatibility. At the same time, because only the storage layer is sharded and scattered, scalability is not as good as the two solutions mentioned above.

DDBS Component Structure

image.png

– Application Processor (AP) Functions:

User Interface: Check user identity, accept user commands (such as SQL)

Semantic Data Controller: Some constraints (view management, security control, semantic integrity control)

Global Query Processor: Translate user commands into database commands; generate global query plans; collect local query results and return to users

Global Execution Monitor (Global Transaction Manager): Schedule and monitor AP and DP; ensure consistency of replicated data; ensure atomicity of global transactions

– Data Processor (DP) Functions:

Local Query Processor: Global command —> Local command; choose the best access path to execute

Local Transaction Manager: Schedule execution based on local sub-transactions

Local Scheduling Manager: Responsible for concurrency control at the local site

Local Recovery Manager: Maintain local database consistency in fault recovery

Storage Manager: Access the database; control the database cache manager; return local execution results

DDBS Schema Structure

image.png

  • Global External Schema (GES): The global external schema is the global user view, the highest level of abstraction for distributed databases for global users.
  • Global Conceptual Schema (GCS): The global conceptual schema is the global conceptual view, an overall abstraction of the distributed database, containing all data characteristics and logical structure. The global conceptual schema is mapped to the local conceptual schema through fragmentation and allocation schemas.
  • Fragmentation Schema describes the logical division view of global data. That is, the logical structure of global data is divided into local data logical structures based on certain conditions. Each logical division becomes a fragment. In relational databases, a sub-relation in a relation is called a fragment of that relation.
  • Allocation Schema describes the local physical structure of local data logic, that is, the physical allocation view of divided fragments.
  • Local Conceptual Schema (LCS): The local conceptual schema is the local conceptual view, a subset of the global conceptual schema. The local conceptual schema is used to describe the logical structure of local data at local sites. When the global data model is different from the local data model, data model conversion and other content are also involved.
  • Local Internal Schema (LIS): Defines the local physical view, a description of the physical database, similar to the inner layer of a centralized database.

DDBS Data Transparency

  • Fragmentation Transparency: Fragmentation divides a relation into several sub-relations, each sub-relation is called a fragment. Users do not need to consider the nature of which fragment the data belongs to, which is called fragmentation transparency. It is between the global conceptual schema and the fragmentation schema.
  • Allocation Transparency: Distributed databases support controlled data redundancy, meaning data can be stored redundantly at different sites. Users do not need to consider the storage site of each fragment, which is called allocation transparency. It is between the fragmentation schema and the allocation schema.
  • Local Mapping Transparency: Users do not need to consider the local storage form of data, which is called local mapping transparency. It is between the allocation schema and the local conceptual schema.
1
2
3
4
select . from S -- Fragmentation Transparency
select . from S1 & S2 -- Allocation Transparency
select . from S1 at site1 -- Local Mapping Transparency
Execute:$SUPIMS($SNO,$FOUND,$SNAME) at L1 -- Non-transparent

MDBS vs. DDBS 4 Points

Distributed Database System: Designed from top-down, allowing flexible fragmentation and allocation design. However, distributed database systems have a limitation on the number of database components, usually not more than a few dozen.

Multi-database Integration System: Data and databases already exist, integrated from bottom-up. Data integration systems can expand the number of database components to hundreds by constraining data management capabilities (only supporting read).

Both need to provide users with a unified data access environment, and data is stored in a distributed manner. The differences are:

  • Whether the data schema is predefined
  • Whether DBMS is homogeneous
  • Whether query optimization strategies are automatically generated
  • Whether local users necessarily exist (MDBS does)

Chapter 3

3.1 Distributed Database Design (Fragmentation, Allocation, Replication)

3.1.1 Design Strategies and Steps

Top-down, requirement analysis -> conceptual design -> distribution design -> physical design -> performance tuning

3.1.2 Definition and Role of Fragmentation

Fragmentation: Logical division of global data.

Allocation: The process of specifying the storage site for fragments, called allocation. When a fragment is stored at more than one site, it is called data replication. If each fragment is stored at only one site, it is called data partition storage.

Role of Fragmentation:

  • Reduce the amount of data transmitted over the network
  • Increase the locality of transaction processing
  • Improve data query efficiency and system reliability
  • Achieve load balancing The fragmentation process is the process of logically dividing and actually physically allocating global data. Global data is defined into various fragment data by the fragmentation schema, and each fragment data is defined to be stored at various sites by the allocation schema.

Principles of Fragmentation: Completeness (no data loss), Reconstructability (no relation loss), Disjointness (formal description)

Types of Fragmentation: Horizontal fragmentation (by tuples), Vertical fragmentation (by attributes), Mixed fragmentation

3.1.3 Horizontal Fragmentation

Horizontal Fragmentation: Selection

Derived: Semi-join

Design based on fragmentation demand information, sourced from application factors and database factors

Design criteria: Define a set of simple predicate sets with completeness and minimality

3.1.4 Vertical Fragmentation

Fragmentation representation: Projection operation

Completeness proof: Union operation (attributes)

Reconstructability proof: Join operation

Disjointness proof: Intersection operation, result is not empty, is the primary key

3.1.5 Fragmentation Representation Methods

Graphical representation (table) and tree representation

Image

202311022152

3.1.6 Allocation Design

The process of mapping fragments to physical site storage is called the allocation design process.

  • Non-replicated allocation If each fragment is stored at only one site, it is called partitioned distribution, and the corresponding distributed database is called a fully partitioned distributed database.
  • Replicated allocation If each fragment has a copy at each site, it is called fully replicated allocation, and the corresponding distributed database is called a fully replicated distributed database. If each fragment has a copy at only some sites, it is called partially replicated allocation, and the corresponding distributed database is called a partially replicated distributed database.

3.1.7 Data Replication Technology

Synchronous replication and asynchronous replication; master-slave replication and peer-to-peer replication;

3.2 Distributed Query Optimization

(Reflects key steps, fragmentation expansion, etc., ∪ is a binary operation, empty circle)

3.2.1 Significance of Query Optimization

3.2.2 Query Processor

3.2.3 Query Decomposition

Decompose calculus queries into algebraic queries based on the global conceptual schema. Obtain a global logical query plan tree. The following five steps:

  1. Query normalization (commutative, associative, distributive laws)
  2. Syntax and semantic analysis (syntax errors, meaningless queries, no permissions, through query graph)
  3. Query reduction
  4. Query rewriting

3.2.4 Data Localization

  • Decompose global tables into local tables using union and join operations
  • Draw the global tree first, optimize the global tree, and convert it into a fragment query tree,
  • Timely place selection operations, join operations to empty, and move ∞ down to execute before ∪ (using distributive law)

3.2.5 Fragment Query Optimization

3.3 Distributed Access Optimization

3.3.1 Basic Concepts

3.3.2 Theoretical Basis for Optimization

Basis of Relation: Refers to the number of tuples contained in relation R, denoted as Card(R)

Attribute Length: Refers to the number of bytes defined by attribute A, denoted as Length(A) • Tuple Length: The number of bytes in each tuple in relation R, denoted as Length(R),

Length(R)=∑Length(Ai) • Size of Relation: The number of bytes contained in relation R, denoted as Size(R) Size(R)=Card(R) Length(R)Characteristic Value of Attribute: Refers to the number of different attribute values taken by attribute A in relation R, denoted as Val(A)Maximum and Minimum Values of Attribute A: Denoted as Max(A) and Min(A)

Selection Operation: Cardinality: Card(S)=ρ Card(R)* ρ calculation: Only consider the equality condition of the selection attribute A, where A is an attribute of R, X is a constant. Then $\rho = \frac{1}{Val(R,A)}$

Val(S,B) calculation: When attribute B is part of the selection condition, Val(S,B)=1 When attribute B is a key (primary key), Val(S,B)=ρ Val(R,B) When attribute B is not part of the selection predicate, $$Val(S,B)=\begin{cases} Card(S), \quad if \quad Card(S) <= \frac{Val(R,B)}{2} \ \frac{Card(S)+Val(R,B)}{3} \ Val(R,B), \quad if \quad Card(S)>=2Val(R,B) \end{cases}$$

Join Operation: Image

Semi-join Operation: 202311022155

3.3.3 Semi-join Optimization Method

202311022156 It is about whether the cost of the semi-join paid for this round is more than the full join.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
Assignment 1 Fragmentation Design
There is a commodity shopping system, which contains two global relations: user table USER(UID, UNAME, ADDRESS, HOBBY, CITY) and order table ORDER(UID, PID, PRICE), where UID is the user number, UNAME is the user name, CITY is the city. PID is the product number, PRICE is the total price of the order, UID is the primary key in the USER table, and a foreign key in the ORDER table. To establish a distributed database, the fragmentation rules are:
(1) The USER relation is vertically fragmented according to attribute sensitivity
U1 contains non-sensitive attributes: UID, UNAME, CITY
U2 contains sensitive attributes: ADDRESS, HOBBY
(2) All non-sensitive attributes in USER are horizontally fragmented according to CITY
U11: CITY IN { Beijing, Shanghai, Guangzhou, Shenzhen}
U12: CITY NOT IN { Beijing, Shanghai, Guangzhou, Shenzhen}
(3) The ORDER relation is fragmented according to the connection relationship with USER, resulting in O1 and O2.

Assignment 2 Query Optimization
Query Q: "Query all orders purchased by users in 'Xuzhou City' with product number 'P1', and get the user number, user name, product number, and total order price in the order."
(1) Write the relational algebra expression for query Q and transform it into a fragment query
(2) Optimize the fragment query tree

Assignment 3 Access Optimization
As shown in the figure below

image.png

Chapter 4 HBase

Problems with HDFS

  1. Does not support random rewriting of data
  2. HDFS does not have the concept of data packets
  3. HDFS cannot quickly perform common data query functions such as row count statistics and filter scanning, generally requiring MapReduce to achieve.
  4. (Advantages are large file storage, multiple copies, automatic partitioning)

Characteristics of HBase

  1. Uses HDFS storage at the bottom, but maintains its own file structure and metadata.
  2. Uses a column + key-value pair storage mode
  3. Can achieve convenient horizontal scaling
  4. Can achieve automatic data fragmentation
  5. Strict read-write consistency and automatic failover
  6. Full-text retrieval and filtering Advantages are good at handling large amounts of data writing, high performance, high reliability, scalability; disadvantages are not supporting table association query analysis, etc.

Region

Region server is a container for storing Regions; a Region is a part of a table’s data, equivalent to a shard of a table in a relational database. A table may be stored in different Regions. Characteristics:

  1. A Region cannot span servers, a Region Server can have one or more Regions;
  2. When the data volume increases, Regions will split;
  3. For load balancing needs, Regions will migrate;
  4. All data access operations of a Region are implemented by calling the HDFS client interface.

The same table’s different rows can be stored on different servers, and the same table’s same rows can also be stored on different servers. How to understand this sentence? (I don’t understand, I think the latter half is problematic.) A server is the storage structure of a Region, but storing a Region does not mean storing a table; each Region contains several Stores, and a Store is a column family, stored as an object, not necessarily a table, possibly a shard of different tables.

Write-Ahead Log (WAL): First write to WAL (a RegionServer has only one WAL), then load into memStore;

Each region internally has multiple store instances, each store corresponds to a column family; each store has a memStore instance, when memStore is full, HDFS will generate a new HFile (stored using LSM tree, will be quickly sorted before final write to achieve sequential storage of randomly written data, improving read efficiency); memStore can be seen as a cache in memory, whether reading or writing, it will first look at memStore.

image.png

Add/Delete/Modify Operations

  • HBase adds a cell, adds a piece of data to HDFS
  • HBase modifies a cell, adds a piece of data to HDFS, but the version number is larger than before
  • HBase deletes a cell, adds a piece of data to HDFS, but this data has no value, the type is Delete, which is a Tombstone marker, and will be truly deleted during HFile merging.

Read/Write Process

Reference article: HBase Read/Write Process Zookeeper(ROOT)->RegionServer(META)->Region->memStore image.png

  1. Client access, query which RegionServer has the hbase:meta table through the /hbase/meta-region-server node of zookeeper

  2. Client connects to the regionserver containing the hbase:meta table. The hbase:meta table stores the row key range information of all Regions, through this table, you can query the Region where the requested row key is located, and the RegionServer where this Region is located

  3. Client finds the needed information on the corresponding RegionServer, first from MemStore, then to HFile.

  4. After the first access, the client will cache the meta information (BlockCache), and the next operation will directly query the meta information from BlockCache. image.png

  5. Client access, query which RegionServer has the hbase:meta table through the /hbase/meta-region-server node of zookeeper

  6. Client connects to the regionserver containing the hbase:meta table. The hbase:meta table stores the row key range information of all Regions, through this table, you can query the Region where the requested row key is located, and the RegionServer where this Region is located

  7. Client writes data to both Hlog and memstore on the corresponding RegionServer

  8. When memstore reaches the threshold, data is flushed into an HFIle file, and after compact, gradually forming larger and larger HFIle files, triggering a split, dividing the current HFIle into two, which is equivalent to splitting a large region into two regions

  9. If data in MemStore is lost, it can be recovered from HLog, and when multiple HFIle files reach a certain size, a Compact merge operation will be triggered, merging into one HFIle, performing version merging and data deletion at the same time

Rowkey Design

  • Three principles: length principle (the shorter the better), hash principle (data evenly distributed), uniqueness principle
  • Salting, assign random numbers to the front of rowkey; random prefixes can make them distributed to different Regions.
  • Pre-partitioning, solving hotspot issues and split/merge issues caused by automatic Region splitting (can also reserve for expansion); for example, generate random numbers from 0-499, specifying ranges of 0-50, 50-100… etc. for Regions.
  • Hashing, solving the need for centralized storage of data like the same user’s data on the same day; pass a parameter (such as uid, date) into a hash, result modulo 500, add the remainder to the front, can be combined with pre-partitioning to meet the demand while allowing data to be evenly distributed among Regionservers.
  • Reversing, sacrificing the orderliness of rowkey for randomness; solving hotspot issues with fixed beginnings and changing ends, such as phone numbers; used for time (Long.MAX_VALUE - timestamp) can meet the need for recent records first.

Chapter 5 Big Data Index Structure

Three basic data storage engines are hash (efficient random lookup), B-tree (efficient range lookup), and LSM tree (Log-Structured Merge Tree). An HBase column family is an LSM tree, the memory part is a skip list, and Bloom filters are chosen for fast determination.

Skip List

Skip List is a memory data structure that can efficiently implement insert, delete, and lookup operations, all of which have a complexity of $O(logN)$.

Applicable Scenarios: Fast writing, low update cost, supports range queries; unlike B+ trees, the difference lies in low update cost, making it suitable for big data scenarios.

Construction of Skip List:

  1. Given an ordered linked list.
  2. Select the largest and smallest elements from the linked list, then randomly select some elements according to a certain algorithm (randomly) to form an ordered linked list. This new linked list is called a level, and the original linked list is called its lower level.
  3. Add a pointer field to each selected element, pointing to the element with the same value in the lower level. The Top pointer points to the first element of this level.
  4. Repeat steps 2 and 3 until no more elements other than the largest and smallest can be selected.

Insertion Process of Skip List:

When inserting into a skip list, the new node should generate an index at the upper level with a certain probability. Find the predecessor node of the element to be inserted -> Insert -> Randomly generate height value -> Modify index according to height value

LSM Tree

Why is LSM Tree considered a write-friendly data structure? LSM Tree is more friendly to writes, as write operations are sequential, leveraging the advantages of HDFS.

  1. Sequential Writing: LSM Tree’s write operations are performed in a sequential write manner. This is because new data is appended to sequential logs (SSTables) on disk, rather than directly writing to the original data files. Compared to traditional random write operations, sequential writing has lower overhead, greatly improving write performance.

  2. Delayed Merging: LSM Tree’s merge operations are usually delayed, meaning that merges between multiple SSTables are not executed immediately after each write operation. This avoids frequent merge operations during writes, reducing write latency and overhead.

  3. Memory Cache: LSM Tree usually maintains a data cache area in memory to store recently written data. Even when flushed to the hard disk, a new memstore is created in memory to serve new writes. This avoids accessing the disk for each write, improving write performance. Meanwhile, data in the memory cache can also be periodically flushed to SSTables on disk to ensure data persistence.

Applicable Scenarios: High throughput (sequential) writing, random lookup, scalability (LSM tree allows data partitioning).

Compaction: Globally integrate data with the same key value, select the appropriate version for the user.

There are mainly two types:

  1. Major compact: Should not be used frequently Advantages: After merging, there is only one file, and read performance is highest Disadvantages: Merging all files takes a long time and consumes a lot of bandwidth.
  2. Minor compact: Advantages: Local compact operation, less IO, reduces the number of files, improves read performance. Disadvantages: Global operation, cannot be completed during the merge process.

Bloom Filter

Problem Type Solved: Effectively exclude some data that is definitely not in the database; Implementation Principle: Implemented through an array and multiple hash functions, for each data, hash it k times, each hash result corresponds to an array position set to 1; if querying a data, find that all hash result indicated positions are 1, then the data may be in the database, otherwise it is definitely not in the database.

Why is HBase considered a “sequential write, random lookup” distributed database? Random lookup: Although HBase uses the LSM tree index structure, HBase’s query operations are not based on the LSM tree, but based on row keys in HBase tables. The meta information organized by Regions.

Chapter 6 Consistency

Distributed Transactions

Transactions are a sequence of operations in a database, either all done or none done; composed of three parts: start marker, operations, end marker (commit or abort); can be divided into flat transactions (transaction autonomy, independence) and nested transactions (execution of a transaction includes another transaction, inner child outer parent);

Characteristics of Nested Transactions

  • Commit Dependency: The commit of a subtransaction must wait for the commit of the parent transaction;
  • Abort Dependency: If the parent transaction aborts, the subtransaction must abort;

Consistency of Distributed Transactions

This problem is caused by data replication in distributed databases (which also brings reliability and availability); Three Levels:

  • Strong Consistency: Accessible immediately after update
  • Eventual Consistency: Accessible after a period of time
  • Weak Consistency: Not accessible (online shopping reviews, advertisements)

CAP

A distributed system cannot simultaneously satisfy consistency, availability, and partition tolerance, at most two;

  • Consistency is the characteristic of keeping data consistent across multiple replicas;
  • Availability is the service provided is always available - returning results within a limited time;
  • Partition Tolerance, ensuring availability and consistency when encountering network partitions; For example, writing to both Beijing and Guangzhou’s DBs successfully before returning success and providing degraded service (not accessible) during network failure, satisfying CP.

BASE

Bascially Available, Soft State, Eventually consistent; is the result of the trade-off between availability and consistency in CAP;

Allows partial availability loss during failures; soft state means allowing data to exist in an inconsistent intermediate state, believing it does not affect availability; all data replicas can eventually reach a consistent state after a period of time;

Overall, BASE theory is aimed at large-scale high-availability scalable distributed systems, and is the opposite of traditional transaction ACID characteristics, it is completely different from ACID’s strong consistency model, but achieves availability by sacrificing strong consistency, allowing data to be inconsistent for a period of time, but eventually reaching a consistent state.

HBase’s ACID Characteristics (Understanding)

Atomicity: Only guarantees the atomicity of WAL; Consistency: Strong consistency;

2PC (Key Point)

Global transactions in distributed databases are composed of subtransactions executed at each site. Only when subtransactions at each site are executed correctly can the global transaction be committed. If at least one subtransaction cannot be committed, the global transaction should be aborted, and all subsequent subtransactions should also be aborted. Therefore, all subtransactions being able to commit correctly is a prerequisite for distributed transaction commit.

Execution Process

Decision Phase: The coordinator sends a pre-commit (prepare) command, then waits for responses from participants. If all participants return “ready to commit”, the coordinator makes a commit decision; if at least one participant returns “ready to abort”, the coordinator makes an abort decision.

Execution Phase: The coordinator sends the decision made in the decision phase to participants. If the coordinator sends a “commit” command to participants, they execute the commit; if the coordinator sends an “abort” command, participants execute the abort, canceling modifications to the database. Whether “commit” or “abort”, participants must return an “acknowledgment” (Ack) response to the coordinator after execution, notifying the coordinator that execution is complete.

image.png

Existing Problems

Synchronous blocking, single point problem, data inconsistency, overly conservative

Coordinator failure, participants occupy resources but cannot execute transactions, entering a blocking state; can be avoided with three-phase commit protocol, if already blocked, use termination protocol to recover.

Step 1: Select participant PT as the new coordinator.

Step 2: PT sends an “access state” command to all participants, each participant returns its own state.

Step 3: PT makes a decision based on the current state of each participant:

  1. If some participants are in the “initial” state, and some are in the “ready” state, PT sends an abort command;
  2. If all participants are in the “ready” state, PT sends a commit command;
  3. If at least one participant is in the “commit” state, PT sends a commit command;
  4. If at least one participant is in the “abort” state, PT sends an abort command;

Paxos

Prepare->Accept->Learn Proposer, Accepter, Learner

Chapter 7 Concurrency Control

Purpose

The main purpose of concurrency control is to ensure the isolation of transactions, ultimately ensuring data consistency; solving concurrency issues such as lost updates, non-repeatable reads, dirty reads; concurrency control is to use the correct way to schedule concurrent operation sequences, avoid data inconsistency; ensure that the execution of a transaction is not disturbed by other transactions, ensuring the serializability of concurrent transaction execution.

Serializability

If the last operation of one transaction is before another transaction, or vice versa, it is a serial schedule. Equivalence determination: Consistent order of conflicting operations

image.png

Serializable: Equivalent to serial scheduling Serializability of Distributed Transactions: The concurrent sequence of n transactions at m sites is denoted as E; When the local schedule at each site is serializable and in the total order if there is $T_{i}<T_{j}$, this relationship must also exist in each local schedule.

1
2
3
4
5
Let data items a, b be stored at site S1, x, y be stored at site S2. There are distributed transactions T1 and T2, determine whether each execution below is locally serializable and globally serializable, and explain the reasons respectively.
1. Execution 1: At site S1 R1(a) R2(a) W2(b) W1(a),
		At site S2 R1(x) W1(x) R2(y) W2(x),
2. Execution 2: At site S1 R1(a) R2(a) W1(a) W2(b)
		At site S2 W2(x) R1(x) R2(y) W1(x).

Distributed Concurrency Control

Distributed database concurrency control refers to the process of adopting certain technical means to control concurrent access in a distributed database system to ensure data consistency and integrity while meeting user concurrent access needs. It mainly solves the following issues:

  1. Data Consistency Problem: In a distributed environment, since data may be distributed across multiple nodes, measures must be taken to ensure data consistency across multiple nodes, avoiding data conflicts and inconsistencies.

  2. Concurrency Control Problem: Multiple users may simultaneously perform read and write operations on the same data, so concurrency control strategies need to be adopted to ensure data correctness and integrity while maximizing the system’s concurrent processing capabilities.

Three Typical Distributed Locks

Database (MySQL) method: Use a table as a lock, locking is inserting a record with a resource ID as the primary key, unlocking is deleting;

Redis Distributed Lock: setnx, setnx is the abbreviation of set if not exists; if the key does not exist, set the key’s value to value; when the key exists, do nothing. Unlock with del key;

Zookeeper Distributed Lock: Create a directory for lock management, locking is creating a temporary sequential node under this directory, if the sequence number is the smallest, the lock is obtained, otherwise, wait for the directory to change; unlocking is deleting the node;

Comparison:

  • From the perspective of ease of understanding (from low to high) Database > Cache > Zookeeper
  • From the perspective of performance (from high to low) Cache > Zookeeper >= Database
  • From the perspective of reliability (from high to low) Zookeeper > Cache > Database
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%