Big Data Storage Review Class
1 Exam Question Types
- Essay Questions (30 points) - Discuss your understanding of concepts
- Distributed Database Design and Query Optimization (20) - Design of distributed databases, sharding design, definition, corresponding query optimization
- Distributed Access Optimization (30) - Physical characteristic indicators, calculation of transmission costs
- Storage Structure Design (10) - HBase design, Bloom filter design (PPT)
- Distributed Transactions (10) - Consistency, concurrency control Three assignments correspond to items 2, 3, and 4
2 Chapter Review
2.1 Chapter 1
- The origin of big data (Why do big data storage systems arise? Horizontal scaling needs, system reliability and availability, consistency needs cannot be effectively addressed under traditional relational models)
- Characteristics of big data
- What kind of storage system does big data require
2.2 Chapter 2
- Client/Server Architecture (Changes in AP functionality in different architectures)
- Relationship between share nothing architecture, database and table partitioning architecture, storage-computation separation architecture, and client/server architecture (open question, combine PPT and personal understanding) Reference Article
- Pattern structure of relational distributed database systems
- Data transparency of distributed database systems (three types, definitions, examples; determine the type of transparency for operation statements)
- Differences and connections between multi-database systems and distributed database systems
2.3 Chapter 3
- Sharding principles, definitions (operations), and representation methods in distributed database design
- Query optimization strategies and fragment query optimization methods in distributed databases
- Access optimization methods for distributed queries, calculation of characteristic parameters (selection operations, projection operations, natural join operations, semi-join operations)
2.4 Chapter 4
- What problems does HBase solve with HDFS? What are its characteristics?
- Meaning and characteristics of regions in HBase databases. Different data of the same table can be stored on different servers, and the same data of the same table can also be stored on different servers. How to understand this statement?
A server is a storage institution for a Region, but storing a Region does not mean storing a table; each Region contains several Stores, a Store is a column family, stored as an object, not necessarily a table, possibly different table shards.
- What are the actual operations of CRUD in HBase?
- HBase read-write process
Advantages of HDFS: (large file storage, multiple replicas, automatic partitioning)
- If only HDFS is used for data management, there are some problems:
- HDFS does not support random rewriting of data
- HDFS has no concept of data tables
- HDFS cannot perform common data queries such as row count statistics, filtering, and scanning
- To achieve quick operations, it generally needs to be implemented through MapReduce.
HBase uses HDFS for storage at the bottom layer, but maintains its own file structure and metadata. Specifically, it has the following characteristics:
- Uses a column-oriented storage model with key-value pairs
- Can achieve convenient horizontal scaling
- Can achieve automatic data sharding
- Implements relatively strict read-write consistency and automatic failover
- Implements full-text retrieval and filtering (filters)
2.5 Chapter 5
What kind of problems does each data structure mainly solve (scenarios)? Implementation principles? For example, skip lists mainly support fast writes, support range queries, and have low update costs. Although B+ trees also support these, they have high update costs and do not support big data scenarios. LSM trees combine skip lists (memory) and multi-way file merging, Bloom filters (external storage).
(1) Skip Lists
- Types of problems solved (fast writes, low update costs, support for range queries)
- Search and insert process (implementation principles) Skip lists are the memory structure of LSM trees; (2) LSM Trees
- Types of problems solved (“sequential writes, random lookups”)
- What is compaction? What are the two types? Advantages and disadvantages.
- Why is LSM tree considered a write-friendly data structure?
(3) Bloom Filters
- Types of problems solved (effectively exclude some objects)
- Construction method and query process (implementation principles)
(4) Why is HBase considered a “sequential write, random lookup” distributed database?
2.6 Chapter 6
- Concept of nested transactions
- Content of consistency levels in distributed databases, with examples
- CAP theory and BASE theory of distributed databases (with examples)
- Distributed transaction commit protocol (Two-phase commit protocol execution process, existing problems - blocking, solutions - termination protocol)
- Implementation methods of HBase’s consistency ACID characteristics (understanding)
- Distributed consistency algorithm Paxos (main process)
2.7 Chapter 7
- Basic concepts of concurrency control (problems solved, serializable scheduling)
- Problems solved by distributed concurrency control (three application scenarios of distributed locks, solution ideas)
- Determination of serializability of distributed transactions (questions)
- Application scenarios and specific solutions of three types of distributed locks


