Big Data Storage Review Class

Review Class

1 Exam Question Types

  1. Essay (30 points) - Discuss your understanding of concepts
  2. Distributed Database Design and Query Optimization (20) - Design of distributed databases, design and definition of sharding, corresponding query optimization
  3. Distributed Access Optimization (30) - Physical characteristic indicators, calculation of transmission costs
  4. Storage Structure Design (10) - HBase design, Bloom filter design (PPT)
  5. Distributed Transactions (10) - Consistency, concurrency control The three assignments correspond to items 2, 3, and 4 respectively.

2 Chapter Review

2.1 Chapter 1

  • Origin of Big Data (Why have big data storage systems emerged? The demand for horizontal expansion, system reliability and availability, and the need for consistency cannot be effectively solved under traditional relational models)
  • Characteristics of Big Data
  • What kind of storage system does big data need?

2.2 Chapter 2

  • Client/Server architecture (Changes in AP functions across different architectures)
  • Relationship between share nothing architecture, database and table partitioning architecture, compute-storage separation architecture, and client/server architecture (Open question, combine PPT and your own understanding) Reference article
  • Schema structure of relational distributed database systems
  • Data transparency in distributed database systems (Three types, definition, examples; determine the type of transparency for given statements)
  • Difference and connection between multi-database systems and distributed database systems

2.3 Chapter 3

  • Principles of distributed database design, definition (operations), representation methods
  • Query optimization strategies for distributed databases and fragment query optimization methods
  • Access optimization methods for distributed queries, calculation of characteristic parameters (selection operation, projection operation, natural join operation, semi-join operation)

2.4 Chapter 4

  • What problems of HDFS does HBase solve? What are its features?
  • The meaning and characteristics of regions in HBase databases. How to understand that data of different rows of the same table can be stored on different servers, and data of the same row of the same table can also be stored on different servers?

A server is the storage structure for a Region, but storing a Region does not mean storing a table; each Region contains several Stores, a Store is a column family, which stores objects by column family, not necessarily a table, possibly shards of different tables.

  • What are the true operations of HBase’s CRUD (Create, Read, Update, Delete)?
  • The read and write process of HBase

Advantages of HDFS: (Large file storage, multiple copies, automatic partitioning)

  1. If only using HDFS for data management, there are some issues:
  2. HDFS does not support random rewriting of data
  3. HDFS lacks the concept of data tables
  4. HDFS cannot perform common data queries like row count statistics and filter scanning
  5. Efficient operation implementation generally requires Mapreduce.

HBase uses HDFS for storage at the bottom layer, but maintains its own file structure and metadata. Specifically, it has the following characteristics:

  1. Uses a column-oriented and key-value storage model
  2. Enables convenient horizontal scaling
  3. Can implement automatic data sharding
  4. Ensures relatively strict consistency of reads and writes and automatic failover
  5. Implements full-text search and filtering (filters)

2.5 Chapter 5

What problems do various data structures primarily solve (scenarios)? What are their underlying principles? For example, the skip list supports fast writes, interval queries, and has a low update cost. Although B+ trees also support these, they have a high update cost and are not suitable for big data scenarios. LSM trees combine the skip list (in-memory) with multi-way file merging and Bloom filters (external storage).

(1) Skip list

  • Types of problems solved (fast writing, low update cost, supports interval queries)
  • The process of searching and inserting (underlying principle) Skip list is the in-memory structure of LSM trees; (2) LSM tree
  • Types of problems solved (“sequential writes, random searches”)
  • What is compaction? What are the two types? Advantages and disadvantages.
  • Why is the LSM tree considered write-friendly?

(3) Bloom filter

  • Types of problems solved (effectively excluding some objects)
  • Construction method and querying process (underlying principle)

(4) Why is HBase described as a “sequential write, random read” distributed database?

2.6 Chapter 6

  1. Concept of Nested Transactions
  2. Content on the consistency levels of distributed databases, with examples
  3. CAP theory and BASE theory of distributed databases (with examples)
  4. Distributed Transaction Commit Protocols (Two-Phase Commit Protocol execution process, existing problems - blocking, and solution method - termination protocol)
  5. Methods for achieving ACID consistency features in HBase (for understanding)
  6. Distributed Consistency Algorithm Paxos (main process)

2.7 Chapter 7

  • Basic concepts of concurrency control (problems solved, serializable scheduling)
  • Problems solved by distributed concurrency control (application scenarios of three types of distributed locks, solution ideas)
  • Determining serializability of distributed transactions (exam questions)
  • Application scenarios and specific solutions for three types of distributed locks
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay
0%