Tim included in Learning Notes Data Science & Machine Learning

2023-02-13 2024-07-25 About 800 words 4 minutes

Contents

Review Class

Exam Question Types

Essay (30 points) - Discuss your understanding of concepts
Distributed Database Design and Query Optimization (20) - Design of distributed databases, design and definition of sharding, corresponding query optimization
Distributed Access Optimization (30) - Physical characteristic indicators, calculation of transmission costs
Storage Structure Design (10) - HBase design, Bloom filter design (PPT)
Distributed Transactions (10) - Consistency, concurrency control The three assignments correspond to items 2, 3, and 4 respectively.

Chapter Review

Chapter 1

Origin of Big Data (Why have big data storage systems emerged? The demand for horizontal expansion, system reliability and availability, and the need for consistency cannot be effectively solved under traditional relational models)
Characteristics of Big Data
What kind of storage system does big data need?

Chapter 2

Client/Server architecture (Changes in AP functions across different architectures)
Relationship between share nothing architecture, database and table partitioning architecture, compute-storage separation architecture, and client/server architecture (Open question, combine PPT and your own understanding) Reference article
Schema structure of relational distributed database systems
Data transparency in distributed database systems (Three types, definition, examples; determine the type of transparency for given statements)
Difference and connection between multi-database systems and distributed database systems

Chapter 3

Principles of distributed database design, definition (operations), representation methods
Query optimization strategies for distributed databases and fragment query optimization methods
Access optimization methods for distributed queries, calculation of characteristic parameters (selection operation, projection operation, natural join operation, semi-join operation)

Chapter 4

What problems of HDFS does HBase solve? What are its features?
The meaning and characteristics of regions in HBase databases. How to understand that data of different rows of the same table can be stored on different servers, and data of the same row of the same table can also be stored on different servers?

A server is the storage structure for a Region, but storing a Region does not mean storing a table; each Region contains several Stores, a Store is a column family, which stores objects by column family, not necessarily a table, possibly shards of different tables.

What are the true operations of HBase’s CRUD (Create, Read, Update, Delete)?
The read and write process of HBase

Advantages of HDFS: (Large file storage, multiple copies, automatic partitioning)

If only using HDFS for data management, there are some issues:
HDFS does not support random rewriting of data
HDFS lacks the concept of data tables
HDFS cannot perform common data queries like row count statistics and filter scanning
Efficient operation implementation generally requires Mapreduce.

HBase uses HDFS for storage at the bottom layer, but maintains its own file structure and metadata. Specifically, it has the following characteristics:

Uses a column-oriented and key-value storage model
Enables convenient horizontal scaling
Can implement automatic data sharding
Ensures relatively strict consistency of reads and writes and automatic failover
Implements full-text search and filtering (filters)

Chapter 5

What problems do various data structures primarily solve (scenarios)? What are their underlying principles? For example, the skip list supports fast writes, interval queries, and has a low update cost. Although B+ trees also support these, they have a high update cost and are not suitable for big data scenarios. LSM trees combine the skip list (in-memory) with multi-way file merging and Bloom filters (external storage).

(1) Skip list

Types of problems solved (fast writing, low update cost, supports interval queries)
The process of searching and inserting (underlying principle) Skip list is the in-memory structure of LSM trees; (2) LSM tree
Types of problems solved (“sequential writes, random searches”)
What is compaction? What are the two types? Advantages and disadvantages.
Why is the LSM tree considered write-friendly?

(3) Bloom filter

Types of problems solved (effectively excluding some objects)
Construction method and querying process (underlying principle)

(4) Why is HBase described as a “sequential write, random read” distributed database?

Chapter 6

Concept of Nested Transactions
Content on the consistency levels of distributed databases, with examples
CAP theory and BASE theory of distributed databases (with examples)
Distributed Transaction Commit Protocols (Two-Phase Commit Protocol execution process, existing problems - blocking, and solution method - termination protocol)
Methods for achieving ACID consistency features in HBase (for understanding)
Distributed Consistency Algorithm Paxos (main process)

Chapter 7

Basic concepts of concurrency control (problems solved, serializable scheduling)
Problems solved by distributed concurrency control (application scenarios of three types of distributed locks, solution ideas)
Determining serializability of distributed transactions (exam questions)
Application scenarios and specific solutions for three types of distributed locks

Buy me a coffee~

Donate

Alipay

PayPal

WeChat Pay

Big Data Storage Review Class