Data Mining Course Review Notes


1. Understanding Data Mining

  1. Definition of Data Mining
  2. Supervised Learning and Unsupervised Learning
  3. The Process of Data Mining

2. Basic Data Mining Techniques

  1. Concept of Decision Trees and General Process of C 4.5 Algorithm
  2. Key Techniques for Decision Trees: Maximum Gain Ratio
  3. Decision Tree Rules: Decision Trees, Production Rules, Accuracy and Coverage
  4. Basic Idea of Apriori Algorithm
  5. Association Rules and Their Confidence and Support
  6. Basic Idea of K-means Algorithm
  7. Example of K-means Clustering Analysis

3. Knowledge Discovery in Databases

  1. Definition of KDD (To understand)
  2. Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing

5. Evaluation Techniques

  1. Evaluating Categorical Output Models: Confusion Matrix, Classification Accuracy, Precision and Recall
  2. Evaluating Numeric Output Models: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error

6. Neural Networks

  1. Artificial Neuron Model (To draw and calculate)
  2. BP Neural Network Structure (To draw, e.g., one input, one output, two hidden layers containing three neurons each), Forward Computation Process
  3. General Procedure of the BP Algorithm
  4. Basic Operations of Convolutional Neural Networks - Convolution and Pooling, Convolutional Features (Local Receptive Fields, Weight Sharing)

7. Statistical Techniques

  1. Simple Linear Regression, Multiple Linear Regression - Least Squares Method
  2. Bayesian Analysis: Bayesian Classifier
  3. General Steps and Examples of Agglomerative Clustering Algorithm
  4. Cobweb Hierarchical Clustering Algorithm: Calculation of CU Value

Chapter 1 Understanding Data Mining

Definition of Data Mining

Technical perspective: The process of automatically analyzing and extracting information from data using computer technology; the goal is to uncover valuable information hidden in the data; typically employs methods from machine learning, statistics, pattern recognition, and more.

Disciplinary perspective: An interdisciplinary field, encompassing artificial intelligence, statistics, visualization, parallel computing, etc.

Business perspective: Reveals hidden, unknown patterns or validates known patterns;

Supervised Learning and Unsupervised Learning

The basic methods in machine learning include: concept learning, inductive learning, supervised learning, and unsupervised learning.

Supervised Learning: By training on a large number of instances with known classifications or outputs, the structure of the classification model is adjusted to achieve the purpose of establishing an accurate classification or prediction model; this inductive process of concept learning is known as supervised learning.

Unsupervised Learning: Before the learning and training, the instances are not predefined into categories. Instances are clustered into a cluster based on some similarity measure, calculating the degree of similarity, and then interpret and understand the meaning of each cluster, discovering the significance of clusters. (K-Means, Agglomerative Clustering, Conceptual Clustering Cobweb, EM Algorithm)

The Process of Data Mining

Four steps:

  1. Prepare the data, including training data and validation data;
  2. Choose a data mining technique or algorithm, and submit the data to the mining software;
  3. Interpret and evaluate the model;
  4. Application of the model; Data preparation can collect from databases, data warehouses, flat files;

When choosing techniques and algorithms, it is necessary to consider:

  • Whether the learning is supervised or unsupervised;
  • How to divide the training, testing, and validation data in the dataset;
  • How to set the parameters of data mining algorithms;

The Role of Data Mining (Categories Classification, Non-technical)

  • Supervised Learning
    • Classification
    • Estimation
    • Prediction
  • Unsupervised Clustering
  • Analysis of Association Relationships

Chapter 2 Basic Data Mining Techniques

Decision Trees

  1. Concept of decision trees and the general process of the C 4.5 algorithm
  2. Key techniques of decision trees: Maximum gain ratio
  3. Rules of decision trees: Decision trees, production rules, accuracy, and coverage


A decision tree is a model structured like a tree, where each node represents an attribute of an object, branches represent a possible value of the attribute, and the value of the leaf node is the output result of the decision tree.

C4.5 Algorithm

  1. Given a dataset T in the “attribute-value” format. The dataset consists of instances with multiple input attributes and one output attribute.
  2. Choose an input attribute that best distinguishes the instances in T. C4.5 uses the gain ratio to select this attribute.
  3. Create a tree node for that attribute and create branches for all possible values of that node.
  4. Use these branches to classify the instances in the dataset into subdivided subclasses.
  5. Set the instance set of the current subclass as T, and repeat steps (2) and (3) for the remaining attributes in the dataset until one of the following two conditions is met, at which point the process stops. Create a leaf node, which is the category expressed along this branch, its value being the value of the output attribute.
  • The instances in this subclass meet predefined criteria, such as all being in one output class or reaching a certain ratio in an output class;
  • No remaining attributes.

Key Techniques

Method of Choosing the Most Distinguishing Attribute

Preferably choose the attribute with the highest gain ratio to maximize the degree of generalization of the data (minimal hierarchy and number of nodes).

Information Entropy

Entropy is a measure of uncertainty; the greater it is, the larger the amount of information, and the more information transmitted;

$$H(x)=-\sum_{i=1}^n {p(x_{i})log_{2}(p(x_{i}))}$$

Information Gain

The greater the gain, the faster the entropy decreases, which is more conducive to classification.

$$Info(I)=-\sum\frac{Number of instances in class i}{Total number of instances}log_{2}\frac{Number of instances in class i}{Total number of instances}$$ $Info(I)$ is the total amount of information expressed by all instances in the current dataset; $$Info(I,A)=\sum\limits_{j=1}^{k}\frac{Number of instances in class j}{Total number of instances}Info(class j)$$ $Info(I,A)$ is the amount of information after classifying I according to the k values of attribute A; $$SplitsInfo(A)=-\sum\limits_{i}^{k}\frac{Number of instances in class j}{Total number of instances}log_{2}\frac{Number of instances in class j}{Total number of instances}$$ $SplitsInfo(A)$ is the normalization of the gain value for attribute A; $$Gain(A)=Info(A)-Info(I,A)$$ $$GainRatio(A)= \frac{Gain(A)}{SplitsInfo(A)}$$

Pruning Methods

Testing Methods

  • Percentage test
  • Cross-validation k-fold Hold-out Leave-one-out

Decision Tree Rules

Translate the decision tree into rules, calculate the accuracy and coverage;

IF Courses ≤ 5 and Weather = Rain THEN Play = No
Accuracy: 3/5 = 60% Coverage: 3/8 = 37.5%
Accuracy: satisfied sentence/satisfied first half sentence Coverage: satisfied sentence/satisfied second half sentence

Advantages and Disadvantages (Overview)


  1. Easy to understand and interpret, and can be mapped into a set of more appealing production rules.
  2. Does not require any assumptions about the nature of the data prior.
  3. Can use datasets that include both numerical data and categorical data.


  1. The output attribute must be categorical, and there should only be one output attribute.
  2. Decision tree algorithms are unstable.
  3. Trees created with numerical datasets are more complex (as seen in the uncontoured decision tree in example 2.3) because attribute splits for numerical data are usually binary splits.

Association Rules

Basic Idea of the Apriori Algorithm

  • Generate item sets. Item sets are combinations of “attribute-value” that meet a certain support requirement. Combinations that do not meet the support requirement are discarded, hence the rule generation process can be completed within a reasonable time frame.
  • Use the generated item sets to create a group of association rules based on confidence.

Association Rules and their Confidence and Support

Support: satisfied entire sentence/satisfied second half sentence

Confidence: satisfied entire sentence/satisfied first half sentence

In practical use, single, double, and triple item sets are generated in sequence to create association rules based on the item sets generated.

From the first item in the double item sets, two rules are generated ——
IF Book = 1 THEN Earphone = 1 (Confidence: 4/5 = 80%, retain)
IF Earphone = 1 THEN Book = 1 (Confidence: 4/7 = 57.1%, delete)
From the first item in the triple item sets, three rules are generated ——
IF Book = 1 & Earphone = 1 THEN DVD = 1 (Confidence: 4/4 = 100%, retain)
IF Book = 1 & DVD = 1 THEN Earphone = 1 (Confidence: 4/4 = 100%, retain)
IF Earphone = 1 & DVD = 1 THEN Book = 1 (Confidence: 4/6 = 66.7%, delete)

Clustering Techniques

Basic Idea of the K-means Algorithm

  1. Randomly select a K value to determine the total number of clusters.
  2. Arbitrarily choose K instances in the dataset as the initial cluster centers.
  3. Calculate the simple Euclidean distance between these K cluster centers and the other remaining instances. Use this distance as a measure of similarity between instances, assigning instances with high similarity to a particular cluster, becoming its member.
  4. Use the instances within each cluster to calculate new cluster centers.
  5. If the newly calculated cluster centers are equal to the cluster centers from the previous iteration, terminate the algorithm. Otherwise, repeat steps 3 to 5 using the new cluster centers. image.png

K-means Clustering Analysis Example

Advantages and Limitations

Advantages A very popular algorithm, easy to understand, and simple to implement. Limitations

  1. Can only handle numerical data. If the dataset contains categorical attributes, either remove these attributes or convert them to equivalent numerical data.
  2. Before starting the algorithm, it is necessary to randomly choose a K value as the initial number of clusters (with arbitrariness, an incorrect choice will affect the clustering effect). Usually, different K values are selected for repeated experiments, hoping to find the optimal K value.
  3. The K-means algorithm works best when the clusters are approximately equal in size.
  4. Attributes that contribute little to clustering may affect the clustering effect (outliers). It is necessary to select attributes before clustering.
  5. Interpreting clusters can be difficult. Guided mining tools can be used to further interpret the nature of clusters formed by unsupervised clustering algorithms.

Chapter 3 Knowledge Discovery in Databases

Definition of KDD (Understanding)

Knowledge Discovery in Data, the process of extracting credible, novel, potentially valuable, understandable, and non-trivial processing from datasets (most steps are performed automatically by the system).

KDD Process Model

Classic Model

CRISP-DM Business Model

Online Analytical Mining (OLAM)

Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing

Binning method: (noise processing -> data smoothing)

Sort the ==data==, such as 3, 6, 12, 22, 24, 26, 27, 30, 30

Divide into three bins “of equal height,” each bin containing an equal number of data points;

“Of equal width,” each bin spans a fixed range;

Smooth using methods like mean, extremum, retaining extremum for boundaries while others converge.

Histogram Reduction

Dividing into equal-width or equal-depth bins, counting, and drawing histograms.

Data Standardization

  1. Min-Max Scaling: Scales the data within the range of 0 to 1 by applying a linear transformation to the original data. Specifically, Min-Max Scaling transforms the original data v into a new value range v’ using the formula $v_{i}^{’}=\frac{v_{i}-minA}{maxA-minA}(b-a)+a$, where b-a is the interval length, and a is the starting point of the interval. This allows mapping the standardized result within a certain interval. If not specified, the interval parameters can be ignored.

  2. Z-score Normalization: Scales data to a distribution with a mean of 0 and a standard deviation of 1, making data distributed normally with 0 as the mean and 1 as the standard deviation. Specifically, Z-score normalization transforms the original data x into a new value range y using the formula y = (x - μ) / σ, where μ and σ are the mean and standard deviation of the original data respectively.

  3. Decimal Scaling: Scales data within a range of [-1,1] or [-0.5,0.5] by moving the decimal point’s position. Specifically, Decimal Scaling multiplies the original data x by 10 to the power of k to obtain a new value range y, using the formula y = x / 10^k, where k is an integer satisfying 10^(k-1) ≤ |x| < 10^k. The number of positions the decimal moves is determined by the maximum value.

  4. Logarithmic Standardization is a data preprocessing method that converts data into logs, commonly used to smooth non-negative data for easier comparison and analysis. Specifically, logarithmic standardization transforms the original data x into a new value range y using the formula y = log(x).

  5. Normalization: Scales data to a unit length or unit range. Common normalization methods include L1 norm normalization and L2 norm normalization. Specifically, L1 norm normalization scales the original data x into a new value range y using the formula y = x / (|x1| + |x2| + … + |xn|); L2 norm normalization scales the original data x into a new value range y using the formula y = x / sqrt(x1^2 + x2^2 + … + xn^2).

Chapter 5 Evaluation Techniques

Evaluating Categorical Output Models

True Positive True Negative
Predicted Positive TP FP
Predicted Negative FN TN

$$Accuracy=\frac{TP+TN}{ALL}$$ $$Error Rate=\frac{FN+FP}{ALL}$$ $$Precision=\frac{TP}{TP+FP}$$ $$Recall=\frac{TP}{TP+FN}$$

Evaluating Numerical Output Models

$$MAE=\frac{|a_{1}-c_{1}|+|a_{2}-c_{2}|+…+|a_{n}-c_{n}|}{n}$$ $$MSE=\frac{(a_{1}-c_{1})^2+(a_{2}-c_{2})^2+…+(a_{n}-c_{n})^2}{n}$$ $$RMS=\sqrt{MSE}=\sqrt{\frac{(a_{1}-c_{1})^2+(a_{2}-c_{2})^2+…+(a_{n}-c_{n})^2}{n}}$$

Chapter 6 Neural Networks

Artificial Neuron Model


R inputs $p_i∈R$, that is, an R-dimensional input vector p n: net input, $n=wp+b$. R weights $wi∈R$, that is, an R-dimensional weight vector w Threshold b Output a=f(n), f: transfer function Common activation functions $$a=f(n)=hardlim(n)=\begin{cases} {1},n>=0 \ {0} ,n<0\ \end{cases}$$ $$a=f(n)=hardlim \quad s(n)=\begin{cases} {1},n>=0 \ {-1} ,n<0\ \end{cases}$$ $$a=f(n)=n$$ $$a=f(n)=\frac{1}{1+e^{-n}}$$

The most important feature of the Sigmoid function is that it is infinitely differentiable.

BP Neural Networks

BP Neural Network Structure, Forward Computation Process


Note that each neuron is fully connected with the next layer, but not with the same layer.

General Process of the BP Algorithm

The learning process of the BP algorithm is as follows:

  1. Select a set of training examples, each consisting of input information and the desired output result.
  2. Take one example from the training set and input the information into the network.
  3. Calculate the output of nodes in each layer after processing by neurons.
  4. Calculate the error between the network’s actual output and the expected output.
  5. Calculate backwards from the output layer to the first hidden layer, and adjust the connection weights of neurons in the network according to some principle that can reduce the error.
  6. Repeat steps (2) to (5) for each example in the training set until the error for the entire training set meets the requirements.




The first feature is called local receptive fields. In an image, pixels that are closer to each other have tighter connections, while pixels that are farther apart have weaker relevancies. Therefore, each neuron doesn’t need to perceive the entire image; it only needs to sense locally and then synthesize local information at higher levels to obtain global information.

The second feature is weight sharing.

What does weight sharing mean? We can think of the convolution operation as a way to extract features, which is independent of position. This implies that the statistical properties of one part of the image are the same as those of other parts. This also means that the features learned in one part can be used in other parts; thus, for all positions in this image, we can use the same learned features.

What are the benefits of weight sharing? On one hand, repeating units can recognize features regardless of their position in the visual field. On the other hand, weight sharing allows more effective feature extraction because it significantly reduces the number of free variables that need to be learned. By controlling the size of the model, convolutional networks can have good generalization capabilities for visual problems.

The best way to explain the convolutional layer is to imagine a flashlight shining in the upper left corner of an image, assuming the area illuminated by the flashlight is a 5 x 5 range. Now imagine this flashlight sliding across different areas of the input image. In machine learning terminology, this flashlight is called a filter (sometimes also referred to as a neuron or convolutional kernel), and the area it illuminates is called the receptive field. This filter is also a series of data (these data are referred to as weights or parameters).



Max pooling selects the maximum value in the pooling window as the sampling value

Average pooling takes the average of all values in the pooling window as the sampling value. Pooling can reduce the amount of parameters, decrease the risk of overfitting.

Chapter 7 Statistical Techniques

Regression Analysis

Regression analysis is a statistical analysis method used to determine the quantitative relationship between two or more variables and to establish a mathematical equation to generalize a set of numerical data.

Least Squares Method

Construct an error function as the objective function (generally quadratic), take partial derivatives with respect to the independent variables.

Bayesian Analysis

A parameter estimation method that combines prior information about an unknown parameter with sample information (tabulated), calculates conditional probabilities according to Bayes’ theorem, obtains posterior information, and thus infers the unknown parameter;

$$P(H|E)=\frac{P(E|H)P(H)}{P(E)}$$ Problems:

  1. Probability equals 0 – add a small constant.
  2. Missing data – simply ignore, treat probability as 1.

Clustering Techniques

There are three main types of clustering: Partitioning clustering (K-Means), Hierarchical clustering (Agglomerative, Divisive), and Model-based clustering (EM algorithm).

Agglomerative Clustering Algorithm

An unsupervised clustering algorithm.

Initially, each instance is in a different category, each forming a cluster;

Find the two most similar clusters, merge them until the requirements are met or merged into one category.

Cobweb Hierarchical Clustering An incremental hierarchical clustering algorithm. It classifies instances using a classification tree, and the construction of the classification tree is a process of conceptual hierarchy. The root of the tree represents the highest level of concepts and contains summary information for all domain instances; except for leaf nodes, other nodes are the base level of the tree, reflecting the hierarchical division of concepts. An evaluation function is used to measure the quality of the conceptual hierarchy. The objects to be classified must be of a categorical type.


  • Establish a class, using the first instance as its only member;
  • For the remaining instances, at each tree level (concept hierarchy), use the evaluation function to decide one of the following actions:
    • Place the new instance into an existing cluster;
    • Create a new conceptual cluster with only this new instance;

Calculation of CU value:

$$CU=\frac{\sum\limits_{k=1}^{m}P(C_k)(\sum\limits_{i}\sum\limits_{j}P(A_{i}=V_{ij}|C_{k})^{2}-\sum\limits_{i}\sum\limits_{j}({A_{i}=V_{ij})^2})}{m}$$ $P(A_{i}=V_{ij}|C_k)$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ among all members of class $C_k$ $P(A_{i}=V_{ij})$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ in the entire dataset $P(C_k)$ represents the probability of each class $C_k$

Advantages: It can automatically adjust the number of classes (clusters) and does not suffer from poor clustering results due to unreasonable random selection of the number of classifications. Limitations:

  • Sensitive to the order of instances, requiring the introduction of two additional operations—merging and splitting—to reduce sensitivity.
  • Assumes that the probability distribution of each attribute is independent of each other, which is not always the case in practice;
  • The complexity of representing, updating, and storing the probability distribution of classes (clusters) depends on the number of values each attribute can take. When an attribute can take many values, the algorithm’s time and space complexity can become significant.
  • Skewed instance data can cause an imbalance in the height of the conceptual hierarchy tree, also resulting in drastic changes in time and space complexity.
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay