Data Mining Course Review Notes
Outline
1. Understanding Data Mining
 Definition of Data Mining
 Supervised Learning and Unsupervised Learning
 The Process of Data Mining
2. Basic Data Mining Techniques
 Concept of Decision Trees and General Process of C 4.5 Algorithm
 Key Techniques for Decision Trees: Maximum Gain Ratio
 Decision Tree Rules: Decision Trees, Production Rules, Accuracy and Coverage
 Basic Idea of Apriori Algorithm
 Association Rules and Their Confidence and Support
 Basic Idea of Kmeans Algorithm
 Example of Kmeans Clustering Analysis
3. Knowledge Discovery in Databases
 Definition of KDD (To understand)
 Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing
5. Evaluation Techniques
 Evaluating Categorical Output Models: Confusion Matrix, Classification Accuracy, Precision and Recall
 Evaluating Numeric Output Models: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
6. Neural Networks
 Artificial Neuron Model (To draw and calculate)
 BP Neural Network Structure (To draw, e.g., one input, one output, two hidden layers containing three neurons each), Forward Computation Process
 General Procedure of the BP Algorithm
 Basic Operations of Convolutional Neural Networks  Convolution and Pooling, Convolutional Features (Local Receptive Fields, Weight Sharing)
7. Statistical Techniques
 Simple Linear Regression, Multiple Linear Regression  Least Squares Method
 Bayesian Analysis: Bayesian Classifier
 General Steps and Examples of Agglomerative Clustering Algorithm
 Cobweb Hierarchical Clustering Algorithm: Calculation of CU Value
Chapter 1 Understanding Data Mining
Definition of Data Mining
Technical perspective: The process of automatically analyzing and extracting information from data using computer technology; the goal is to uncover valuable information hidden in the data; typically employs methods from machine learning, statistics, pattern recognition, and more.
Disciplinary perspective: An interdisciplinary field, encompassing artificial intelligence, statistics, visualization, parallel computing, etc.
Business perspective: Reveals hidden, unknown patterns or validates known patterns;
Supervised Learning and Unsupervised Learning
The basic methods in machine learning include: concept learning, inductive learning, supervised learning, and unsupervised learning.
Supervised Learning: By training on a large number of instances with known classifications or outputs, the structure of the classification model is adjusted to achieve the purpose of establishing an accurate classification or prediction model; this inductive process of concept learning is known as supervised learning.
Unsupervised Learning: Before the learning and training, the instances are not predefined into categories. Instances are clustered into a cluster based on some similarity measure, calculating the degree of similarity, and then interpret and understand the meaning of each cluster, discovering the significance of clusters. (KMeans, Agglomerative Clustering, Conceptual Clustering Cobweb, EM Algorithm)
The Process of Data Mining
Four steps:
 Prepare the data, including training data and validation data;
 Choose a data mining technique or algorithm, and submit the data to the mining software;
 Interpret and evaluate the model;
 Application of the model; Data preparation can collect from databases, data warehouses, flat files;
When choosing techniques and algorithms, it is necessary to consider:
 Whether the learning is supervised or unsupervised;
 How to divide the training, testing, and validation data in the dataset;
 How to set the parameters of data mining algorithms;
The Role of Data Mining (Categories Classification, Nontechnical)
 Supervised Learning
 Classification
 Estimation
 Prediction
 Unsupervised Clustering
 Analysis of Association Relationships
Chapter 2 Basic Data Mining Techniques
Decision Trees
 Concept of decision trees and the general process of the C 4.5 algorithm
 Key techniques of decision trees: Maximum gain ratio
 Rules of decision trees: Decision trees, production rules, accuracy, and coverage
Concept
A decision tree is a model structured like a tree, where each node represents an attribute of an object, branches represent a possible value of the attribute, and the value of the leaf node is the output result of the decision tree.
C4.5 Algorithm
 Given a dataset T in the “attributevalue” format. The dataset consists of instances with multiple input attributes and one output attribute.
 Choose an input attribute that best distinguishes the instances in T. C4.5 uses the gain ratio to select this attribute.
 Create a tree node for that attribute and create branches for all possible values of that node.
 Use these branches to classify the instances in the dataset into subdivided subclasses.
 Set the instance set of the current subclass as T, and repeat steps (2) and (3) for the remaining attributes in the dataset until one of the following two conditions is met, at which point the process stops. Create a leaf node, which is the category expressed along this branch, its value being the value of the output attribute.
 The instances in this subclass meet predefined criteria, such as all being in one output class or reaching a certain ratio in an output class;
 No remaining attributes.
Key Techniques
Method of Choosing the Most Distinguishing Attribute
Preferably choose the attribute with the highest gain ratio to maximize the degree of generalization of the data (minimal hierarchy and number of nodes).
Information Entropy
Entropy is a measure of uncertainty; the greater it is, the larger the amount of information, and the more information transmitted;
$$H(x)=\sum_{i=1}^n {p(x_{i})log_{2}(p(x_{i}))}$$
Information Gain
The greater the gain, the faster the entropy decreases, which is more conducive to classification.
$$Info(I)=\sum\frac{Number of instances in class i}{Total number of instances}log_{2}\frac{Number of instances in class i}{Total number of instances}$$ $Info(I)$ is the total amount of information expressed by all instances in the current dataset; $$Info(I,A)=\sum\limits_{j=1}^{k}\frac{Number of instances in class j}{Total number of instances}Info(class j)$$ $Info(I,A)$ is the amount of information after classifying I according to the k values of attribute A; $$SplitsInfo(A)=\sum\limits_{i}^{k}\frac{Number of instances in class j}{Total number of instances}log_{2}\frac{Number of instances in class j}{Total number of instances}$$ $SplitsInfo(A)$ is the normalization of the gain value for attribute A; $$Gain(A)=Info(A)Info(I,A)$$ $$GainRatio(A)= \frac{Gain(A)}{SplitsInfo(A)}$$
Pruning Methods
Testing Methods
 Percentage test
 Crossvalidation kfold Holdout Leaveoneout
Decision Tree Rules
Translate the decision tree into rules, calculate the accuracy and coverage;


Advantages and Disadvantages (Overview)
Advantages
 Easy to understand and interpret, and can be mapped into a set of more appealing production rules.
 Does not require any assumptions about the nature of the data prior.
 Can use datasets that include both numerical data and categorical data.
Limitations
 The output attribute must be categorical, and there should only be one output attribute.
 Decision tree algorithms are unstable.
 Trees created with numerical datasets are more complex (as seen in the uncontoured decision tree in example 2.3) because attribute splits for numerical data are usually binary splits.
Association Rules
Basic Idea of the Apriori Algorithm
 Generate item sets. Item sets are combinations of “attributevalue” that meet a certain support requirement. Combinations that do not meet the support requirement are discarded, hence the rule generation process can be completed within a reasonable time frame.
 Use the generated item sets to create a group of association rules based on confidence.
Association Rules and their Confidence and Support
Support: satisfied entire sentence/satisfied second half sentence
Confidence: satisfied entire sentence/satisfied first half sentence
In practical use, single, double, and triple item sets are generated in sequence to create association rules based on the item sets generated.


Clustering Techniques
Basic Idea of the Kmeans Algorithm
 Randomly select a K value to determine the total number of clusters.
 Arbitrarily choose K instances in the dataset as the initial cluster centers.
 Calculate the simple Euclidean distance between these K cluster centers and the other remaining instances. Use this distance as a measure of similarity between instances, assigning instances with high similarity to a particular cluster, becoming its member.
 Use the instances within each cluster to calculate new cluster centers.
 If the newly calculated cluster centers are equal to the cluster centers from the previous iteration, terminate the algorithm. Otherwise, repeat steps 3 to 5 using the new cluster centers.
Kmeans Clustering Analysis Example
Advantages and Limitations
Advantages A very popular algorithm, easy to understand, and simple to implement. Limitations
 Can only handle numerical data. If the dataset contains categorical attributes, either remove these attributes or convert them to equivalent numerical data.
 Before starting the algorithm, it is necessary to randomly choose a K value as the initial number of clusters (with arbitrariness, an incorrect choice will affect the clustering effect). Usually, different K values are selected for repeated experiments, hoping to find the optimal K value.
 The Kmeans algorithm works best when the clusters are approximately equal in size.
 Attributes that contribute little to clustering may affect the clustering effect (outliers). It is necessary to select attributes before clustering.
 Interpreting clusters can be difficult. Guided mining tools can be used to further interpret the nature of clusters formed by unsupervised clustering algorithms.
Chapter 3 Knowledge Discovery in Databases
Definition of KDD (Understanding)
Knowledge Discovery in Data, the process of extracting credible, novel, potentially valuable, understandable, and nontrivial processing from datasets (most steps are performed automatically by the system).
KDD Process Model
Classic Model
CRISPDM Business Model
Online Analytical Mining (OLAM)
Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing
Binning method: (noise processing > data smoothing)
Sort the ==data==, such as 3, 6, 12, 22, 24, 26, 27, 30, 30
Divide into three bins “of equal height,” each bin containing an equal number of data points;
“Of equal width,” each bin spans a fixed range;
Smooth using methods like mean, extremum, retaining extremum for boundaries while others converge.
Histogram Reduction
Dividing into equalwidth or equaldepth bins, counting, and drawing histograms.
Data Standardization

MinMax Scaling: Scales the data within the range of 0 to 1 by applying a linear transformation to the original data. Specifically, MinMax Scaling transforms the original data v into a new value range v’ using the formula $v_{i}^{’}=\frac{v_{i}minA}{maxAminA}(ba)+a$, where ba is the interval length, and a is the starting point of the interval. This allows mapping the standardized result within a certain interval. If not specified, the interval parameters can be ignored.

Zscore Normalization: Scales data to a distribution with a mean of 0 and a standard deviation of 1, making data distributed normally with 0 as the mean and 1 as the standard deviation. Specifically, Zscore normalization transforms the original data x into a new value range y using the formula y = (x  μ) / σ, where μ and σ are the mean and standard deviation of the original data respectively.

Decimal Scaling: Scales data within a range of [1,1] or [0.5,0.5] by moving the decimal point’s position. Specifically, Decimal Scaling multiplies the original data x by 10 to the power of k to obtain a new value range y, using the formula y = x / 10^k, where k is an integer satisfying 10^(k1) ≤ x < 10^k. The number of positions the decimal moves is determined by the maximum value.

Logarithmic Standardization is a data preprocessing method that converts data into logs, commonly used to smooth nonnegative data for easier comparison and analysis. Specifically, logarithmic standardization transforms the original data x into a new value range y using the formula y = log(x).

Normalization: Scales data to a unit length or unit range. Common normalization methods include L1 norm normalization and L2 norm normalization. Specifically, L1 norm normalization scales the original data x into a new value range y using the formula y = x / (x1 + x2 + … + xn); L2 norm normalization scales the original data x into a new value range y using the formula y = x / sqrt(x1^2 + x2^2 + … + xn^2).
Chapter 5 Evaluation Techniques
Evaluating Categorical Output Models
True Positive  True Negative  

Predicted Positive  TP  FP 
Predicted Negative  FN  TN 
$$Accuracy=\frac{TP+TN}{ALL}$$ $$Error Rate=\frac{FN+FP}{ALL}$$ $$Precision=\frac{TP}{TP+FP}$$ $$Recall=\frac{TP}{TP+FN}$$
Evaluating Numerical Output Models
$$MAE=\frac{a_{1}c_{1}+a_{2}c_{2}+…+a_{n}c_{n}}{n}$$ $$MSE=\frac{(a_{1}c_{1})^2+(a_{2}c_{2})^2+…+(a_{n}c_{n})^2}{n}$$ $$RMS=\sqrt{MSE}=\sqrt{\frac{(a_{1}c_{1})^2+(a_{2}c_{2})^2+…+(a_{n}c_{n})^2}{n}}$$
Chapter 6 Neural Networks
Artificial Neuron Model
R inputs $p_i∈R$, that is, an Rdimensional input vector p n: net input, $n=wp+b$. R weights $wi∈R$, that is, an Rdimensional weight vector w Threshold b Output a=f(n), f: transfer function Common activation functions $$a=f(n)=hardlim(n)=\begin{cases} {1},n>=0 \ {0} ,n<0\ \end{cases}$$ $$a=f(n)=hardlim \quad s(n)=\begin{cases} {1},n>=0 \ {1} ,n<0\ \end{cases}$$ $$a=f(n)=n$$ $$a=f(n)=\frac{1}{1+e^{n}}$$
The most important feature of the Sigmoid function is that it is infinitely differentiable.
BP Neural Networks
BP Neural Network Structure, Forward Computation Process
Note that each neuron is fully connected with the next layer, but not with the same layer.
General Process of the BP Algorithm
The learning process of the BP algorithm is as follows:
 Select a set of training examples, each consisting of input information and the desired output result.
 Take one example from the training set and input the information into the network.
 Calculate the output of nodes in each layer after processing by neurons.
 Calculate the error between the network’s actual output and the expected output.
 Calculate backwards from the output layer to the first hidden layer, and adjust the connection weights of neurons in the network according to some principle that can reduce the error.
 Repeat steps (2) to (5) for each example in the training set until the error for the entire training set meets the requirements.
CNN
Convolution
The first feature is called local receptive fields. In an image, pixels that are closer to each other have tighter connections, while pixels that are farther apart have weaker relevancies. Therefore, each neuron doesn’t need to perceive the entire image; it only needs to sense locally and then synthesize local information at higher levels to obtain global information.
The second feature is weight sharing.
What does weight sharing mean? We can think of the convolution operation as a way to extract features, which is independent of position. This implies that the statistical properties of one part of the image are the same as those of other parts. This also means that the features learned in one part can be used in other parts; thus, for all positions in this image, we can use the same learned features.
What are the benefits of weight sharing? On one hand, repeating units can recognize features regardless of their position in the visual field. On the other hand, weight sharing allows more effective feature extraction because it significantly reduces the number of free variables that need to be learned. By controlling the size of the model, convolutional networks can have good generalization capabilities for visual problems.
The best way to explain the convolutional layer is to imagine a flashlight shining in the upper left corner of an image, assuming the area illuminated by the flashlight is a 5 x 5 range. Now imagine this flashlight sliding across different areas of the input image. In machine learning terminology, this flashlight is called a filter (sometimes also referred to as a neuron or convolutional kernel), and the area it illuminates is called the receptive field. This filter is also a series of data (these data are referred to as weights or parameters).
Pooling/Sampling
Max pooling selects the maximum value in the pooling window as the sampling value
Average pooling takes the average of all values in the pooling window as the sampling value. Pooling can reduce the amount of parameters, decrease the risk of overfitting.
Chapter 7 Statistical Techniques
Regression Analysis
Regression analysis is a statistical analysis method used to determine the quantitative relationship between two or more variables and to establish a mathematical equation to generalize a set of numerical data.
Least Squares Method
Construct an error function as the objective function (generally quadratic), take partial derivatives with respect to the independent variables.
Bayesian Analysis
A parameter estimation method that combines prior information about an unknown parameter with sample information (tabulated), calculates conditional probabilities according to Bayes’ theorem, obtains posterior information, and thus infers the unknown parameter;
$$P(HE)=\frac{P(EH)P(H)}{P(E)}$$ Problems:
 Probability equals 0 – add a small constant.
 Missing data – simply ignore, treat probability as 1.
Clustering Techniques
There are three main types of clustering: Partitioning clustering (KMeans), Hierarchical clustering (Agglomerative, Divisive), and Modelbased clustering (EM algorithm).
Agglomerative Clustering Algorithm
An unsupervised clustering algorithm.
Initially, each instance is in a different category, each forming a cluster;
Find the two most similar clusters, merge them until the requirements are met or merged into one category.
Cobweb Hierarchical Clustering An incremental hierarchical clustering algorithm. It classifies instances using a classification tree, and the construction of the classification tree is a process of conceptual hierarchy. The root of the tree represents the highest level of concepts and contains summary information for all domain instances; except for leaf nodes, other nodes are the base level of the tree, reflecting the hierarchical division of concepts. An evaluation function is used to measure the quality of the conceptual hierarchy. The objects to be classified must be of a categorical type.
Algorithm:
 Establish a class, using the first instance as its only member;
 For the remaining instances, at each tree level (concept hierarchy), use the evaluation function to decide one of the following actions:
 Place the new instance into an existing cluster;
 Create a new conceptual cluster with only this new instance;
Calculation of CU value:
$$CU=\frac{\sum\limits_{k=1}^{m}P(C_k)(\sum\limits_{i}\sum\limits_{j}P(A_{i}=V_{ij}C_{k})^{2}\sum\limits_{i}\sum\limits_{j}({A_{i}=V_{ij})^2})}{m}$$ $P(A_{i}=V_{ij}C_k)$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ among all members of class $C_k$ $P(A_{i}=V_{ij})$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ in the entire dataset $P(C_k)$ represents the probability of each class $C_k$
Advantages: It can automatically adjust the number of classes (clusters) and does not suffer from poor clustering results due to unreasonable random selection of the number of classifications. Limitations:
 Sensitive to the order of instances, requiring the introduction of two additional operations—merging and splitting—to reduce sensitivity.
 Assumes that the probability distribution of each attribute is independent of each other, which is not always the case in practice;
 The complexity of representing, updating, and storing the probability distribution of classes (clusters) depends on the number of values each attribute can take. When an attribute can take many values, the algorithm’s time and space complexity can become significant.
 Skewed instance data can cause an imbalance in the height of the conceptual hierarchy tree, also resulting in drastic changes in time and space complexity.