Data Mining Course Review Notes
Outline
1. Understanding Data Mining
- Definition of Data Mining
- Supervised and Unsupervised Learning
- Data Mining Process
2. Basic Data Mining Techniques
- Concept of Decision Trees and General Process of C4.5 Algorithm
- Key Technology of Decision Trees: Maximum Gain Ratio
- Decision Tree Rules: Decision Tree, Production Rules, Accuracy, and Coverage
- Basic Idea of Apriori Algorithm
- Association Rules and Their Confidence and Support
- Basic Idea of K-means Algorithm
- Example of K-means Cluster Analysis
3. Knowledge Discovery in Databases
- Definition of KDD (Understanding)
- Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing
5. Evaluation Techniques
- Evaluating Classification Output Models: Confusion Matrix, Classification Accuracy, Precision, and Recall
- Evaluating Numerical Output Models: Mean Absolute Error, Mean Squared Error, Root Mean Square Error
6. Neural Networks
- Artificial Neuron Model (Able to Draw and Calculate)
- Structure of BP Neural Network (Able to Draw, e.g., one input, one output, two hidden layers each containing three neurons), Forward Calculation Process
- General Process of BP Algorithm
- Basic Operations of Convolutional Neural Networks—Convolution and Pooling, Convolution Features (Local Receptive Field, Weight Sharing)
7. Statistical Techniques
- Univariate Linear Regression, Multivariate Linear Regression—Least Squares Method
- Bayesian Analysis: Bayesian Classifier
- General Steps of Agglomerative Clustering Algorithm, Example
- Cobweb Hierarchical Clustering Algorithm: Calculation of CU Value
Chapter 1 Understanding Data Mining
Definition of Data Mining
Technical Perspective: A process of automatically analyzing and extracting information from data using computer technology; the goal is to discover potentially valuable information in the data; generally implemented using various methods such as machine learning, statistics, and pattern recognition. Disciplinary Perspective: An interdisciplinary field involving artificial intelligence, statistics, visualization, parallel computing, etc. Business Perspective: Revealing hidden, unknown, or verifying known patterns;
Supervised and Unsupervised Learning
Basic methods in machine learning include: concept learning, inductive learning, supervised learning, and unsupervised learning.
Supervised Learning: By training on a large number of instances with known classifications or outputs, the structure of the classification model is adjusted to establish an accurate classification or prediction model; this inductive concept learning process is called supervised learning. Unsupervised Learning: Before learning training, instances are not pre-defined. Data instances are clustered into a cluster based on some similarity measure method, calculating similarity, and then interpreting and understanding the meaning of each cluster to discover the significance of the clustering. (K-Means, Agglomerative Clustering, Conceptual Hierarchy Cobweb, EM Algorithm)
Data Mining Process
Four Steps:
- Prepare data, including training data and test data;
- Select a data mining technique or algorithm and submit the data to mining software;
- Interpret and evaluate the model;
- Apply the model; Data preparation can be collected from databases, data warehouses, flat files; Considerations for Selecting Techniques and Algorithms:
- Determine whether learning is supervised or unsupervised;
- How to divide training, testing, and validation data in the dataset;
- How to set parameters for the data mining algorithm;
Functions of Data Mining (Type Classification, Non-Technical)
- Supervised Learning
- Classification
- Estimation
- Prediction
- Unsupervised Clustering
- Association Analysis
Chapter 2 Basic Data Mining Techniques
Decision Trees
- Concept of Decision Trees and General Process of C4.5 Algorithm
- Key Technology of Decision Trees: Maximum Gain Ratio
- Decision Tree Rules: Decision Tree, Production Rules, Accuracy, and Coverage
Concept
A decision tree is a tree-like model where each node represents an attribute of the object, each branch represents a possible value of the attribute, and the value of the leaf node is the output result of the decision tree.
C4.5 Algorithm
(1) Given a dataset T represented in “attribute-value” format. The dataset consists of multiple instances with multiple input attributes and one output attribute. (2) Select an input attribute that best distinguishes instances in T, C4.5 uses gain ratio to select this attribute. (3) Create a tree node using this attribute and create branches for this node, each branch for all possible values of the node. (4) Classify instances in the dataset using these branches, becoming subdivided subclasses. (5) Set the instance set of the current subclass as T, repeat steps (2) and (3) for the remaining attributes in the dataset until one of the following two conditions is met, at which point the process terminates. Create a leaf node, which is the classification category expressed along this branch, with a value of the output attribute.
- Instances in the subclass meet predefined criteria, such as all being classified into one output class, or instances classified into one output class reaching a certain proportion;
- No remaining attributes.
Key Technology
Method for Selecting the Most Distinguishing Attribute
Preferably select the attribute with the maximum gain ratio to maximize the generalization degree of the data (minimize levels and number of nodes). Information Entropy Entropy is a measure of uncertainty, the larger it is, the more information it contains, the more information is transmitted; $$H(x)=-\sum_{i=1}^n {p(x_{i})log_{2}(p(x_{i}))}$$ Information Gain The greater the gain, the faster the entropy decreases, which is more conducive to classification. $$Info(I)=-\sum\frac{number\ in\ class\ i}{total\ number\ of\ instances}log_{2}\frac{number\ in\ class\ i}{total\ number\ of\ instances}$$ $Info(I)$ is the total amount of information expressed by all instances in the current dataset; $$Info(I,A)=\sum\limits_{j=1}^{k}\frac{number\ of\ instances\ in\ class\ j}{total\ number\ of\ instances}Info(class\ j)$$ $Info(I,A)$ is the amount of information after classifying I according to the k values of attribute A; $$SplitsInfo(A)=-\sum\limits_{i}^{k}\frac{number\ of\ instances\ in\ class\ j}{total\ number\ of\ instances}log_{2}\frac{number\ of\ instances\ in\ class\ j}{total\ number\ of\ instances}$$ $SplitsInfo(A)$ is the normalization of the gain value for attribute A; $$Gain(A)=Info(A)-Info(I,A)$$ $$GainRatio(A)= \frac{Gain(A)}{SplitsInfo(A)}$$
Pruning Methods
Testing Methods
- Percentage Testing
- Cross-Validation k-fold Leave-out Leave-one-out
Decision Tree Rules
Translate the decision tree into rules and calculate accuracy and coverage;
|
|
Advantages and Disadvantages (Understanding)
Advantages (1) Easy to understand and interpret, and can be mapped to a set of more attractive production rules. (2) No need to make assumptions about the nature of the data in advance. (3) Can use datasets with numerical and categorical data to build models. Limitations (1) The output attribute must be categorical, and there must be only one output attribute. (2) Decision tree algorithms are unstable. (3) Trees created with numerical datasets are more complex because splits of numerical data attributes are usually binary splits.
Association Rules
Basic Idea of Apriori Algorithm
- Generate itemsets. Itemsets are combinations of “attribute-value” that meet certain support requirements. Combinations that do not meet support requirements are discarded, so the rule generation process can be completed in a reasonable time.
- Use the generated itemsets to create a set of association rules based on confidence.
Association Rules and Their Confidence and Support
Support: Satisfying the whole sentence / Satisfying the second half of the sentence Confidence: Satisfying the whole sentence / Satisfying the first half of the sentence
In practical use, generate single-item, double-item, and triple-item itemsets in sequence, and create association rules based on the generated itemsets.
|
|
Clustering Techniques
Basic Idea of K-means Algorithm
- Randomly select a K value to determine the total number of clusters.
- Arbitrarily select K instances in the dataset as initial cluster centers.
- Calculate the simple Euclidean distance between these K cluster centers and other remaining instances. Use this distance as a measure of similarity between instances, classifying instances with high similarity to a cluster as its members.
- Use instances in each cluster to calculate new cluster centers.
- If the new cluster centers obtained are equal to the cluster centers from the last iteration, terminate the algorithm. Otherwise, repeat steps 3-5 with the new cluster centers.
Example of K-means Cluster Analysis
Advantages and Disadvantages
Advantages A very popular algorithm, easy to understand, simple to implement. Limitations (1) Can only handle numerical data, if there are categorical attributes in the dataset, either delete the attribute or convert it to equivalent numerical data. (2) Before the algorithm starts, a K value needs to be randomly selected as the initial number of clusters (with randomness, incorrect selection will affect clustering results). Usually, different K values are chosen for repeated experiments, hoping to find the best K value. (3) The K-means algorithm works best when the size of clusters is approximately equal. (4) Attributes that do not contribute much to clustering may affect clustering results (outliers). Attribute selection is needed before clustering. (5) Interpretation of clusters is difficult. Supervised mining tools can be used to further explain the properties of clusters formed by unsupervised clustering algorithms.
Chapter 3 Knowledge Discovery in Databases
Definition of KDD (Understanding)
Knowledge Discovery in Data, a process of extracting credible, novel, potentially valuable, and understandable, non-trivial information from datasets (most steps are automatically executed by the system).
KDD Process Model
Classic Model
CRISP-DM Business Model
Online Analytical Mining Model (OLAM)
Data Preprocessing: Histogram Reduction, Data Normalization, and Data Smoothing
Binning Method: (Noise Processing -> Data Smoothing) Sort the data, e.g., 3, 6, 12, 22, 24, 26, 27, 30, 30 Divide into three bins with equal height, each bin containing the same number of data; “Equal width,” each bin spans a fixed range; Smooth using methods such as mean, extreme value, retaining extreme values as boundaries, and others.
Histogram Reduction Divide into equal width or equal depth bins, count, and draw a histogram
Data Normalization
-
Min-Max Scaling: Scale data to a range between 0-1 by performing a linear transformation on the original data. Specifically, min-max scaling scales the original data v to a new range v’, calculated as $v_{i}^{’}=\frac{v_{i}-minA}{maxA-minA}(b-a)+a$, where b-a is the interval length, and a is the starting point of the interval. This way, the normalized result can be mapped to a certain interval, if not specified, the interval parameters can be ignored.
-
Z-score normalization: Scale data to a distribution with a mean of 0 and a standard deviation of 1, making the data distributed in a normal distribution with a mean of 0 and a standard deviation of 1. Specifically, Z-score normalization scales the original data x to a new range y, calculated as y = (x - μ) / σ, where μ and σ are the mean and standard deviation of the original data, respectively.
-
Decimal Scaling: Scale data to a range between [-1,1] or [-0.5,0.5] by moving the decimal point. Specifically, decimal scaling scales the original data x by multiplying it by 10 to the power of k, resulting in a new range y, calculated as y = x / 10^k, where k is an integer satisfying 10^(k-1) ≤ |x| < 10^k. The number of digits to move is determined by the maximum value.
-
Logarithmic normalization is a data preprocessing method that converts data into logarithmic form, often used to smooth non-negative data, making it easier to compare and analyze. Specifically, logarithmic normalization takes the logarithm of the original data x, resulting in a new range y, calculated as y = log(x).
-
Normalization: Scale data to unit length or unit range, common normalization methods include L1 norm normalization and L2 norm normalization. Specifically, L1 norm normalization scales the original data x to a new range y, calculated as y = x / (|x1| + |x2| + … + |xn|); L2 norm normalization scales the original data x to a new range y, calculated as y = x / sqrt(x1^2 + x2^2 + … + xn^2).
Chapter 5 Evaluation Techniques
Evaluating Classification Output Models
True Positive | True Negative | |
---|---|---|
Predicted Positive | TP | FP |
Predicted Negative | FN | TN |
$$Accuracy=\frac{TP+TN}{ALL}$$ | ||
$$Error\ Rate=\frac{FN+FP}{ALL}$$ | ||
$$Precision=\frac{TP}{TP+FP}$$ | ||
$$Recall=\frac{TP}{TP+FN}$$ |
Evaluating Numerical Output Models
$$MAE=\frac{|a_{1}-c_{1}|+|a_{2}-c_{2}|+…+|a_{n}-c_{n}|}{n}$$ $$MSE=\frac{(a_{1}-c_{1})^2+(a_{2}-c_{2})^2+…+(a_{n}-c_{n})^2}{n}$$ $$RMS=\sqrt{MSE}=\sqrt{\frac{(a_{1}-c_{1})^2+(a_{2}-c_{2})^2+…+(a_{n}-c_{n})^2}{n}}$$
Chapter 6 Neural Networks
Artificial Neuron Model
R inputs $p_i∈R$, i.e., R-dimensional input vector p n: net input, $n=wp+b$. R weights $wi∈R$, i.e., R-dimensional weight vector w Threshold b Output a=f(n), f: transfer function Common activation functions $$a=f(n)=hardlim(n)=\begin{cases} {1},n>=0 \ {0} ,n<0\ \end{cases}$$ $$a=f(n)=hardlim \quad s(n)=\begin{cases} {1},n>=0 \ {-1} ,n<0\ \end{cases}$$ $$a=f(n)=n$$ $$a=f(n)=\frac{1}{1+e^{-n}}$$ The Sigmoid function’s most important feature is that it is infinitely differentiable.
BP Neural Network
Structure of BP Neural Network, Forward Calculation Process
Note that each neuron is fully connected to the next layer, and there are no connections within the same layer.
General Process of BP Algorithm
The learning process of the B-P algorithm is as follows: (1) Select a set of training samples, each sample consisting of input information and expected output results. (2) Take a sample from the training sample set and input the information into the network. (3) Calculate the output of each layer node after being processed by neurons. (4) Calculate the error between the actual output of the network and the expected output. (5) From the output layer, calculate back to the first hidden layer, and according to a principle that makes the error decrease, adjust the connection weights of neurons in the network. (6) Repeat steps (2) to (5) for each sample in the training sample set until the error for the entire training sample set meets the requirements.
CNN
Convolution
The first magic is called local receptive field. In an image, pixels that are closer together are more closely related, while pixels that are farther apart are less correlated. Therefore, each neuron does not need to perceive the entire image globally, only locally, and then at higher levels, local information is combined to obtain global information. The second magic is weight sharing. How to understand weight sharing? We can think of the convolution operation as a way to extract features, which is position-independent. The underlying principle is that the statistical properties of one part of the image are the same as those of another part. This also means that the features learned in one part can be used in another part, so we can use the same learned features for all positions on this image. What are the benefits of weight sharing? On the one hand, repeated units can recognize features regardless of their position in the visual field. On the other hand, weight sharing allows us to extract features more efficiently because it greatly reduces the number of free variables to learn. By controlling the size of the model, convolutional networks can have good generalization capabilities for visual problems. The best way to explain the convolutional layer is to imagine a flashlight illuminating the top left of the image, assuming the flashlight illuminates an area of 5 x 5. Then imagine this flashlight sliding across different regions of the input image. In machine learning terms, this flashlight is called a filter (sometimes also called a neuron or convolution kernel), and the area it illuminates is called the receptive field. This filter is also a series of data (these data are called weights or parameters).
Pooling/Sampling
Max pooling selects the maximum value in the pooling window as the sampling value. Average pooling adds all values in the pooling window and averages them, using the average value as the sampling value. Pooling can reduce the number of parameters and reduce the risk of overfitting.
Chapter 7 Statistical Techniques
Regression Analysis
Regression analysis is a statistical analysis method used to determine the quantitative dependency relationship between two or more variables and establish a mathematical equation to generalize a set of numerical data. Least Squares Method Construct the error function as the objective function (usually quadratic), and take partial derivatives with respect to the independent variables.
Bayesian Analysis
A parameter estimation method that combines prior information about unknown parameters with sample information (list), calculates conditional probabilities based on Bayes’ theorem, derives posterior information, and infers unknown parameters; $$P(H|E)=\frac{P(E|H)P(H)}{P(E)}$$ Existing problems:
- Probability of 0–add a small constant
- Data missing–simply ignore, probability is recorded as 1
Clustering Techniques
Clustering includes partitioning clustering (K-Means), hierarchical clustering (agglomerative, divisive), and model clustering (EM algorithm);
Agglomerative Clustering Algorithm An unsupervised clustering algorithm. Initially, each instance is in a different category, each forming a separate cluster; Find the two most similar clusters, merge them, until the requirement is met or merged into one class.
Cobweb Hierarchical Clustering An incremental hierarchical clustering algorithm. Uses a classification tree to classify instances, and the construction process of the classification tree is a process of conceptual hierarchy. The root node of the tree is the highest level of concept, containing summary information of all domain instances; except for leaf nodes, other nodes are intermediate nodes of the tree, expressing the hierarchy of concept division. Uses an evaluation function to measure the quality of the concept hierarchy. Classification objects must be categorical. Algorithm:
- Establish a class, using the first instance as its only member;
- For the remaining instances. At each tree level (concept hierarchy), use the evaluation function to decide to perform one of the following actions:
- Place the new instance into an existing cluster;
- Create a new concept cluster with only this new instance;
Calculation of CU Value: $$CU=\frac{\sum\limits_{k=1}^{m}P(C_k)(\sum\limits_{i}\sum\limits_{j}P(A_{i}=V_{ij}|C_{k})^{2}-\sum\limits_{i}\sum\limits_{j}({A_{i}=V_{ij})^2})}{m}$$ $P(A_{i}=V_{ij}|C_k)$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ among all members of class $C_k$. $P(A_{i}=V_{ij})$ represents the probability that attribute $A_i$ takes the value $V_{ij}$ in the entire dataset. $P(C_k)$ represents the probability of each class $C_k$.
Advantages: Can automatically adjust the number of classes (clusters), and will not cause unsatisfactory clustering results due to unreasonable random selection of the number of classifications. Limitations:
- Sensitive to the order of instances, requiring the introduction of two additional operations—merging and splitting to reduce sensitivity.
- Assumes that the probability distribution of each attribute is independent, which is not always true in practice;
- The complexity of representing, updating, and storing the probability distribution of classes (clusters) depends on the number of values each attribute can take. When there are many attribute values, the time and space complexity of the algorithm can be very high.
- Skewed instance data can cause the concept hierarchy tree to be highly unbalanced, which can also lead to drastic changes in time and space complexity.