Machine Learning Course Review Notes
1. Introduction
Concept of Machine Learning
Machines can be endowed with certain skills without deterministic programming.
Performance Task Experience, the program uses E to achieve an improvement of P on T, which is learning.
Classified by task type: regression, classification, clustering, dimensionality reduction.
Classified by learning method: supervised, unsupervised, reinforcement learning.
2. Model Evaluation and Selection
Key Chapter Various evaluation metrics, including writing code, calling library functions
Examples such as covariance matrix, precision, confusion matrix, ROC curve
Basic knowledge points such as overfitting problems and solutions
Empirical error: error on the training set (training error)
Generalization error: error on future samples
The smaller the generalization error, the better, but the empirical error is not necessarily.
Evaluation Methods
Hold-out
X_train,X_test,y_train,y_test=train_test_split()
- Maintain consistent data distribution
- Multiple random splits, repeat experiments and take the average
- Appropriate test set ratio
Cross-validation
First, divide the dataset into k equally sized mutually exclusive subsets, ensuring consistent data distribution in each subset, then select k-1 subsets as the training set, and the remaining one as the test set. This way, k sets of training/test sets can be obtained, allowing for k rounds of training and testing.
|
|
Specifically, if $k=m$, cross-validation becomes Leave One Out:
- Not affected by the random sample partitioning method, as there is only one partitioning method—one sample per subset;
- The training set contains as many samples as possible, making the model being evaluated very similar to the expected model;
- High training cost
- The estimated result may not be more accurate than other evaluation methods
Bootstrap
Use sampling with replacement to construct the dataset, with $\frac{1}{e} = 0.368$ of the original data never being sampled, which serves as the test set.
Performance Metrics
$$MSE = \frac{1}{m}\sum_{i=1}^{m}(f(x_i)-y_i)^2$$
$$MAE = \frac{1}{m}\sum_{i=1}^{m}\vert f(x_i)-y_{i}\vert$$
The horizontal axis of the ROC curve is the false positive rate, the vertical axis is the true positive rate, and the area enclosed is AUC, the larger the better.
Comparison Test
Performance on the test set cannot reflect the true performance of the model, and statistical hypothesis testing is needed to infer whether it is statistically superior. Reasons:
- Test performance is not equal to generalization performance
- Test performance varies with the test set
- Machine learning algorithms themselves have some randomness
Bias and Variance
For regression tasks, the generalization error can be decomposed into the sum of bias, variance, and noise.
$$E(f;D)= bias^{2}(x)+var(x)+ \epsilon^2 $$
During training, it generally shifts from bias-dominated to variance-dominated.
|
|
3. Linear Models
Concept of gradient (covariance), calculation, solution (manual, code)
Least squares: take the derivative of the loss function, set the derivative=0, solve: $$w= \frac{\sum\limits_{i=1}^{m}y_{i}(x_{i}-\bar x)}{\sum\limits_{i=1}^{m}x_{i}^{2}-\frac{1}{m}(\sum\limits_{i=1}^{m}x_{i})^2}$$ $$b = \frac{1}{m}\sum\limits_{i=1}^{m}(y_{i}- wx_{i})$$ Logistic regression: use linear models for classification tasks, logistic function (sigmoid)
Linear Discriminant Analysis (LDA): Given a training sample set, try to project the samples onto a line so that points of the same class are as close as possible and points of different classes are as far apart as possible, through covariance, the optimization goal is to maximize the Rayleigh quotient.
Multiclass learning: one-vs-one ($\frac{n*(n-1)}{2}$) classifiers, one-vs-all (n classifiers), many-vs-many (ECOC)
|
|
Gradient descent: $w^{t+1} = w^{t} - \eta \frac{\partial L}{\partial w}$ $$Conv(x,y) = \frac{\sum_{i=1}^{n} (x_{i}-\bar x)(y_{i} - \bar y)}{n-1}$$ $$Conv(x,y) = Conv(y,x)$$ $$Conv(x,x) = Var(x)$$
|
|
4. Neural Networks
Key Chapter
BP derivation, calculation, error calculation, algorithm
Perceptron
Original Form
Where $L(w,b)$ is the loss function, which is related to the distance of all misclassified points to the hyperplane, and the training process is the process of making the loss function converge. The perceptron algorithm is convergent: when the data is linearly separable, the perceptron can find a separating hyperplane that completely separates the training data after a finite number of searches.
Dual Form
Neural Networks
Backpropagation
Algorithm steps: Initialization - Forward calculation of sample output - Calculate output layer gradient $g$ - Calculate hidden layer gradient $e$ - Update connection weights and thresholds
Forward calculation: bias values are to be subtracted
Backpropagation: Gradient of output layer nodes:
$$ g = \hat y_{j}(1- \hat y_{j})(y_{j}- \hat y_j) $$ Where $\hat y_j$ is the calculated output value, and $y_{j}$ is the expected value (true value, label)
Gradient of hidden layer nodes:
$$e = b_n(1-b_n)\sum_{j=1}^{l}w_{hj}g_{j}$$
Overfitting
Coping strategies: early stopping and regularization
Early Stopping
- If the change in training error is small for several consecutive rounds, stop training
- Use a validation set: if the training error decreases but the validation error increases, stop training
Regularization
Add a term describing the complexity of the network to the error objective function, making the network prefer smaller connection weights and thresholds, and making the network output smoother.
Local minima
Simulated annealing, random perturbation, genetic algorithms, momentum mechanism….
Activation Function
The role is to increase the nonlinear expression ability of the network. Requirements:
- Continuous and differentiable nonlinear function
- The function and its derivative should be as simple as possible, facilitating computation
- The range of the derivative should be in a suitable interval, otherwise, it will affect the efficiency and stability of training $$sigmoid = \frac{1}{1+e^{-x}}$$
5. Optimization Theory and Methods
Prove convex set, convex function, graphical method, dual problem
Introduction
Convex set: In n-dimensional Euclidean space, if the line segment between any two points is still within this space, it is a convex set. Formally described as $\lambda x_{1}+ (1-\lambda) x_{2} \in S$
Convex function: Within the domain, for any $x$, if $f(\lambda x_{1}+ (1-\lambda ) x_{2}) <= \lambda f(x_{1}+ (1- \lambda)f(x_2))$, it is a convex function, a convex function is concave, and a concave function is convex. If the Hessian matrix is positive semi-definite, the function is convex, and if it is positive definite, it is strictly convex.
Convex programming: Solving for the minimum point of a convex function on a convex set. The local minimum point of convex programming is the global minimum point, and the set of minimum points is a convex set. If the objective function is a strictly convex function and a minimum point exists, the minimum point is unique.
Linear Programming
The standard form of linear programming is:
$$
\begin{align*}
min \quad cx \
s.t. \quad Ax = b, \
x \geq 0,
\end{align*}
$$
- Graphical Method
- Basic Properties
- The feasible region of linear programming is a convex set
- If there is an optimal solution, it must be at an extreme point
- Feasible Solution In the standard form of linear programming, let the rank of matrix A be m, and suppose $A=[B,N]$, where B is an invertible m-order matrix, then $$ x= \begin{bmatrix} x_{B} \ x_{N} \end{bmatrix} = \begin{bmatrix} B^{-1}b \ 0 \end{bmatrix} $$ $$$$ is a basic solution. B is the basis matrix, and its components are basic variables. If the basic solution contains no negative numbers, it is a basic feasible solution. Examples can help understand.
6. Support Vector Machines
Key Chapter Kernel function, functional margin, geometric margin, support vector, three forms and dual problem sklearn call support vector machine
Functional margin: $\hat \gamma_i = y(wx_i+b)$ The functional margin of the hyperplane concerning all training sets is the minimum value of the functional margin of all sample points. Geometric margin: $\gamma = \frac{\hat \gamma_{i}}{\vert \vert w \vert \vert}$
Linearly Separable SVM
Linear SVM
Nonlinear SVM - Kernel Trick
Support Vector
SVM Features
■Advantages
- Has good generalization ability, especially on small sample training sets, it can achieve much better results than other algorithms. This is because its optimization goal is to minimize structural risk, not empirical risk, so through margin, it can obtain a structured description of the data distribution, thereby reducing the requirements for data scale and data distribution.
- Has strong mathematical theoretical support, basically does not involve probability measures and laws of large numbers, etc.
- Introducing kernel functions can solve nonlinear classification problems. ■Disadvantages
- Inconvenient for solving multi-classification problems.
- There are hyperparameters that have a significant impact on the results, such as the penalty term C and the kernel parameter gamma when using the rbf kernel function, which can only be obtained through exhaustive search or empirical estimation.
|
|
7. Ensemble Learning
Adaboost Algorithm
Input training set $D= {(x_1,y_1),(x_2,y_2),…,(x_m,y_{m})} \quad y={-1,1}$, output final classifier $H(x)$, specific steps are as follows:
- Set the number of weak classifiers to $T$, let $t=1$, use the uniformly distributed initialized training set sample weight distribution, let the $m$-dimensional vector $D_1$ represent the first updated sample weight, the initial value of $D_1$ is set to $$D_{t} = (\frac{1}{m},\frac{1}{m},…,\frac{1}{m})^T $$
- Use the training sample set $D$ with weight distribution $D_t$ to learn the $t$-th weak classifier $h_t$
- Calculate the misclassification rate $\epsilon_t$ of $h_t$ on the training sample set $D$, if $\epsilon_t>0.5$, discard it
- Determine the combination weight $\alpha_t$ of the weak learner $h_t$, the smaller the error rate of $h_t$, the larger the weight $\alpha_t$ should be, with $$\alpha_t = \frac{1}{2}ln\frac{1-\epsilon_t}{\epsilon_{t}}$$
- Update the sample weights according to the misclassification rate $\epsilon_t$ of the weak classifier $h_t$: $$D_{t+1}= \frac{D_{t,i}exp(-\alpha_{t} y_{i}h_{t}(x_{i}))}{Z_{t}}$$
- If $t<T$, let $t= t+1$, and return to step 2, otherwise execute step 7
- Combine the $T$ classifiers $h_1,h_2,…,h_T$ according to each weight $\alpha_t$ to obtain the final classifier $$H(x) = sign(\sum_{t=1}^{t}\alpha_{t}h_{t}(x))$$
Bagging Algorithm Steps
Methods to Enhance Diversity
-
Data Sample Perturbation Usually based on sampling methods, such as Bagging using bootstrap sampling, AdaBoost using sequential sampling
-
Input Attribute Perturbation Random subspace: extract several attribute subsets from the initial attribute set, and then train a base learner based on each attribute subset
-
Output Representation Perturbation Manipulating the output representation can enhance diversity. Flipping method: randomly change the labels of some training samples to transform the output representation Output modulation method: convert classification output to regression output, construct individual learners to solve sub-tasks ECOC method: use error-correcting output codes to decompose multi-classification tasks into a series of binary classifications to train base learners
-
Algorithm Parameter Perturbation Randomly set different parameters. Negative correlation method: explicitly force individual neural networks to use different parameters through regularization terms.
References
Reference Note 1 Code-Oriented
2020 Level Exam Questions
1. 10 points
Original question from the Chinese Academy of Sciences graduate exam, two questions totaling 10 points
2. 10 points
- What is Leave One Out?
- Given the following data, use the linear model $y=wx+b$, calculate the (overall) mean squared error using Leave One Out.
x y 0 2 1 1 2 2
(Anyway, there are three groups, but the specific data may not be this)
3. Determine whether the following are convex functions 10 points
(1) $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (2) $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (They are all in this form, but I don’t remember the specific numbers)
4. Prove Convex Set 10 points
Given $p \in C$, $C$ is a convex set on $R^p$, now $S = {x \vert x=A\rho ,\rho \in C}$ where $A$ is an $n \times p$ matrix, prove that $S$ is a convex set.
5. Use graphical method to find the minimum value of linear programming. 10 points
6. Write the dual form of the minimization problem. 10 points
7. 15 points
- Describe the AdaBoost algorithm
- Describe the Bagging algorithm
- Briefly describe the methods to enhance diversity
8. 25 points Given a neural network, $d$-dimensional input $x = [x_{1},…,x_{d}]$, single hidden layer with $p$ hidden nodes, $l$ output nodes, output is $y = [\hat y_{1}^{k},…,\hat y_{l}^{k}]$
- Write the input of the hidden layer and the input of the output layer
- Write the mean square loss function $E_{k}$
- Given the learning rate $\eta$, find $\frac{\partial E_{k}}{w_{hj}}$ (show the derivation process)
- Provide pseudocode for backpropagation
- Explain gradient descent and some common variants, and compare their differences


