Machine Learning Course Review Notes
1 1. Introduction
1.1 Concept of Machine Learning
The ability to equip machines with a skill without deterministic programming Performance Task Experience, the program has improved P over T using E, that is learning Divided by task types: regression, classification, clustering, dimensionality reduction Divided by learning methods: supervised, unsupervised, reinforcement learning
2 2. Model Evaluation and Selection
Key section Various evaluation metrics, including writing code, calling library functions Example problems like covariance matrix, precision, confusion matrix, ROC curve Basic concepts like overfitting problem and solutions
Empirical error: Error on the training set (training error) Generalization error: Error on future samples The smaller the generalization error, the better, but not the empirical error.
2.1 Evaluation Methods
2.1.1 Holdout method
X_train,X_test,y_train,y_test=train_test_split()
 Maintain consistent data distribution
 Multiple random splits, repeated experiments to get average value
 Appropriate test set ratio
2.1.2 Crossvalidation method
First, divide the dataset into k mutually exclusive subsets of the same size, ensuring the data distribution of each subset is consistent, then select k1 subsets as the training set, the remaining as the test set. This way, k groups of train/test sets can be obtained, allowing for k training and tests.


Especially, if $k=m$ then crossvalidation becomes Leave One Out:
 Not affected by random sample split, because there is only one way to split  one sample per subset;
 The training set contains as many samples as possible, making the actually evaluated model very similar to the expected model to evaluate;
 High training cost
 Estimation result may not be more accurate than other evaluation methods
2.1.3 Bootstrap method
Use replacement to repeatedly sample to construct the dataset, $\frac{1}{e} = 0.368$ of the original data will never be sampled, which then serves as the test set.
2.2 Performance Metrics
$$MSE = \frac{1}{m}\sum_{i=1}^{m}(f(x_i)y_i)^2$$ $$MAE = \frac{1}{m}\sum_{i=1}^{m}\vert f(x_i)y_{i}\vert$$
The ROC curve’s horizontal axis is the false negative rate, and its vertical axis is the true positive rate, enclosed area is AOC, the larger, the better.
2.3 Comparison Test
The performance on the test set cannot reflect the true performance of the model, statistical hypothesis testing is needed to infer statistically whether it is superior. Reasons:
 Testing performance is not equal to generalization performance
 Testing performance varies with the changes in the test set
 Machine learning algorithms themselves have certain randomness Bias and Variance For regression tasks, the generalization error can be decomposed into bias, variance, and noise sum. $$E(f;D)= bias^{2}(x)+var(x)+ \epsilon^2 $$
In the process of training, it generally gradually transitions from bias dominance to variance dominance.


3 3. Linear Models
Concepts, calculations, and solutions (manual, code) of gradients (covariance)
Least squares method: Derive the loss function, set the derivative to 0, solve to obtain: $$w= \frac{\sum\limits_{i=1}^{m}y_{i}(x_{i}\bar x)}{\sum\limits_{i=1}^{m}x_{i}^{2}\frac{1}{m}(\sum\limits_{i=1}^{m}x_{i})^2}$$ $$b = \frac{1}{m}\sum\limits_{i=1}^{m}(y_{i} wx_{i})$$ Logistic regression: Using a linear model for classification tasks, logistic function (sigmoid) Linear Discriminant Analysis (LDA): Given a training set, aim to project the samples onto a line, such that points of the same class are as close as possible and points of different classes are as far apart as possible, optimizing the objective by maximizing the Rayleigh quotient. Multiclass learning: OnevsOne (classifier for each $\frac{n*(n1)}{2}$ pair), OnevsAll (n classifiers), ManyvsMany (ECOC)


Gradient Descent: $w^{t+1} = w^{t}  \eta \frac{\partial L}{\partial w}$ $$Conv(x,y) = \frac{\sum_{i=1}^{n} (x_{i}\bar x)(y_{i}  \bar y)}{n1}$$ $$Conv(x,y) = Conv(y,x)$$ $$Conv(x,x) = Var(x)$$


4 4. Neural Networks
Key sections BP derivation, computation, error calculation, algorithm
4.1 Perceptron
4.1.1 Original form
Where $L(w,b)$ represents the loss function, which is related to the distance of all misclassified points to the hyperplane. The training process is a convergence process of the loss function. The perceptron algorithm is convergent: when the data is linearly separable, the perceptron can find a separating hyperplane that perfectly separates the training data after a finite number of searches.
4.1.2 Dual form
4.2 Neural Networks
4.2.1 Backpropagation
Algorithm steps: InitializationForecast calculation of sample outputCalculate the gradient of the output layer $g$Calculate the gradient of the hidden layer $e$Update connection weights and thresholds Forward calculation: bias values are to be subtracted Backpropagation: The gradient of the output layer nodes: $$ g = \hat y_{j}(1 \hat y_{j})(y_{j} \hat y_j) $$ where $\hat y_j$ is the calculated output value, and $y_{j}$ is the expected value (real value, label) The gradient of the hidden layer nodes: $$e = b_n(1b_n)\sum_{j=1}^{l}w_{hj}g_{j}$$
4.2.2 Overfitting
Countermeasures: Early Stopping & Regularization Early Stopping
 Stop training if the training error changes little over several rounds
 Use a validation set: Stop training if training error decreases but validation error increases Regularization Add a term describing the network’s complexity to the error target function, making the network prefer smaller connection weights and thresholds, resulting in a smoother network output. Local Minima Simulated annealing, random perturbation, genetic algorithms, momentum mechanisms….
4.2.3 Activation Functions
Their purpose is to increase the network’s nonlinear expressive capability. Requirements:
 Continuous and differentiable nonlinear function
 The function and its derivative should be as simple as possible, facilitating computation
 The value range of the derivative should be in a suitable interval, otherwise, it will affect the training efficiency and stability $$sigmoid = \frac{1}{1+e^{x}}$$
5 5. Optimization Theory and Methods
Proof of convex sets, convex functions, graph methods, dual issues
5.1 Introduction
Convex set: In an ndimensional Euclidean space, if the line connecting any two points still lies within this space, it is a convex set. Formally described as $\lambda x_{1}+ (1\lambda) x_{2} \in S$
Convex function: In the domain, for any $x$ if $f(\lambda x_{1}+ (1\lambda ) x_{2}) <= \lambda f(x_{1}+ (1 \lambda)f(x_2))$, it is a convex function. Convex functions are concave, and concave functions are convex. If the Hessian matrix is semipositive definite, the function is convex, and if it’s positive definite, the function is strictly convex.
Convex programming: Seeking the minimal point of a convex function on a convex set. The local minimum point in convex programming is the global minimum point, and the set of minimal points is a
convex set. The objective function is a strictly convex function + the existence of a minimal point = the minimal point is unique.
5.2 Linear Programming
The standard form of linear programming is:
$$
\begin{align*}
min \quad cx \
s.t. \quad Ax = b, \
x \geq 0,
\end{align*}
$$
 Graphical Method
 Basic Properties
 The feasible domain of linear programming is a convex set
 If there exists an optimal solution, it is necessarily obtained at an extreme point
 Feasible Solution In the standard form of linear programming, suppose the rank of matrix A is m and suppose $A=[B,N]$ where B is an morder invertible matrix, then $$ x= \begin{bmatrix} x_{B} \ x_{N} \end{bmatrix} = \begin{bmatrix} B^{1}b \ 0 \end{bmatrix} $$ $$$$ is a basic solution. B is a basis matrix, whose components are basic variables. If the basic solution does not contain negative numbers, it is a basic feasible solution. Example problems can facilitate understanding.
6 6. Support Vector Machine
Key section Kernel function, functional margin, geometric margin, support vectors, three forms, and dual problem sklearn calling support vector machine
Functional margin: $\hat \gamma_i = y(wx_i+b)$ The functional margin of the hyperplane relative to all training sets is the minimum value among all sample points’ functional margins. Geometric margin: $\gamma = \frac{\hat \gamma_{i}}{\vert \vert w \vert \vert}$
6.1 Linearly Separable SVM
6.2 Linear SVM
6.3 Nonlinear SVM  Kernel Trick
6.4 Support Vectors
6.5 SVM Features
■Advantages
 Has good generalization capability, especially achieving much better results on small training datasets than other algorithms. This is because its optimization goal is the minimization of structural risk rather than empirical risk, thus through the margin, it can achieve a structured description of data distribution, thereby reducing the requirements on data scale and distribution.
 Has strong mathematical theoretical support, basically not involving probability measurements and laws of large numbers, etc.
 Introducing kernel functions can solve nonlinear classification problems. ■Disadvantages
 Not convenient for solving multiclassification problems.
 There exist hyperparameters that have a significant impact on the results, such as when using the RBF kernel function, penalty term C and kernel parameter gamma, can only be obtained through exhaustive search or empirical inference.


7 7. Ensemble Learning
7.1 Adaboost Algorithm
Given a training set $D= {(x_1,y_1),(x_2,y_2),…,(x_m,y_{m})} \quad y={1,1}$, output the final classifier $H(x)$, the specific steps are as follows:
 Set the number of weak classifiers to $T$, let $t=1$, initialize the training set sample weight distribution using a uniform distribution, and let $m$dimensional vector $D_1$ represent the sample weights to be updated for the first time, with initial value $$D_{t} = (\frac{1}{m},\frac{1}{m},…,\frac{1}{m})^T $$
 Learn the $t$th weak classifier $h_t$ using the training sample set $D$ with weight distribution $D_t$
 Calculate the error rate $\epsilon_t$ of $h_t$ on the training sample set $D$, if $\epsilon_t>0.5$, then discard it
 Determine weak learner $h_t$’s combination weight $\alpha_t$, the smaller the error rate $\epsilon_t$ of $h_t$, the greater its weight $\alpha_t$ should be, with $$\alpha_t = \frac{1}{2}ln\frac{1\epsilon_t}{\epsilon_{t}}$$
 Update the sample weight based on the classification error rate $\epsilon_t$ of weak classifier $h_t$: $$D_{t+1}= \frac{D_{t,i}exp(\alpha_{t} y_{i}h_{t}(x_{i}))}{Z_{t}}$$
 If $t<T$, let $t= t+1$, and return to step 2, otherwise go to step 7
 Combine the $T$ classifiers $h_1,h_2,…,h_T$ according to each weight $\alpha_t$, to obtain the final classifier $$H(x) = sign(\sum_{t=1}^{t}\alpha_{t}h_{t}(x))$$
7.2 Bagging Algorithm Steps
7.3 Ways to Enhance Diversity

Data Sample Disturbance Usually based on sampling methods, such as Bagging using bootstrapping, AdaBoost employs sequential sampling

Input Feature Disturbance Random subspace: Extract several attribute subsets from the initial attribute set, and train a base learner for each attribute subset

Output Representation Disturbance Manipulating output representation can enhance diversity. Flip method: Randomly change the labels of some training samples to transform the output representation Output modulation method: Convert classification output to regression output, and build individual learners to solve subtasks ECOC method: Use errorcorrecting output codes to decompose multiclass tasks into a series of binary classifications to train base learners

Algorithm Parameter Perturbation Randomly set different parameters. Negative correlation method: explicitly force individual neural networks to use different parameters through regularization terms.
8 References
Reference Note One: CodeOriented
9 20th Level Real Questions
I. 10 points
Original question of the Chinese Academy of Sciences graduate exam, two questions total 10 points
II. 10 points
 What is the leaveoneout method?
 Given the following data, using the linear model $y=wx+b$, calculate the (overall) mean squared error using the leaveoneout method.
x y 0 2 1 1 2 2
(Anyway, it’s three sets, but the specific data might not be this.)
III. Determine whether the following are convex functions 10 points
 $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$
 $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (Both are in this form, but I don’t remember the specific numbers.)
IV. Prove convex set 10 points Given $p \in C$, $C$ is a convex set on $R^p$, now we have $S = {x  x=A\rho ,\rho \in C}$ where $A$ is an $n \times p$ matrix, prove that $S$ is a convex set.
V. Graphical Method to Find the Minimum of Linear Programming. 10 points
VI. Write down the dual form of the minimal problem. 10 points
VII. 15 points
 Describe the AdaBoost algorithm
 Describe the Bagging algorithm
 Briefly describe ways to enhance diversity
VIII. 25 points Given a neural network, $d$dimensional input $x = [x_{1},…,x_{d}]$, a single hidden layer with $p$ hidden layer nodes, output layer with $l$ nodes, output is $y = [\hat y_{1}^{k},…,\hat y_{l}^{k}]$
 Write the input of the hidden layer and the input of the output layer
 Write out the mean squared loss function $E_{k}$
 Given the learning rate $\eta$, find $\frac{\partial E_{k}}{w_{hj}}$ (Reflect the derivation process)
 Provide the pseudocode for backpropagation
 Explain gradient descent and some common variants, and compare their differences