Machine Learning Course Review Notes

Tim included in Learning Notes and Course-Notes

2023-05-24 2025-05-16 About 2200 words 10 minutes - views

Contents

1. Introduction

Concept of Machine Learning

Machines can be endowed with certain skills without deterministic programming.

Performance Task Experience, the program uses E to achieve an improvement of P on T, which is learning.

Classified by task type: regression, classification, clustering, dimensionality reduction.

Classified by learning method: supervised, unsupervised, reinforcement learning.

2. Model Evaluation and Selection

Key Chapter Various evaluation metrics, including writing code, calling library functions

Examples such as covariance matrix, precision, confusion matrix, ROC curve

Basic knowledge points such as overfitting problems and solutions

Empirical error: error on the training set (training error)

Generalization error: error on future samples

The smaller the generalization error, the better, but the empirical error is not necessarily.

Evaluation Methods

Hold-out

X_train,X_test,y_train,y_test=train_test_split()

Maintain consistent data distribution
Multiple random splits, repeat experiments and take the average
Appropriate test set ratio

Cross-validation

First, divide the dataset into k equally sized mutually exclusive subsets, ensuring consistent data distribution in each subset, then select k-1 subsets as the training set, and the remaining one as the test set. This way, k sets of training/test sets can be obtained, allowing for k rounds of training and testing.

1
2
3
4
5
6
7


model = linear_model.Lasso()
cv_results = sklearn.model_selection.cross_validate(model,X,y,cv=3)
# Second method
kf = KFold(3)
for train_idx,test_idx in kf.split(X):
	X_train,X_test = X[train_idx],X[test_idx]
	y_train,y_test = y[train_idx],y[test_idx]

Specifically, if $k=m$, cross-validation becomes Leave One Out:

Not affected by the random sample partitioning method, as there is only one partitioning method—one sample per subset;
The training set contains as many samples as possible, making the model being evaluated very similar to the expected model;
High training cost
The estimated result may not be more accurate than other evaluation methods

Bootstrap

Use sampling with replacement to construct the dataset, with $\frac{1}{e} = 0.368$ of the original data never being sampled, which serves as the test set.

Performance Metrics

$$MSE = \frac{1}{m}\sum_{i=1}^{m}(f(x_i)-y_i)^2$$ $$MAE = \frac{1}{m}\sum_{i=1}^{m}\vert f(x_i)-y_{i}\vert$$

The horizontal axis of the ROC curve is the false positive rate, the vertical axis is the true positive rate, and the area enclosed is AUC, the larger the better.

Comparison Test

Performance on the test set cannot reflect the true performance of the model, and statistical hypothesis testing is needed to infer whether it is statistically superior. Reasons:

Test performance is not equal to generalization performance
Test performance varies with the test set
Machine learning algorithms themselves have some randomness Bias and Variance For regression tasks, the generalization error can be decomposed into the sum of bias, variance, and noise. $$E(f;D)= bias^{2}(x)+var(x)+ \epsilon^2 $$

During training, it generally shifts from bias-dominated to variance-dominated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


//Hold-out
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25)
//k-fold cross-validation
kf=KFold(n_split=3)
//Repeated p times k-fold cross-validation
rkf=RepeateKFold(n_split=3,n_repeats=2)
//Stratified k-fold cross-validation
skf=StratifiedKFold(n_split=4)
//Leave One Out
loo=LeaveOneOut()
//Performance metrics
metrics.mean_squared_error(y_true,y_pred)
np.sqrt(metrics.mean_squared_error(y_true,y_pred))
metrics.mean_absolute_error(y_true,y_pred)
//Confusion matrix
confusion_matrix(y_true,y_pred,labels=[0,1,2])
//Plot P-R curve
precision, recall, thresholds = precision_recall_curve(y_true, y _pred)
plt.plot(recall, precision)

3. Linear Models

Concept of gradient (covariance), calculation, solution (manual, code)

Least squares: take the derivative of the loss function, set the derivative=0, solve: $$w= \frac{\sum\limits_{i=1}^{m}y_{i}(x_{i}-\bar x)}{\sum\limits_{i=1}^{m}x_{i}^{2}-\frac{1}{m}(\sum\limits_{i=1}^{m}x_{i})^2}$$ $$b = \frac{1}{m}\sum\limits_{i=1}^{m}(y_{i}- wx_{i})$$ Logistic regression: use linear models for classification tasks, logistic function (sigmoid)

Linear Discriminant Analysis (LDA): Given a training sample set, try to project the samples onto a line so that points of the same class are as close as possible and points of different classes are as far apart as possible, through covariance, the optimization goal is to maximize the Rayleigh quotient.

Multiclass learning: one-vs-one ($\frac{n*(n-1)}{2}$) classifiers, one-vs-all (n classifiers), many-vs-many (ECOC)

1
2
3
4
5
6
7
8
9


def train(epoches = 1 ,l r = 0.01):
	for epoch in range(epoches):
		# Forward propagation
		output = inputs.mv(w) # Calculate output y'=wx
		loss = (target - output).pow(2).sum()
		loss.backward()
		w.data -=  lr* w.grad
		w.grad.zero_()
sklearn.linear_model.LinearRegression()

Gradient descent: $w^{t+1} = w^{t} - \eta \frac{\partial L}{\partial w}$ $$Conv(x,y) = \frac{\sum_{i=1}^{n} (x_{i}-\bar x)(y_{i} - \bar y)}{n-1}$$ $$Conv(x,y) = Conv(y,x)$$ $$Conv(x,x) = Var(x)$$

1
2
3
4


a = np.array([1,3,4,5])
b = np.array([2,6,2,2])
z = np.stack((a,b))
np.cov(z)

4. Neural Networks

Key Chapter

BP derivation, calculation, error calculation, algorithm

Perceptron

Original Form

Where $L(w,b)$ is the loss function, which is related to the distance of all misclassified points to the hyperplane, and the training process is the process of making the loss function converge. The perceptron algorithm is convergent: when the data is linearly separable, the perceptron can find a separating hyperplane that completely separates the training data after a finite number of searches.

Dual Form

Neural Networks

Backpropagation

Algorithm steps: Initialization - Forward calculation of sample output - Calculate output layer gradient $g$ - Calculate hidden layer gradient $e$ - Update connection weights and thresholds

Forward calculation: bias values are to be subtracted

Backpropagation: Gradient of output layer nodes:

$$ g = \hat y_{j}(1- \hat y_{j})(y_{j}- \hat y_j) $$ Where $\hat y_j$ is the calculated output value, and $y_{j}$ is the expected value (true value, label)

Gradient of hidden layer nodes:

$$e = b_n(1-b_n)\sum_{j=1}^{l}w_{hj}g_{j}$$

Overfitting

Coping strategies: early stopping and regularization

Early Stopping

If the change in training error is small for several consecutive rounds, stop training
Use a validation set: if the training error decreases but the validation error increases, stop training

Regularization

Add a term describing the complexity of the network to the error objective function, making the network prefer smaller connection weights and thresholds, and making the network output smoother.

Local minima

Simulated annealing, random perturbation, genetic algorithms, momentum mechanism….

Activation Function

The role is to increase the nonlinear expression ability of the network. Requirements:

Continuous and differentiable nonlinear function
The function and its derivative should be as simple as possible, facilitating computation
The range of the derivative should be in a suitable interval, otherwise, it will affect the efficiency and stability of training $$sigmoid = \frac{1}{1+e^{-x}}$$

5. Optimization Theory and Methods

Prove convex set, convex function, graphical method, dual problem

Introduction

Convex set: In n-dimensional Euclidean space, if the line segment between any two points is still within this space, it is a convex set. Formally described as $\lambda x_{1}+ (1-\lambda) x_{2} \in S$

Convex function: Within the domain, for any $x$, if $f(\lambda x_{1}+ (1-\lambda ) x_{2}) <= \lambda f(x_{1}+ (1- \lambda)f(x_2))$, it is a convex function, ~~a convex function is concave, and a concave function is convex.~~ If the Hessian matrix is positive semi-definite, the function is convex, and if it is positive definite, it is strictly convex.

Convex programming: Solving for the minimum point of a convex function on a convex set. The local minimum point of convex programming is the global minimum point, and the set of minimum points is a convex set. If the objective function is a strictly convex function and a minimum point exists, the minimum point is unique.

Linear Programming

The standard form of linear programming is: $$
\begin{align*} min \quad cx \ s.t. \quad Ax = b, \ x \geq 0, \end{align*} $$

Graphical Method
Basic Properties

The feasible region of linear programming is a convex set
If there is an optimal solution, it must be at an extreme point

Feasible Solution In the standard form of linear programming, let the rank of matrix A be m, and suppose $A=[B,N]$, where B is an invertible m-order matrix, then $$ x= \begin{bmatrix} x_{B} \ x_{N} \end{bmatrix} = \begin{bmatrix} B^{-1}b \ 0 \end{bmatrix} $$ $$$$ is a basic solution. B is the basis matrix, and its components are basic variables. If the basic solution contains no negative numbers, it is a basic feasible solution. Examples can help understand.

6. Support Vector Machines

Key Chapter Kernel function, functional margin, geometric margin, support vector, three forms and dual problem sklearn call support vector machine

Functional margin: $\hat \gamma_i = y(wx_i+b)$ The functional margin of the hyperplane concerning all training sets is the minimum value of the functional margin of all sample points. Geometric margin: $\gamma = \frac{\hat \gamma_{i}}{\vert \vert w \vert \vert}$

Linearly Separable SVM

Linear SVM

Nonlinear SVM - Kernel Trick

Support Vector

SVM Features

■Advantages

Has good generalization ability, especially on small sample training sets, it can achieve much better results than other algorithms. This is because its optimization goal is to minimize structural risk, not empirical risk, so through margin, it can obtain a structured description of the data distribution, thereby reducing the requirements for data scale and data distribution.
Has strong mathematical theoretical support, basically does not involve probability measures and laws of large numbers, etc.
Introducing kernel functions can solve nonlinear classification problems. ■Disadvantages
Inconvenient for solving multi-classification problems.
There are hyperparameters that have a significant impact on the results, such as the penalty term C and the kernel parameter gamma when using the rbf kernel function, which can only be obtained through exhaustive search or empirical estimation.

1
2
3
4
5
6
7
8
9


import numpy as np
from sklearn.svm import SVC

x = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])  # Independent variable
y = np.array([1, 1, 2, 2])  # Dependent variable
model = SVC()  # Define model
model.fit(x, y)  # Train model based on data
pred = model.predict([[-0.8, -1]])  # Predict the result of x=[-0.8,-1]
print(pred)  # Output y=[1]

7. Ensemble Learning

Adaboost Algorithm

Input training set $D= {(x_1,y_1),(x_2,y_2),…,(x_m,y_{m})} \quad y={-1,1}$, output final classifier $H(x)$, specific steps are as follows:

Set the number of weak classifiers to $T$, let $t=1$, use the uniformly distributed initialized training set sample weight distribution, let the $m$-dimensional vector $D_1$ represent the first updated sample weight, the initial value of $D_1$ is set to $$D_{t} = (\frac{1}{m},\frac{1}{m},…,\frac{1}{m})^T $$
Use the training sample set $D$ with weight distribution $D_t$ to learn the $t$-th weak classifier $h_t$
Calculate the misclassification rate $\epsilon_t$ of $h_t$ on the training sample set $D$, if $\epsilon_t>0.5$, discard it
Determine the combination weight $\alpha_t$ of the weak learner $h_t$, the smaller the error rate of $h_t$, the larger the weight $\alpha_t$ should be, with $$\alpha_t = \frac{1}{2}ln\frac{1-\epsilon_t}{\epsilon_{t}}$$
Update the sample weights according to the misclassification rate $\epsilon_t$ of the weak classifier $h_t$: $$D_{t+1}= \frac{D_{t,i}exp(-\alpha_{t} y_{i}h_{t}(x_{i}))}{Z_{t}}$$
If $t<T$, let $t= t+1$, and return to step 2, otherwise execute step 7
Combine the $T$ classifiers $h_1,h_2,…,h_T$ according to each weight $\alpha_t$ to obtain the final classifier $$H(x) = sign(\sum_{t=1}^{t}\alpha_{t}h_{t}(x))$$

Bagging Algorithm Steps

Methods to Enhance Diversity

Data Sample Perturbation Usually based on sampling methods, such as Bagging using bootstrap sampling, AdaBoost using sequential sampling
Input Attribute Perturbation Random subspace: extract several attribute subsets from the initial attribute set, and then train a base learner based on each attribute subset
Output Representation Perturbation Manipulating the output representation can enhance diversity. Flipping method: randomly change the labels of some training samples to transform the output representation Output modulation method: convert classification output to regression output, construct individual learners to solve sub-tasks ECOC method: use error-correcting output codes to decompose multi-classification tasks into a series of binary classifications to train base learners
Algorithm Parameter Perturbation Randomly set different parameters. Negative correlation method: explicitly force individual neural networks to use different parameters through regularization terms.

References

Reference Note 1 Code-Oriented

2020 Level Exam Questions

1. 10 points

Image requires a chemical network environment to display

Original question from the Chinese Academy of Sciences graduate exam, two questions totaling 10 points

2. 10 points

What is Leave One Out?
Given the following data, use the linear model $y=wx+b$, calculate the (overall) mean squared error using Leave One Out.

x y

0 2

1 1

2 2

x	y
0	2
1	1
2	2

(Anyway, there are three groups, but the specific data may not be this)

3. Determine whether the following are convex functions 10 points

(1) $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (2) $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (They are all in this form, but I don’t remember the specific numbers)

4. Prove Convex Set 10 points

Given $p \in C$, $C$ is a convex set on $R^p$, now $S = {x \vert x=A\rho ,\rho \in C}$ where $A$ is an $n \times p$ matrix, prove that $S$ is a convex set.

5. Use graphical method to find the minimum value of linear programming. 10 points

6. Write the dual form of the minimization problem. 10 points

7. 15 points

Describe the AdaBoost algorithm
Describe the Bagging algorithm
Briefly describe the methods to enhance diversity

8. 25 points Given a neural network, $d$-dimensional input $x = [x_{1},…,x_{d}]$, single hidden layer with $p$ hidden nodes, $l$ output nodes, output is $y = [\hat y_{1}^{k},…,\hat y_{l}^{k}]$

Write the input of the hidden layer and the input of the output layer
Write the mean square loss function $E_{k}$
Given the learning rate $\eta$, find $\frac{\partial E_{k}}{w_{hj}}$ (show the derivation process)
Provide pseudocode for backpropagation
Explain gradient descent and some common variants, and compare their differences

Included in Collection・Course-Notes 9

Parallel Computing Course Review Notes Review Notes for the Deep Learning Course

Buy me a coffee~

Donate

Alipay

PayPal

WeChat Pay