Machine Learning Course Review Notes

1. Introduction

Concept of Machine Learning

The ability to equip machines with a skill without deterministic programming Performance Task Experience, the program has improved P over T using E, that is learning Divided by task types: regression, classification, clustering, dimensionality reduction Divided by learning methods: supervised, unsupervised, reinforcement learning

2. Model Evaluation and Selection

Key section Various evaluation metrics, including writing code, calling library functions Example problems like covariance matrix, precision, confusion matrix, ROC curve Basic concepts like overfitting problem and solutions

Empirical error: Error on the training set (training error) Generalization error: Error on future samples The smaller the generalization error, the better, but not the empirical error.

Evaluation Methods

Hold-out method


  • Maintain consistent data distribution
  • Multiple random splits, repeated experiments to get average value
  • Appropriate test set ratio

Cross-validation method

First, divide the dataset into k mutually exclusive subsets of the same size, ensuring the data distribution of each subset is consistent, then select k-1 subsets as the training set, the remaining as the test set. This way, k groups of train/test sets can be obtained, allowing for k training and tests.


model = linear_model.Lasso()
cv_results = sklearn.model_selection.cross_validate(model,X,y,cv=3)
# Second method
kf = KFold(3)
for train_idx,test_idx in kf.split(X):
    X_train,X_test = X[train_idx],X[test_idx]
    y_train,y_test = y[train_idx],y[test_idx]

Especially, if $k=m$ then cross-validation becomes Leave One Out:

  • Not affected by random sample split, because there is only one way to split - one sample per subset;
  • The training set contains as many samples as possible, making the actually evaluated model very similar to the expected model to evaluate;
  • High training cost
  • Estimation result may not be more accurate than other evaluation methods

Bootstrap method

Use replacement to repeatedly sample to construct the dataset, $\frac{1}{e} = 0.368$ of the original data will never be sampled, which then serves as the test set.

Performance Metrics

$$MSE = \frac{1}{m}\sum_{i=1}^{m}(f(x_i)-y_i)^2$$ $$MAE = \frac{1}{m}\sum_{i=1}^{m}\vert f(x_i)-y_{i}\vert$$ image.png image.png image.png image.png

The ROC curve’s horizontal axis is the false negative rate, and its vertical axis is the true positive rate, enclosed area is AOC, the larger, the better.

Comparison Test

The performance on the test set cannot reflect the true performance of the model, statistical hypothesis testing is needed to infer statistically whether it is superior. Reasons:

  • Testing performance is not equal to generalization performance
  • Testing performance varies with the changes in the test set
  • Machine learning algorithms themselves have certain randomness Bias and Variance For regression tasks, the generalization error can be decomposed into bias, variance, and noise sum. $$E(f;D)= bias^{2}(x)+var(x)+ \epsilon^2 $$ image.png

In the process of training, it generally gradually transitions from bias dominance to variance dominance.

// Leave-One-Out method
// k-fold cross-validation
// Repeated p-times k-fold cross-validation
// Stratified k-fold cross-validation
// Leave-One-Out method
// Performance metrics
// Confusion Matrix
// Plot P-R curve
precision, recall, thresholds = precision_recall_curve(y_true, y _pred)
plt.plot(recall, precision)

3. Linear Models

Concepts, calculations, and solutions (manual, code) of gradients (covariance)

Least squares method: Derive the loss function, set the derivative to 0, solve to obtain: $$w= \frac{\sum\limits_{i=1}^{m}y_{i}(x_{i}-\bar x)}{\sum\limits_{i=1}^{m}x_{i}^{2}-\frac{1}{m}(\sum\limits_{i=1}^{m}x_{i})^2}$$ $$b = \frac{1}{m}\sum\limits_{i=1}^{m}(y_{i}- wx_{i})$$ Logistic regression: Using a linear model for classification tasks, logistic function (sigmoid) Linear Discriminant Analysis (LDA): Given a training set, aim to project the samples onto a line, such that points of the same class are as close as possible and points of different classes are as far apart as possible, optimizing the objective by maximizing the Rayleigh quotient. Multiclass learning: One-vs-One (classifier for each $\frac{n*(n-1)}{2}$ pair), One-vs-All (n classifiers), Many-vs-Many (ECOC)

def train(epoches = 1 ,l r = 0.01):
	for epoch in range(epoches):
		# Forward propagation
		output = # Compute output y'=wx
		loss = (target - output).pow(2).sum()
		loss.backward() -=  lr* w.grad

Gradient Descent: $w^{t+1} = w^{t} - \eta \frac{\partial L}{\partial w}$ $$Conv(x,y) = \frac{\sum_{i=1}^{n} (x_{i}-\bar x)(y_{i} - \bar y)}{n-1}$$ $$Conv(x,y) = Conv(y,x)$$ $$Conv(x,x) = Var(x)$$

a = np.array([1,3,4,5])
b = np.array([2,6,2,2])
z = np.stack((a,b))

4. Neural Networks

Key sections BP derivation, computation, error calculation, algorithm


Original form


Where $L(w,b)$ represents the loss function, which is related to the distance of all misclassified points to the hyperplane. The training process is a convergence process of the loss function. The perceptron algorithm is convergent: when the data is linearly separable, the perceptron can find a separating hyperplane that perfectly separates the training data after a finite number of searches.

Dual form


Neural Networks


Algorithm steps: Initialization-Forecast calculation of sample output-Calculate the gradient of the output layer $g$-Calculate the gradient of the hidden layer $e$-Update connection weights and thresholds Forward calculation: bias values are to be subtracted Backpropagation: The gradient of the output layer nodes: $$ g = \hat y_{j}(1- \hat y_{j})(y_{j}- \hat y_j) $$ where $\hat y_j$ is the calculated output value, and $y_{j}$ is the expected value (real value, label) The gradient of the hidden layer nodes: $$e = b_n(1-b_n)\sum_{j=1}^{l}w_{hj}g_{j}$$ image.png image.png


Countermeasures: Early Stopping & Regularization Early Stopping

  • Stop training if the training error changes little over several rounds
  • Use a validation set: Stop training if training error decreases but validation error increases Regularization Add a term describing the network’s complexity to the error target function, making the network prefer smaller connection weights and thresholds, resulting in a smoother network output. Local Minima Simulated annealing, random perturbation, genetic algorithms, momentum mechanisms….

Activation Functions

Their purpose is to increase the network’s non-linear expressive capability. Requirements:

  • Continuous and differentiable non-linear function
  • The function and its derivative should be as simple as possible, facilitating computation
  • The value range of the derivative should be in a suitable interval, otherwise, it will affect the training efficiency and stability $$sigmoid = \frac{1}{1+e^{-x}}$$

5. Optimization Theory and Methods

Proof of convex sets, convex functions, graph methods, dual issues


Convex set: In an n-dimensional Euclidean space, if the line connecting any two points still lies within this space, it is a convex set. Formally described as $\lambda x_{1}+ (1-\lambda) x_{2} \in S$

Convex function: In the domain, for any $x$ if $f(\lambda x_{1}+ (1-\lambda ) x_{2}) <= \lambda f(x_{1}+ (1- \lambda)f(x_2))$, it is a convex function. Convex functions are concave, and concave functions are convex. If the Hessian matrix is semi-positive definite, the function is convex, and if it’s positive definite, the function is strictly convex.

Convex programming: Seeking the minimal point of a convex function on a convex set. The local minimum point in convex programming is the global minimum point, and the set of minimal points is a

convex set. The objective function is a strictly convex function + the existence of a minimal point = the minimal point is unique.

Linear Programming

The standard form of linear programming is: $$
\begin{align*} min \quad cx \ s.t. \quad Ax = b, \ x \geq 0, \end{align*} $$

  1. Graphical Method
  2. Basic Properties
  • The feasible domain of linear programming is a convex set
  • If there exists an optimal solution, it is necessarily obtained at an extreme point
  1. Feasible Solution In the standard form of linear programming, suppose the rank of matrix A is m and suppose $A=[B,N]$ where B is an m-order invertible matrix, then $$ x= \begin{bmatrix} x_{B} \ x_{N} \end{bmatrix} = \begin{bmatrix} B^{-1}b \ 0 \end{bmatrix} $$ $$$$ is a basic solution. B is a basis matrix, whose components are basic variables. If the basic solution does not contain negative numbers, it is a basic feasible solution. Example problems can facilitate understanding.

6. Support Vector Machine

Key section Kernel function, functional margin, geometric margin, support vectors, three forms, and dual problem sklearn calling support vector machine

Functional margin: $\hat \gamma_i = y(wx_i+b)$ The functional margin of the hyperplane relative to all training sets is the minimum value among all sample points’ functional margins. Geometric margin: $\gamma = \frac{\hat \gamma_{i}}{\vert \vert w \vert \vert}$

Linearly Separable SVM

image.png image.png

Linear SVM

image.png image.png

Nonlinear SVM - Kernel Trick


Support Vectors


SVM Features


  • Has good generalization capability, especially achieving much better results on small training datasets than other algorithms. This is because its optimization goal is the minimization of structural risk rather than empirical risk, thus through the margin, it can achieve a structured description of data distribution, thereby reducing the requirements on data scale and distribution.
  • Has strong mathematical theoretical support, basically not involving probability measurements and laws of large numbers, etc.
  • Introducing kernel functions can solve nonlinear classification problems. ■Disadvantages
  • Not convenient for solving multi-classification problems.
  • There exist hyperparameters that have a significant impact on the results, such as when using the RBF kernel function, penalty term C and kernel parameter gamma, can only be obtained through exhaustive search or empirical inference.
import numpy as np
from sklearn.svm import SVC

x = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])  # Independent variables
y = np.array([1, 1, 2, 2])  # Dependent variables
model = SVC()  # Define the model, y)  # Train the model with data
pred = model.predict([[-0.8, -1]])  # Predict the result for x=[-0.8,-1]
print(pred)  # Output y=[1]

7. Ensemble Learning

Adaboost Algorithm

Given a training set $D= {(x_1,y_1),(x_2,y_2),…,(x_m,y_{m})} \quad y={-1,1}$, output the final classifier $H(x)$, the specific steps are as follows:

  1. Set the number of weak classifiers to $T$, let $t=1$, initialize the training set sample weight distribution using a uniform distribution, and let $m$-dimensional vector $D_1$ represent the sample weights to be updated for the first time, with initial value $$D_{t} = (\frac{1}{m},\frac{1}{m},…,\frac{1}{m})^T $$
  2. Learn the $t$-th weak classifier $h_t$ using the training sample set $D$ with weight distribution $D_t$
  3. Calculate the error rate $\epsilon_t$ of $h_t$ on the training sample set $D$, if $\epsilon_t>0.5$, then discard it
  4. Determine weak learner $h_t$’s combination weight $\alpha_t$, the smaller the error rate $\epsilon_t$ of $h_t$, the greater its weight $\alpha_t$ should be, with $$\alpha_t = \frac{1}{2}ln\frac{1-\epsilon_t}{\epsilon_{t}}$$
  5. Update the sample weight based on the classification error rate $\epsilon_t$ of weak classifier $h_t$: $$D_{t+1}= \frac{D_{t,i}exp(-\alpha_{t} y_{i}h_{t}(x_{i}))}{Z_{t}}$$
  6. If $t<T$, let $t= t+1$, and return to step 2, otherwise go to step 7
  7. Combine the $T$ classifiers $h_1,h_2,…,h_T$ according to each weight $\alpha_t$, to obtain the final classifier $$H(x) = sign(\sum_{t=1}^{t}\alpha_{t}h_{t}(x))$$

Bagging Algorithm Steps


Ways to Enhance Diversity

  • Data Sample Disturbance Usually based on sampling methods, such as Bagging using bootstrapping, AdaBoost employs sequential sampling

  • Input Feature Disturbance Random subspace: Extract several attribute subsets from the initial attribute set, and train a base learner for each attribute subset

  • Output Representation Disturbance Manipulating output representation can enhance diversity. Flip method: Randomly change the labels of some training samples to transform the output representation Output modulation method: Convert classification output to regression output, and build individual learners to solve subtasks ECOC method: Use error-correcting output codes to decompose multi-class tasks into a series of binary classifications to train base learners

  • Algorithm Parameter Perturbation Randomly set different parameters. Negative correlation method: explicitly force individual neural networks to use different parameters through regularization terms.


Reference Note One: Code-Oriented

20th Level Real Questions

I. 10 points

Image display requires the network environment of chemistry

Original question of the Chinese Academy of Sciences graduate exam, two questions total 10 points

II. 10 points

  1. What is the leave-one-out method?
  2. Given the following data, using the linear model $y=wx+b$, calculate the (overall) mean squared error using the leave-one-out method.
    x y
    0 2
    1 1
    2 2

(Anyway, it’s three sets, but the specific data might not be this.)

III. Determine whether the following are convex functions 10 points

  1. $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$
  2. $ax_{1}^{2} + bx_{2}^{2} + cx_{1}x_{2} +dx_{1} +ex_2$ (Both are in this form, but I don’t remember the specific numbers.)

IV. Prove convex set 10 points Given $p \in C$, $C$ is a convex set on $R^p$, now we have $S = {x | x=A\rho ,\rho \in C}$ where $A$ is an $n \times p$ matrix, prove that $S$ is a convex set.

V. Graphical Method to Find the Minimum of Linear Programming. 10 points

VI. Write down the dual form of the minimal problem. 10 points

VII. 15 points

  1. Describe the AdaBoost algorithm
  2. Describe the Bagging algorithm
  3. Briefly describe ways to enhance diversity

VIII. 25 points Given a neural network, $d$-dimensional input $x = [x_{1},…,x_{d}]$, a single hidden layer with $p$ hidden layer nodes, output layer with $l$ nodes, output is $y = [\hat y_{1}^{k},…,\hat y_{l}^{k}]$

  1. Write the input of the hidden layer and the input of the output layer
  2. Write out the mean squared loss function $E_{k}$
  3. Given the learning rate $\eta$, find $\frac{\partial E_{k}}{w_{hj}}$ (Reflect the derivation process)
  4. Provide the pseudocode for backpropagation
  5. Explain gradient descent and some common variants, and compare their differences
Buy me a coffee~
Tim AlipayAlipay
Tim PayPalPayPal
Tim WeChat PayWeChat Pay