CSPCC5002. Notes - Machine learning

These notes provide a detailed yet concise overview of the topics in Machine Learning as per the syllabus

Open Table of contents

IMPORTANT
UNIT 1. Introduction
UNIT 2. Supervised Learning Methods
UNIT 3. Unsupervised Learning Methods
UNIT 4. Integrate Learning Methods
UNIT 5. Reinforcement Learning

IMPORTANT

We recommended reading books, as per the syllabus. Following notes are prepared with the help of LLM, and edited by myself.

`UNIT 1`. Introduction

TODO

`UNIT 2`. Supervised Learning Methods

Classification and Regression Trees (CART), Regression, Support Vector Machines (SVM), Kernel Functions

2.1. Classification and Regression Trees (CART)

Split data hierarchically for both types of tasks

Purpose: Used for both classification and regression tasks. The model splits the dataset into smaller subsets (binary splits) based on feature values, forming a tree structure.
Key Concepts:
- Root node: Starting point of the tree (contains the entire dataset).
- Decision nodes: Internal nodes that represent decisions based on a feature.
- Leaf nodes: End nodes that represent a final prediction (class or value).
Steps:
1. Choose the best feature to split the data (based on a criterion).
2. Recursively split the data until stopping criteria are met (e.g., max depth, min samples per leaf).
Criteria for Splitting:
- Gini Index: Measures impurity for classification tasks.
- Entropy/Information Gain: Measures reduction in entropy (uncertainty) after a split.
- Mean Squared Error (MSE): Used in regression tasks to minimize error between predicted and actual values.
Example: Predicting if a patient has a disease (classification) or predicting housing prices (regression).

2.2. Regression

Models the relationship between variables

Purpose: To predict continuous values (real numbers) based on input features.

a. Linear Regression:

Formula: $$ y = \beta_0 + \beta_1x + \epsilon $$ where $y$ is the predicted value, $\beta_0$ is the intercept, $\beta_1$ is the coefficient (slope), $x$ is the input feature, and $\epsilon$ is the error term.
Assumptions:
- Linearity between input feature(s) and output.
- Homoscedasticity (constant variance of errors).
- No multicollinearity (for multiple features).
- Errors are normally distributed.
Loss Function: Minimize Mean Squared Error (MSE).
Use Case: Predicting a person’s salary based on years of experience.

b. Multiple Linear Regression:

Extends to multiple features, while Logistic Regression focuses on binary classification using the sigmoid function.

Formula: $$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon $$ where $x_1, x_2, \dots, x_n$ are multiple features.
Key Differences:
- Models the relationship between a dependent variable and multiple independent variables.
Use Case: Predicting house prices based on multiple factors like area, number of rooms, location, etc.

c. Logistic Regression:

Purpose: Used for binary classification problems.
Formula: $$ P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1x)}} $$ where $P(y=1|x)$ is the probability of the positive class.
Key Points:
- Uses the Sigmoid function to map predicted values to probabilities (between 0 and 1).
- The output is a probability score, typically thresholded at 0.5 to classify as 0 or 1.
Loss Function: Log Loss (or binary cross-entropy).
Use Case: Predicting if an email is spam or not (1 = spam, 0 = not spam).

2.3. Support Vector Machines (SVM)

Classify data by maximizing the margin with a hyperplane

Purpose: Supervised learning method used for both classification and regression.
Concept: Finds the hyperplane that maximizes the margin between different classes.

a. Linear SVM:

Key Idea: A linear decision boundary (hyperplane) separates the data points of different classes.
Formulation:
- The hyperplane is defined as: $$ w^T x + b = 0 $$ where $w$ is the weight vector, $x$ is the input feature vector, and $b$ is the bias term.
Maximizing Margin: The objective is to maximize the distance between the closest data points (support vectors) from both classes and the hyperplane.
Loss Function: Hinge Loss.

b. Non-Linear SVM:

Concept: For non-linearly separable data, transforms the feature space using a kernel function.
Kernel Trick: Instead of explicitly transforming data into higher dimensions, the kernel function computes the similarity between points directly in the original space.
Use Case: Classifying data where a simple linear boundary doesn’t work (e.g., spiral or circular patterns).

2.4. Kernel Functions

Allow non-linear separation in higher-dimensional spaces

Purpose: Allow SVM to solve non-linear problems by mapping data into a higher-dimensional space.
Types:
- Linear Kernel: Simple dot product, used when data is linearly separable.
- Polynomial Kernel: Computes the polynomial of the input features (degree of the polynomial can be specified). $$ K(x_i, x_j) = (x_i^T x_j + c)^d $$
- Radial Basis Function (RBF) Kernel: Measures similarity based on the distance between points in feature space. $$ K(x_i, x_j) = \exp\left(-\gamma |x_i - x_j|^2\right) $$ where $\gamma$ controls the width of the Gaussian function.
- Sigmoid Kernel: Similar to neural networks’ activation function. $$ K(x_i, x_j) = \tanh(\alpha x_i^T x_j + c) $$
Selection: The choice of kernel depends on the problem’s nature; RBF is commonly used for general-purpose non-linear data.

`UNIT 3`. Unsupervised Learning Methods

Mixture Models, Expectation-Maximization (EM) Algorithm, Reinforcement Learning (RL), Generative Models

3.1. Mixture Models

Purpose: Probabilistic models that assume data is generated from a mixture of several distributions (e.g., Gaussian).
Key Concept: Each component in the mixture represents a distribution, and the overall model is a weighted sum of these components.
Common Example: Gaussian Mixture Model (GMM).
- Assumes data comes from a mixture of several Gaussian distributions.
- Useful for unsupervised learning tasks like clustering.
Formula: $$ p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k) $$ where $\pi_k$ is the mixing coefficient, $\mathcal{N}(x | \mu_k, \Sigma_k)$ is the Gaussian distribution with mean $\mu_k$ and covariance $\Sigma_k$ , and $K$ is the number of components.
Use Case: Clustering, anomaly detection, speech recognition.

3.2. Expectation-Maximization (EM) Algorithm

Purpose: Iterative algorithm to find maximum likelihood estimates of parameters in models with latent variables (e.g., Mixture Models).
Steps:
1. Expectation (E-step): Compute the expected value of the latent variables given the observed data and current parameter estimates.
2. Maximization (M-step): Maximize the likelihood function with respect to the parameters using the expectations from the E-step.
Convergence: Repeat until convergence (i.e., when parameters stabilize).
Use Case: Fitting Gaussian Mixture Models (GMMs), missing data problems.

3.3. Reinforcement Learning (RL)

Involves agents learning optimal strategies in dynamic environments based on rewards.

Purpose: A learning paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.
Key Concepts:
- Agent: Learner making decisions.
- Environment: The external system the agent interacts with.
- State (s): The current situation of the agent in the environment.
- Action (a): The decision or move the agent takes.
- Reward (r): Feedback the agent receives from the environment after taking an action.
- Policy ((\pi\) ): The strategy the agent follows to take actions based on the current state.
- Value Function (V): The expected cumulative reward from a given state.
- Q-Function (Q(s,a)): The expected reward for taking action $a$ in state $s$ .
Algorithm Types:
- Q-learning: Off-policy RL method that learns the optimal Q-values for state-action pairs.
- Policy Gradient: Optimizes the policy directly by adjusting its parameters using gradients.
Use Case: Game playing (e.g., AlphaGo), robotics, autonomous driving.

3.4. Generative Models

Create new data by learning the underlying distribution, with applications in image generation and unsupervised learning.

Purpose: Models that can generate new data points by learning the underlying distribution of the data.
Types:
- Explicit Density Models: Learn a model of the data distribution explicitly (e.g., Gaussian Mixture Models, Variational Autoencoders).
- Implicit Density Models: Do not assume a specific form for the data distribution but generate new data implicitly (e.g., Generative Adversarial Networks (GANs)).
Common Generative Models:
- Variational Autoencoder (VAE): Learns a probabilistic mapping from data to a latent space and back to generate new samples.
- Generative Adversarial Network (GAN): Uses two networks (generator and discriminator) in a competitive process where the generator creates data and the discriminator tries to distinguish between real and fake data.
Use Case: Image generation, data augmentation, unsupervised learning.

`UNIT 4`. Integrate Learning Methods

Ensemble Learning, Model Combination Schemes, Voting, Bagging (Bootstrap Aggregating), Boosting

4.1. Ensemble Learning

Purpose: Combine multiple base models (weak learners) to create a stronger model for better performance, robustness, and accuracy.
Key Idea: By aggregating predictions from several models, ensemble methods reduce variance (overfitting) and/or bias (underfitting).
Types:
1. Parallel Methods: Models are built independently and combined at the end (e.g., Bagging, Random Forest).
2. Sequential Methods: Models are built sequentially, where each subsequent model tries to correct errors from previous ones (e.g., Boosting).

4.2. Model Combination Schemes

Purpose: Techniques to combine multiple models’ predictions into a single prediction.
Popular Schemes:
- Averaging: For regression tasks, average the predictions from multiple models.
- Majority Voting: For classification, select the class predicted by the majority of models.
- Weighted Voting: Assign higher weights to more accurate models in voting or averaging.

4.3. Voting

Purpose: Combine predictions from different models (classifiers) for classification tasks.
Types:
- Hard Voting: Each model predicts a class, and the final output is the class with the most votes.
- Soft Voting: Uses predicted probabilities from each model, and the class with the highest average probability is selected.
Use Case: Multiple classifiers (e.g., SVM, Decision Tree, Logistic Regression) voting on whether an email is spam.

4.4. Bagging (Bootstrap Aggregating)

Purpose: Reduce model variance by training multiple models on different subsets of the data (created by bootstrapping) and averaging their predictions.
Key Idea: Each model is trained on a random sample with replacement, helping to reduce overfitting by averaging over many models.
Steps:
1. Bootstrap sampling: Create multiple random datasets by sampling with replacement from the training set.
2. Train a base model (e.g., Decision Tree) on each sampled dataset.
3. Aggregate the predictions (e.g., average for regression or majority voting for classification).
Example: Random Forest.

a. Random Forest Trees:

Purpose: A bagging-based ensemble of decision trees that reduces overfitting by averaging predictions from many decision trees.
Key Features:
- Each tree is trained on a different bootstrap sample.
- At each split in the tree, only a random subset of features is considered, ensuring diversity between trees.
- For classification: final prediction is based on majority voting; for regression: average prediction.
Advantages:
- Reduces overfitting compared to a single decision tree.
- Can handle high-dimensional data well.
Use Case: Predicting customer churn, classifying images, etc.

4.5. Boosting

Purpose: Sequentially trains weak learners (often decision trees) where each learner focuses on the mistakes made by the previous one, improving model performance.
Key Idea: Emphasizes training on previously misclassified data points by adjusting their weights.
Comparison with Bagging:
- Bagging reduces variance by averaging models trained independently.
- Boosting reduces bias by correcting mistakes iteratively.
Popular Methods:
- Adaboost
- Gradient Boosting

a. Adaboost (Adaptive Boosting):

Purpose: Boosting technique where subsequent models focus on the misclassified instances from previous models.
Steps:
1. Initialize weights for all training samples equally.
2. Train a weak learner (e.g., a small decision tree).
3. Increase the weights of misclassified samples so that the next weak learner focuses more on these samples.
4. Repeat the process, aggregating weak learners to make a final strong model.
Final Prediction: Weighted sum of the predictions from each weak learner.
Use Case: Face detection, text classification.

`UNIT 5`. Reinforcement Learning