Why should a machine learn? | 1. We cannot anticipate all possible future situations
2. Sometimes it is not clear how to program a solution |
Supervised Learning:
Features and labels are variables that can be of two types: | Numerical (Continous, discrete)
Categorical (Nominal, ordinal) |
Definition | When the label is numerical, the learning problem is
called regression
When the label is categorical, the learning problem is
called classification |
There are various techniques to evaluate a model
against the test set | Train/test split (e.g. 80%/20%)
K-fold cross validation
Leave-one-out |
What are the typical performance indicators? | Mean squared error for regression
Accuracy for classification (correctly
classified/total samples). |
K-fold Cross validation | - K determines the number of partitions (and iterations)
- Complete evaluation because uses all data available to evaluate the
model.
- More costly than train/split because K models needs to be trained
- Leave-one-out is an extreme case of cross validation where K equals to
the size of all data |
Bias vs Variance | If a model performs poorly on the trained data, we say that the model
underfits the data.
If a model performs well on the training data, but poorly on the test data,
the model overfits the data
Simple models (e.g. Linear) tend to have high bias - This leads
to underfitting (poor performance on training data).
Too complex (or overtrained) models tend to have low bias but high
variance - This leads to overfitting (poor performance on test data). |
What is a decision tree? | A decision tree is a non-parametric supervised learning algorithm, which is
utilized for both classification and regression tasks.
It has a hierarchical, tree structure, which consists of a root node,
branches, internal nodes and leaf nodes. |
Decision trees – when to stop? | In the example we stopped because all the leaves were
pure, but in general other criteria can be used:
• Max depth
• Min samples in split
• Min samples in leaf
• Min impurity decrease |
Advantages vs Disadvantages of Decision trees? | Advantages:
• White-box method, totally transparent
• Small data preparation (can deal with categorical
and numerical data, no scaling needed..)
• Robust against missing data
• Non-linearity
Disadvantages:
• High variance algorithm. Small changes in input data
can lead to very different trees. Prone to overfit
• Can have problems with unbalanced data
• Poorly suited for regression
• Greedy algorithm |
Random forest? | It uses bagging (bootstrap aggregating) to create different trees from the same dataset, and then aggregate the results.
They combine the simplicity of DTs with flexibility, resulting in vast improvement in accuracy. |
Random forest - Training | The first step is creating a bootstrapped dataset from the original data.
A bootstrap dataset is a new dataset with the same size of the original, where the samples are randomly selected from the original dataset.
IMPORTANT: Repetition in the random selection are allowed. |
Random forest - Training | We built a tree
1) Using a bootstrapped dataset
2) Only considering a random a subset of variables at each step |
Random forest - Evaluation | Repeat the process until the desired amount of trees is reached.
Bootstrap and random feature selection ensure variation in the created trees.
To evaluate the forest, we run each sample on each generated tree and we look at the results. The sample is classified based on the class selected by the majority of the trees. |
Advantages vs Disadvantages of Random forests | Advantages:
• They are generally more accurate than single trees
• The process can be easily parallelized
• They also provide additional information, like ranking of the mos influential features (useful in unsupervised learning)
• No need to data preprocessing (like trees)
Disadvantages:
• They sacrifice readability
• Can be computationally intensive |
K-Nearest Neighbors | The K-Nearest Neighbor (KNN) is another nonparametric method. It relies on the idea that similar data points tend to have similar labels or values.
During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance.
The algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. (weighted average for regression) |
Advantages vs Disadvantages of K-Nearest Neighbors | Advantages:
• White box model
• No training needed (only parameter tuning)
• Non linearity
• Only one parameter to tune (K)
Disadvantages:
• Can be slow with large datasets
• Scaling is necessary for the distance measure to work properly
• Sensible to outliers
• Curse of dimensionality |