You are in browse mode. You must login to use MEMORY

level: Level 1 of Chapter 3- Machine Learning

Questions and Answers List

level questions: Level 1 of Chapter 3- Machine Learning

Question	Answer
Why should a machine learn?	1. We cannot anticipate all possible future situations 2. Sometimes it is not clear how to program a solution
Supervised Learning: Features and labels are variables that can be of two types:	Numerical (Continous, discrete) Categorical (Nominal, ordinal)
Definition	When the label is numerical, the learning problem is called regression When the label is categorical, the learning problem is called classification
There are various techniques to evaluate a model against the test set	Train/test split (e.g. 80%/20%) K-fold cross validation Leave-one-out
What are the typical performance indicators?	Mean squared error for regression Accuracy for classification (correctly classified/total samples).
K-fold Cross validation	- K determines the number of partitions (and iterations) - Complete evaluation because uses all data available to evaluate the model. - More costly than train/split because K models needs to be trained - Leave-one-out is an extreme case of cross validation where K equals to the size of all data
Bias vs Variance	If a model performs poorly on the trained data, we say that the model underfits the data. If a model performs well on the training data, but poorly on the test data, the model overfits the data Simple models (e.g. Linear) tend to have high bias - This leads to underfitting (poor performance on training data). Too complex (or overtrained) models tend to have low bias but high variance - This leads to overfitting (poor performance on test data).
What is a decision tree?	A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Decision trees – when to stop?	In the example we stopped because all the leaves were pure, but in general other criteria can be used: • Max depth • Min samples in split • Min samples in leaf • Min impurity decrease
Advantages vs Disadvantages of Decision trees?	Advantages: • White-box method, totally transparent • Small data preparation (can deal with categorical and numerical data, no scaling needed..) • Robust against missing data • Non-linearity Disadvantages: • High variance algorithm. Small changes in input data can lead to very different trees. Prone to overfit • Can have problems with unbalanced data • Poorly suited for regression • Greedy algorithm
Random forest?	It uses bagging (bootstrap aggregating) to create different trees from the same dataset, and then aggregate the results. They combine the simplicity of DTs with flexibility, resulting in vast improvement in accuracy.
Random forest - Training	The first step is creating a bootstrapped dataset from the original data. A bootstrap dataset is a new dataset with the same size of the original, where the samples are randomly selected from the original dataset. IMPORTANT: Repetition in the random selection are allowed.
Random forest - Training	We built a tree 1) Using a bootstrapped dataset 2) Only considering a random a subset of variables at each step
Random forest - Evaluation	Repeat the process until the desired amount of trees is reached. Bootstrap and random feature selection ensure variation in the created trees. To evaluate the forest, we run each sample on each generated tree and we look at the results. The sample is classified based on the class selected by the majority of the trees.
Advantages vs Disadvantages of Random forests	Advantages: • They are generally more accurate than single trees • The process can be easily parallelized • They also provide additional information, like ranking of the mos influential features (useful in unsupervised learning) • No need to data preprocessing (like trees) Disadvantages: • They sacrifice readability • Can be computationally intensive
K-Nearest Neighbors	The K-Nearest Neighbor (KNN) is another nonparametric method. It relies on the idea that similar data points tend to have similar labels or values. During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance. The algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. (weighted average for regression)
Advantages vs Disadvantages of K-Nearest Neighbors	Advantages: • White box model • No training needed (only parameter tuning) • Non linearity • Only one parameter to tune (K) Disadvantages: • Can be slow with large datasets • Scaling is necessary for the distance measure to work properly • Sensible to outliers • Curse of dimensionality