This document provides an overview of regression analysis and linear regression. It explains that regression analysis estimates relationships among variables to predict continuous outcomes. Linear regression finds the best fitting line through minimizing error. It describes modeling with multiple features, representing data in vector and matrix form, and using gradient descent optimization to learn the weights through iterative updates. The goal is to minimize a cost function measuring error between predictions and true values.
14. Optimization (convex)
derivative = 0
derivative > 0derivative < 0
The gradient points in the direction
of the greatest rate of increase of
the function, and its magnitude is
the slope of the graph in that
direction. - Wikipedia
21. Linear Regression: Multiple features
• Example: for house pricing, in addition to size in square meters, we
can use city, location, number of rooms, number of bathrooms, etc
• The model/hypothesis becomes
�𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 1 + 𝑤𝑤2 𝑥𝑥2 + ⋯ + 𝑤𝑤𝑛𝑛 𝑥𝑥𝑛𝑛
𝑤𝑤𝑒𝑒ℎ𝑟𝑟𝑟𝑟 𝑛𝑛 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
25. Analytical vs. Gradient Descent
• Gradient descent: must select parameter 𝜂𝜂
• Analytical solution: no parameter selection
• Gradient descent: a lot of iterations
• Analytical solution: no need for iterations
• Gradient descent: works with large number of features
• Analytical solution: slow with large number of features
30. Logistic Regression
• How to turn regression problem into classification one?
• y = 0 or 1
• Map values to [0 1] range
𝑔𝑔 𝑥𝑥 =
1
1 + 𝑒𝑒−𝑥𝑥
Sigmoid/Logistic Function
57. K-Mean Algorithm
• Select number of clusters: 𝐾𝐾 (number of centroids 𝜇𝜇1, … , 𝜇𝜇𝐾𝐾)
• Given dataset of size N
• 𝑓𝑓𝑜𝑜𝑜𝑜 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑡𝑡:
• 𝑓𝑓𝑓𝑓𝑓𝑓 𝑖𝑖 = 1 𝑡𝑡𝑡𝑡 𝑁𝑁:
• 𝑐𝑐𝑖𝑖 ≔ assign cluster 𝑐𝑐𝑖𝑖 to sample 𝑥𝑥𝑖𝑖 as the smallest Euclidean distance
between 𝑥𝑥𝑖𝑖 and the centroids
• 𝑓𝑓𝑜𝑜𝑜𝑜 𝑘𝑘 = 1 𝑡𝑡𝑡𝑡 𝐾𝐾:
• 𝜇𝜇𝑘𝑘 ≔ mean of the points assigned to cluster 𝑐𝑐𝑘𝑘
65. Model Evaluation
• Splitting data (training, validation, testing) IMPORTANT
• No hard rule: usually 60%-20%-20% will be fine
• k-fold cross-validation
• If dataset is very small
• Leave-one-out
• Fine-tuning hyper-parameters
• Automated hyper-parameter selection
• Using validation set
Training Validation Testing
66. Performance Measures
• Depending on the problem
• Some of the well-known measure are:
• Classification measures
• Accuracy
• Confusion matrix and related measures
• Regression
• Mean Squared Error
• R2 metric
• Clustering performance measure is not straight forward, and will not
be discussed here
67. Performance Measures: Accuracy
• If we have 100 persons, one of them having cancer. What is the accuracy if
classify all of them as having no cancer?
• Accuracy is not good for heavily biased class distribution
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑎𝑎𝑎𝑎𝑎𝑎 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
68. Performance Measures: Confusion matrix
NegativePositive
False Positives (FP)Ture Positives (TP)Positive
True Negatives
(TN)
False Negatives
(FN)
Negative
Predicated
Class
Actual Class
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹 + 𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 =
𝑇𝑇𝑇𝑇
𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹
𝐹𝐹 − 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 2 ∗
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ∗ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 + 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅
a measure of result relevancy a measure of how many
truly relevant results are
returned
the harmonic mean of precision and recall
69. Performance Measures: Mean Squared Error (MSE)
• Defined as
• Gives an idea of how wrong the predictions were
• Only gives an idea of the magnitude of the error, but not the direction (e.g. over
or under predicting)
• Root Mean Squared Error (RMSE) is the square root of MSE, which has the same
unit of the data
𝑀𝑀𝑀𝑀𝑀𝑀 =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
(�𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2
70. Performance Measures: R2
• Is a statistical measure of how close the data are to the fitted
regression line
• Also known as the coefficient of determination
• Has a value between 0 and 1 for no-fit and perfect fit, respectively
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟 = �
𝑖𝑖
(𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖)2
𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡 = �
𝑖𝑖
(𝑦𝑦𝑖𝑖 − �𝑦𝑦)2 , 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 �𝑦𝑦 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑜𝑜𝑜𝑜 𝑡𝑡ℎ𝑒𝑒 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑
𝑅𝑅2 = 1 −
𝑆𝑆𝑆𝑆𝑟𝑟𝑟𝑟𝑟𝑟
𝑆𝑆𝑆𝑆𝑡𝑡𝑡𝑡𝑡𝑡
72. Curse of dimensionality
Image from http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/
when the dimensionality
increases the volume
of the space increases so
fast that the available
data become sparse
74. Features compression/projection
• Project to lower dimensional
space while preserving as
much information as possible
• Principle Component Analysis
(PCA)
• Unsupervised method
Image from http://compbio.pbworks.com/w/page/16252905/Microarray%20Dimension%20Reduction
78. Training vs. Testing Errors
•Accuracy on training set is not representative of
model performance
•We need to calculate the accuracy on the test
set (a new unseen examples)
•The goal is to generalize the model to work on
unseen data
79. Bias and variance trade-off
Model Complexity
error
Low High
Testing
Training
High Bias High Variance
The optimal is
to have low
bias and low
variance
80. Learning Curves
High Bias High Variance
Number of Training Samples
error
Low High
Validation
Training
Number of Training Sampleserror Low High
Validation
Training
81. Regularization
• To prevent overfitting
• Decrease the complexity of the model
• Example of regularized regression model (Ridge
Regression)
• 𝜆𝜆 is a very important hyper-parameter
𝑅𝑅𝑅𝑅𝑅𝑅 𝑤𝑤0, 𝑤𝑤1 = ∑𝑖𝑖=1
𝑁𝑁
(�𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)2
+ 𝜆𝜆 ∑𝑗𝑗
𝑘𝑘
𝑤𝑤𝑗𝑗
2
,
𝑘𝑘 = 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ𝑡𝑡𝑡𝑡
82. Debugging a Learning Algorithm
• From “Machine Learning” course on coursera.org, by Andrew Ng
• Get more training examples fixes high variance
• Try smaller sets of features fixes high variance
• Try getting additional features fixes high bias
• Try adding polynomial features (e.g., 𝑥𝑥1
2
, 𝑥𝑥2
2
, 𝑥𝑥1, 𝑥𝑥2, 𝑒𝑒𝑒𝑒𝑒𝑒) fixes high
bias
• Try decreasing λ fixes high bias
• Try increasing 𝜆𝜆 fixes high variance
83.
84. What is the best ml algorithm?
•“No free lunch” theorem: there is no
one model that works best for every
problem
•We need to try and compare different
models and assumptions
•Machine learning is full of uncertainty