ML basics - MSE & MAE

4 minute read

Dataset

Β  Restaurant Location Cuisines AverageCost MinimumOrder Rating Votes Reviews DeliveryTime
0 ID6321 FTI College, Law College Road, Pune Fast Food, Rolls, Burger, Salad, Wraps 200 50 3.5 12 4 30
1 ID2882 Sector 3, Marathalli Ice Cream, Desserts 100 50 3.5 11 4 30
2 ID1595 Mumbai Central Italian, Street Food, Fast Food 150 50 3.6 99 30 65
3 ID5929 Sector 1, Noida Mughlai, North Indian, Chinese 250 99 3.7 176 95 30
4 ID6123 Rmz Centennial, I Gate, Whitefield Cafe, Beverages 200 99 3.2 521 235 65

Preprocessing

OHE

df = pd.concat([df, df['Cuisines'].str.get_dummies(sep=',').add_prefix('Cuisines_')],axis=1)
df = pd.concat([df, df['Location'].str.get_dummies(sep=',').add_prefix('Location_')],axis=1)

Count Encoding

# Count Encoding
ce = df['Restaurant'].value_counts()
df['Restaurant'+'_Freq'] = df['Restaurant'].map(ce)
df.reset_index(inplace=True,drop=True)

Categorization

cate_cols = ['Restaurant','Rating','Location','Cuisines','AverageCost']
for c in cate_cols:
    df[c] = df[c].astype('category')

Remove white space in column names

import re
df = df.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

Split Train & Test

Stratified Holdout 0.2:

train, test, _, _ = train_test_split(df, df['DeliveryTime'], test_size = 0.2, shuffle=True, stratify=df['DeliveryTime'])

Run Model

MAE

metric = 'mae'
params = {
                 'objective': metric, # loss
                 'max_depth': -1,
                 'learning_rate': 0.01,
                 "boosting": "gbdt",
                 "metric": metric,
                 "verbosity": -1,
                 "nthread": -1,
                 "random_state": 42,
                'n_estimators': 20000, # num_iterations 
                }

preds, oof, cv_mean, cv_std, best_feats = LGB_STRATAKFOLD_REG(5, X_tr, X_te, metric, train_Y, params)

result = preds < test['DeliveryTime'].values
UnderPrediction = sum(result)/len(result)

print("\n"+'#'*40,"\n## SKF5 - {}:\t\t\t {:0.4f}".format(metric.upper(),cv_mean))
print("## SKF5 - UnderPrediction:\t {:0.4f}".format(UnderPrediction)+"\n"+'#'*40)
                              Feature  importance  Feature Rank
172                             Votes    25147.60           1.0
171                           Reviews    18196.20           2.0
170                   Restaurant_Freq     9977.80           3.0
168                            Rating     8791.40           4.0
0                         AverageCost     4370.40           5.0
######################################## 
## SKF5 - MAE:			    5.3994
## SKF5 - UnderPrediction:	    0.4403
########################################

MSE

metric = 'mse'
params = {
                 'objective': metric, # loss
                 'max_depth': -1,
                 'learning_rate': 0.01,
                 "boosting": "gbdt",
                 "metric": metric,
                 "verbosity": -1,
                 "nthread": -1,
                 "random_state": 42,
                'n_estimators': 20000, # num_iterations 
                }

preds, oof, cv_mean, cv_std, best_feats = LGB_STRATAKFOLD_REG(5, X_tr, X_te, metric, train_Y, params)

result = preds < test['DeliveryTime'].values
UnderPrediction = sum(result)/len(result)

print("\n"+'#'*40,"\n## SKF5 - {}:\t\t\t {:0.4f}".format(metric.upper(),cv_mean))
print("## SKF5 - UnderPrediction:\t {:0.4f}".format(UnderPrediction)+"\n"+'#'*40)
                  Feature  importance  Feature Rank
172                 Votes    29995.80           1.0
171               Reviews    26825.80           2.0
170       Restaurant_Freq     9766.80           3.0
168                Rating     7717.20           4.0
0             AverageCost     4006.80           5.0
######################################## 
## SKF5 - MSE:			97.0386
## SKF5 - UnderPrediction:	0.3335
########################################

MAE Vs. MSE and RMSE

\[ MAE = \frac{1}{N}\sum_{i=1}^{N}|y_{i}-\hat{y}| \]
\[ MSE = \frac{1}{N}\sum_{i=1}^{N}(y_{i}-\hat{y})^{2} \]
\[ RMSE = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(y_{i}-\hat{y})^{2}} \]

from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mean_squared_error(y_test, y_preds)
Β  Prediction GroundTruth Diff Abs Diff
0 400,000 420,000 20,000 20,000
1 250,000 234,000 -16,000 16,000
2 800,000 760,400 -39,600 39,600
Β  Β  Β  MAE 25,200

Interpretation:
This model produce avg 25200 wrong prediction, intuitively.

Β  Prediction GroundTruth Diff Abs Diff
0 400,000 420,000 20,000 20,000
1 250,000 234,000 -16,000 16,000
2 800,000 760,400 -39,600 39,600
Β  Β  Β  MAE 25,200
Β  Β  Β  RMSE 27,228

Can’t say model wrongly predict result with avg 27,228, since we use square and square root.

Β  Prediction GroundTruth Diff Abs Diff
0 400,000 420,000 20,000 20,000
1 250,000 234,000 -16,000 16,000
2 800,000 760,400 -39,600 39,600
3 29,342,000 34,234,400 4,892,000 4,892,000
Β  Β  Β  MAE(before outlier) 25,200
Β  Β  Β  RMSE(before outlier) 27,228
Β  Β  Β  MSE(before outlier) 7,413,866,666
Β  Β  Β  MAE(after outlier) 1,241,900
Β  Β  Β  RMSE(after outlier) 2,446,144
Β  Β  Β  MSE(after outlier) 5,984,450,480,000

MAE

MAE is more robust to data with outliers.

MSE & RMSE

MSE penalize the large prediction errors with square.
This makes more sensitive to the outliers.

Since MSE squares the error(\(e\)), the value of error(\(e\)) increases a lot if \(e > 1\). This will make the model with MSE loss give more weight to outliers than a model with MAE loss. The model with RMSE as loss will be adjusted to minimize outliers at the expense of other common examples, which will reduce its overall performance.

MAE loss is useful if the training data is corrupted with outliers.

Intuitively, we can think about it like this: If we only had to give one predction for all the observations that try to minimize MSE, then that prediction should be the mean of all target values. But if we try to minimize MAE, that prediction would be the median of all observations.

MSE predicts less on UnderPrediction. Why?

#######################################
## SKF5 - MAE:			5.3994
## SKF5 - UnderPrediction:	0.4403

## SKF5 - MSE:			97.0386
## SKF5 - UnderPrediction:	0.3335
#######################################

Since it squares the error, MSE predicts more OverPrediction then MAE.

Summary

Under the circumstance of few outliers, if you want the model to minimize UnderPrediction, use MSE.
Under the circumstance of many outliers, use MAE.

  • UnderPrediction: predicted delivery time < true delivery time

If minimize UnderPrediction is your goal, use mse for loss function.
If lots of outliers in dataset, use mae for loss function.

  • MAE fits on the basis of the median.
  • MSE fits on the basis of the mean.
  • RMSE solves the problem of squaring the units.

Appendix

pandas to md table

install tabulate:

pip install tabulate
print(df.head().to_markdown())

Reference

df to md table: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_markdown.html
MSE MAE: https://medium.com/analytics-vidhya/mae-mse-rmse-coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e
MSE MAE: https://data101.oopy.io/mae-vs-rmse
MSE MAE: https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0
MSE MAE: http://rishy.github.io/ml/2015/07/28/l1-vs-l2-loss/
MSE MAE: https://www.kaggle.com/c/home-data-for-ml-course/discussion/143364

Leave a comment