Machine Learning – Easy Reference

In this post, we have included the must known things when you deal with Machine Learning Algorithms. Here is the list of things for your easy reference, bookmark this page!

Classification metrics

In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.

Confusion matrix: The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of classification models:

ROC: The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:

AUC: The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:

Regression metrics:

Basic metrics: Given a regression model f, the following metrics are commonly used to assess the performance of the model:

Coefficient of determination: The coefficient of determination, often noted R^2 or r^2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:

Main metrics: The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:

Model selection:

Vocabulary– When selecting a model, we distinguish 3 different parts of the data that we have as follows:

Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:

Cross-validation: It also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:

The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.

Regularization: The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:

Diagnostics:

Bias: The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.

Variance: The variance of a model is the variability of the model prediction for given data points.

Bias/variance tradeoff: The simpler the model, the higher the bias, and the more complex the model, the higher the variance.

Error analysis: Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.

Ablative analysis: Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.

I hope this is useful for you to refer the important things easily.

Credits: Amidi brothers!

Life-cycle of a Data Science Project

Cover

Are you wondering how would the life-cycle of a data science project be? Here you go..
Problem Identification:

1 identify-the-problem

Have you ever heard the phrase “Here’s the data, can you do some analysis find some insights?” Often, management approach Data Scientists with vague or even undefined goals. Understanding the goal is important and sets up the rest of the project for success.

This step consumes up about 10% of the time in the project life-cycle

Data Preparation:

2 data prep

So far, everybody’s least favorite stage, but possibly the most important one. Data can come from different sources, be in the ugly format, and have errors and a myriad of other problems. A single error in this stage can render the rest of the analysis useless.

That’s why typically, up to 70% of the time is spent here.

Analyse the data:

3 Data-Analysis

Creating models, performing data mining, setting up simulations etc. This is the most exciting part and if the previous stages were done correctly, analyzing the data and getting insights will feel like a good.

Time needed here would be 10%

Visualization of the insights:

4 Visual

Visualizing comes hand-in-hand with analyzing. This is a powerful technique as looking at the data in various forms and shapes can help reveal insights that are otherwise not evident. Also several projects such as BI dashboards don’t need much analysis but rely on visualization instead.

Time needed here would be 10%

Presentation of the findings:

5 data-presentation

We’ve reached 100% the project is over! Actually, No. Presenting findings is a whole separate “Additional” stage. You need to not only convey the insights in your audience’s language but also get buy-in from them to take action based on those insights. This is an art.

Time needed: extra 80% 🙂

Hope you benefited ! Enjoy learning!