Member-only story

Two Massively Misused Metrics in Data Science: Correlation & Accuracy

Noah Sultan, PhD
3 min readMay 13, 2024

--

The office reference

A common mistake in data science is thinking of correlation when checking the dependence between two variables, while correlation doesn’t reflect dependence. This misunderstanding leads to correlation being often misused and misinterpreted. Another common mistake is using accuracy to test the quality of a predictive classification model.

Below are the common misuses of these metrics and the corresponding remedies.

Correlation

Correlation doesn’t reflect dependence. It reflects a pattern. This could lead to two variables having a perfect dependence, while their correlation being zero. In the same manner, two variables can have a very weak dependence, and their correlation being high.

Drawing conclusions based on correlation is problematic and not reproducible for the following reasons :

  1. Correlation is a random variable. It is based on a sample of data and when the sample is not representative, the correlation measure will not be accurate.
  2. High correlation does not mean much of relation in many cases, here is an example of a high value of correlations > 0.8, look at the actual underlying graphs :

--

--

Noah Sultan, PhD
Noah Sultan, PhD

Written by Noah Sultan, PhD

LinkedIn Top Data Voice | Data Scientist | Creating AI apps, 1 per weekend | PhD in Machine Learning | 📍 Paris | linkedin.com/in/eisultan

No responses yet