Fooled by Correlation. Why High Correlation is Not Enough to Predict the Future

Noah Sultan, PhD
2 min readApr 25, 2021

--

Four sets of data with the same correlation of 0.816

For most people, a high correlation between two variables is a good enough reason to make a prediction about the future, to take a business decision, or even to draw a scientific conclusion. Recently Nassim Taleb showed, in a youtube video, that correlation measures are not supposed to be used in the presence of nonlinearities, which is the case most of the time.

As an example, when 2 variables are associated only half the time (as shown in next figure), correlation will not be 50% but will show ~90%.

A plot shows 2 variables that are associated Only half the time. Correlation will not be 50%, but ~90%.

You can double check for yourself by using that simple Python code : -

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 11)
y = np.piecewise(x, [x < 5, x >= 5], [lambda x: x, lambda x: 5])
plt.plot(x,y)
print("R = " + str(np.corrcoef(x,y)[0][1]))

How to avoid falling in the correlation trap?

Visual examination is one fast solution. The below image was created specifically to show the importance of visualisation., and that numerical calculations are not enough.

Four sets of data with the same correlation of 0.816

Nowadays, where are many predictive models that are far from accurate, and many scientific papers that contradict each other, we need to familiarise ourselves with these kinds of biases.

Next time, before using the correlation value to make a decision, visualise your data. Then, decide if you want to change your mind or not.

--

--

Noah Sultan, PhD
Noah Sultan, PhD

Written by Noah Sultan, PhD

LinkedIn Top Data Voice | Data Scientist | Creating AI apps, 1 per weekend | PhD in Machine Learning | 📍 Paris | linkedin.com/in/eisultan

Responses (2)