Member-only story
Two Massively Misused Metrics in Data Science: Correlation & Accuracy
A common mistake in data science is thinking of correlation when checking the dependence between two variables, while correlation doesn’t reflect dependence. This misunderstanding leads to correlation being often misused and misinterpreted. Another common mistake is using accuracy to test the quality of a predictive classification model.
Below are the common misuses of these metrics and the corresponding remedies.
Correlation
Correlation doesn’t reflect dependence. It reflects a pattern. This could lead to two variables having a perfect dependence, while their correlation being zero. In the same manner, two variables can have a very weak dependence, and their correlation being high.
Drawing conclusions based on correlation is problematic and not reproducible for the following reasons :
- Correlation is a random variable. It is based on a sample of data and when the sample is not representative, the correlation measure will not be accurate.
- High correlation does not mean much of relation in many cases, here is an example of a high value of correlations > 0.8, look at the actual underlying graphs :