SQL Hidden Mistakes Made by Experienced Data Analysts

Noah Sultan, PhD
3 min readJul 15, 2023

--

Writing wrong SQL syntax is not that bad because you would immediately know, correct what you missed, et voila, as nothing has happened. What is really too bad is when the code run and it gives results but it is not correct results (not crazy wrong either, otherwise you might notice).

To demonstrate that, imagine you have two tables, the first contains customers of your website and the second contains orders made on the website.

customers table:
+-------------+---------------+-------------+
| customer_id | customer_name | visit_count |
+-------------+---------------+-------------+
| 1 | John Doe | 5 |
| 2 | Jane Smith | 8 |
| 3 | Mike Johnson | 3 |
+-------------+---------------+-------------+
orders table:
+----------+-------------+------------+--------------+
| order_id | customer_id | order_date | total_amount |
+----------+-------------+------------+--------------+
| 1 | 1 | 2023-07-10 | 50.00 |
| 2 | 1 | 2023-07-12 | 75.00 |
| 3 | 2 | 2023-07-11 | 120.50 |
| 4 | 3 | 2023-07-15 | 25.75 |
+----------+-------------+------------+--------------+

Most of the time, we do not work with tables on their own, but rather after joining them. The above mentioned tables can be joined together on the customer_id column.

Now if we want compute something as simple as total visits on the website or total visits by user, if we are not very careful, we will get wrong results without noticing.

SELECT SUM(c.visit_count) AS total_visits
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id

The results will be 21 visits, which is not correct.

And even when we try to sum the total visits per user :

SELECT c.customer_id, c.customer_name, SUM(c.visit_count) AS total_visits
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;

We get wrong values for one of the customer (the customer that has more than one order):

customers table:
+-------------+---------------+--------------+
| customer_id | customer_name | total_visits |
+-------------+---------------+--------------+
| 1 | John Doe | 10 |
| 2 | Jane Smith | 8 |
| 3 | Mike Johnson | 3 |
+-------------+---------------+--------------+

We can see what went wrong by taking a look on the table resulted from the join :

+-------------+---------------+--------------+----------+------------+--------------+
| customer_id | customer_name | visit_count | order_id | order_date | total_amount |
+-------------+---------------+--------------+----------+------------+--------------+
| 1 | John Doe | 5 | 1 | 2023-07-10 | 50.00 |
| 1 | John Doe | 5 | 2 | 2023-07-12 | 75.00 |
| 2 | Jane Smith | 8 | 3 | 2023-07-11 | 120.50 |
| 3 | Mike Johnson | 3 | 4 | 2023-07-15 | 25.75 |
+-------------+---------------+--------------+----------+------------+--------------+

This phenomena is called fanout, and it is hard to notice because the results are partially correct. For instance, if we are aggregating on one of the other dimensions (ex. computing visits per customers), the results will be be wrong only for clients that made more than 1 order.

Here is the whole SQL code to test it your self on a platform like sqlite :

CREATE TABLE customers (
customer_id integer,
customer_name varchar(255),
visit_count integer
);

CREATE TABLE orders (
order_id integer,
customer_id integer,
order_date date,
total_amount decimal(10, 2)
);

INSERT INTO customers (customer_id, customer_name, visit_count) VALUES
(1, 'John Doe', 5),
(2, 'Jane Smith', 8),
(3, 'Mike Johnson', 3);

INSERT INTO orders (order_id, customer_id, order_date, total_amount) VALUES
(1, 1, '2023-07-10', 50.00),
(2, 1, '2023-07-12', 75.00),
(3, 2, '2023-07-11', 120.50),
(4, 3, '2023-07-15', 25.75);

SELECT c.customer_id, c.customer_name, SUM(c.visit_count) AS total_visits
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;

It is not easy to avoid these types of errors specially they are hidden when they happen. But it is enough to know they exist, and double check the results, or apply the aggregation on the original table when the join is not needed for the computation.

Some BI tools like Google Looker take care of this errors for you by applying what is called symmetric aggregates, but otherwise you will have to take of it yourself.

--

--

Noah Sultan, PhD
Noah Sultan, PhD

Written by Noah Sultan, PhD

LinkedIn Top Data Voice | Data Scientist | Creating AI apps, 1 per weekend | PhD in Machine Learning | 📍 Paris | linkedin.com/in/eisultan

No responses yet