SQL tips and tricks for data analysis

Published March 4, 2021 |

Along with R and Python, SQL is one of the pillars of data analysis programming. To derive the most meaningful data, analysts need to learn the most efficient ways to use SQL.

Everything from query performance to data grouping can be enhanced by following a few simple tips.

1. Use Descriptive Names

Use simple and easily understood names for columns and tables.

If a table is named “users,” don’t label its columns “user_name”, “user_birthday” and so on. Labeling them “name” and “birthday” is more descriptive.

If you find yourself combining data to retrieve columns regularly, consider adding a new column to the schema with this data. Your queries will be simple and won’t need any data manipulation.

2. Format Your Code

Formatting is a visual characteristic of your code and it doesn’t affect results. However, it makes your code easier to debug and analyze in case you encounter errors. Have a look at the following lines of code and ask yourself which instance is easier to read:

SELECT * FROM countries WHERE name = ‘PH’;

versus

SELECT

FROM

countries

WHERE

name = ‘PH’;

Everyone has a preferred formatting style. Some might find the second example over formatted. Whatever your choice is, make sure you’re consistent and that you use line breaks before major clauses at the very least.

3. Use Uppercase and Lowercase

SQL clauses aren’t case sensitive, but the generally accepted practice is to write them in all caps.

When combined with proper formatting, it makes reading your queries easy. Lowercase is used for tables, columns, and field values except when the names of these elements include uppercase letters.

4. Follow the Order of Execution

The SQL order of execution refers to the way programming clauses are listed. Analysts often create problems for themselves by changing the ideal order of execution. Much like baking a cake, your query needs to follow the right steps to achieve the best result.

Here’s the correct order:

from – Define which tables you’ll source data from
where – Apply filters to your base data
group by – Aggregate your data
having – Filter the aggregated data
select – Display the final data
order by – Sort data for easy viewing
limit – Restrict the number of results

Follow the order of execution, and your queries will run smoothly without needing many performance boosting tricks.

5. Avoid Over-Normalization

Avoid organizing your database into tables that contain many one- or two-column tables. For example, data such as dates and zip codes don’t need their own tables with foreign keys.

You’ll reduce the amount of duplicate SQL and will stress the database to a lesser degree.

6. Go Tall, Not Wide, With Tables

You don’t want to over-normalize your data, but you shouldn’t be creating large, wide tables either.

If your tables have more than a dozen columns and contain sequential data, such as “question1′,” “question2”, and so on, your queries are going to stress the database immensely. Store such data in a different table. This makes it easier to retrieve data and run analytics on it.

JSON columns are great for production, but they aren’t the best for analysis environments. Transform these data into simpler types, and your analysis will be much smoother.

7. Be Consistent With Keys

There are many ways to name primary and foreign keys. The easiest is to label the primary key with a name (for example, “id”) and to label the foreign key as “tablename.id.” A popular method of naming primary and foreign keys is to label the foreign keys with the same name as the primary key. If you follow such a pattern, you’ll need to avoid abbreviations.

Be consistent with your naming patterns and clearly name columns that are foreign keys to other tables. For example, if the column “owner_id” is the foreign key to a table named “users,” then name it “user_owner_id” or something that makes it obvious that it’s a foreign key to a designated table.

8. Master ‘Group By’

The clause “group by” is one of the most frequently used in SQL. You can use it to aggregate data, and there are many ways to get it to work in powerful ways for you.

For instance, if you find yourself writing lengthy “group by” clauses, consider grouping data by the column number instead of the name. This practice is called ordinal notation, and its use is debated since it reduces legibility for some users. However, it makes aggregating large datasets easy.

Ordinal notation is also useful if you choose aliases that match column names. You can use ordinal notation for dev work and reserve using column names for production.

9. Ensure Data Consistency

Make sure your data has one source of truth. Label views and rollups accordingly so that it’s clear that the source data exists elsewhere.

Discard legacy columns and tables to prevent confusion during data maintenance checks.

10. Store datetimes Correctly

Some teams store Unix strings as dates. Instead, convert them to datetimes. Don’t store days, months, and years in different columns. SQL’s date functions are hard enough to code, so don’t make it even harder on yourself by having to concatenate this data.

Store everything in the UTC timezone, and you’ll avoid running into time series analysis issues.

Conclusion

These are just some of the many tips you can apply to get more out of your data warehouse and queries. SQL isn’t a perfect tool for data analysis, but it is a powerful one nonetheless.

Ensure your schemas follow these best practices and build your infrastructure in such a way that it supports your analytics efforts.

You can refer to the SQL style guide for more helpful tips and clarifications.