Essential SQL Commands for Data Science: A Comprehensive Guide

Essential SQL Commands for Data Science: A Comprehensive Guide

What is SQL?

Structured Query Language, abbreviated as SQL, is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS) or stream processing in a relational data stream management system (RDSMS) as per Wikipedia.

It is a language used to write queries against Relational Databases.

Types of SQL Commands

There are five types of SQL commands that you will use to interact with the databases. They are based on functionalities.

  1. Data Definition Language (DDL)

These are commands that are used to create, modify and delete the database objects. They are used to define and manage the database structure. They are used rarely in data science as they are common with developers and database managers.

These includes:

CREATE   --- used to create database objects i.e table or view

ALTER    --- used to modify existing daatabase objects

DROP     --- used to delete database object or temporary database objects

TRUNCATE --- used for deleting data from table but retains its structure.

2. Data Manipulation Language(DML)

These are commands used to manipulate data stored in a database as the name suggests. They usually allow the users to retrieve and modify the data. The most common tasks are filtering data, sorting data, joining tables, grouping data, and calculations.

The most common DML commands are:

SELECT  --- used for retrieving data from tables in a database
INSERT  --- used for adding data to a table in a database
UPDATE  --- used for modifying data in a table
DELETE  --- used for data removal from a table

3. Data Query Language(DQL)

These are commands used to retrieve data from tables in a database. They focus on querying or selecting data from a database. These include:

SELECT   --- used for retrieving data from table(s), specifies columns to be retrieved and conditions
WHERE    --- used for filtering data based on a condition
ORDER BY --- used for sorting the retrieved data
GROUP BY --- used for grouping data based on columns. Used with aggregating functions
JOIN    --- used to combine data from different tables in a database.

4. Transaction Control Language(TCL)

These are commands used to control the changes that are made to a database and to ensure data quality and consistency are maintained. The most common TCL commands include:

COMMIT          --- used for saving changes made to the database
ROLLBACK        --- used to undo changes made to a database since last commit
SAVEPOINT       --- used to create a restore point
SET TRANSACTION --- used in defining transaction characteristics

5. Data Control Language(DCL)

These commands are rarely used in data science as they involve controlling access to a database. They grant or revoke privileges to users of a database.

These commands include:

GRANT  --- gives users permission for perfominng certain actions
REVOKE --- removes permissions granted to users
DENY   --- prevents user(s) from accessing certain database objects.

Data Science SQL Commands

Having an understanding of the different types of commands it's easy to apply them in your data science projects. Let's dive into their usage in data science with practical examples.

Data Cleaning with SQL

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

A database can be large and thus prone to data integrity and consistency issues. To assist in data-driven decisions you must use accurate and consistent data, thus a need for cleaning data.

a) Missing Data

The database can have missing data in columns in a table(s) or incomplete data. SQL has commands that aid in detecting and replacing missing data with desired values. The commands include:-

IS NULL      --- checks for null variables
IS NOT NULL  --- checks for non null variables
COALESCE     --- returns non null value in a list of expression.

Examples:

We query the Person table for a person with a null middle name.

SELECT *
FROM Person.Person
WHERE MiddleName IS NULL

The output is as follows:

IS NOT NULL is the opposite of the IS NULL function and returns the rows without null data points.

The COALESCE command is used to return the non-null values, we use it to replace null or missing values with a default value.

Example:

Here we will query the table and change the null MiddleName column values to blank.

SELECT  NameStyle, Title, FirstName, COALESCE(MiddleName, '') as MiddleName
FROM Person.Person

The output is as follows:

b) Duplicates

These are recurring items in our database. When conducting data science projects we try to use distinct values to minimize errors during machine learning or model building.

We use the following commands to achieve this:

SELECT  DISTINCT(FirstName) --- the output are the distinct names in the FirstName column
FROM Person.Person

SELECT  FirstName, EmailPromotion
FROM Person.Person
GROUP BY  FirstName, EmailPromotion  --- the columns should be in the select statement
HAVING  condition   --- used to filter the groups based on the condition.

These are common commands used in detecting and dealing with duplicates.

Data Manipulation with SQL

Data manipulation is the process of organizing or arranging data in order to make it easier to interpret.

SQL can be used to get insights pertaining to a given data set. As you may have earlier noted SELECT statement is always used to query the database for data. We will have a look at it together with other commands used in data manipulation with real examples.

a) Joins

Databases are structured into tables, and unlike a flat relational database, you have to query several tables to retrieve data pertaining to a certain person, department, or item. There are several types of joins which include INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL OUTER JOIN. They differentiate in which rows from which table is returned.

The key thing is identifying the primary key before joining the tables. I’ll illustrate the use of inner join which returns rows with matching values in both tables.

SELECT a.Name, c.JobTitle, c.NationalIDNumber, d.FirstName, d.LastName
FROM HumanResources.Department a
INNER JOIN HumanResources.EmployeeDepartmentHistory b
ON a.DepartmentID = b.DepartmentID
INNER JOIN HumanResources.Employee c
ON b.BusinessEntityID = c.BusinessEntityID
INNER JOIN Person.Person d
ON b.BusinessEntityID = d.BusinessEntityID

In the above snippet, we join three tables to get the name, job title, id number, and names. There are two primary keys that make it possible to join them(DepartmentID and BusinessEntityID). The resulting table is as follows, which can be exported for other analytics.

b) Aggregate Functions

These are commands used to perform calculations on columns in a database. The most common are Count, Sum, AVG, Min, Max, STDDEV, Variance, and Group_concat. They are mostly used in the Select statements to return aggregated values from the columns.

Examples depicting their usage can be:

SELECT sum(linetotal) as TotalSalesAmount, ProductID
FROM Sales.SalesOrderDetail
GROUP BY ProductID
HAVING sum(LineTotal)>=1000000
ORDER BY TotalSalesAmount DESC

The query will retrieve the sum of the Linetotal column after grouping with the product and return only the total above the given amount. We use the order function for sorting in descending order. The given results are as follows:

Additional Aspects

Different data science project requires different statements/commands depending on the data you want to retrieve from a database using SQL. I would suggest as you dive deeper into SQL language learn about the additional basic commands that assist in day-to-day SQL for data science.

These include:

LIMIT 10 --- assists when the database hase many rows to display a certain number of rows
TOP      --- works similarly with limit but used together with the select statement.

*/  OPERATORS /*
=      --- equals
>,<    --- less and greater than
>=,<=  --- less or equal and greater or equal than
<>     --- not equal to

*/ Text Operations /*

% , -, [], ^, -   ---- used in returning strings with certain characters or range of character

*/  Main  /*

*    --- used in returning all the rows in a ceratin table or tables

SQL allows set, conditional and logical-mathematical valuations which include UNION and INTERSECTION, BETWEEN, IS NULL, LIKE and IN, AND and OR respectively.

These come in handy in data science projects using SQL.

Conclusion

SQL is a powerful tool for data scientists, and a solid understanding of the five types of SQL commands can help you to effectively manage data and gain valuable insights. While some of the commands are not commonly used by data scientists, many of them are essential for data manipulation and data cleaning and can make a significant difference in the quality of your data-driven decisions.

SQL can help streamline your workflow, improve data accuracy and consistency, and make faster, more informed decisions.

Happy Learning and feel free to reach out!


You can reach out through LinkedIn, Twitter, and Gmail. Thank you!