- Explain the importance of grouping and sorting data stored in a database
- Group and sort data with the
GROUP BY()
andORDER BY()
keywords - Craft advanced queries using aggregator functions along with sorting keywords and other conditional clauses
SQL isn't picky about how it returns data to you, based on your queries. It will simply return the relevant table rows in the order in which they exist in the table. This is often insufficient for the purposes of data analysis and organization.
How common is it to order a list of items alphabetically? Or numerically from least to greatest?
We can tell our SQL queries and aggregate functions to group and sort our data using several clauses:
ORDER BY()
LIMIT
GROUP BY()
HAVING
andWHERE
ASC
/DESC
Let's take a closer look at how we use these keywords to narrow our search criteria as well as to order and group the results.
Some cats are very famous, and accordingly very wealthy. Our Pets Database will
have a cats
table in which each cat has a name, age, breed, and net worth. Our
database will also have an owners
table and cats_owners
join table so that a
cat can have many owners and an owner can have many cats.
Creating the Database:
In your terminal, create the database and start sqlite3
with the following:
sqlite3 pets_database.db
Creating the tables:
Create the tables by entering the commands below at the sqlite>
prompt:
cats
table:
CREATE TABLE cats (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER,
breed TEXT,
net_worth INTEGER
);
owners
Table:
CREATE TABLE owners (id INTEGER PRIMARY KEY, name TEXT);
cats_owners
Table:
CREATE TABLE cats_owners (
cat_id INTEGER,
owner_id INTEGER
);
Inserting the values:
Finally, to insert the values, enter the following:
cats
:
INSERT INTO cats (id, name, age, breed, net_worth) VALUES (1, "Maru", 3, "Scottish Fold", 1000000);
INSERT INTO cats (id, name, age, breed, net_worth) VALUES (2, "Hana", 1, "Tabby", 21800);
INSERT INTO cats (id, name, age, breed, net_worth) VALUES (3, "Grumpy Cat", 4, "Persian", 181600);
INSERT INTO cats (id, name, age, breed, net_worth) VALUES (4, "Lil\' Bub", 2, "Tortoiseshell", 2000000);
owners
:
INSERT INTO owners (name) VALUES ("mugumogu");
INSERT INTO owners (name) VALUES ("Sophie");
INSERT INTO owners (name) VALUES ("Penny");
cats_owners
:
INSERT INTO cats_owners (cat_id, owner_id) VALUES (2, 2);
INSERT INTO cats_owners (cat_id, owner_id) VALUES (4, 3);
INSERT INTO cats_owners (cat_id, owner_id) VALUES (1, 2);
The general syntax for selecting values and sorting them is:
SELECT column_name, column_name
FROM table_name
ORDER BY column_name ASC, column_name DESC;
Note that ORDER BY()
will automatically sort the returned values in
ascending order so the use of the ASC
keyword is optional. If we want
to sort in descending order instead, we need to use the DESC
keyword.
Imagine you're working for an important investment firm in Manhattan. The
investors are interested in investing in a lucrative and popular cat. They need
your help to decide which cat that will be. They want a list of famous and
wealthy cats. We can do that by running a basic SELECT
statement at the
sqlite>
prompt:
SELECT * FROM cats WHERE net_worth > 0;
This will return:
id name age breed net_worth
----------- --------------- ---------- ------------- ----------
1 Maru 3 Scottish Fold 1000000
2 Hana 1 Tabby 21800
3 Grumpy Cat 4 Persian 181600
4 Lil\' Bub 2 Tortoiseshell 2000000
Our investors are busy people though. They don't have time to manually sort through this list of cats for the best candidate. They want you to return the list to them with the cats sorted by net worth, from greatest to least.
We can do so with the following line:
SELECT * FROM cats ORDER BY(net_worth) DESC;
This will return:
id name age breed net_worth
----------- --------------- ---------- ------------- ----------
4 Lil\' Bub 2 Tortoiseshell 2000000
1 Maru 3 Scottish Fold 1000000
3 Grumpy Cat 4 Persian 181600
2 Hana 1 Tabby 21800
Turns out our investors are very impatient. They don't want to review the list
themselves, they just want you to return to them the wealthiest cat. We can
accomplish this by using the LIMIT
keyword with the above query:
SELECT * FROM cats ORDER BY(net_worth) DESC LIMIT 1;
Which will return:
name age breed net_worth
--------------- ---------- ------------- ----------
Lil\' Bub 2 Tortoiseshell 2000000
The LIMIT
keyword specifies how many of the records resulting from the query
you'd like to actually return. In this case, because we've sorted the records
in descending order by net worth and set the LIMIT
to 1, the wealthiest cat
is returned.
The GROUP BY()
keyword is very similar to ORDER BY()
. The main difference is
that ORDER BY()
sorts sets of data returned by basic queries while GROUP BY()
sorts sets of data returned by aggregate functions.
SELECT column_name, aggregate_function(column_name)
FROM table_name
WHERE column_name operator value
GROUP BY column_name;
Let's calculate the sum of the net worth of all of the cats, grouped by owner name:
SELECT owners.name, SUM(cats.net_worth)
FROM owners
INNER JOIN cats_owners
ON owners.id = cats_owners.owner_id
JOIN cats ON cats_owners.cat_id = cats.id
GROUP BY owners.name;
This should return:
owners.name SUM(cats.net_worth)
--------------- -------------------
Penny 2000000
Sophie 1021800
Note: If you have headers turned on, the headers you see in your terminal may differ from the ones displayed here and below.
In the above query, we've implemented two joins. First, we're joining owners
and cat_owners
on owners.id = cats_owners.owner_id
. This first joined table
would look like the following if we were to query it:
owners.id owners.name cat_owners.cat_id cat_owners.owner_id
--------- ----------- ----------------- -------------------
2 Sophie 2 2
3 Penny 4 3
2 Sophie 1 2
With this table, we then implement a second join with cats
on
cats_owners.cat_id = cats.id
. To better understand this, try running the
provided query, but select everything rather than just the owner's name and
the sum of their cats' net worth, and remove the GROUP BY
line. You'll be able
to see all three tables have been joined.
In our example query above, we need to use the SUM(cats.net_worth)
aggregator
in conjunction with GROUP BY
to get the information we want. Without GROUP BY
,
only the first owner in the table would be returned, along with the sum of the
net worth of all the cats:
owners.name SUM(cats.net_worth)
--------------- -------------------
Sophie 3021800
If, on the other hand, we forget to use SUM
and just get cats.net_worth
,
the results will be grouped by owner, but only the net worth of the first
cat belonging to each owner will be returned, not the aggregate:
owners.name cats.net_worth
--------------- ---------------
Penny 2000000
Sophie 21800
When we use SUM
and GROUP BY
together, SUM
looks at all of the values in
the net_worth
column of the cats
table (or whichever column you specify in
parentheses) and takes the sum of those values, but only after those cats have
been grouped by owner:
owners.name SUM(cats.net_worth)
--------------- -------------------
Penny 2000000
Sophie 1021800
In our original data, Penny is the owner of Lil' Bub (2000000) while Sophie is the owner of Maru and Hana (1000000 + 21800).
Suppose we have a table called employee_bonus
as shown below. Note that the
table has multiple entries for employees Abigail and Matthew.
employee_bonus
:
Employee | Bonus |
---|---|
Matthew | 1000 |
Abigail | 2000 |
Matthew | 500 |
Tom | 700 |
Abigail | 1250 |
To calculate the total bonus that each employee received, we would write a SQL statement like this:
SELECT employee, SUM(bonus) FROM employee_bonus GROUP BY employee;
This should return:
employee SUM(bonus)
--------------- -------------------
Abigail 3250
Matthew 1500
Tom 700
Now, suppose we wanted to find the employees who received more than $1,000 in bonuses. You might think that we could write a query like this:
BAD SQL:
SELECT employee, SUM(bonus) FROM employee_bonus
GROUP BY employee WHERE SUM(bonus) > 1000;
Unfortunately, the above will not work because the WHERE
clause can't be used
with aggregates (SUM
, AVG
, MAX
, etc). What we need to use is the HAVING
clause. The HAVING
clause was added to SQL so that we could compare aggregates
in the same way that the WHERE
clause can be used for comparing non-aggregates.
Now, the correct SQL will look like this:
GOOD SQL:
SELECT employee, SUM(bonus) FROM employee_bonus
GROUP BY employee HAVING SUM(bonus) > 1000;
The difference between the HAVING
and WHERE
clauses in SQL is that the
WHERE
clause cannot be used with aggregates while the HAVING
clause can.
HAVING
filters out groups of rows created by GROUP BY
, and WHERE
filters
out individual rows. Note that there is nothing to stop you from using both
of them in the same query.
Finally, it is important to remember that order matters here: WHERE
must
come before GROUP BY
and HAVING
must come after it, as shown
below; changing the order will produce a syntax error.
SELECT
FROM
JOIN
ON
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT