Intro To Relational Databases

Here's just a sampling of the many data types that SQL supports. We won't be using most of these types in this course, though.

The exact list of types differs from one database to another. For a full list of types, check the manual for your database, such as this one for PostgreSQL.

Text and string types

text — a string of any length, like Python str or unicode types.

char(n) — a string of exactly n characters.

varchar(n) — a string of up to n characters.

Numeric types

integer — an integer value, like Python int.

real — a floating-point value, like Python float. Accurate up to six decimal places.

double precision — a higher-precision floating-point value. Accurate up to 15 decimal places.

decimal — an exact decimal value.

Date and time types

date — a calendar date; including year, month, and day.

time — a time of day.

timestamp — a date and time together.

Select statement

The most basic form of the select statement is to select a single scalar value:

select 2 + 2 ;

More usefully, we can select one or more columns from a table. With no restrictions, this will return all rows in the table:

select name, species from animals ;

Columns are separated by commas; use * to select all columns from the tables.

Quite often, we don't want all the data from a table. We can restrict the rows using a variety of select clauses, listed below. There are also a wide variety of functions that can apply to columns; including aggregation functions that operate on values from several rows, such as max and count.

where

The where clause expresses restrictions — filtering a table for rows that follow a particular rule. where supports equalities, inequalities, and boolean operators (among other things):

limit / offset

The limit clause sets a limit on how many rows to return in the result table. The optional offset clause says how far to skip ahead into the results. So limit 10 offset 100 will return 10 results starting with the 101st.

order by

The order by clause tells the database how to sort the results — usually according to one or more columns. So order by species, name says to sort results first by the species column, then by name within each species.

Ordering happens before limit/offset, so you can use them together to extract pages of alphabetized results. (Think of the pages of a dictionary.)

The optional desc modifier tells the database to order results in descending order — for instance from large numbers to small ones, or from Z to A.

group by

The group by clause is only used with aggregations, such as max or sum. Without a group by clause, a select statement with an aggregation will aggregate over the whole selected table(s), returning only one row. With a group by clause, it will return one row for each distinct value of the column or expression in the group by clause.

having

The having clause works like the where clause, but it applies after group by aggregations take place. The syntax is like this:

select columns from tables group by column having condition ;

Usually, at least one of the columns will be an aggregate function such as count, max, or sum on one of the tables' columns. In order to apply having to an aggregated column, you'll want to give it a name using as. For instance, if you had a table of items sold in a store, and you wanted to find all the items that have sold more than five units, you could use:

select name, count(*) as num from sales having num > 5;

You can have a select statement that uses only where, or only group by, or group by and having, or where and group by, or all three of them!

But it doesn't usually make sense to use having without group by.

If you use both where and having, the where condition will filter the rows that are going into the aggregation, and the having condition will filter the rows that come out of it.

You can read more about having here:

http://www.postgresql.org/docs/9.4/static/sql-select.html#SQL-HAVING

By default, SQL statements are terminated with semicolons. You use a semicolon to terminate statements unless you've (rarely) set a new statement terminator.

The big difference between comparison operators in SQL and in Python is that we use = for equality in SQL, whereas Python uses ==.

select name from animals where species != 'gorilla' and name != 'Max';
Find all the llamas born between January 1, 1995 and December 31, 1998.

select 
    name 
from 
    animals 
where 
    species = 'llama' and 
    birthdate >= '1995-01-01' and 
    birthdate <= '1998-12-31'

All the tables in the zoo database

animals

This table lists individual animals in the zoo. Each animal has only one row. There may be multiple animals with the same name, or even multiple animals with the same name and species.

diet

This table matches up species with the foods they eat. Every species in the zoo eats at least one sort of food, and many eat more than one. If a species eats more than one food, there will be more than one row for that species.

taxonomy

This table gives the (partial) biological taxonomic names for each species in the zoo. It can be used to find which species are more closely related to each other evolutionarily.

If you've never heard of this classification, don't worry about it; the details won't be necessary for this course. But if you're curious, Wikipedia articles Taxonomy and Biological classification may help.

ordernames

This table gives the common names for each of the taxonomic orders in the taxonomy table.

The SQL for it

And here are the SQL commands that were used to create those tables. We won't cover the create table command until lesson 4, but it may be interesting to look at:

create table animals (  
       name text,
       species text,
       birthdate date);

create table diet (
       species text,
       food text);  

create table taxonomy (
       name text,
       species text,
       genus text,
       family text,
       t_order text); 

create table ordernames (
       t_order text,
       name text);

Remember: In SQL, we always put string and date values inside single quotes.

Sample Queries

select max(name) from animals;
+-----------+
| max(name) |
+===========+
|       Zoe |
+-----------+

select * from animals limit 10;
+----------+------------+------------+
|     name |    species |  birthdate |
+==========+============+============+
|   Andrea |     alpaca | 2001-01-16 |
|    Bruno |     alpaca | 2004-09-23 |
|  Charlie |     alpaca | 2004-09-23 |
|    Della |     alpaca | 2006-01-09 |
|     Emma |     alpaca | 2013-03-16 |
|     Fred | brown bear | 1993-05-02 |
|   George | brown bear | 1997-06-24 |
|    Molly | brown bear | 1981-10-17 |
|  Eliezer |      camel | 1971-03-08 |
| Giuseppe |      camel | 1979-12-25 |
+----------+------------+------------+

select * from animals where species = 'orangutan' order by birthdate;
+---------+-----------+------------+
|    name |   species |  birthdate |
+=========+===========+============+
|    Raja | orangutan | 1975-04-09 |
| Kambing | orangutan | 1988-11-12 |
|    Ratu | orangutan | 1989-09-15 |
|  Putera | orangutan | 1993-06-29 |
|   Gajah | orangutan | 2011-05-26 |
|   Singa | orangutan | 2012-11-03 |
+---------+-----------+------------+

select name from animals where species = 'orangutan' order by birthdate desc;
+---------+
|    name |
+=========+
|   Singa |
|   Gajah |
|  Putera |
|    Ratu |
| Kambing |
|    Raja |
+---------+

select name, birthdate from animals order by name limit 10 offset 20;
+---------+------------+
|    name |  birthdate |
+=========+============+
|   Duran | 2009-09-20 |
| Ekwensu | 1993-10-31 |
|     Eli | 2002-02-22 |
| Eliezer | 1971-03-08 |
|    Emma | 2013-03-16 |
|    Emmy | 2008-04-15 |
|    Fido | 1999-08-04 |
| Francis | 1996-04-27 |
|    Fred | 1993-05-02 |
|   Gajah | 2011-05-26 |
+---------+------------+

select species, min(birthdate) from animals group by species;
+------------+----------------+
|    species | min(birthdate) |
+============+================+
|     alpaca |     2001-01-16 |
| brown bear |     1981-10-17 |
|      camel |     1971-03-08 |
|      dingo |     1999-08-04 |
|    echidna |     2003-01-31 |
|     ferret |     2006-09-18 |
|    gorilla |     1979-07-04 |
|      hyena |     2007-06-08 |
|     iguana |     1999-09-09 |
|     jackal |     2003-12-21 |
|      llama |     1984-01-05 |
|   mongoose |     2006-02-28 |
|      moose |     1990-03-27 |
|    narwhal |     1975-02-07 |
|  orangutan |     1975-04-09 |
|   platypus |     2000-04-13 |
|    raccoon |     2009-08-11 |
|   sea lion |     1987-04-29 |
|    unicorn |     1752-05-20 |
|    warthog |     2007-05-20 |
|        yak |     1996-09-20 |
|      zebra |     1989-03-15 |
+------------+----------------+

select name, count(*) as num from animals
group by name
order by num desc
limit 5;
+---------+-----+
|    name | num |
+=========+=====+
|  George |   4 |
|  Andrea |   2 |
| Charlie |   2 |
|    John |   2 |
|     Max |   2 |
+---------+-----+

select species, count(*) as count from animals
group by species
order by count desc
limit 1;
+---------+-------+
| species | count |
+=========+=======+
| gorilla |     9 |
+---------+-------+

insert

The basic syntax for the insert statement:

insert into table ( column1, column2, ... ) values ( val1, val2, ... );

If the values are in the same order as the table's columns (starting with the first column), you don't have to specify the columns in the insert statement:

insert into table values ( val1, val2, ... );

insert into animals values ('Bitey', 'opossum', '2015-04-08');

For instance, if a table has three columns (a, b, c) and you want to insert into a and b, you can leave off the column names from the insert statement. But if you want to insert into b and c, or a and c, you have to specify the columns.

A single insert statement can only insert into a single table. (Contrast this with the select statement, which can pull data from several tables using a join.)

join

select animals.name
from animals join diet
on animals.species = diet.species
where food = 'fish'
+---------+
|    name |
+=========+
|    Fred |
|  George |
|   Molly |
|   Bacon |
| Francis |
|   Inara |
|     Mal |
|   River |
|   Simon |
|     Zoe |
+---------+
select 
    name 
from 
    animals, diet
where 
    animals.species = diet.species
    and
    diet.food = 'fish'

columns that are used in join doesn't have to be in select

After Aggregating

Find the one food that is eaten by only one animal.

select 
    diet.food, count(animals.name) as num
from 
    diet, animals
where
    diet.species = animals.species
group by
    diet.food
having
    num = 1
+--------+-----+
|   food | num |
+========+=====+
| snakes |   1 |
+--------+-----+

More Join Practice

select 
    ordernames.name, count(animals.name) as num
from
    ordernames, taxonomy, animals
where
    ordernames.t_order = taxonomy.t_order
    and
    taxonomy.name = animals.species
group by
    ordernames.name
order by
    num desc
+---------------------+-----+
|                name | num |
+=====================+=====+
| even-toed ungulates |  29 |
|          carnivores |  25 |
|            primates |  15 |
|  odd-toed ungulates |   7 |
|  lizards and snakes |   4 |
|          monotremes |   4 |
|          marsupials |   3 |
| whales and dolphins |   2 |
+---------------------+-----+

Here's one possible solution:

select ordernames.name, count(*) as num
  from animals, taxonomy, ordernames
  where animals.species = taxonomy.name
    and taxonomy.t_order = ordernames.t_order
  group by ordernames.name
  order by num desc

And here's another, this time using the explicit join style:

select ordernames.name, count(*) as num
  from (animals join taxonomy 
                on animals.species = taxonomy.name)
                as ani_tax
        join ordernames
             on ani_tax.t_order = ordernames.t_order
  group by ordernames.name
  order by num desc

I think the upper version is much more readable than the lower one, because in the explicit join style you have to explicitly tell the database what order to join the tables in — ((a join b) join c) — instead of just letting the database worry about that.

If you're using a more barebones database (like SQLite) there can be a performance benefit to the explicit join style. But in PostgreSQL, the more server-oriented database system we'll be using next lesson, the query planner should optimize away any difference.