Applied Database Systems - Tutorial 2

I put an example ER schema and DDL translation for the exercise in the previous tutorial online. Todays tutorial is about inserting data into a databases, as well as reading, modifying, and deleting existing data in a database. This tutorial consists of three parts:

 Basic SQL

This tutorial is adapted in minor ways from the PostgreSQL tutorial. This section of the tutorial consists of 3 parts:

In order to create and examine the example tables:

  • Download this file (right click and "Save as...")
  • Start the PostgreSQL interactive terminal
  • Run the script with \i db1.sql (remembering that if you do not start pgsql from the folder in which you saved the file, then you need to qualify the name with the directory as well).
    Note: The script will drop tables weather and cities if they exist in your database. In case that those tables are not present you will see an error message that can be ignored.
  • \d will "describe" all the tables that were created
  • \d weather will describe that particular table
  • \d cities will describe the other table

A number of people have asked about how to to get more detailed information about the various SQL commands we're using, since the help that you get with \h in the PostgreSQL interactive terminal (psql) is rather terse, in essentially just describing the syntax of the commands. Here are some places to try:

Querying a Table

To retrieve data from a table, the table is queried. A SELECT statement is used to do this. The statement is divided into a select list (the part that lists the columns to be returned), a table list (the part that lists the tables from which to retrieve the data), and an optional qualification (the part that specifies any restrictions). For example, to retrieve all the rows of table weather, type:

 SELECT * FROM weather; 

(here * means "all columns") and the output should be:

     city      | temp_lo | temp_hi | prcp |    date
---------------+---------+---------+------+------------
 Hayward       |      37 |      54 |      | 1994-11-29
 London        |      -5 |      25 |  0.3 | 2004-01-05
 Chicago       |      15 |      35 |  0.1 | 2003-05-06
 San Francisco |      46 |      50 | 0.25 | 1994-11-27
 San Francisco |      43 |      57 |    0 | 1994-11-29
 Hayward       |      37 |      54 |      | 1994-11-29
(6 rows)

You may specify any arbitrary expressions in the target list. For example, you can do:

SELECT city, (temp_hi+temp_lo)/2 AS temp_avg, date FROM weather;

This should give:

     city      | temp_avg |    date
---------------+----------+------------
 Hayward       |       45 | 1994-11-29
 London        |       10 | 2004-01-05
 Chicago       |       25 | 2003-05-06
 San Francisco |       48 | 1994-11-27
 San Francisco |       50 | 1994-11-29
 Hayward       |       45 | 1994-11-29
(6 rows)

Notice how the AS clause is used to relabel the output column. (It is optional.)

Arbitrary Boolean operators (AND, OR, and NOT) are allowed in the qualification of a query. For example, the following retrieves the weather of San Francisco on rainy days:

SELECT * FROM weather
    WHERE city = 'San Francisco'
    AND prcp > 0.0;
Result:

     city      | temp_lo | temp_hi | prcp |    date
---------------+---------+---------+------+------------
 San Francisco |      46 |      50 | 0.25 | 1994-11-27
(1 row)

As a final note, you can request that the results of a select can be returned in sorted order or with duplicate rows removed:

SELECT DISTINCT city
    FROM weather
    ORDER BY city;
     city
---------------
 Chicago
 Hayward
 London
 San Francisco
(4 rows)

DISTINCT and ORDER BY can be used separately, of course.

Joining Tables

Thus far, our queries have only accessed one table at a time. Queries can access multiple tables at once, or access the same table in such a way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables at one time is called a join query. As an example, say you wish to list all the weather records together with the location of the associated city. To do that, we need to compare the city column of each row of the weather table with the name column of all rows in the cities table, and select the pairs of rows where these values match.

Note:This is only a conceptual model. The actual join may be performed in a more efficient manner, but this is invisible to the user.

This would be accomplished by the following query:

SELECT *
    FROM weather, cities
    WHERE city = name;

     city      | temp_lo | temp_hi | prcp |    date    |     name      |  location
---------------+---------+---------+------+------------+---------------+------------
 London        |      -5 |      25 |  0.3 | 2004-01-05 | London        | (153,65.3)
 San Francisco |      46 |      50 | 0.25 | 1994-11-27 | San Francisco | (-194,53)
 San Francisco |      43 |      57 |    0 | 1994-11-29 | San Francisco | (-194,53)
(3 rows)

Observe two things about the result set:

  • There is no result row for the city of Hayward. This is because there is no matching entry in the cities table for Hayward, so the join ignores the unmatched rows in the weather table. We will see shortly how this can be fixed.
  • There are two columns containing the city name. This is correct because the lists of columns of the weather and the cities table are concatenated. In practice this is undesirable, though, so you will probably want to list the output columns explicitly rather than using *:
SELECT city, temp_lo, temp_hi, prcp, date, location
    FROM weather, cities
    WHERE city = name;

Exercise: Attempt to find out the semantics of this query when the WHERE clause is omitted.

Since the columns all had different names, the parser automatically found out which table they belong to, but it is good style to fully qualify column names in join queries:

SELECT weather.city, weather.temp_lo, weather.temp_hi,
       weather.prcp, weather.date, cities.location
    FROM weather, cities
    WHERE cities.name = weather.city;

Join queries of the kind seen thus far can also be written in this alternative form:

SELECT *
    FROM weather INNER JOIN cities ON (weather.city = cities.name);

This syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.

Now we will figure out how we can get the Hayward records back in. What we want the query to do is to scan the weather table and for each row to find the matching cities row. If no matching row is found we want some "empty values" to be substituted for the cities table's columns. This kind of query is called an outer join. (The joins we have seen so far are inner joins.) The command looks like this:

SELECT * FROM weather LEFT OUTER JOIN cities ON (weather.city = cities.name);
     city      | temp_lo | temp_hi | prcp |    date    |     name      |  location
---------------+---------+---------+------+------------+---------------+------------
 Chicago       |      15 |      35 |  0.1 | 2003-05-06 |               |
 Hayward       |      37 |      54 |      | 1994-11-29 |               |
 Hayward       |      37 |      54 |      | 1994-11-29 |               |
 London        |      -5 |      25 |  0.3 | 2004-01-05 | London        | (153,65.3)
 San Francisco |      46 |      50 | 0.25 | 1994-11-27 | San Francisco | (-194,53)
 San Francisco |      43 |      57 |    0 | 1994-11-29 | San Francisco | (-194,53)
(6 rows)

This query is called a left outer join because the table mentioned on the left of the join operator will have each of its rows in the output at least once, whereas the table on the right will only have those rows output that match some row of the left table. When outputting a left-table row for which there is no right-table match, empty (null) values are substituted for the right-table columns.

Exercise: There are also right outer joins and full outer joins. Try to find out what those do.

We can also join a table against itself. This is called a self join. As an example, suppose we wish to find all the weather records that are in the temperature range of other weather records. So we need to compare the temp_lo and temp_hi columns of each weather row to the temp_lo and temp_hi columns of all other weather rows. We can do this with the following query:

SELECT W1.city, W1.temp_lo AS low, W1.temp_hi AS high,
     W2.city, W2.temp_lo AS low, W2.temp_hi AS high
     FROM weather W1, weather W2
     WHERE W1.temp_lo < W2.temp_lo
     AND W1.temp_hi > W2.temp_hi;
     city      | low | high |     city      | low | high
---------------+-----+------+---------------+-----+------
 Hayward       |  37 |   54 | San Francisco |  46 |   50
 San Francisco |  43 |   57 | San Francisco |  46 |   50
 Hayward       |  37 |   54 | San Francisco |  46 |   50
(3 rows)

Here we have relabeled the weather table as W1 and W2 to be able to distinguish the left and right side of the join. You can also use these kinds of aliases in other queries to save some typing, e.g.:

SELECT *
     FROM weather w, cities c
     WHERE w.city = c.name;

You will encounter this style of abbreviating quite frequently.

Aggregate Functions

Like most other relational database products, PostgreSQL supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the count, sum, avg (average), max (maximum) and min (minimum) over a set of rows.

As an example, we can find the highest low-temperature reading anywhere with:

SELECT max(temp_lo) FROM weather;
 max
 -----
   46
 (1 row)

If we wanted to know what city (or cities) that reading occurred in, we might try:

  SELECT city FROM weather WHERE temp_lo = max(temp_lo);     WRONG

but this will not work since the aggregate max cannot be used in the WHERE clause. (This restriction exists because the WHERE clause determines the rows that will go into the aggregation stage; so it has to be evaluated before aggregate functions are computed.) However, as is often the case the query can be restated to accomplish the intended result, here by using a subquery:

SELECT city FROM weather
     WHERE temp_lo = (SELECT max(temp_lo) FROM weather);
     city
 ---------------
  San Francisco
 (1 row)

This is OK because the subquery is an independent computation that computes its own aggregate separately from what is happening in the outer query.

Aggregates are also very useful in combination with GROUP BY clauses. For example, we can get the maximum low temperature observed in each city with:

SELECT city, max(temp_lo)
     FROM weather
     GROUP BY city;

     city      | max
---------------+-----
 Chicago       |  15
 Hayward       |  37
 London        |  -5
 San Francisco |  46
(4 rows)

which gives us one output row per city. Each aggregate result is computed over the table rows matching that city. We can filter these grouped rows using HAVING:

SELECT city, max(temp_lo)
     FROM weather
     GROUP BY city
     HAVING max(temp_lo) < 40;
  city   | max
---------+-----
 Chicago |  15
 Hayward |  37
 London  |  -5
(3 rows)

which gives us the same results for only the cities that have all temp_lo values below 40. Finally, if we only care about cities whose names begin with "S", we might do

SELECT city, max(temp_lo)
     FROM weather
     WHERE city LIKE 'S%'
     GROUP BY city
     HAVING max(temp_lo) < 40;

The LIKE operator does pattern matching and is explained in the PostgreSQL User's Guide.

It is important to understand the interaction between aggregates and SQL's WHERE and HAVING clauses. The fundamental difference between WHERE and HAVING is this: WHERE selects input rows before groups and aggregates are computed (thus, it controls which rows go into the aggregate computation), whereas HAVING selects group rows after groups and aggregates are computed. Thus, the WHERE clause must not contain aggregate functions; it makes no sense to try to use an aggregate to determine which rows will be inputs to the aggregates. On the other hand, HAVING clauses always contain aggregate functions. (Strictly speaking, you are allowed to write a HAVING clause that doesn't use aggregates, but it's wasteful: The same condition could be used more efficiently at the WHERE stage.)

Observe that we can apply the city name restriction in WHERE, since it needs no aggregate. This is more efficient than adding the restriction to HAVING, because we avoid doing the grouping and aggregate calculations for all rows that fail the WHERE check.

 Inserting, Deleting, and Updating Data

This section of the tutorial consists of 5 parts:

 Inserting Data Into A Table

The INSERT command is used to populate a table with rows. If you have looked at the example data file we have used, you will have seen statements such as:

INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');

INSERT INTO weather VALUES( 'London', -5, 25, 0.3, '2004-01-05' );

INSERT INTO weather VALUES( 'Chicago', 15, 35, 0.1, '2003-05-06' );

Note carefully which input formats are used for strings and numbers. Constants that are not simple numeric values usually must be surrounded by single quotes ('), as in the example. The date type is actually quite flexible in what it accepts, but for this tutorial we will stick to the unambiguous format shown here (one of the forms in the very useful ISO 8601 standard).

The point type requires a coordinate pair as input, as shown here:

INSERT INTO cities  VALUES ('San Francisco', '(-194.0, 53.0)');

INSERT INTO cities VALUES( 'London', '(153.0,65.3)' );

or you can use the point( ... , ... ) syntax:

INSERT INTO cities VALUES( 'Canberra', point(683.0,575.3) );

The syntax used so far requires you to remember the order of the columns. An alternative syntax allows you to list the columns explicitly:

INSERT INTO weather (city, temp_lo, temp_hi, prcp, date)
         VALUES ('San Francisco', 43, 57, 0.0, '1994-11-29');

INSERT INTO weather VALUES('Hayward',37,54,NULL,'1994-11-29');

In that example we enter the special value "NULL" into the prcp columns, to indicate that the there is no data available. You can disallow NULL values in a column by specifying NOT NULL after the type name in your CREATE TABLE statement.

You can list the columns in a different order if you wish or even omit some columns. For example, if the precipitation is unknown one could enter a row like:

INSERT INTO weather (date, city, temp_hi, temp_lo) VALUES ('1994-11-29', 'Hayward',54,37);

Many developers consider explicitly listing the columns better style than relying on the order implicitly.

If you already have values within one table (or across several other tables) that you wish to insert into a separate table, this can also be achieved with the INSERT INTO command. The following syntax is used for this technique:

INSERT INTO table_name
         [ ( column_name [, ...] ) ]
         query

Similar to the syntax of INSERT INTO, you may optionally specify which columns you wish to insert into, and in what order the query returns their values. However, with this form of INSERT INTO, you provide a complete SQL SELECT statement in the place of the VALUES keyword.

 Changing Data In A Table

You can update existing rows using the UPDATE command. Suppose you discover the temperature readings are all off by 2 degrees as of November 29th. You may update the data as follows:

UPDATE weather
     SET temp_hi = temp_hi - 2,  temp_lo = temp_lo - 2
     WHERE date >= '1994-11-29';

You should see that this affects 5 rows. Now look at the new state of the data:

sXXXXXXX=> SELECT * from weather;
     city      | temp_lo | temp_hi | prcp |    date
---------------+---------+---------+------+------------
 San Francisco |      46 |      50 | 0.25 | 1994-11-27
 San Francisco |      41 |      55 |    0 | 1994-11-29
 Hayward       |      35 |      52 |      | 1994-11-29
 Hayward       |      35 |      52 |      | 1994-11-29
 London        |      -7 |      23 |  0.3 | 2004-01-05
 Chicago       |      13 |      33 |  0.1 | 2003-05-06
(6 rows)
 Deleting Rows From a Table

In the example above we entered duplicate data for the city of Hayward, and with NULL in the prcp column. If we wanted to delete those rows, we can use the DELETE command, (which may remind you of the SELECT command, but without column names):

DELETE FROM weather WHERE city = 'Hayward';

All weather records belonging to Hayward are now removed:

SELECT * FROM weather;

     city      | temp_lo | temp_hi | prcp |    date
---------------+---------+---------+------+------------
 San Francisco |      46 |      50 | 0.25 | 1994-11-27
 London        |      -7 |      23 |  0.3 | 2004-01-05
 Chicago       |      13 |      33 |  0.1 | 2003-05-06
 San Francisco |      41 |      55 |    0 | 1994-11-29
(4 rows)

Note that without the "WHERE" clause, as in:

DELETE FROM <table-name>;

... the DELETE will remove all rows from the given table, leaving it empty. The system will not request confirmation before doing this!

 Copy Data between the Database and the File System

The INSERT command is one way of inserting data into a table. An alternative way of inserting data into a table is the \copy command. This specific (i.e. non-standard) command loads (large amounts of) data from flat-text files. It also allows to write the data in a database into a file on your file system.

The data read or written is a text file with one line per table row. Columns in a row are separated by a delimiter character. The column values themselves are strings generated by the output function, or acceptable to the input function, of each attribute's data type. A specified null string is used in place of columns that are null.

Instead of having several INSERT commands, we may also represent the weather data in a text file having the following format:

San Francisco|46|50|0.25|1994-11-27
London|-5|25|0.3|2004-01-05
Chicago|15|35|0.1|2003-05-06
San Francisco|43|57|0.0|1994-11-29
Hayward|37|54|null|1994-11-29

The following command reads the data from the file into the database:

\copy weather(city, temp_lo, temp_hi, prcp, date) FROM './weather.load'
         WITH DELIMITER '|' NULL 'null'

If a list of columns is specified, \copy will only copy the data in the specified columns to or from the file. If there are any columns in the table that are not in the column list, \copy ... FROM will insert the default values for those columns.

The following command writes the data from table weather to a file on the file system:

\copy weather(city, temp_lo, temp_hi, prcp, date) TO './weather.out'
         WITH DELIMITER '|' NULL 'null'

You can read more about the COPY command in the PostgreSQL Reference Manual.

Note: COPY is restricted to the super-user while \copy can be executed by everyone.

 Sequence Objects

Sequence objects (also called sequence generators or just sequences) are special single-row tables created with CREATE SEQUENCE. A sequence object is usually used to generate unique identifiers for rows of a table. The listed sequence functions provide simple methods for obtaining successive sequence values from sequence objects.

Function                     | Return Type | Description
-----------------------------+-------------+-------------------------------------------------
nextval(text)                | bigint      | Advance sequence and return new value
currval(text)                | bigint      | Return value most recently obtained with nextval
setval(text, bigint)         | bigint      | Set sequence's current value
setval(text, bigint, boolean)| bigint      | Set sequence's current value and is_called flag

The following list of commands creates a table containing the names of cities in table weather. Each city is assigned a unique identifier in ascending order of their name

CREATE TABLE city (
         id int NOT NULL,
         name varchar(80) NOT NULL,
         PRIMARY KEY (id)
);

CREATE SEQUENCE city_id_sequence;

ALTER TABLE city
         ALTER COLUMN id
         SET DEFAULT nextval('city_id_sequence');

INSERT INTO city(name) SELECT city FROM weather GROUP BY city ORDER BY city;

As a short-cut you could also the implicit sequence generator use SERIAL when creating table city:

CREATE TABLE city (
         id serial,
         name varchar(80) NOT NULL,
         PRIMARY KEY (id)
);

INSERT INTO city(name) SELECT city FROM weather GROUP BY city ORDER BY city;
 Exercise

The Web-Site Munro Magic provides a list of mountains in Scotlan, England, and Wales. In the following we want to use this data to create our own list of Munros and Corbetts in Scotland.

  • Create the given table for the Munro Magic data. Every mountain has a name, and a height (in metres). The Web-Site distinguishes three types of mountains: M(unro), C(orbett), and EW (indicating that the mountain is in England/ Wales). For each mountain the Web-Site also lists the grid reference in the Ordnance Survey system.
  • You are given a text file in which the current Munro Magic data is stored. Columns in the text file are separated by |. The order of columns is height, name, type, and grid reference. Load the data into table munro_magic.
  • Delete all the mountains that are in England/Wales from munro_magic.
  • Create a table mountains that has the following columns:
    • id (int): a unique identifier for each mountain,
    • name: the name of the mountain,
    • type: type of the mountain, i.e., M or C,
    • gridref: grid reference in the Ordnance Survey system,
    • heightm (int): the mountain height in metres,
    • heightf (int): the mountain height in feet, and
    • rating (int): a rating for the mountain.
  • Populate table mountains with the data from munro_magic.
    • Number the mountains starting from 1, ordered by their name and height, i.e., '' is assigned id 1, and '' is assigned id xxx.
    • One foot is approximately 0.3048 metres. See the PostgreSQL Documentation on how to round and cast values.
    • The rating for mountains is assigned as follows:
      • 1 : heightm >= 1300.
      • 2 : 1300 > heightm >= 1200.
      • 3 : 1200 > heightm >= 1100.
      • 4 : 1100 > heightm >= 1000.
      • 5 : 1000 > heightm >= 900.
      • 6 : 900 > heightm.
  • How many munros are there? How many Corbetts are there?
  • What is the height (in metres and feet) of the highest munro / corbett?
End of Tutorial 1. Please send any corrections or suggestions to Heiko Müller.