Applied Database Systems - Tutorial 6

This weeks tutorial is another chance for everyone to catch up with the previous tutorials or continue their work on the assignments. The tutorial itself will be on creating index structures. While it is designed as stand alone, it can very well be completed together with next weeks tutorial on evaluating PostgreSQL query plans.

This tutorial consists of two parts:

 Indexes

This section describes the usage of indexes in PostgreSQL. It is based on an on-line tutorial on the Web. The section is divided into the follwoing parts:

 Creating and Dropping Indexes

Given the munros relation:

CREATE TABLE munros(
         mid int NOT NULL,
         mname varchar(80) NOT NULL,
         gridref char(10) NOT NULL,
         heightm int NOT NULL,
         PRIMARY KEY(mid)
);

Suppose that your application requires a lot of queries of the form:

SELECT * FROM munros WHERE gridref = constant

With no advance preparation, the system would have to scan the entire munros relation, row by row, to find all Munros in the specific loaction. However, only a few Munros will match a particular location (you should be able to write a query that returns the maximum number of Munros in any of the grid locations). If there are only a few Munros returned by the query (in our example it will only be zero or one), then scanning the entire relation is clearly an inefficient method (especially for large tables, e.g., assume a relation containing all the mountains in the world). If the system has been instructed to maintain an index on the gridref column, then it can use a more efficient method for locating matching Munros. For instance, it might only have to walk a few levels deep into a search tree.

A similar approach is used in most books of non-fiction: terms and concepts that are frequently looked up by readers are collected in an alphabetic index at the end of the book. The interested reader can scan the index relatively quickly and flip to the appropriate page(s), rather than having to read the entire book to find the material of interest. Just as it is the task of the author to anticipate the items that the readers are likely to look up, it is the task of the database programmer to foresee which indexes will be of advantage.

The following command would be used to create the index on the gridref column, as discussed (check the full syntax of CREATE INDEX in the on-line documentation):

CREATE INDEX idx_munros_gridref ON munros(gridref);

The name of an index can be chosen freely, e.g., in this case it is idx_munros_gridref. The indexes that are defined on a relation are listed by the psql command \d relation:

sXXXXXX=> \d munros
                                 Table "public.munros"
 Column  |         Type          |                      Modifiers

---------+-----------------------+-----------------------------------------------------
 mid     | integer               | not null default nextval('seq_munros_id'::reg class)
 mname   | character varying(80) | not null
 gridref | character(10)         | not null
 heightm | integer               | not null
Indexes:
    "munros_pkey" PRIMARY KEY, btree (mid)
    "idx_munros_gridref" btree (gridref)

Many of you have noticed already that the system automatically creates an index for the primary key of a relation. Since the primary key is unique it does make sense to use such an index when searching for rows with a particular primary key value.

To remove an index, use the DROP INDEX command. Indexes can be added to and removed from tables at any time.

DROP INDEX idx_munros_gridref;

Question: What happens when you try to drop the index on the primary key?

Once an index is created, no further intervention is required: the system will update the index when the table is modified, and it will use the index in queries when it thinks this would be more efficient than a sequential table scan.

 Index Types

PostgreSQL provides several index types, most notable B-tree, and Hash. Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command will create a B-tree index, which fits the most common situations.

B-trees can handle equality and range queries on data that can be sorted into some ordering. Hash indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the = operator. The following command is used to create a hash index:

CREATE INDEX name ON table USING hash (column);
 Multicolumn Indexes

An index can be defined on more than one column of a table. Assuming that the name and the height uniquely identifies a Munro you may have frequent queries like this:

SELECT * FROM munros WHERE mname = constant1 AND heightm = constant2;

In order to speed up these queries you can define the following index on mname and heightm:

CREATE INDEX idx_munros_nh ON munros(mname, heightm);

A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns. The exact rule is that equality constraints on leading columns, plus any inequality constraints on the first column that does not have an equality constraint, will be used to limit the portion of the index that is scanned. Constraints on columns to the right of these columns are checked in the index, so they save visits to the table proper, but they do not reduce the portion of the index that has to be scanned. For example, given an index on (a, b, c) and a query condition WHERE a = 5 AND b >= 42 AND c < 77, the index would have to be scanned from the first entry with a = 5 and b = 42 up through the last entry with a = 5. Index entries with c >= 77 would be skipped, but they'd still have to be scanned through. This index could in principle be used for queries that have constraints on b and/or c with no constraint on a but the entire index would have to be scanned, so in most cases the planner would prefer a sequential table scan over using the index.

 Indexes on Expressions

An index column need not be just a column of the underlying table, but can be a function or scalar expression computed from one or more columns of the table. Assume that you want to find Munros based on a combination of their name and height, e.g., Ben Vorlich, 985 m. You can do so using the following query:

SELECT * FROM munros
WHERE (mname || ', ' || heightm || ' m') = 'Ben Vorlich, 985 m';

To support such a query you can define an index on the expression mname || ', ' || heightm || ' m':

CREATE INDEX idx_munros_mkey ON munros ((mname || ', ' || heightm || ' m'));

The syntax of the CREATE INDEX command normally requires writing parentheses around index expressions, as shown in the example. The parentheses may be omitted when the expression is just a function call.

 Clustered Indexes

The CLUSTER statement instructs PostgreSQL to cluster the table specified by tablename based on the index specified by indexname. The index must already have been defined on tablename.

CLUSTER indexname ON tablename

When a table is clustered, it is physically reordered based on the index information. In cases where you are accessing single rows randomly within a table, the actual order of the data in the table is unimportant. However, if you tend to access some data more than others, and there is an index that groups them together, you will benefit from using CLUSTER. If you are requesting a range of indexed values from a table, or a single indexed value that has multiple rows that match, CLUSTER may help save disk accesses and speed up the query. However, since you can have only one physical order of a table, you can have only one clustered index per table and should carefully pick which index you will use to cluster on or if you even want to cluster at all.

Because the planner records statistics about the ordering of tables, it is advisable to run ANALYZE on the newly clustered table. Otherwise, the planner may make poor choices of query plans.

 Analyze

The ANALYZE command collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

ANALYZE tablename

It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the planner to choose the most appropriate query plan, and thereby improve the speed of query processing.

 Exercise
 Question 1

You are given the following relation that stores information about the weather conditions in the U.K.:

CREATE TABLE ukweather(
         date char(8) NOT NULL,
         time char(4) NOT NULL,
         region varchar(30) NOT NULL,
         location varchar(30) NOT NULL,
         conditions varchar(20),
         temperature real,
         windspeed int,
         winddirection char(3)
);

You are further given the following load files (in gzip-format) that contain records on weather conditions in the U.K.:

The order of columns in the load files are the same as in the CREATE TABLE statement. Columns are separated by | and null values are represented by string null. Your first task is to create a copy of the relation in your database. Choose one of the files.

Note: The performance gain will be higher on the large data set. Creating indexes, however, will also require more time on the large data set.

You are further given the following five queries that represent examples of frequent queries your database will have to answer. For each of the queries propose one or more indexes to speed up query execution. Considering the whole set of queries, which index/indexes should be created?

Note: After creating an index you can check whether the query optimizer uses the index using the EXPLAIN command, i.e., EXPLAIN query statement. We will talk about the details of EXPLAIN in next weeks tutorial.

Note: You can cancel execution of a SELECT statement (or CREATE INDEX statement) using Ctrl - C.

 Query 1

This query returns those locations that had a temperature above 30 degrees after 3 p.m. under sunny conditions

SELECT date, region, location
FROM ukweather
WHERE temperature > 30 AND time > '1500' AND conditions = 'Sunny';
 Query 2

The query returns the names of the regions that have a location named 'Glen Ogle'.

SELECT DISTINCT region FROM ukweather WHERE location = 'Glen Ogle';
 Query 3

The query returns the locations that had the maximum temperature in August 2009 with wind speeds was above 10 mph.

SELECT date, time, region, location, temperature
FROM ukweather
WHERE  date >= '20090801' AND date <= '20090831' AND
temperature = (SELECT MAX(temperature) FROM ukweather
WHERE date >= '20090801' AND date <= '20090831' AND windspeed > 10);
 Query 4

Location 'Glen Ogle' has the lowest average day-time temperature in May 2009. This query returns the names of locations that had a lower temperature at some point in time than the temperature measured in 'Glen Ogle'.

SELECT DISTINCT u1.location
FROM ukweather u1, ukweather u2
WHERE u1.date = u2.date AND u1.time = u2.time
AND u1.date >= '20090501' AND u1.date <= '20090531'
AND u1.time >= '0800' AND u1.time <= '2000'
AND u2.location = 'Glen Ogle'
AND u2.temperature > u1.temperature
 Query 5

This query lists the locations in the Highlands that had during day-time an increase in temperature of at least 5 dgrees and an increase in wind speed of at least 10 mph.

SELECT u1.date, u1.location
FROM ukweather u1, ukweather u2
WHERE u1.date = u2.date and u1.time > u2.time
AND u2.time > '0800' AND u2.time < '2000'
AND u1.date >= '20090801' AND u1.date <= '20090831'
AND u1.region = 'Highland and Eilean Siar' AND u2.region = 'Highland and Eilean Siar'
AND u1.location = u2.location
AND u1.conditions = 'Sunny' AND u2.conditions = 'Sunny'
AND u1.temperature > u2.temperature + 5 AND u1.windspeed > u2.windspeed + 10
GROUP BY u1.date, u1.region, u1.location;
End of Tutorial 6. Please send any corrections or suggestions to Heiko Müller.