Defining datatypes

The type mechanism cannot be used to produce a fresh type: only to re-name an existing type. A Standard ML programmer can introduce a new type, distinct from all the others, through the use of datatypes.

datatype colour = red | blue | green;

This introduces a new type, colour, and three constructors for that type, red, blue and green. Equality is defined for this type with the expected behaviour that, for example, red = red and red <> green. No significance is attached to the order in which the constructors were listed in the type definition and no ordering is defined for the type. Constructors differ from values because constructors may be used to form the patterns which appear in the definition of a function by pattern matching, as in the following function: (fn red => 1 | blue => 2 | green => 3). The pre-defined type bool behaves as if defined thus.

datatype bool = true | false;

In Standard ML it is illegal to rebind the constructors of built-in datatypes such as bool. The motivation for this is to prevent confusion about the interaction between the derived-forms translation and runt datatypes such as this--datatype bool = true--intended to replace the built-in booleans. Thus the constructors of built-in datatypes have an importance which places them somewhere between the constructors of programmer-defined datatypes and the reserved words of the language.

In constrast to the reverence accorded to the built-in constructors, programmer-defined constructors can be re-defined and these new definitions hide the ones which can before. So imagine that after elaborating the definition of the colour datatype we elaborate the definition of traffic_light shown below.

datatype traffic_light = red | green | amber;

Now we have available four constructors of two different types.

amber: traffic_light
blue: colour
green: traffic_light
red: traffic_light
The name blue is still in scope but the two other names of colours are not.

Another distinctive difference between datatype definitions and type abbreviations is that the type abbreviation mechanism cannot be used to describe recursive data structures; the type name is not in scope on the right-hand side of the definition. This is the real reason why we need another keyword, ``datatype'', to mark our type definitions as being [potentially] recursive just as we needed a new keyword, rec, to mark our function definitions as being recursive. One recursive datatype we might wish to define is the datatype of binary trees. If we wished to store integers in the tree we could use the following definition.

datatype inttree = empty | node of int * inttree * inttree;

Note the use of yet another keyword, ``of''. The declaration introduces the empty binary tree and a constructor function which, when given an integer n and two integer trees, t1 and t2, builds a tree with n at the root and with t1 and t2 as left and right sub-trees. This tree is simply node (n, t1, t2). The reason for the use of the term ``constructor'' now becomes clearer, larger trees are really being constructed from smaller ones by the use of these functions. The constructors of the colour datatype are a degenerate form of constructors since they are nullary constructors.

The mechanism for destructing a constructed value into its component parts is to match it against a pattern which uses the constructor and, in so doing, bind the value identifiers which occur in the pattern. This convenient facility removes the need to implement `destructors' for every new type and thereby reduces the amount of code which must be produced, enabling more effort to be expended on the more taxing parts of software development.

Since Standard ML type abbreviations may define families of types, it would seem natural that the datatypes of the language should be able to define families of datatypes. A datatype definition with a type parameter may be used to build objects of different types. The following tree definition generalises the integer trees given above.

datatype 'a tree = empty | node of 'a * 'a tree * 'a tree;

We see that the node constructor is of type ('a * ('a tree) * ('a tree)) -> ('a tree). There is a peculiar consequence of allowing datatypes to be defined in this way since we might make the type increase every time it is passed back to the type constructor thus making a so-called ``stuttering'' datatype.

datatype 'a ttree = 
  | node of 'a * ('a * 'a) ttree * ('a * 'a) ttree;

Standard ML functions cannot be used within their own definitions on values of different types so there is no way to write a recursive Standard ML function which can process these trees, say to count the number of values stored in the tree or even to calculate its depth. We could comment that allowing the parameter in a datatype definition to be inflated in this way when it is passed back has created a slight imbalance in the language because it is possible to define recursive datatypes which cannot be processed recursively. This is not anything more serious than an imbalance; it is not a serious flaw in the language.

A built-in parameterised datatype of the language is the type of lists. These are ordered collections of elements of the same type. The pre-defined Standard ML type constructor list is a parameterised datatype for representing lists. The parameter is the type of elements which will appear in the list. Thus, int list describes a list of integers, char list describes a list of characters and so on.

The list which contains no elements is called nil and if h is of type 'a and t is of type 'a list then h :: t--pronounced ``h cons t''--is also of type 'a list and represents the list with first element h and following elements the elements of t in the order that they appear in t. Thus 1 :: nil is a one-element integer list; 2 :: 1 :: nil is a two-element integer list and so on. Evidently to be correctly typed an expression with multiple uses of cons associates to the right. Thus the datatype definition for 'a list is as shown below. It declares the cons symbol to be used infix with right associativity and priority five. The keywords infix and infixr specify left and right associativity respectively.

infixr 5 ::
datatype 'a list = nil | :: of 'a * 'a list;

Lists come with derived forms. The notation [2, 1] is the derived form for 2 :: 1 :: nil; and similarly. For consistency, [] is the derived form for nil. As with the bool datatype we cannot re-define :: or nil although, rather curiously, we can tinker with their fixity status and associativity--perhaps a small oversight by the language designers.

All of the parameterised datatypes which we have declared so far have been parameterised by a single type variable but they can be parameterised by a tuple of type variables. We can define lookup tables to be lists of pairs as shown below.

type ('a, 'b) lookup = ('a * 'b) list;


This amusing puzzle is due to Bruce Duba of Rice University. At first it does not seem that it is possible at all. [Hint: you will need a tuple of type variables.]

Define the constructors, Nil and Cons, such that the following code type checks.

fun length (Nil) = 0
  | length (Cons (_, x)) = 1 + length (x);
val heterogeneous = Cons (1, Cons (true, Cons (fn x => x, Nil)))


It is possible to introduce two values at once by introducing a pair with the values as the elements, e.g. val (x, y) = (6, 7) defines x to be six and y to be seven. Why is it not possible to get around the need to use the keyword and by defining functions in pairs as shown below?

val (odd, even) = (fn 0 => false | n => even (n - 1),
                   fn 0 => true  | n => odd (n - 1)  );

Datatype definitions can also be mutually recursive. An example of an application where this arises is in defining a programming language with integer expressions where operations such as addition, subtraction, multiplication and division can be used together with parentheses. A Standard ML datatype for integer expressions is shown here.

datatype int_exp = plus of int_term * int_term
                 | minus of int_term * int_term
and     int_term = times of int_factor * int_factor
                 | divide of int_factor * int_factor
                 | modulo of int_factor * int_factor
and   int_factor = int_const of int
                 | paren of int_exp;


Define the following functions.

eval_int_exp: int_exp -> int
eval_int_term: int_term -> int
eval_int_factor: int_factor -> int