How to Write an Informatics Paper
I am grateful to
Vicky Rotarova for providing a translation of this page into French and to
Adrian Pantilimonu and All Science Magazine for providing a translation into Romanian.
This short guide is intended to assist Informatics researchers to write
scientific papers, whether these are for conferences, journals, dissertations
or some other purpose. Although it is not itself a scientific paper, it
is based on an hypothesis:
The key to successful paper writing is an explicit
statement of both a scientific hypothesis and the evidence to
support (or refute) it.
We will show how this key idea underpins the overall structure of the paper
and determines the story it tells.
The Importance of Hypotheses
Informatics is an engineering science. Like other branches of
both engineering and science it contributes to the advancement of
knowledge by formulating hypotheses and evaluating them. It is
not enough merely to describe some new technique or system; some
claim about it must be first stated and then evaluated. This
claim has the status of a scientific hypothesis; the evaluation
provides the evidence that will support or refute it.
Of course, the whole story may be spread across several papers
and several authors. For instance, the initial paper about a new
idea may not contain all the evidence needed to support a
hypothesis; further evidence may be provided in later
papers.
In experimental research, hypotheses typically take one
of the following two forms:
- Technique/system X automates task Y for the first
time;
- Technique/system X automates task Y better, along some
dimension, than each of its rivals;
where the dimensions are typically:
-
Behaviour: X has a higher success rate then Y or produces better quality
outputs, e.g. shorter, easier to understand, more similar to human outputs,
etc.
-
Coverage: X is applicable to a wider range of examples then Y.
-
Efficiency: X is faster or uses less space then Y.
-
Dependability: X is either more reliable, safe or secure
than each of its rivals.
-
Maintainability: Developers find X easier to adapt and
extend than its rivals.
-
Useability: Users find X easier to use than its rivals.
A paper may contain one or more hypotheses. However, it is a
mistake to try to cover too many hypotheses in a single paper: it
leads to confusion.
Explicit hypotheses are rarely stated in Informatics
papers. This is a Bad Thing. It makes it hard for the reader to
understand and assess the contribution of the paper. If the
reader misidentifies the hypotheses then s/he is bound to find
the evidence for it unconvincing. If the reader is also a referee
or examiner s/he may reject the paper. Worse still, it may
indicate that the author is unclear about the
contribution of the paper. In the absence of a clear hypothesis
it is impossible to know what evidence would support or refute
it.
The symptoms of this malaise are commonplace: papers whose
contribution is vague, ambiguous or absent; papers with a
confused mixture of multiple, implicit hypotheses; evaluations
that are inconclusive or non-existent; referee reports that
appear harsh or inconsistent. One of the main purposes of this
guide is to help to reverse this unhappy situation.
Theoretical papers are usually welcome exceptions: they usually
contain both hypotheses and convincing evidence to support them.
The hypotheses are the statements of theorems and the supporting
evidence is their proofs. This may account for the relatively
healthy state of theoretical research in Informatics compared
with experimental research.
The Structure of Informatics Papers
There is a default structure for writing an experimental
Informatics paper, whether this be for a conference or journal or
as a dissertation. When reporting experimental work, you should
use this structure unless you have a good reason not to. Theoretical
papers have a different default structure, which I hope to
include at a later date.
The main parts of an experimental Informatics paper should be
as follows. Each part could be a section of a paper or chapter of
a dissertation. To reduce clutter I will refer to sections and
papers below. Some parts may need to be spread over several
sections/chapters, e.g. if there is a lot of material to be
covered or it naturally falls into disjoint subparts. Some parts,
especially adjacent parts, may be merged into a single
section/chapter, e.g., where there is not much to be said or two
or more topics are interlinked. Parts marked with * are optional,
but you should think hard before deciding to omit them; if there
is something that should to be said you should say it.
- Title:
- Ideally, the title should
summarise the hypothesis of the paper. The reader should be able
to work out what the paper is about from the title alone. Cute,
cryptic titles are fun, but unhelpful.
- Abstract:
- Similarly, the abstract should
state the hypothesis and summarise the evidence that supports or
refute it. Remember, most readers will not read beyond the
abstract, so be sure to include the key points you want casual
readers to take with them. It should be more than a summary, but
should also mention the key contributions of the paper.
- Introduction:
- The main purpose of the introduction is
to motivate the contribution of the paper and to place it in
context. It should also restate the hypothesis and summarise the
evidence. It traditionally ends with a short summary of the rest
of the paper.
- Literature Survey*:
- The literature survey
is a broad and shallow account of the field, which helps to place
the contribution of the paper in context. It is part of the
motivation of the paper, because it helps to identify
the gap that this work is trying to fill, and explain why it is
important to fill this gap. Rather than a list of disconnected
accounts of other people's work, you should try to organise it
into a story: What are the rival approaches? What are the
drawbacks of each? How has the battle between different
approaches progressed? What are the major outstanding problems?
(This is where you come in.)
- Background*:
- The background allows
previous work to be stated in more technical detail. Don't try to
write an introductory textbook on the field, but only include
what the reader needs to know for a proper understanding of the
contribution of your paper.
- Theory*:
- This part develops the underlying theory of technique(s)
or system. Where appropriate, a mathematical style of
definitions, lemmas, theorems, corollaries, remarks, may be
used.
- Specification*:
- The techniques that
underlie the implementation are formally specified. The
requirements of the implementation are given.
- Implementation*:
- Only the final state of the
implementation should be described: not a blow by blow history of
its development. However, each of the major design decisions
should be identified and reasons given for the choices made. You
should abstract away from the code and outline the overall
structure of the system and the key algorithms in abstract form,
e.g. using diagrams or formalised English. You can point to bits
of actual code in the appendices, if necessary. A worked example
is often helpful.
- Evaluation:
- Evaluation is not testing. Testing is the process of
debugging that ensures that the implementation meets the
specification. This debugging process is not usually considered
worthy of discussion unless either the bug or the debugging
process is especially remarkable. The evaluation, on the other
hand, is the gathering of evidence to support or refute the
hypothesis. If the hypothesis is of type 1 then system X must be
applied to a sufficient range and diversity of examples of task Y
to convince the reader that it constitutes a general solution to
this task. Descriptions of its behaviour, coverage and efficiency
should be presented and, where appropriate, a description of dependability, maintainability or useability. If the hypothesis is of type 2
then, in addition to this evidence, there must also be a
comparison with rival systems along the chosen dimensions. There
should also be a brief comparison along the unchosen
dimensions, even if this is a negative result for system X;
honesty in science is essential and negative results are also
important.
A thorough evaluation usually requires large-scale
experimentation, with system X being applied to many examples of
task Y. To aid the reader's understanding, the result of these
experiments are best presented graphically. To verify the
hypothesis, the results must usually be statistically
processed (Cohen's book "Empirical methods for artificial
intelligence", MIT Press, 1995, is a good guide to statistical
methods for Informatics researchers. Toby Walsh has also collected some useful resources on empirical methods in Informatics.). It can aid the
reader's understanding of the processing of system X to give one
or two worked examples before the results are presented in
detail. Further details of the results can be presented in an
appendix.
I make no apology for the length of this discussion of
evaluation. Evaluation is the most important part of the
paper as it provides the evidence for the hypothesis. It is also
one of the most neglected parts: even being absent in many
papers. If this guide succeeds in raising the profile of
evaluation than half my battle will be won.
- Related Work*:
- A narrow but deep
comparison is made between system X and its main rivals at their
critical points of difference. It is logically part of the
evaluation, since it establishes the originality of your
contribution. You should explain the differences
in behaviour, coverage, efficiency, etc that were
identified in the evaluation. Note that related work is different in
purpose, position, breadth and depth from the literature survey;
both are needed.
- Further Work*:
- Some unexplored
avenues of the research and new directions that have been
suggested by the research are identified and briefly developed.
In particular, any research that would improve the evidence for/against
the hypothesis or increase its strength or scope should be
highlighted.
- Conclusion:
- The conclusion should both summarise the
research and discuss its significance. This includes a brief
restatement of the hypothesis and the evidence for and against
it. It should then recapitulate the original motivation and
reassess the state of the field in the light of this new
contribution.
- Appendices*:
- The appendices should
provide any information which would detract from the flow of the
main body of the paper, but whose inclusion could assist the
reader in understanding or assessing the research. Appendices
might include any of the following: a glossary of technical
terms, some technical background that only some readers may
require, examples of program code, a trace of the program on one
or more examples, more details of the examples evaluated and the
experimental results, the full versions of proofs, an
index. Alternatively, this information might be posted on the
web, and a link be provided to its location. If you use this
alternate route, however, then you should take steps to ensure
that the web site will continue to exist well into the future,
especially if it contains detailed experimental evidence. As a
scientist you have an obligation to make available to posterity
the full evidence on which any scientific claims are based.
Conclusion
This guide has both a descriptive and a normative role:
descriptive of best practice in the presentation of Informatics
research and normative in highlighting the importance of explicit
hypotheses in such presentations. In particular, I have tried to
show how the conventional structure of papers describing
experimental research should be used to emphasise these
hypotheses and their supporting (or refuting) evidence.
I believe that the neglect of explicit hypotheses has caused
methodological problems in Informatics. At worst, it leads to
work that fails to advance the state of the field. Systems are
built with no clear idea of the contribution they will make to
the advancement of knowledge. Such systems are described without
convincing evaluation, since it is unclear what the purpose of
evaluation would be. Readers of the research may not explicitly
notice the absence of hypotheses, but will feel some vague unease
about the contribution of the research, sometimes summarised in
the comment "so what?". It is tragic to witness such talent,
energy and opportunity being wasted in this way. I hope this
guide can make some small contribution to preventing such waste
in the future.
This guide is under development. I would be grateful for any
comments on ways to correct, improve or extend it. More stuff
on methodology can be found in my
Informatics Research Methodologies course.