Calculating with distributions

Conrad Hughes, School of Informatics, University of Edinburgh.

I present Agrajag, a software toolkit which performs numerical operations on probability distributions, e.g. given two variables with distributions X and Y respectively, what is the distribution of their sum? Their maximum? The motivation for the toolkit originated in work investigating Quality of Service (QoS) in computational workflows — in particular, how the performance of the whole may be inferred from that of its parts (Dependable Infrastructure for Grid Services, DIGS). Existing work on this topic uses one-dimensional variables, such as expected value, to model workflow characteristics. These models are of little use to someone making decisions about workflow structure and componentry: it's impossible to decide whether a workflow will satisfy a Service Level Agreement (SLA) requiring completion within a time limit in 90% of cases if all that is known is its average runtime.

I also make the argument that a general purpose library for performing this kind of calculation has many potential applications ranging from service design to decision-making in autonomous robots, and that my work makes a good start on such a toolkit. It models distributions approximately, but to an arbitrary degree of accuracy — although in the current implementation, high accuracy comes at a substantial cost in performance.

Source code (Perl and C) may be found here.

Introduction

There are many planning- and workflow-related contexts in which more information than simple expected performance is useful or even necessary; decision-making about structures, methods and components to use in a process cannot sensibly be made on the basis of such a measurement: a mean time to completion of seven hours with a standard deviation of twenty minutes makes for high likelihood of completion during a normal working day; the same mean with a standard deviation of two hours is much less certain.

To support this, I take a more thorough approach to this kind of computation than alternatives such as minimum-maximum calculators, and work with arbitrarily precise approximations of variables' distributions. The goal is to get a very accurate feel for the range and likelihood of different values resulting from the computation, and to support more realistic input data than would be possible if (say) all variables were assumed to be exponentially distributed (a common assumption when modelling using Markov chains for example).

Here, for example, are the distributions of two real world variables: the first, the response time of a well-known Internet search engine, as measured over a period of two months in late 2005; the second, the ping time between two fixed points on the Internet, measured over the same period:

Log-normal-like graph of search response times Odd flat and spiky graph of ping response times

While the first is close to log-normal with an extra bump (possibly internal retry failover), the second graph exhibits a hint of bimodality, with a long evenly distributed bed surrounding the two peaks. Neither distribution would be particularly well served by an exponential assumption. (In both graphs, the little red peak at the right edge is due to a cutoff of all longer times)

The operations applicable to the variables I model currently include binary arithmetic operations including linear combinations of variables, highest or lowest N out of M variables (with maximum being highest 1 of 2 variables for example), partitioning (split into <= x and > x, for branching in workflows), merging of variables (for parallel behaviour in workflows) and voting. The variable distributions themselves may be open at either end, and may include delta function spikes. Discrete distributions taking a range of symbolic values (such as “succeeded” or “failed”) with varying probabilities are also supported, but these have only one operator: a transition matrix (e.g. “succeeded” × “succeeded” ⇒ “succeeded”, “succeeded” × “failed” ⇒ “failed”, etc.).

Example operations:

Literature survey

TBD

Background

TBD

Theory

On paper, the mathematical theory behind Agrajag is pretty straightforward: mainly it's performing numerical integration over its operands. In practice things are fractionally less straightforward since almost none of the functions involved integrate cleanly. This is why a numerical methods approach has been taken. Each distribution is approximated using an arbitrary number of uniform distributions which together cover its entire range — a “piecewise uniform” approximation. Accuracy is improved by increasing the number of pieces in the approximation. Calculations are then performed over these collections of uniform distributions, a much simpler task.

Specification

The intention is that Agrajag take the form of a language extension, so that the programmer experience is much as if they were operating on ordinary scalar variables. Under the surface however the variables are of course distributions, and the end result of a calculation will give a usually good approximation of the distribution of the calculation's result, given the distributions of its inputs. The programmer can then conveniently graph the outputs or make queries against them (“How much of the time will the result exceed x?”).

It is also intended that the same type system will implement several alternative approaches, such as min-max, mean value, and hopefully even Monte Carlo simulation, in order to facilitate comparisons with these other approaches and evaluate the trade-offs among them.

Implementation

Initial implementation was entirely in Perl; for performance reasons some of the core numerical algorithms have now been implemented in C. These seamlessly integrate into the Perl wrapper classes using the Inline::C module.

Evaluation

TBD

Related work

TBD

Further work

There are many possibilities for future development:

Conclusion

The appearance of these same methods in various other projects and Agrajag's own immediate usefulness in two pieces of research here at Edinburgh support the hypothesis that this functionality is useful as a general purpose toolkit for use in other work.

The author should find the time to complete some of the further work and publish Agrajag for general use (while it's already publicly available, such niceties as a “README” and programmers' manual do not exist: users must read the source for documentation).