# Calculating with distributions

Conrad Hughes, School of Informatics, University of Edinburgh.

I present Agrajag, a software toolkit which performs numerical operations on probability distributions, e.g. given two variables with distributions X and Y respectively, what is the distribution of their sum? Their maximum? The motivation for the toolkit originated in work investigating Quality of Service (QoS) in computational workflows — in particular, how the performance of the whole may be inferred from that of its parts (Dependable Infrastructure for Grid Services, DIGS). Existing work on this topic uses one-dimensional variables, such as expected value, to model workflow characteristics. These models are of little use to someone making decisions about workflow structure and componentry: it's impossible to decide whether a workflow will satisfy a Service Level Agreement (SLA) requiring completion within a time limit in 90% of cases if all that is known is its average runtime.

I also make the argument that a general purpose library for performing this kind of calculation has many potential applications ranging from service design to decision-making in autonomous robots, and that my work makes a good start on such a toolkit. It models distributions approximately, but to an arbitrary degree of accuracy — although in the current implementation, high accuracy comes at a substantial cost in performance.

Source code (Perl and C) may be found here.

## Introduction

There are many planning- and workflow-related contexts in which more information than simple expected performance is useful or even necessary; decision-making about structures, methods and components to use in a process cannot sensibly be made on the basis of such a measurement: a mean time to completion of seven hours with a standard deviation of twenty minutes makes for high likelihood of completion during a normal working day; the same mean with a standard deviation of two hours is much less certain.

To support this, I take a more thorough approach to this kind of computation than alternatives such as minimum-maximum calculators, and work with arbitrarily precise approximations of variables' distributions. The goal is to get a very accurate feel for the range and likelihood of different values resulting from the computation, and to support more realistic input data than would be possible if (say) all variables were assumed to be exponentially distributed (a common assumption when modelling using Markov chains for example).

Here, for example, are the distributions of two real world variables: the first, the response time of a well-known Internet search engine, as measured over a period of two months in late 2005; the second, the ping time between two fixed points on the Internet, measured over the same period:

While the first is close to log-normal with an extra bump (possibly internal retry failover), the second graph exhibits a hint of bimodality, with a long evenly distributed bed surrounding the two peaks. Neither distribution would be particularly well served by an exponential assumption. (In both graphs, the little red peak at the right edge is due to a cutoff of all longer times)

The operations applicable to the variables I model currently include binary arithmetic operations including linear combinations of variables, highest or lowest N out of M variables (with maximum being highest 1 of 2 variables for example), partitioning (split into <= x and > x, for branching in workflows), merging of variables (for parallel behaviour in workflows) and voting. The variable distributions themselves may be open at either end, and may include delta function spikes. Discrete distributions taking a range of symbolic values (such as “succeeded” or “failed”) with varying probabilities are also supported, but these have only one operator: a transition matrix (e.g. “succeeded” × “succeeded” ⇒ “succeeded”, “succeeded” × “failed” ⇒ “failed”, etc.).

Example operations:

• Parallel(0.75 * , 0.25 * ) = .

The red curves show Probability Density Functions (PDF), while the green curves show Cumulative Distribution Functions (CDF). All graphs have the same X range (0…6).

In a workflow context, this represents a situation where two paths might be taken, one 75% of the time, the other 25%.

• Max(, ) = .

The distribution of the maximum of two variables — e.g. the network bandwidth required by two tasks running in series.

• Min(, ) = .

The distribution of the minimum of two variables (e.g. time to complete of two tasks run in parallel for redundancy, the first to complete being used).

• Sum(, ) = .

The distribution of the sum of two variables (e.g. cost of two tasks run in parallel for redundancy) — note that the distribution in the middle takes only two values, one of them 90% of the time.

TBD

TBD

## Theory

On paper, the mathematical theory behind Agrajag is pretty straightforward: mainly it's performing numerical integration over its operands. In practice things are fractionally less straightforward since almost none of the functions involved integrate cleanly. This is why a numerical methods approach has been taken. Each distribution is approximated using an arbitrary number of uniform distributions which together cover its entire range — a “piecewise uniform” approximation. Accuracy is improved by increasing the number of pieces in the approximation. Calculations are then performed over these collections of uniform distributions, a much simpler task.

## Specification

The intention is that Agrajag take the form of a language extension, so that the programmer experience is much as if they were operating on ordinary scalar variables. Under the surface however the variables are of course distributions, and the end result of a calculation will give a usually good approximation of the distribution of the calculation's result, given the distributions of its inputs. The programmer can then conveniently graph the outputs or make queries against them (“How much of the time will the result exceed x?”).

It is also intended that the same type system will implement several alternative approaches, such as min-max, mean value, and hopefully even Monte Carlo simulation, in order to facilitate comparisons with these other approaches and evaluate the trade-offs among them.

## Implementation

Initial implementation was entirely in Perl; for performance reasons some of the core numerical algorithms have now been implemented in C. These seamlessly integrate into the Perl wrapper classes using the `Inline::C` module.

TBD

TBD

## Further work

There are many possibilities for future development:

• Pushing all of the core implementation down into a C library would allow construction of shim layers to expose Agrajag's functionality to Java, GNU R, Perl, Python, C++, Ruby etc. This would facilitate its acceptance as a useful general purpose toolkit.

• Evaluation has been somewhat thin on the ground: once the code started giving correct test results, a blanket assumption that it was more useful than the less informative alternatives resulted in little evaluation being performed against them. Completion of implementation of (say) mean value and Monte Carlo alternatives using the same API would allow a better appraisal of Agrajag's usefulness.

• The choice of piecewise uniform distribution as the underlying representation for calculations was made on the basis of ease of implementation. Application of better numerical integration algorithms or an alternative representation would probably improve performance at a given level of accuracy, or improve accuracy at a given level of performance.

## Conclusion

The appearance of these same methods in various other projects and Agrajag's own immediate usefulness in two pieces of research here at Edinburgh support the hypothesis that this functionality is useful as a general purpose toolkit for use in other work.

The author should find the time to complete some of the further work and publish Agrajag for general use (while it's already publicly available, such niceties as a “README” and programmers' manual do not exist: users must read the source for documentation).