Starting with Bayes theorem we can extend the joint probability equation to three and more events
For n events with probabilities computed assuming a particular interpretation of the data Y (for example a model)
Maximum Likelihood statistics involves the identification of the
event Y which maximizes such a probability. In the absence of any
other information the prior probability is assumed to be
constant for all Y .
For large numbers of variables this is an impractical method for
probability estimation. Even if the events were simple binary
variables there are clearly an exponential number of possible values
for even the first term in
requiring a prohibitive amount
of data storage.
In the case where each observed event is independent of all others
we can write.
Clearly this is a more practical definition of joint probability but the requirement of independence is quite a severe restriction. However, in some cases data can be analysed to remove these correlations, in particular the use of an appropriate data model (such as in least squares fitting) and processes for data orthogonalisation (including principle component analysis). For these reasons all common forms of maximum likelihood definitions assume data independence.
Probability independence is such an important concept it is worth
defining carefully. If knowledge of the probability of one variable
A allows us to gain knowledge about another event B then
these variables are not independent. Put in a way which is easily
visualised, if the distribution of over all possible
values of B is constant for all A then the two variables
are independent. Assumptions of independence of data can be tested
graphically by plotting
against
or A against B
if the variables are directly monotonically related to their probabilities.
Unfortunately, under some circumstances, observable evidence towards a hypothesis cannot be combined into a single measure. Such problems are called Multi Objective problems, under such circumstances it may be impossible to define a unique solution to a problem and a whole set of candidate solution exist for various combinations of trade-off between the two objectives. This is best seen in the area of Genetic Algorithms where the set of optimal solutions lies along what is known as the Pareto Front. Further discussion of these techniques is beyond the scope of this tutorial, except to say that multi-objective optimization should be avoided if a probabilistic combination approach is applicable.