The best way, in my mind, to understand what is going on in this puzzle is to tackle it from a Bayesian perspective. First we note that there are the various possible sums of money which could be put in the envelopes. The lower value we denote by X, and the higher value is fully determined by the lower value, being twice its value. We define a prior probability distribution P(X) for the choice the host makes for the lower value in the envelopes. Then there is the issue as to which envelope you choose to open: the one that contains the lower value (denoted C=lower), or the one that contains the higher (denoted C=higher). As the amounts are hidden from you, you choose entirely at random, with equal probabilities for the two options. Hence the probability P(C) is 0.5 for both cases.
The paradox deals with the case that the value of the content (denoted Y) of the chosen envelope is revealed to be a certain value N say. We wish to ascertain, given this information, the posterior probability that the current envelope contains the higher or lower value. This is P(C|Y=N), the probability that we chose the envelope containing the lower/higher value given we now know what Y is. Writing this out, we have
Taking the first of two cases, we have P(Y=N|C=lower)=P(X=N) as, if we have chosen the lower value, Y=X. The second of the two cases is P(Y=N|C=higher)=P(X=N/2) as in this case the chosen envelope contains double the lowest value. Substituting this in to (1) gives
Now P(Y=N) must normalise the distribution and so
The point is that, from (2), P(C=lower|Y=N) is only the same as P(C=lower) if P(X=N) is the same as P(Y=N). If it is not then in our calculations of expected gain we have used the incorrect probability. We have used the prior probability, P(C=lower)=0.5, of choosing the lower value, rather than the posterior probability P(C=lower|Y). The same argument follows for P(C=higher|Y).
Under what circumstances could we have P(X=N)=P(Y=N)? Well from (4) we would need P(X=N/2)=P(X=N) for all values of N. This is an infinite uniform "distribution", but no such distribution exists (it cannot be normalised). Hence it is impossible to have any prior distribution for which P(X=N)=P(X=N/2) is satisfied for all N. As a result there is no posterior distribution for which, given any Y, we could have P(C=lower|Y)=0.5. We have made the wrong calculation.
What is the correct expected gain calculation? It is (2N) P(C=lower|Y=N)+(N/2) P(C=higher|Y=N)-N, and the result you will get depends on the prior distribution for the amounts in the envelope. For example if the prior distribution P(X) were uniform between 0 and 30000 pounds, and you found 40000 pounds in the envelope, then you would surely lose to swap, whereas if the prior were uniform between 30000 and 100000 pounds, you would surely gain. More importantly if the value of Y is not known then it needs to be integrated out. Some simple algebra would show that the result of doing this would be a zero expected gain of swapping, independent of the form of P(X).
There is one further question that we might ask though. Is there some prior distribution for which actually knowing the amount in the envelope provides no additional information as to whether a swap should be made; is there some uninformative distribution? For this to be the case the expected gain from swapping would be zero, and so
Substituting in (2) and (3) gives 2P(X=N)=P(X=N/2). This is satisfied by P(X)=A/X. However again if this were defined for all values from zero to infinity it could not be normalised. Why is it not possible to have such a distribution? Well think. Lets suppose there were a distribution of this form. If instead of putting 2X as much in the second envelope, the host put X^2+X, then were a distribution such as this possible we could go through the whole same argument again and arise at the same absurd conclusion: its always better to swap regardless of the amount in the envelope.
The problem, then, is one of the use of the wrong probabilities. We were using the prior probabilities to calculate our expected gain rather than our posterior probabilities. Furthermore it is not possible to choose a prior distribution which results in a posterior distribution for which the original argument holds; there are no circumstances in which it could be a valid argument to always use probabilities of 0.5. Also of interest, there is also no prior distribution for which the value of the cheque in the opened envelope is guaranteed to give us no information. Finally, more generally, a simple extension of the argument can be used to show that any priors which always give f(N) P(C=lower|Y=N)+f^(-1)(N)P(C=higher|Y=N)-N to be well defined and always positive for all N must be improper. That is left to the reader...
At the Newton Institute workshop at which I first came across this problem, I did remember an incident in which a speaker was talking about a problem for which he used various improper (i.e. unnormalisable) priors. Again if my memory serves me correctly, John Skilling, who was among the participants, couldn't help himself but offer a highly vocal "Uuuuurrrghhh" to the suggestion that improper priors should ever be used. The poor speaker was somewhat put off! However this example is a great lesson in the pitfalls of using improper priors without proper and careful use of limit arguments, as well as a warning that using prior probabilities when posterior probabilities should have been used is a bad idea.