Bayesian learning

2.2 Bayesian learning

The previous unit introduced you to the concepts of the frequentist and Bayesian approaches to probability. In this unit, we will look into how the Bayesian approach can be used to model reality and how it excels compared to frequentist methods. Again, return to the straightforward example of observing tosses of a coin. We would naturally expect the coin to be fair, i.e., to show heads or tails with equal probability. Before observing the behaviour of the coin, it seems like the appropriate assumption is most coins are more or less fair. To test our hypothesis, we toss the coin 100 times.

Even if the coin tosses are indeed fair, the probability of the coin tosses representing this fairness is very low in the sense that it is very unlikely that we will observe exactly 50 head and 50 tails tosses. In fact, between the possible number of heads and tails, we should expect to see a difference of about 8 in the case of 100 coin tosses.

Individual activity: Probability 1

Question: Assume that the expected difference between the number of heads and tails observed after n coin tosses is C × n for some constant C. Provided that this value is about 8 for n = 100 , about how much will it be if we were to toss the coin 10,000 times?

Answer: C × 100 = 10 C 8 then C × 10000 = 100 C 80

If you are not sure how to calculate the answer, post your thoughts in the Bayesian learning forum to discuss with your classmates and tutor.

We observe 54 head and 46 tails tosses. While this is a realistic outcome when observing the behaviour of a fair coin tossed 100 times, it is not clear how this observation could be interpreted as proof of the coin's fairness. After all, it seems intuitive that we would be more likely to observe this outcome if the coin were 54% biased towards heads. The question is: How should we update our prior belief, the assumed fairness of the coin, based on our observations?

A reasonable approach would be to try to find the probability distribution most likely to take values like the dataset we observed. To put it formally, we could aim to find the distribution θ which maximises the expression P ( D a t a | θ ) , i.e., the probability of observing the dataset if it were generated according to the distribution θ. In this simple case if we were to select θ to take a head with probability exactly of the frequency of heads observed that would maximise this likelihood (the proof of this requires some calculus we are not interested in now). Essentially, we would draw the same conclusion as a frequentist would as we would define the probability of a head to be exactly its frequency in the dataset. However, it would be inherently somewhat wrong due to the variance present in the data.

Sterling's approximation

Sterling's approximation is a mathematical technique used to estimate the value of large factorials (the product of all positive integers up to a given number). In simpler terms, it provides an easier way to handle and calculate the value of big numbers, especially when dealing with factorials.

For example, calculating the factorial of a large number like 100! (which means 1 x 2 x 3 x ... x 100) can be cumbersome. Sterling's approximation helps to make this task more manageable by providing a close estimate rather than computing the exact value. This approximation is particularly useful in areas like probability, statistics, and combinatorics, where factorials are often involved.

Individual activity: Probability 2

Question: The probability of observing exactly n heads and tails from 2n tosses is ( 2 n n ) 1 2 2 n = ( 2 n ) ! ( n ! ) 2 1 2 2 n .

Assuming Sterling's approximation of k ! 2 π k ( k e ) k , estimate this probability.

Answer: ( 2 n ) ! ( n ! ) 2 2 2 n = 4 π n ( 2 n e ) 2 n 2 π n ( n e ) 2 n 2 2 n = 1 π n .

If you are not sure how to calculate the answer, post your thoughts in the Bayesian learning forum to discuss with your classmates and tutor.

It could be argued that if we were to look at increasingly larger volumes of tosses, then the frequentist approach would provide us with an increasingly accurate model. However, in practice, we might not have access to more data points, but we might be able to make reasonable assumptions about the source of the data prior to observing the data itself. Somehow we would like to incorporate the information from these prior assumptions into our model.

To achieve this, let us take a Bayesian approach. Recall from the previous unit that Bayesian probability does not focus on a 'true' underlying distribution that could have generated the sample. Instead, it considers as 'true' only the data that validates or disproves possible hypotheses. To formalise this, a Bayesian approach attempts to calculate P ( θ | D a t a ) , the likelihood of a hypothesis holding based on the observed data.