A Brief Introduction to Bayesian Statistics
Note: This note borrows from Bayesian Data Analysis, Andrew Gelman et al.[1], and Probability Theory: The Logic of Science, E. T. Jaynes[2]. These texts are recommended.
Empiricism through the lens of "inverse problems"
Science is about "inverse problems".
Forward problems are typically what we learn from class. Given a model (eg. classical EM), and a set of known data, what should we measure? For example, for electrostatic forces, we would get
If we know all the parameters, , then we can compute the force. This is how we learn science. However, this is not how we do science. When we do science, often times we have many observed variables, which are either controlled or measured, and latent variables, which cannot be directly observed but that we are interested in. In the context of the electrostatic force, we might be interested in the charge of an object, denoted . We can indirectly measure this by setting up a known charge some known distance away from our unknown charge, measure , and infer . This is known as an inverse problem [3].
What is the difficulty of inverse problems?
There is often no simple one-to-one mapping between sets of observed variables and sets of latent variables. For example, if we are trying to localise a gravitational wave signal in the night sky, a single observation in both LIGO hanford and LIGO Livingston might be able to localise the event to a long band in the night sky, and there might be multiple disconnected regions that the two detectors alone cannot distinguish.
We often do not know the observed quantities perfectly well.
Quick introduction to probability and statistics
Notation, definitions, conventions
For my lectures, we will be using a lot of probability notation. Here is a quick primer.
The probability of event happening is denoted .
The probability of or happening is denoted .
The probability of both and happening is denoted . This is also called a joint probability, and can be denoted .
The probability that occurs, given that occurs (eg. because has already occured or you want to know what would happen if were to occur) is called the conditional probability of given , denoted
Typically, capital would denote a discrete probability, whereas small would denote a probability density.
What is probability?
Frequentist: Probabilities represent long-run frequencies.
This is attractive as a way of codifying probability, as it forms an easily formalised definition.
Bayesian: Probabilities represent belief. This is vague, but intuitive. For example, we might say "It will likely rain tomorrow" to represent a belief that there is a chance that it will rain. A frequentist would instead, and less intuitively, say that there is either a chance or a chance, but we do not know which is correct!
Bayesian probability is an extension to logic
In logic, we can have statements with truth values, and the truth values of statements can depend on other statements. For example:
: It rained yesterday.
: The roads were wet yesterday.
In this case, we can say that the truth of is conditional on being true. However, about statements for which the truth values are not perfectly predetermined? Consider the following two statements:
: It will rain tomorrow morning.
: The roads will be wet tomorrow evening.
As we are now talking about events in the future, the truth value of the events are not known. In fact, we might wish to represent them as probabilities! After running extensive weather simulations on a supercomputer, I might conclude that . Given evaporation patterns and previous rainfall, we might conclude that , and . Bayesian statistics gives us a way to reason about statements with truth values that we are not sure about. If we are reasoning about statements with truth values that we are not sure about, we might also want to have rules for updating our confidence in the truth of a statement in light of new evidence; if we see storm cloud brewing, we might want to update !
Bayes' theorem
Bayes' theorem is the central rule for updating degrees of belief:
This theorem can be derived from the definition of the conditional probabilities and . We can work out a concrete example (inspired by example from Wikipedia). Suppose we are interested in testing for a certain condition in a broad population. Note that I am being deliberately vague here as the numbers are made up! Let us further suppose that this condition or disease is quite rare, and that only pf the population has it; in other words:
We have a very good test for this condition; someone who has this vague and mysterious illness has a chance of testing positive, whereas someone who does not only has a chance of testing positive:
If you get a positive test, should you freak out? In other words, what is ?
In other words, if there is nothing to increase the prior probability (for example, if you had matching symptoms, that might be a reason to increase the prior belief of having this condition), even if you tested positive on a test, it is still more likely for you to not have the condition!
Statistical modelling and probability functions
In physics, we often are not just dealing with discrete scenarios as above. Let us try out a classical particle physics scenario: a spectral fit with signal and background: 
In this scenario, we expect background events, distributed in energy as . Similarly, we expect signal events, distributed in energy as . denotes the parameters of this model, such as the signal strength, detector parameters, background rate, etc. Now, suppose we observe data points with energies .
Given such a model and data vector we would write Bayes' theorem as:
Here, is the posterior distribution we want, is the likelihood function, and is the prior. We can think of the denominator, , as a normalisation constant; after all, probability distributions like our posterior have to integrate to , and our prior multiplied by our likelihood is not guaranteed to do so.
What is the explicit form of our likelihood? We can think about how we would write down explicitly. We would read this as the probability density of our data points, , given a particular set of model parameters, . This tells us all we need to know!
Firstly, the total number of events should be Poisson distributed. We earlier stated that there are signal and background means and ; these summed together should be the Poisson rate parameter . In addition, we know the PDF as a function of energy for both the signal and the background components as and . (In a practical modelling problem, these will all be specified explicitly! See the colab notebook.)
We can thus write the likelihood as such:
References
| [1] | Gelman, Andrew, et al. Bayesian data analysis. Chapman and Hall/CRC, 1995. |
| [2] | Jaynes, Edwin T. Probability theory: The logic of science. Cambridge university press, 2003. |
| [3] | Cranmer, Kyle, Johann Brehmer, and Gilles Louppe. "The frontier of simulation-based inference." Proceedings of the National Academy of Sciences 117.48 (2020): 30055-30062. |