A Brief Introduction to Bayesian Statistics

Note: This note borrows from Bayesian Data Analysis, Andrew Gelman et al.[1], and Probability Theory: The Logic of Science, E. T. Jaynes[2]. These texts are recommended.

Empiricism through the lens of "inverse problems"

Science is about "inverse problems".

Forward problems are typically what we learn from class. Given a model (eg. classical EM), and a set of known data, what should we measure? For example, for electrostatic forces, we would get

F=14πε0q1q2r2r^\mathbf{F} = \frac{1}{4\pi\varepsilon_0}\frac{q_1 q_2}{|r|^2}\hat{r}

If we know all the parameters, q1,q2,rq_1, q_2, r, then we can compute the force. This is how we learn science. However, this is not how we do science. When we do science, often times we have many observed variables, which are either controlled or measured, and latent variables, which cannot be directly observed but that we are interested in. In the context of the electrostatic force, we might be interested in the charge of an object, denoted q2q_2. We can indirectly measure this by setting up a known charge q1q_1 some known distance rr away from our unknown charge, measure F\mathbf{F}, and infer q2q_2. This is known as an inverse problem [3].

What is the difficulty of inverse problems?

  1. There is often no simple one-to-one mapping between sets of observed variables and sets of latent variables. For example, if we are trying to localise a gravitational wave signal in the night sky, a single observation in both LIGO hanford and LIGO Livingston might be able to localise the event to a long band in the night sky, and there might be multiple disconnected regions that the two detectors alone cannot distinguish.

  2. We often do not know the observed quantities perfectly well.

Quick introduction to probability and statistics

Notation, definitions, conventions

For my lectures, we will be using a lot of probability notation. Here is a quick primer.

  • The probability of event AA happening is denoted P(A)P(A).

  • The probability of AA or BB happening is denoted P(AB)P(A \cup B).

  • The probability of both AA and BB happening is denoted P(AB)P(A \cap B). This is also called a joint probability, and can be denoted P(A,B)P(A, B).

  • The probability that AA occurs, given that BB occurs (eg. because BB has already occured or you want to know what would happen if BB were to occur) is called the conditional probability of AA given BB, denoted P(AB)P(AB)P(B)P(A|B)\equiv\frac{P(A \cap B)}{P(B)}

  • Typically, capital PP would denote a discrete probability, whereas small pp would denote a probability density.

What is probability?

Frequentist: Probabilities represent long-run frequencies.

P(A)=limNtotNANtotP(A) = \lim_{N_\mathrm{tot} \to \infty} \frac{N_\mathrm{A}}{N_\mathrm{tot}}

This is attractive as a way of codifying probability, as it forms an easily formalised definition.

Bayesian: Probabilities represent belief. This is vague, but intuitive. For example, we might say "It will likely rain tomorrow" to represent a belief that there is a >50%>50\% chance that it will rain. A frequentist would instead, and less intuitively, say that there is either a 0%0\% chance or a 1000%1000\% chance, but we do not know which is correct!

Bayesian probability is an extension to logic

In logic, we can have statements with truth values, and the truth values of statements can depend on other statements. For example:

  • AA: It rained yesterday.

  • BB: The roads were wet yesterday.

In this case, we can say that the truth of BB is conditional on AA being true. However, about statements for which the truth values are not perfectly predetermined? Consider the following two statements:

  • AA: It will rain tomorrow morning.

  • BB: The roads will be wet tomorrow evening.

As we are now talking about events in the future, the truth value of the events are not known. In fact, we might wish to represent them as probabilities! After running extensive weather simulations on a supercomputer, I might conclude that P(A)=30%P(A) = 30\%. Given evaporation patterns and previous rainfall, we might conclude that P(BA)=30%P(B|A) = 30\%, and P(B¬A)=1%P(B|\neg A) = 1\%. Bayesian statistics gives us a way to reason about statements with truth values that we are not sure about. If we are reasoning about statements with truth values that we are not sure about, we might also want to have rules for updating our confidence in the truth of a statement in light of new evidence; if we see storm cloud brewing, we might want to update P(A)P(A)!

Bayes' theorem

Bayes' theorem is the central rule for updating degrees of belief:

P(AB)=P(BA)P(A)P(B) P(A\vert B)={\frac {P(B\vert A)P(A)}{P(B)}}

This theorem can be derived from the definition of the conditional probabilities P(AB)P(A\vert B) and P(BA)P(B\vert A). We can work out a concrete example (inspired by example from Wikipedia). Suppose we are interested in testing for a certain condition in a broad population. Note that I am being deliberately vague here as the numbers are made up! Let us further suppose that this condition or disease is quite rare, and that only 0.1%0.1\% pf the population has it; in other words:

P(Sick)=103.P(\mathrm{Sick}) = 10^{-3}.

We have a very good test for this condition; someone who has this vague and mysterious illness has a 95%95\% chance of testing positive, whereas someone who does not only has a 2%2\% chance of testing positive:

P(PositiveSick)=0.95,P(Positive¬Sick)=0.02.P(\mathrm{Positive}|\mathrm{Sick}) = 0.95, P(\mathrm{Positive}|\neg\mathrm{Sick}) = 0.02.

If you get a positive test, should you freak out? In other words, what is P(SickPositive)P(\mathrm{Sick}|\mathrm{Positive})?

P(SickPositive)=P(PositiveSick)P(Sick)P(Positive)=P(PositiveSick)P(Sick)P(PositiveSick)P(Sick)+P(Positive¬Sick)P(¬Sick)=0.95×1030.95×103+0.02×(1103)0.045\begin{split} P(\mathrm{Sick}|\mathrm{Positive}) &= \frac{P(\mathrm{Positive}|\mathrm{Sick})P(\mathrm{Sick})}{P(\mathrm{Positive})}\\ &=\frac{P(\mathrm{Positive}|\mathrm{Sick})P(\mathrm{Sick})}{P(\mathrm{Positive}|\mathrm{Sick})P(\mathrm{Sick}) + P(\mathrm{Positive}|\neg\mathrm{Sick})P(\neg\mathrm{Sick})}\\ &= \frac{0.95 \times 10^{-3}}{0.95\times10^{-3} + 0.02 \times (1-10^{-3})}\\ &\approx 0.045 \end{split}

In other words, if there is nothing to increase the prior probability (for example, if you had matching symptoms, that might be a reason to increase the prior belief of having this condition), even if you tested positive on a test, it is still more likely for you to not have the condition!

Statistical modelling and probability functions

In physics, we often are not just dealing with discrete scenarios as above. Let us try out a classical particle physics scenario: a spectral fit with signal and background: Spectrum with signal and background

In this scenario, we expect μbg(θ)\mu_\mathrm{bg}\left(\vec{\theta}\right) background events, distributed in energy as pbg(Eθ)p_{\mathrm{bg}}\left(E|\vec{\theta}\right). Similarly, we expect μsig(θ)\mu_\mathrm{sig}\left(\vec{\theta}\right) signal events, distributed in energy as psig(Eθ)p_{\mathrm{sig}}\left(E|\vec{\theta}\right). θ\vec{\theta} denotes the parameters of this model, such as the signal strength, detector parameters, background rate, etc. Now, suppose we observe NobsN_\mathrm{obs} data points with energies X={Ei}\vec{X} = \{E_i\}.

Given such a model and data vector we would write Bayes' theorem as:

p(θX)=p(Xθ)p(θ)p(X).\begin{split} p\left(\vec{\theta}|\vec{X}\right) = \frac{p\left(\vec{X}|\vec{\theta}\right)p\left(\vec{\theta}\right)}{p\left(\vec{X}\right)}. \end{split}

Here, p(θX)p\left(\vec{\theta}|\vec{X}\right) is the posterior distribution we want, p(Xθ)p\left(\vec{X}|\vec{\theta}\right) is the likelihood function, and p(θ)p\left(\vec{\theta}\right) is the prior. We can think of the denominator, p(X)p\left(\vec{X}\right), as a normalisation constant; after all, probability distributions like our posterior have to integrate to 11, and our prior multiplied by our likelihood is not guaranteed to do so.

What is the explicit form of our likelihood? We can think about how we would write down p(Xθ)p\left(\vec{X}|\vec{\theta}\right) explicitly. We would read this as the probability density of our data points, X\vec{X}, given a particular set of model parameters, θ\vec{\theta}. This tells us all we need to know!

Firstly, the total number of events should be Poisson distributed. We earlier stated that there are signal and background means μsig(θ)\mu_\mathrm{sig}\left(\vec{\theta}\right) and μbg(θ)\mu_\mathrm{bg}\left(\vec{\theta}\right); these summed together should be the Poisson rate parameter λ\lambda. In addition, we know the PDF as a function of energy for both the signal and the background components as psig(Eθ)p_{\mathrm{sig}}\left(E|\vec{\theta}\right) and pbg(Eθ)p_{\mathrm{bg}}\left(E|\vec{\theta}\right). (In a practical modelling problem, these will all be specified explicitly! See the colab notebook.)

We can thus write the likelihood as such:

p(Xθ)=(μsig(θ)+μbg(θ))Ne(μsig(θ)+μbg(θ))N!Poisson term for total number of events×1μsig(θ)+μbg(θ)iN(μsig(θ)psig(Eiθ)+μbg(θ)pbg(Eiθ))Likelihood for event energies .\begin{split} p\left(\vec{X}|\vec{\theta}\right) =& \underbrace{\frac{\left(\mu_\mathrm{sig}\left(\vec{\theta}\right) + \mu_\mathrm{bg}\left(\vec{\theta}\right)\right)^{N} e^{-\left(\mu_\mathrm{sig}\left(\vec{\theta}\right) + \mu_\mathrm{bg}\left(\vec{\theta}\right)\right)}}{N!}}_\text{Poisson term for total number of events} \\ &\underbrace{\times\frac{1}{\mu_\mathrm{sig}\left(\vec{\theta}\right) + \mu_\mathrm{bg}\left(\vec{\theta}\right)} \prod^N_i\left(\mu_\mathrm{sig}\left(\vec{\theta}\right)p_{\mathrm{sig}}\left(E_i|\vec{\theta}\right) + \mu_\mathrm{bg}\left(\vec{\theta}\right)p_{\mathrm{bg}}\left(E_i|\vec{\theta}\right)\right)}_{\text{Likelihood for event energies }}. \end{split}

References

[1] Gelman, Andrew, et al. Bayesian data analysis. Chapman and Hall/CRC, 1995.
[2] Jaynes, Edwin T. Probability theory: The logic of science. Cambridge university press, 2003.
[3] Cranmer, Kyle, Johann Brehmer, and Gilles Louppe. "The frontier of simulation-based inference." Proceedings of the National Academy of Sciences 117.48 (2020): 30055-30062.

CC BY 4.0 Qin Juehang. Last modified: March 25, 2025. Website built with Franklin.jl and the Julia programming language.