On Github tjmckinley / statisticalinference
Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." W. A. Wallis
In the next few lectures we will introduce some of the fundamental concepts behind statistical inference.
We can't cover everything, but hopefully we can show why statistics is important, and debunk the myth that it is fundamentally difficult.
The truth is beyond our grasp, but knowledge grows through the gathering of evidence.
Everything varies, but we can't measure everything... so we take samples.
Sampled measurements contain noise and signal.
It is up to us to decide what information we are interested in extracting from the data, and how to quantify it adequately?
We do this by defining a statistical model of the system.
This is a mathematical representation of reality that can be used to improve our understanding of the system.
"All models are wrong, but some are useful" George E. P. Box
"... we need first to abstract the essence of the data-producing mechanism into a form that is amenable to mathematical and statistical treatment...
This is the statistical model of the system. Fitting a model to a given set of data will then provide a framework for extrapolating the results to a wider context or for predicting future outcomes, and can often lead to an explanation of the system."
Wojtek J. Krzanowski (1998)
The fundamental concept underpinning statistics is that measureable quantities are random, in the sense that they will vary within a population of individuals.
Paraphrasing David Spiegelhalter*,
"Measurements are unpredictable, but in a predictable way."
*because his fantastic programme, "Tails you win: the science of chance" is currently unavailable on BBC iPlayer...
We denote these quantities random variables, of which there are three main types:
The most basic fundamental model is called a probability distribution.
We assume that in a population, the probability of observing any viable value for a random variable is given by an associated probability distribution.
For example, consider a normal* distribution.
In this case the population is symmetric around the mean.
*sometimes called Gaussian distribution, after Carl Friedrich Gauss (1777–1855)
We assume that in a population, the probability of observing any viable value for a random variable is given by an associated probability distribution.
For example, consider a normal* distribution.
The probability that new sample \(x'\) takes a value between \(x_1\) and \(x_2\) is given by the area under the curve:
*sometimes called Gaussian distribution, after Carl Friedrich Gauss (1777–1855)
Problem: we don't observe the population
This module is concerned entirely with the following question:
How do we make inferences about the underlying population, using information obtained from a sample.
Begging rate of nestlings in relation to total mass of the brood of reed warbler chicks (solid circles, dashed fitted line) or cuckoo chick (open circles, solid fitted line).
A snapshot of the data (\(n = 58\)):
Take your measurement:
Split it into bins (usually of equal sizes) Allocate each datum into a bin Count how many data in each bin (frequency) Plot measurement on \(x\)-axis and frequency on \(y\)-axis.Features we might want to estimate:
Central Tendency:
Spread:
Shape:
Mean: add up all the measurements and divide by the sample size.
Median: rank the data smallest to largest and split the sample in half.
Mode: which measurement, or bin, is the most frequent?
Which to Choose?
Mode: useful if you actually want to say which outcome most common. Not actually a measure of central tendency. Some histograms have multiple modes.
Median: great for asymmetrical distributions, and the basis of many non-parametric tests.
Mean: is the basis for most mathematical statistics and works well for symmetrical distributions.
Range: biggest value minus smallest value---i.e. how wide is the distribution?
Interquartile range: rank data smallest to largest. Lower Q is at 25%, upper Q is at 75%. IQ range is UQ minus LQ.
Variance: the average squared distance from each data point to the mean.
Standard deviation: the square root of the variance.
Which to choose?
Range: gets bigger as sample size increases. Not good.
IQ range: useful for describing asymmetrical distributions.
Standard deviation: the basis for lots of mathematical statistics. Measures spread on the same scale as the data. Is: "the square root of the average squared distance from the data to the mean".
What about variance?
So we can generate point estimates for certain features of a population from a sample.
A key question to ask is then:
How confident are we in our estimate?
We can use the idea of sampling distributions to give us some measure of uncertainty associated with an estimated quantity.
These are known as confidence intervals.
Because measurements are random, and samples are of finite size, there will always be some variation in what's observed.
For example, let's draw a random sample of size 5 from a normal distribution with a mean 0 and variance of 10:
1 2 3 4 5 mean 0.06 -0.58 -4.34 -1.89 0.93 -1.16 1.23 -3.82 -1.15 -5.14 -0.81 -1.94 3.48 2.39 -0.75 3.12 2.34 2.12 0.28 -3.02 -0.62 2.93 1.53 0.22 -1.89 -6.91 -2.13 -6.7 -4 -4.33Carrying this on ad infinitum...
This distribution is known as the sampling distribution of the sample mean.
Notice that most of the time the sample mean is around zero, but occasionally there are large and small values, just by random chance.
Hence it is possible to see these extreme values, but the likelihood is very small.
So the sample mean itself is a random variable, which in this case follows a normal distribution with mean \(\mu\) (the population mean).
Notice that as \(n\) increases, the variance of the sampling distribution decreases.
In fact, \(~\bar{x} \sim N\left( \mu, \frac{\sigma^2}{n}\right)~\) and so \(~\bar{x} \to \mu~\) as \(~n \to \infty~\).
Hence the sample mean is an unbiased and consistent estimator of \(\mu\).
In reality we often do not known \(\sigma^2\), so we estimate it using the sample variance: \(s^2\).
In this case \(~\frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \sim t_{n - 1}~\), where \(~t_{n - 1}~\) denotes a t-distribution on \(~n - 1~\) degrees-of-freedom.
The quantity \(~\frac{s}{\sqrt{n}}~\) is known as the standard error.
What about sample variance?
This is also a random variable, with mean (central tendency) \(\sim\) population variance (phew)!
It is also based on a sample and is subject to error.
In this case the ratio of sample variance to population variance is a chi-squared (\(~\chi^2~\)) distribution.
A confidence interval can be derived from the quantiles of the sampling distribution e.g. a 95% CI is given by the 2.5% and 97.5% quantiles.
The \(~(1 - \alpha)~\%\) CI for the sample mean (from normally distributed samples):
\[ \bar{x} \pm \frac{s}{\sqrt{n}} t_{1 - \alpha / 2, n - 1} \]
where \(~t_{1 - \alpha / 2, n - 1}~\) is the \(~(1 - \alpha / 2)^{\text{th}}~\) quantile of the t-distribution on \(~n - 1~\) degrees-of-freedom.
\[ \bar{x} \pm \frac{s}{\sqrt{n}} t_{1 - \alpha / 2, n - 1} \]
The length decreases as:
So estimate gets more precise as \(n \to \infty~\).
(This is the form of a CI for the sample mean, but other forms exist for other statistics, such as the sample variance. They depend on the relevant sampling distribution.)
A 95% CI does NOT mean that there is a 95% probability that the true value lies between the upper and lower bounds.
It means:
there is 95% probability that the interval contains the true value
i.e. if we were to repeat the experiment ad infinitum, then 95% of the time the CI would contain the true value.
Bayesian vs. frequentist - see later lecture.
From our earlier example, we obtain the following 95% CIs:
1 2 3 4 5 mean 2.5% 97.5% 0.06 -0.58 -4.34 -1.89 0.93 -1.16 -3.09 0.76 1.23 -3.82 -1.15 -5.14 -0.81 -1.94 -4.32 0.44 3.48 2.39 -0.75 3.12 2.34 2.12 0.54 3.69 0.28 -3.02 -0.62 2.93 1.53 0.22 -1.89 2.33 -1.89 -6.91 -2.13 -6.7 -4 -4.33 -6.59 -2.07These theoretical sampling distributions rely on the assumption of normality. What if the data are non-normal?
Continuous measurement, skewed to the right, constrained below by zero.
Integer-valued (it's a count), skewed to the right, constrained below by zero.
Counts of random events can have a Poisson distribution
Number of successes out of \(~n~\) trials has a binomial distribution
Number of successes out of 1 trial has a Bernoulli distribution
What if we wish to calculate a confidence interval for a quantity when the data are non-normal?
What if we wish to estimate a quantity like the median, for which a theoretical sampling distribution can't be derived?
One possibility: bootstrapping...
Basic idea is to re-sample WITH replacement from the sample itself.
If this is done a large number of times, then an empirical sampling distribution can be generated.
The idea is that if the sample is representative of the population, then we can approximate the sampling distribution by performing inference treating the sample as our 'population'.
e.g.
2.5% 97.5% Boot 2.5% Boot 97.5% -3.09 0.76 -2.79 0.28 -4.32 0.44 -3.87 -0.06 0.54 3.69 0.64 3.18 -1.89 2.33 -1.63 1.94 -6.59 -2.07 -6.2 -2.41Be careful if distribution is heavy-tailed if bootstrapping the mean.
Median: 13, 95% CI: (7, 25.5)
What is potential issue with the interpretation?
Statistical inference aims to disentangle signal from noise.
Make assumptions about the data, guided by:
Choose what features to measure:
Understand uncertainty!