Lecture 1 – Statistical inference – Spread



Lecture 1 – Statistical inference – Spread

0 0


statisticalinference


On Github tjmckinley / statisticalinference

Lecture 1

Statistical inference

TJ McKinley / t.j.mckinley@exeter.ac.uk

Statistics

Statistics may be defined as "a body of methods for making wise decisions in the face of uncertainty." W. A. Wallis

In the next few lectures we will introduce some of the fundamental concepts behind statistical inference.

We can't cover everything, but hopefully we can show why statistics is important, and debunk the myth that it is fundamentally difficult.

Why 'do' Statistics?

The truth is beyond our grasp, but knowledge grows through the gathering of evidence.

Everything varies, but we can't measure everything... so we take samples.

Samples

Sampled measurements contain noise and signal.

It is up to us to decide what information we are interested in extracting from the data, and how to quantify it adequately?

Statistical models

We do this by defining a statistical model of the system.

This is a mathematical representation of reality that can be used to improve our understanding of the system.

"All models are wrong, but some are useful" George E. P. Box

Statistical model

"... we need first to abstract the essence of the data-producing mechanism into a form that is amenable to mathematical and statistical treatment...

This is the statistical model of the system. Fitting a model to a given set of data will then provide a framework for extrapolating the results to a wider context or for predicting future outcomes, and can often lead to an explanation of the system."

Wojtek J. Krzanowski (1998)

Random variables

The fundamental concept underpinning statistics is that measureable quantities are random, in the sense that they will vary within a population of individuals.

Paraphrasing David Spiegelhalter*,

"Measurements are unpredictable, but in a predictable way."

*because his fantastic programme, "Tails you win: the science of chance" is currently unavailable on BBC iPlayer...

Common types of random variables

We denote these quantities random variables, of which there are three main types:

  • continuous (e.g. weight, height),
  • discrete (e.g. age, counts),
  • categorical (e.g. gender).

The most basic fundamental model is called a probability distribution.

Probability distributions

We assume that in a population, the probability of observing any viable value for a random variable is given by an associated probability distribution.

For example, consider a normal* distribution.

In this case the population is symmetric around the mean.

*sometimes called Gaussian distribution, after Carl Friedrich Gauss (1777–1855)

Probability distributions

We assume that in a population, the probability of observing any viable value for a random variable is given by an associated probability distribution.

For example, consider a normal* distribution.

The probability that new sample \(x'\) takes a value between \(x_1\) and \(x_2\) is given by the area under the curve:

*sometimes called Gaussian distribution, after Carl Friedrich Gauss (1777–1855)

Problem

Problem: we don't observe the population

This module is concerned entirely with the following question:

How do we make inferences about the underlying population, using information obtained from a sample.

Exploratory analysis

Example: Cuckoos

Example: Cuckoos

Begging rate of nestlings in relation to total mass of the brood of reed warbler chicks (solid circles, dashed fitted line) or cuckoo chick (open circles, solid fitted line).

Example: Cuckoos

A snapshot of the data (\(n = 58\)):

Mass Beg Species 9.6 0 Cuckoo 10.2 0 Cuckoo 13.1 0 Cuckoo 15.2 0 Cuckoo 16.2 0 Cuckoo 20.1 0 Cuckoo
Mass Beg Species 29.9 44 Warbler 38.3 22 Warbler 38.3 29 Warbler 38.3 37 Warbler 36.5 46 Warbler 39.6 53 Warbler

Histogram

Take your measurement:

Split it into bins (usually of equal sizes) Allocate each datum into a bin Count how many data in each bin (frequency) Plot measurement on \(x\)-axis and frequency on \(y\)-axis.

Histogram of nestling mass

Point estimation

Point estimates

Features we might want to estimate:

Central Tendency:

  • mean
  • median
  • mode

Spread:

  • variance / standard deviation
  • range / interquartile range

Shape:

  • skewness
  • kurtosis

Central Tendency

Mean: add up all the measurements and divide by the sample size.

Median: rank the data smallest to largest and split the sample in half.

Mode: which measurement, or bin, is the most frequent?

Example: nestling mass

Central Tendency

Which to Choose?

Mode: useful if you actually want to say which outcome most common. Not actually a measure of central tendency. Some histograms have multiple modes.

Median: great for asymmetrical distributions, and the basis of many non-parametric tests.

Mean: is the basis for most mathematical statistics and works well for symmetrical distributions.

Spread

Range: biggest value minus smallest value---i.e. how wide is the distribution?

Interquartile range: rank data smallest to largest. Lower Q is at 25%, upper Q is at 75%. IQ range is UQ minus LQ.

Variance: the average squared distance from each data point to the mean.

Standard deviation: the square root of the variance.

Example: nestling mass

  • range = (5, 63)
  • IQR = 21.1
  • sd = 13.8
  • var = 190.6

Example: nestling mass

  • range = (5, 63)
  • IQR = 21.1
  • sd = 13.8
  • var = 190.6

Example: nestling mass

  • range = (5, 63)
  • IQR = 21.1
  • sd = 13.8
  • var = 190.6

Spread

Which to choose?

Range: gets bigger as sample size increases. Not good.

IQ range: useful for describing asymmetrical distributions.

Standard deviation: the basis for lots of mathematical statistics. Measures spread on the same scale as the data. Is: "the square root of the average squared distance from the data to the mean".

  • Why square and then square root?

Spread

What about variance?

  • Often looks very little like the spread of the data.
  • Not measured on the same scale as the data - e.g. if data in cm, variance is in cm\(^2\).
  • Exists because of its links to mathematical statistics (and because \(\text{sd} = \sqrt{\text{var}}\)).

Interval estimation

So we can generate point estimates for certain features of a population from a sample.

A key question to ask is then:

How confident are we in our estimate?

We can use the idea of sampling distributions to give us some measure of uncertainty associated with an estimated quantity.

These are known as confidence intervals.

Sampling distributions

Sampling distributions

Because measurements are random, and samples are of finite size, there will always be some variation in what's observed.

For example, let's draw a random sample of size 5 from a normal distribution with a mean 0 and variance of 10:

1 2 3 4 5 mean 0.06 -0.58 -4.34 -1.89 0.93 -1.16 1.23 -3.82 -1.15 -5.14 -0.81 -1.94 3.48 2.39 -0.75 3.12 2.34 2.12 0.28 -3.02 -0.62 2.93 1.53 0.22 -1.89 -6.91 -2.13 -6.7 -4 -4.33

Sampling distributions

Carrying this on ad infinitum...

This distribution is known as the sampling distribution of the sample mean.

Sampling distributions

Notice that most of the time the sample mean is around zero, but occasionally there are large and small values, just by random chance.

Hence it is possible to see these extreme values, but the likelihood is very small.

Sampling distributions and sample size

So the sample mean itself is a random variable, which in this case follows a normal distribution with mean \(\mu\) (the population mean).

Sampling distributions and sample size

Notice that as \(n\) increases, the variance of the sampling distribution decreases.

In fact, \(~\bar{x} \sim N\left( \mu, \frac{\sigma^2}{n}\right)~\) and so \(~\bar{x} \to \mu~\) as \(~n \to \infty~\).

Hence the sample mean is an unbiased and consistent estimator of \(\mu\).

t-distribution

In reality we often do not known \(\sigma^2\), so we estimate it using the sample variance: \(s^2\).

In this case \(~\frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \sim t_{n - 1}~\), where \(~t_{n - 1}~\) denotes a t-distribution on \(~n - 1~\) degrees-of-freedom.

The quantity \(~\frac{s}{\sqrt{n}}~\) is known as the standard error.

Aside: Central Limit Theorem

Sample variance

What about sample variance?

This is also a random variable, with mean (central tendency) \(\sim\) population variance (phew)!

It is also based on a sample and is subject to error.

In this case the ratio of sample variance to population variance is a chi-squared (\(~\chi^2~\)) distribution.

Confidence intervals

Confidence intervals

A confidence interval can be derived from the quantiles of the sampling distribution e.g. a 95% CI is given by the 2.5% and 97.5% quantiles.

The \(~(1 - \alpha)~\%\) CI for the sample mean (from normally distributed samples):

\[ \bar{x} \pm \frac{s}{\sqrt{n}} t_{1 - \alpha / 2, n - 1} \]

where \(~t_{1 - \alpha / 2, n - 1}~\) is the \(~(1 - \alpha / 2)^{\text{th}}~\) quantile of the t-distribution on \(~n - 1~\) degrees-of-freedom.

Confidence intervals

\[ \bar{x} \pm \frac{s}{\sqrt{n}} t_{1 - \alpha / 2, n - 1} \]

The length decreases as:

  • \(s~\) decreases,
  • \(n~\) increases

So estimate gets more precise as \(n \to \infty~\).

(This is the form of a CI for the sample mean, but other forms exist for other statistics, such as the sample variance. They depend on the relevant sampling distribution.)

Aside: interpretation

A 95% CI does NOT mean that there is a 95% probability that the true value lies between the upper and lower bounds.

It means:

there is 95% probability that the interval contains the true value

i.e. if we were to repeat the experiment ad infinitum, then 95% of the time the CI would contain the true value.

Bayesian vs. frequentist - see later lecture.

Aside: interpretation

From our earlier example, we obtain the following 95% CIs:

1 2 3 4 5 mean 2.5% 97.5% 0.06 -0.58 -4.34 -1.89 0.93 -1.16 -3.09 0.76 1.23 -3.82 -1.15 -5.14 -0.81 -1.94 -4.32 0.44 3.48 2.39 -0.75 3.12 2.34 2.12 0.54 3.69 0.28 -3.02 -0.62 2.93 1.53 0.22 -1.89 2.33 -1.89 -6.91 -2.13 -6.7 -4 -4.33 -6.59 -2.07

Example: cuckoos

Non-normal data

These theoretical sampling distributions rely on the assumption of normality. What if the data are non-normal?

Continuous measurement, skewed to the right, constrained below by zero.

Integer-valued (it's a count), skewed to the right, constrained below by zero.

Lots of ecological data is non-normal

Counts of random events can have a Poisson distribution

Number of successes out of \(~n~\) trials has a binomial distribution

Number of successes out of 1 trial has a Bernoulli distribution

Dealing with non-normality

Think carefully about what features you wish to estimate. Choose appropriate test (think about e.g. central limit theorem). Use non-parametric methods that do not rely on the mathematics of the assumed distribution (e.g. bootstrapping) Transform data to make it 'look' normal. Exploit links between your 'known' distribution and the normal distribution.

Bootstrapping

What if we wish to calculate a confidence interval for a quantity when the data are non-normal?

What if we wish to estimate a quantity like the median, for which a theoretical sampling distribution can't be derived?

One possibility: bootstrapping...

Bootstrapping

Basic idea is to re-sample WITH replacement from the sample itself.

If this is done a large number of times, then an empirical sampling distribution can be generated.

The idea is that if the sample is representative of the population, then we can approximate the sampling distribution by performing inference treating the sample as our 'population'.

Bootstrapping

e.g.

2.5% 97.5% Boot 2.5% Boot 97.5% -3.09 0.76 -2.79 0.28 -4.32 0.44 -3.87 -0.06 0.54 3.69 0.64 3.18 -1.89 2.33 -1.63 1.94 -6.59 -2.07 -6.2 -2.41

Be careful if distribution is heavy-tailed if bootstrapping the mean.

Example: cuckoos

Median: 13, 95% CI: (7, 25.5)

What is potential issue with the interpretation?

Example: cuckoos

Summary

Statistical inference aims to disentangle signal from noise.

Make assumptions about the data, guided by:

  • exploratory data analysis

Choose what features to measure:

  • Point estimates
  • Interval estimates

Understand uncertainty!