Post

Statistics Foundations

A notebook covering basic statistical terms, normal distributions, sampling, confidence intervals, and hypothesis testing.

Statistics Foundations

A reference notebook for core statistical concepts β€” from the basics of central tendency through to hypothesis testing.


πŸ“ Basic Statistical Terms

Mean is also commonly referred to as the average of the dataset. To find the mean, first count up the total number of data points. Add all the data points β€” sum of the data points. Divide the sum of the data points by the number of data points.

\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]

Median is the middle data point in a data set where the numbers are listed from smallest to largest. Organize the data points from smallest to largest. The middle data point in this arrangement is the median.

Mode is the data point that is most common in the dataset.

Range is the difference between the biggest and smallest numbers in the dataset.

\[\text{Range} = x_{\max} - x_{\min}\]

Standard deviation is a measure of how far the data points are from each other. Most statistical tools can calculate it for you:

\[\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}\]

Examples

Imagine you are given nine test scores: 40, 50, 60, 70, 70, 70, 80, 90, 100

StatisticValueHow
Mean70Sum = 630, divide by 9
Median705th value of 9 (middle)
Mode70Appears 3 times
Range60100 βˆ’ 40
Std Dev18.7Calculated from formula

Compare: dataset 65, 66, 67 has Οƒ = 1, while 20, 50, 90 has Οƒ = 35.1. Higher spread β†’ larger standard deviation.


πŸ”” Normally Distributed Datasets

Datasets often take the shape of the famous bell curve, often called the normal curve:

  • Lots of data points near the center (top of the bell)
  • Small numbers of data points at the right and left ends

When datasets look like the bell curve, we say that the data is normally distributed.

A graph of a normal curve.

Empirical Rule (68-95-99.7 Rule)

The empirical rule states that in a normally distributed data set we would expect:

Range% of Data
Within 1Οƒ of mean68%
Within 2Οƒ of mean95%
Within 3Οƒ of mean99.7%

Outliers are data points more than three standard deviations from the mean. In a normal distribution, only 0.3% of data falls outside this range β€” so if you see more, investigate your data.


🎲 Using Data Sets

Samples and Sample Size

Sample β€” A small group taken from a larger group. This small group will help provide clues about the entire population. For example, suppose you are interested in knowing the average shoe size for 15-year-old boys in a certain city. Instead of measuring the shoe size of every boy in the city, a researcher could measure the shoe size of a group of 25 boys. This group of 25 boys is the sample.

Random sample β€” The best type of sample for statistical research. A random sample is gathered in a fair and unbiased way β€” every member of the population has an equal opportunity to be chosen.

Sample size β€” The number of objects chosen for your random sample.

Central limit theorem β€” As more and more samples are collected, the averages of all the individual samples will result in a normal (bell-shaped) distribution, regardless of the original distribution.

\[\bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \text{ as } n \to \infty\]

Standard error β€” If you collect a sample with average 100, the standard error measures how big the difference might be between your sample average and the actual population average:

\[SE = \frac{\sigma}{\sqrt{n}}\]

The larger your sample size, the smaller the standard error β€” and the more trustworthy your estimate.


πŸ“Š Confidence Intervals

Suppose you ask five weather forecasters what the high temperature will be tomorrow. The guesses are 28, 29, 28, 31, 34. You can say you’re fairly confident the high temperature will be between 28 and 34. That’s your confidence interval.

A statistical confidence interval is developed using:

  • A random sample
  • The average of the sample
  • The sample size
  • The z-score

The confidence interval has both a lower limit and an upper limit:

\[CI = \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}\]

πŸ§ͺ Hypothesis Tests

A hypothesis test is a statistical or scientific investigation. A researcher will have a guess, then perform a hypothesis test to see if there is sufficient evidence to support that guess.

Example: Suppose you have a bag of 50 marbles β€” either red or blue. The general belief is that half are red and half are blue. At random, you pull four marbles and all four are blue. You guess that more than half the marbles must be blue. A hypothesis test uses the result of your random sample as evidence to evaluate how likely it is that 50% of the marbles are red, given what you observed.

In a hypothesis test:

  • Null hypothesis (Hβ‚€): the status quo (e.g., 50% blue)
  • Alternative hypothesis (H₁): your claim (e.g., more than 50% blue)
  • The p-value tells you how surprising your sample result would be if Hβ‚€ were true

Part of my data science foundations series. Next: regression analysis and model evaluation metrics.

This post is licensed under CC BY 4.0 by the author.