Statistics Foundations
A notebook covering basic statistical terms, normal distributions, sampling, confidence intervals, and hypothesis testing.
A reference notebook for core statistical concepts β from the basics of central tendency through to hypothesis testing.
π Basic Statistical Terms
Mean is also commonly referred to as the average of the dataset. To find the mean, first count up the total number of data points. Add all the data points β sum of the data points. Divide the sum of the data points by the number of data points.
\[\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\]Median is the middle data point in a data set where the numbers are listed from smallest to largest. Organize the data points from smallest to largest. The middle data point in this arrangement is the median.
Mode is the data point that is most common in the dataset.
Range is the difference between the biggest and smallest numbers in the dataset.
\[\text{Range} = x_{\max} - x_{\min}\]Standard deviation is a measure of how far the data points are from each other. Most statistical tools can calculate it for you:
\[\sigma = \sqrt{\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n}}\]Examples
Imagine you are given nine test scores: 40, 50, 60, 70, 70, 70, 80, 90, 100
| Statistic | Value | How |
|---|---|---|
| Mean | 70 | Sum = 630, divide by 9 |
| Median | 70 | 5th value of 9 (middle) |
| Mode | 70 | Appears 3 times |
| Range | 60 | 100 β 40 |
| Std Dev | 18.7 | Calculated from formula |
Compare: dataset
65, 66, 67has Ο = 1, while20, 50, 90has Ο = 35.1. Higher spread β larger standard deviation.
π Normally Distributed Datasets
Datasets often take the shape of the famous bell curve, often called the normal curve:
- Lots of data points near the center (top of the bell)
- Small numbers of data points at the right and left ends
When datasets look like the bell curve, we say that the data is normally distributed.
Empirical Rule (68-95-99.7 Rule)
The empirical rule states that in a normally distributed data set we would expect:
| Range | % of Data |
|---|---|
| Within 1Ο of mean | 68% |
| Within 2Ο of mean | 95% |
| Within 3Ο of mean | 99.7% |
Outliers are data points more than three standard deviations from the mean. In a normal distribution, only 0.3% of data falls outside this range β so if you see more, investigate your data.
π² Using Data Sets
Samples and Sample Size
Sample β A small group taken from a larger group. This small group will help provide clues about the entire population. For example, suppose you are interested in knowing the average shoe size for 15-year-old boys in a certain city. Instead of measuring the shoe size of every boy in the city, a researcher could measure the shoe size of a group of 25 boys. This group of 25 boys is the sample.
Random sample β The best type of sample for statistical research. A random sample is gathered in a fair and unbiased way β every member of the population has an equal opportunity to be chosen.
Sample size β The number of objects chosen for your random sample.
Central limit theorem β As more and more samples are collected, the averages of all the individual samples will result in a normal (bell-shaped) distribution, regardless of the original distribution.
\[\bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right) \text{ as } n \to \infty\]Standard error β If you collect a sample with average 100, the standard error measures how big the difference might be between your sample average and the actual population average:
\[SE = \frac{\sigma}{\sqrt{n}}\]The larger your sample size, the smaller the standard error β and the more trustworthy your estimate.
π Confidence Intervals
Suppose you ask five weather forecasters what the high temperature will be tomorrow. The guesses are 28, 29, 28, 31, 34. You can say youβre fairly confident the high temperature will be between 28 and 34. Thatβs your confidence interval.
A statistical confidence interval is developed using:
- A random sample
- The average of the sample
- The sample size
- The z-score
The confidence interval has both a lower limit and an upper limit:
\[CI = \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}\]π§ͺ Hypothesis Tests
A hypothesis test is a statistical or scientific investigation. A researcher will have a guess, then perform a hypothesis test to see if there is sufficient evidence to support that guess.
Example: Suppose you have a bag of 50 marbles β either red or blue. The general belief is that half are red and half are blue. At random, you pull four marbles and all four are blue. You guess that more than half the marbles must be blue. A hypothesis test uses the result of your random sample as evidence to evaluate how likely it is that 50% of the marbles are red, given what you observed.
In a hypothesis test:
- Null hypothesis (Hβ): the status quo (e.g., 50% blue)
- Alternative hypothesis (Hβ): your claim (e.g., more than 50% blue)
- The p-value tells you how surprising your sample result would be if Hβ were true
Part of my data science foundations series. Next: regression analysis and model evaluation metrics.