Describing and Comparing Distributions Revision Notes for HSC SSCE Mathematics Standard

Describing and Comparing Distributions

Understanding distribution shapes

When working with data, we need to describe the overall shape and pattern of how values are distributed. The general shape of a distribution can be described using three main characteristics: smoothness, symmetry, and the number of modes.

A smooth distribution is one where data changes gradually without sudden breaks or irregular jumps. The pattern flows continuously from one value to the next.

Symmetry refers to whether the distribution is balanced evenly on both sides of a central vertical line. If you could fold the distribution along this centre line, both halves would match up like a mirror image.

Note

The mode is the value that appears most frequently in the dataset (the peak with the highest frequency). Distributions can have different numbers of peaks or modes, which tells us about the pattern of the data.

Types of distribution shapes

Distributions can be classified based on how many peaks or modes they contain:

Unimodal distributions have a single peak or mode. Most of the data clusters around one central value, creating one high point in the distribution.

Bimodal distributions have two distinct peaks or modes. This often happens when you have two different groups mixed together in your dataset, each with its own typical value.

Multimodal distributions have many peaks or modes. This suggests the data comes from several different groups or sources, each contributing its own cluster of values.

Symmetry and skewness

Data is symmetric when it forms a mirror image of itself around a central vertical line. If you were to fold the distribution in half at the middle, both sides would match up perfectly. In symmetric distributions, the data is evenly balanced on both sides.

However, many real-world datasets are not symmetric. When data has more values concentrated on one side, we say it is skewed. The direction of the skew is determined by where the long tail of the distribution points.

Types of skewness

There are three main categories of skewness:

No skew (symmetric): The data is evenly balanced on both sides of a vertical centre line. The left and right halves are mirror images of each other.

Positively skewed: There is more data concentrated on the left side of the distribution, with a long tail extending toward the right (positive) side. The tail points in the positive direction along the number line. This often occurs with data like income or house prices, where most values are lower but a few are extremely high.

Negatively skewed: There is more data concentrated on the right side of the distribution, with a long tail extending toward the left (negative) side. The tail points in the negative direction along the number line. This might occur with test scores where most students score highly but a few score very low.

Important

Remember that the skew is named after the direction of the tail, not where most of the data is located. Positive skew = tail to the right; negative skew = tail to the left.

Measures of location

Measures of location (also called measures of central tendency) tell us about the typical or central value in a dataset. The three main measures are the mean, median, and mode. Each has its own strengths and weaknesses, making them suitable for different situations.

Mean

The mean is calculated by adding up all the values and dividing by the number of values:

$\bar{x} = \frac{\text{sum of all scores}}{\text{number of scores}}$

Advantages:

Simple to understand and calculate
Takes into account every single value in the dataset
Most consistent across different samples from the same population

Disadvantages:

Strongly influenced by extreme values (outliers)
Cannot be used with categorical data (data in categories rather than numbers)

Median

The median is the middle value when all data values are arranged in order from smallest to largest.

Advantages:

Simple to understand
Not affected by outliers or extreme values

Disadvantages:

May not represent the centre well in some distributions
Data must be sorted before calculating
Varies more than the mean when taking different samples from the same population

Note

The median is particularly useful when your data contains outliers or is skewed, as it provides a more reliable measure of the centre than the mean in these cases.

Mode

The mode is the value that occurs most frequently in the dataset.

Advantages:

Very simple to identify
Not affected by outliers
Can be used with categorical data (like favourite colours or types of pets)

Disadvantages:

There may be no mode at all, or multiple modes
May not represent the centre of the data well

Measures of spread

Measures of spread tell us how much the data values vary or are dispersed around the centre. The three main measures are the range, interquartile range, and standard deviation.

Range

The range is the difference between the highest and lowest values:

$\text{Range} = \text{highest score} - \text{lowest score}$

Advantages:

Very simple to understand
Quick and easy to calculate

Disadvantages:

Only depends on the two extreme values (smallest and largest)
Can be heavily distorted by a single outlier

Interquartile range (IQR)

The interquartile range is the difference between the upper quartile ( $Q_3$ ) and the lower quartile ( $Q_1$ ):

$\text{IQR} = Q_3 - Q_1$

This represents the spread of the middle 50% of the data.

Advantages:

Straightforward to calculate for small datasets
Easy to understand (focuses on middle 50% of data)
Not affected by outliers

Disadvantages:

More difficult to calculate for large datasets
Only depends on the two quartile values
Data must be sorted before calculating

Note

The IQR is a robust measure of spread because it focuses on the central portion of the data, effectively ignoring the influence of extreme values on either end of the distribution.

Standard deviation

The standard deviation measures how far values typically deviate from the mean. It takes into account every value in the dataset.

Advantages:

Uses every single data value
Not significantly affected by outliers (compared to range)

Disadvantages:

Complex to calculate without a calculator
More difficult concept to understand

Comparison table

Measure	Advantages	Disadvantages
Mean	- Easy to understand and calculate - Uses every score - Most consistent across samples	- Distorted by outliers - Not suitable for categorical data
Median	- Easy to understand - Not affected by outliers	- May not be central - Data needs sorting - Varies more than mean in samples
Mode	- Easy to determine - Not affected by outliers - Suitable for categorical data	- May be no mode or multiple modes - May not be central
Range	- Easy to understand - Easy to calculate	- Depends only on extreme values - Distorted by outliers
IQR	- Easy for small datasets - Easy to understand - Not affected by outliers	- Difficult for large datasets - Depends on quartiles only - Data needs sorting
Standard deviation	- Uses every score - Not significantly affected by outliers	- Difficult to calculate without calculator - Difficult to understand

Important

When choosing which measures to use, consider whether your data has outliers. If outliers are present, use median and IQR rather than mean and range. If the data is symmetric with no outliers, the mean and standard deviation are usually preferred.

Worked example: Comparing statistics for two sets of data

Example

Worked Example: Analyzing Fitness Class Attendance

Let's look at a practical example comparing the number of participants in fitness classes for two instructors, Bec and Rita, across one week:

Day	M	T	W	T	F	S	S
Bec	8	5	4	8	8	4	5
Rita	10	9	12	14	8	10	1

Part a: Find the mean and median for each dataset

For Bec:

Calculate the mean:

$\bar{x} = \frac{8 + 5 + 4 + 8 + 8 + 4 + 5}{7}$

$\bar{x} = \frac{42}{7}$

$\bar{x} = 6$

To find the median, arrange the values in order:

$4, 4, 5, 5, 8, 8, 8$

Since there are 7 values (odd number), the median is the middle value (4th value):

Median $= 5$

For Rita:

Calculate the mean:

$\bar{x} = \frac{10 + 9 + 12 + 14 + 8 + 10 + 1}{7}$

$\bar{x} = \frac{64}{7}$

$\bar{x} = 9.1$

Arrange the values in order:

$1, 8, 9, 10, 10, 12, 14$

The median is the middle value (4th value):

Median $= 10$

Part b: Find the range and interquartile range for each dataset

For Bec:

Range $= 8 - 4 = 4$

For IQR, we need the upper quartile ( $Q_3$ ) and lower quartile ( $Q_1$ ):

IQR $= 8 - 4 = 4$

For Rita:

Range $= 14 - 1 = 13$

IQR $= 12 - 8 = 4$

Part c: Examine the summary statistics and outline any concerns

Looking at Rita's data, there is a clear outlier: the value $1$ on Sunday. This extremely low value is separated from the rest of the data.

The presence of this outlier has significantly affected some of Rita's statistics:

The mean has been pulled down (from what would be around 11 without the outlier to 9.1)
The range has been greatly increased (from what would be about 6 to 13)
The median and IQR remain more reliable as they are not affected by the single extreme value

This example shows why it's important to identify outliers and choose appropriate measures. For Rita's data, the median and IQR give a better representation of the typical class size than the mean and range.

Summary

Key Points to Remember:

Distribution shapes can be described by smoothness (gradual changes), symmetry (balanced vs skewed), and number of modes (unimodal, bimodal, multimodal)
Skewness is named after the tail direction: positive skew has a right tail, negative skew has a left tail
Mean uses all values but is affected by outliers; median resists outliers but varies more between samples
Range is simple but heavily influenced by extreme values; IQR focuses on the middle 50% and resists outliers
When data contains outliers, prefer median and IQR over mean and range for more reliable summaries

Describing and Comparing Distributions (HSC SSCE Mathematics Standard): Revision Notes