Organise and Graph Datasets Revision Notes for HSC SSCE Mathematics Advanced

Organise and Graph Datasets

In statistics, we need effective ways to organise and display data so we can identify patterns and extract useful information. This note covers how to create frequency tables and construct different types of graphs including histograms and cumulative frequency polygons.

infoNote

Understanding how to organize and visualize data is fundamental to statistical analysis. The techniques in this note will help you see patterns that aren't obvious in raw data and make meaningful comparisons between datasets.

Organise datasets in tables

When working with data from a discrete random variable (a variable that can only take specific, countable values), we can organise the information in a structured table. This table displays values alongside various frequency measures.

What is frequency?

Frequency tells us how many times a particular value appears in a dataset. For individual values, it's simply a count. When data is grouped into class intervals, the frequency represents how many observations fall within that interval.

For example, if you roll a die $20$ times and get the number $6$ on $3$ occasions, the frequency of $6$ is $3$ .

What is relative frequency?

Relative frequency expresses frequency as a proportion of the total dataset. It's calculated using the formula:

$\text{Relative frequency} = \frac{f}{n}$

where $f$ is the frequency of a particular value or group, and $n$ is the total number of observations in the dataset.

infoNote

Relative frequency is particularly useful when comparing datasets of different sizes, as it standardises the frequencies to proportions. For example, comparing survey results from 100 people versus 1000 people becomes much easier when using relative frequencies.

What is cumulative frequency?

Cumulative frequency is the running total of frequencies as you move through an ordered dataset. For each value or class, you add up all the frequencies up to and including that point.

The cumulative frequency at the end of the dataset will always equal the total number of observations ( $n$ ).

Similarly, cumulative relative frequency is the running total of relative frequencies, which will always reach $1$ (or $100\%$ ) at the end.

lightbulbExample

Worked Example 1: Pet Ownership Data

A survey records the number of pets owned by $20$ households. The data collected is:

$0, 1, 2, 1, 3, 0, 2, 1, 4, 1, 0, 2, 3, 1, 2, 0, 1, 2, 1, 0$

Let's organise this data in a complete frequency table.

Strategy:

Count how many times each value appears (frequency)
Calculate the proportion for each value (relative frequency = frequency ÷ 20)
Build running totals for cumulative frequencies

Working:

First, count the occurrences:

$0$ appears $5$ times
$1$ appears $7$ times
$2$ appears $5$ times
$3$ appears $2$ times
$4$ appears $1$ time

Total: $5 + 7 + 5 + 2 + 1 = 20$ ✓

Now construct the complete table:

$x$	$f$	$\frac{f}{20}$	$F$	$\frac{F}{20}$
$0$	$5$	$0.25$	$5$	$0.25$
$1$	$7$	$0.35$	$12$	$0.6$
$2$	$5$	$0.25$	$17$	$0.85$
$3$	$2$	$0.1$	$19$	$0.95$
$4$	$1$	$0.05$	$20$	$1$

where:

$x$ = the value (number of pets)
$f$ = frequency
$\frac{f}{20}$ = relative frequency
$F$ = cumulative frequency
$\frac{F}{20}$ = cumulative relative frequency

Check: The total frequency is $20$ and the cumulative relative frequency reaches $1$ , confirming our calculations are correct.

chatImportant

Always verify your frequency table calculations:

The sum of all frequencies should equal $n$ (the total number of observations)
The final cumulative frequency should equal $n$
The final cumulative relative frequency should equal $1$ (or $100\%$ )

Visualise datasets with histograms

A histogram is a visual representation of a frequency distribution using bars. The bars are placed adjacent to each other with no gaps (for continuous or consecutive discrete data), showing how the data is distributed.

Types of histograms

There are three main types of histograms:

1. Frequency histogram

Bar heights represent the frequency of each value or class interval
The y-axis shows frequency (count)
Used to see which values occur most often

2. Relative frequency histogram

Bar heights represent relative frequency (proportion)
The y-axis shows relative frequency as decimals or percentages
Useful for comparing datasets of different sizes

3. Cumulative frequency histogram

Bar heights represent cumulative frequency (running total)
The y-axis shows cumulative frequency
Useful for finding the median and understanding the distribution pattern

infoNote

For grouped data, the width of each bar represents the class interval. The bars should be adjacent with no gaps between them, emphasising the continuous nature of the distribution.

Finding the mode and median

The mode is the value or class interval with the highest frequency. On a frequency histogram, this is the tallest bar.

The median is the middle value when data is arranged in order. It divides the dataset into two equal parts. For a dataset with $n$ observations, the median is located at position $\frac{n}{2}$ .

chatImportant

Finding the median from a cumulative frequency histogram:

Find $\frac{n}{2}$ on the y-axis
Draw a horizontal line across to the histogram
Read down to find the corresponding value on the x-axis

This histogram shows the distribution of running times for a $10$ km race with $72$ runners. The mode is in the $60$ - $65$ minute interval, as this is the tallest bar. To find the median, we would need to use a cumulative frequency histogram.

lightbulbExample

Worked Example 2: Creating a Frequency Histogram

A dataset records times (in minutes) for $20$ runners to complete a race:

$35, 38, 40, 42, 45, 46, 48, 50, 52, 55, 56, 58, 60, 62, 65, 68, 70, 72, 75, 80$

Create a frequency histogram with class intervals of width $10$ minutes starting at $30$ . Identify the mode.

Strategy:

Group the data into class intervals: $30$ - $40$ , $40$ - $50$ , $50$ - $60$ , $60$ - $70$ , $70$ - $80$ , $80$ - $90$
Count the frequency for each interval
Draw bars with heights matching the frequencies
Find the tallest bar to identify the mode

Working:

Class intervals and frequencies:

$30$ - $40$ : $2$ values ( $35, 38$ )
$40$ - $50$ : $5$ values ( $40, 42, 45, 46, 48$ )
$50$ - $60$ : $5$ values ( $50, 52, 55, 56, 58$ )
$60$ - $70$ : $4$ values ( $60, 62, 65, 68$ )
$70$ - $80$ : $3$ values ( $70, 72, 75$ )
$80$ - $90$ : $1$ value ( $80$ )

Mode: The highest frequency is $5$ , which occurs in both the $40$ - $50$ and $50$ - $60$ intervals. Therefore, the distribution is bimodal (has two modes).

Visualise datasets with cumulative frequency polygons

A cumulative frequency polygon, also called an ogive, is a line graph that displays cumulative frequencies. It's particularly useful for finding the median and analysing the accumulation of data.

How to construct a cumulative frequency polygon

To create a cumulative frequency polygon:

Plot points at the upper boundary of each class interval (or at each value for ungrouped data)
The y-coordinate of each point is the cumulative frequency up to that boundary
Start the graph at $(x_0, 0)$ , where $x_0$ is the lower boundary of the first class
Connect the points with straight lines
The graph ends when the cumulative frequency reaches $n$ (the total frequency)

The resulting shape typically shows an increasing curve that becomes steeper in regions where data is concentrated.

This cumulative frequency polygon shows pet ownership data. The median can be found by locating $\frac{n}{2}$ on the y-axis (in this case, $25$ ) and reading across to the curve, then down to the x-axis.

Finding the median from a cumulative frequency polygon

To find the median:

Calculate $\frac{n}{2}$ where $n$ is the total frequency
Locate this value on the y-axis (cumulative frequency axis)
Draw a horizontal line from this point until it meets the polygon
Draw a vertical line down to the x-axis
Read the x-value where this vertical line meets the axis

This x-value is the median, representing the middle value of the ordered dataset.

chatImportant

Key Difference: The mode cannot be found directly from a cumulative frequency polygon. You need to refer back to the frequency table or frequency histogram, as the mode is the value with the highest individual frequency (not cumulative).

lightbulbExample

Worked Example 3: Books Read by Students

A dataset records the number of books read by $20$ students in a month:

$2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 11, 12, 15$

Part a) Create a cumulative frequency polygon with class intervals of width $5$ books starting at $0$ .

Strategy:

Group data into intervals: $0$ - $5$ , $5$ - $10$ , $10$ - $15$ , $15$ - $20$
Count frequencies and calculate cumulative frequencies
Plot points at upper class boundaries
Connect the points starting from $(0, 0)$

Working:

Class Interval	Frequency	Cumulative Frequency
$0$ - $5$	$5$	$5$
$5$ - $10$	$11$	$16$
$10$ - $15$	$3$	$19$
$15$ - $20$	$1$	$20$

Plot points at: $(0, 0)$ , $(5, 5)$ , $(10, 16)$ , $(15, 19)$ , $(20, 20)$

Part b) Find the mode.

Strategy: Look at the frequency table to identify which class interval has the highest frequency.

Answer: The mode is in the $5$ - $10$ interval, as this has the highest frequency of $11$ .

Part c) Find the median.

Strategy: Locate $\frac{20}{2} = 10$ on the cumulative frequency axis, read across to the polygon, then down to find the corresponding number of books.

Working:

From the graph, when the cumulative frequency equals $10$ , the corresponding value on the x-axis is approximately $7$ books.

Answer: The median is approximately $7$ books. This means half the students read $7$ or fewer books, and half read $7$ or more books.

bookmarkSummary

Key Points to Remember:

Frequency tables organise data with columns for values, frequency ( $f$ ), relative frequency ( $\frac{f}{n}$ ), cumulative frequency ( $F$ ), and cumulative relative frequency ( $\frac{F}{n}$ ).
Histograms use adjacent bars to display frequency distributions. The tallest bar indicates the mode (most common value or interval).
Cumulative frequency histograms show running totals and help estimate the median by locating $\frac{n}{2}$ on the y-axis.
Cumulative frequency polygons (ogives) connect points at upper class boundaries and are particularly useful for finding the median by reading from $\frac{n}{2}$ on the y-axis.
Always check your work: total frequencies should sum to $n$ , and cumulative relative frequencies should reach $1$ (or $100\%$ ).

Organise and Graph Datasets (HSC SSCE Mathematics Advanced): Revision Notes