Grouped Data and Histograms Revision Notes for HSC SSCE Mathematics Advanced

Grouped Data and Histograms

Introduction

When working with large datasets, organising data into tables and graphs helps us see the overall patterns. However, when a table has too many rows or a graph has too much detail, it becomes difficult to get a clear overview. The solution is to group the data, which reduces complexity while still showing important patterns.

infoNote

Grouping data is particularly useful when:

A frequency table would have too many rows
Individual data points would create a cluttered graph
We want to see the "big picture" of the data distribution

Grouping data

What is grouped data?

Grouped data organises individual values into intervals (also called bins). Each interval has equal width and is represented by its class centre, which is the midpoint of the interval.

Creating a grouped frequency table

Consider this example of heights (in centimetres) of $100$ people from the !Kung people of the Kalahari desert:

To group this data, we organise it into intervals of $10$ cm:

infoNote

Key features of this table:

Interval: The range of values (e.g., $80-90$ means $80 \leq \text{height} < 90$ )
Class centre: The midpoint of each interval (e.g., $85$ for the $80-90$ interval)
Frequency: How many data points fall in each interval
Cumulative frequency: The running total of frequencies

Why use class centres?

When data is grouped, we lose information about the exact values within each interval. We use the class centre as a representative value for all data in that interval. This simplification allows calculations while keeping the data manageable.

Effects of grouping

Grouping is a form of rounding that helps us see patterns, but it involves ignoring some information:

chatImportant

Advantages:

Clearer overview of data distribution
Easier to create meaningful graphs
Reduces clutter in large datasets

Disadvantages:

Loss of precision in calculations
Summary statistics (mean, median, range) will be approximations

Never discard the original data. Keep it for more accurate calculations if needed.

lightbulbExample

Example: Impact of Grouping on Accuracy

For the height data above:

Median from grouped data: $145$ cm
Median from raw data: $147.955$ cm
Range from grouped data: $90$ cm
Range from raw data: $92.71$ cm

The grouped data provides useful approximations while being easier to work with.

Handling boundary values

When the underlying variable is continuous, some data points may fall exactly on the boundary between intervals. You should:

Choose one consistent approach (either place in lower or upper interval)
Make a note of your decision
Be consistent throughout your analysis

Frequency histograms and frequency polygons

Frequency histograms

A frequency histogram is a bar graph where:

Each rectangle represents a class interval
The height shows the frequency
Rectangles join with no gaps between them

Guidelines for drawing frequency histograms

infoNote

For ungrouped data:

Each rectangle is centred on the value
Rectangles join up with no gaps

For grouped data:

Each rectangle is centred on the class centre
All rectangles have equal width
Rectangles join up with no gaps
The horizontal axis intervals are called bins

Practical tip: If you have too many columns making the histogram difficult to interpret, use coarser grouping (wider intervals).

Frequency polygons

A frequency polygon is a line graph that can be drawn alongside or instead of a histogram.

Guidelines for drawing frequency polygons

infoNote

Key steps for frequency polygons:

Plot points at the centre of the top of each histogram rectangle
Join these points with straight line segments
On the left: Start the polygon on the horizontal axis at the previous value or class centre
On the right: End the polygon on the horizontal axis at the next value or class centre

Histograms with discrete data

Histograms are designed for continuous variables, but can be used for discrete data. Remember:

Rectangles still have width
Rectangles still join up
They are centred on values (or class centres for grouped data)
This may involve numbers like half-integers that aren't possible values of the variable

lightbulbExample

Worked Example: Spelling Test Marks

Consider these Year $7$ spelling test marks:

Table

Grouping the data:

Table

Histograms and polygons:

Analysis: The grouped data histogram makes it clearer that a significant group of students have difficulties with spelling or tests, shown by the higher frequency in the middle-to-upper mark ranges.

Cumulative frequency histograms and polygons (ogives)

Cumulative frequency histograms

A cumulative frequency histogram shows the accumulation of frequencies:

Rectangles are "piled on top of each other"
The height of the last rectangle equals the total sample size

Cumulative frequency polygons (ogives)

An ogive (pronounced "oh-jive") is the cumulative frequency version of the frequency polygon.

Guidelines for drawing cumulative frequency histograms

infoNote

Key steps:

Stack the frequency histogram rectangles on top of each other
The final rectangle height equals the total number of observations ( $n$ )

Guidelines for drawing ogives

infoNote

Key steps for drawing ogives:

Start at zero at the bottom left corner of the first rectangle (no scores yet accumulated)
Pass through the top right-hand corner of each rectangle
This plots scores less than or equal to the upper bound of each interval
Finish at the top right corner of the last rectangle
Final height equals the total sample size

lightbulbExample

Worked Example: Cumulative Frequencies for Spelling Test

Table

Grouped cumulative frequency table:

Table

Cumulative histograms and ogives:

Finding the median:

Original data: The $20$ th and $21$ st scores are both $6$ , so median $= 6$
Grouped data: The class centres of the $20$ th and $21$ st scores are both $5.5$ , so median $= 5.5$

Such discrepancies are normal when using grouped data.

Calculating the mean for grouped data

Formula for the mean

For data organised in a frequency table:

$\bar{x} = \frac{\sum xf}{n}$

where:

$\bar{x}$ is the mean
$x$ is the score (or class centre for grouped data)
$f$ is the frequency
$n$ is the total number of observations
$\sum xf$ means "sum all the products of $x \times f$ "

lightbulbExample

Worked Example: Heights Data

Table

Calculation:

$\bar{x} = \frac{\sum xf}{n} = \frac{14470}{100} = 144.70 \text{ cm}$

Note: The mean calculated from the raw (ungrouped) data was $145.352$ cm. Grouping reduced the mean by about $6.5$ mm.

Understanding the formula

infoNote

The formula $\bar{x} = \frac{\sum xf}{n}$ is equivalent to the weighted mean formula:

$\bar{x} = \sum xf_r$

where $f_r = \frac{f}{n}$ is the relative frequency.

Each score is weighted by how often it occurs in the dataset.

Calculating variance and standard deviation

What are variance and standard deviation?

Variance ( $s^2$ ) and standard deviation ( $s$ ) measure how spread out the data are from the mean:

Larger values indicate data are more spread out
Smaller values indicate data are clustered near the mean
Standard deviation is the square root of variance
Standard deviation has the same units as the original data

Formulas for variance

Recommended formula for calculation:

$s^2 = \frac{\sum x^2f}{n} - \bar{x}^2$

Alternative formula (shows meaning more clearly):

$s^2 = \frac{\sum(x - \bar{x})^2f}{n}$

infoNote

The alternative formula shows that variance is the average of the squared deviations from the mean.

Formula for standard deviation

$s = \sqrt{s^2}$

Important notation

chatImportant

For samples (data):

Mean: $\bar{x}$
Standard deviation: $s$
Variance: $s^2$

For populations (theoretical):

Mean: $\mu$ or $E(X)$
Standard deviation: $\sigma$
Variance: $\sigma^2$ or $\text{Var}(X)$

lightbulbExample

Worked Example: Heights Data

Calculating the mean:

$\bar{x} = \frac{\sum xf}{n} = \frac{14470}{100} = 144.7 \text{ cm}$

Calculating the variance:

s^2 = \frac{\sum x^2f}{n} - \bar{x}^2 \\[0.5em] s^2 = \frac{2127700}{100} - 144.7^2 \\[0.5em] s^2 = 21277 - 20938.09 \\[0.5em] s^2 = 338.91

Calculating the standard deviation:

$s = \sqrt{338.91} \approx 18.41 \text{ cm}$

Comparison with raw data:

Raw data: $s^2 = 325.5407$ , $s \approx 18.04$ cm
Grouped data produced slightly different results

lightbulbExample

Worked Example: Spelling Test Marks

For the original (ungrouped) data:

\bar{x} = \frac{233}{40} = 5.825 \\[0.5em] s^2 = \frac{1579}{40} - 5.825^2 = 5.544375 \\[0.5em] s \approx 2.355

For the grouped data:

Table

\bar{x} = \frac{232}{40} = 5.8 \\[0.5em] s^2 = \frac{1566}{40} - 5.8^2 = 5.51 \\[0.5em] s \approx 2.347

The differences arise from the loss of information when grouping.

Key formulas summary

bookmarkSummary

Essential Formulas for Grouped Data:

Mean: $\bar{x} = \frac{\sum xf}{n}$

Variance: $s^2 = \frac{\sum x^2f}{n} - \bar{x}^2$

Standard deviation: $s = \sqrt{s^2}$

For grouped data: Use class centres in place of $x$ values.

Exam tips

infoNote

When grouping data:

Choose an appropriate number of intervals (typically $5-10$ )
Use equal interval widths
Record class centres clearly
Note how you handle boundary values

infoNote

When drawing histograms:

Centre rectangles on values or class centres
Ensure no gaps between rectangles
Label axes clearly with units
Include a frequency scale

infoNote

When drawing frequency polygons:

Join points with straight lines
Extend to the horizontal axis at both ends
At the appropriate previous/next value or class centre

infoNote

When drawing ogives:

Start at zero (bottom left)
Pass through top right corners
End at total sample size

infoNote

When calculating statistics:

Use the recommended formula $s^2 = \frac{\sum x^2f}{n} - \bar{x}^2$ as it's usually easier
Show your working in a clear table format
For grouped data, use class centres
Remember that grouped data gives approximations

Remember!

bookmarkSummary

Key Points to Remember:

Grouping data helps us see the big picture but involves some loss of precision. The trade-off is worth it for clarity when dealing with large datasets.
Class centres represent all values in an interval. They are used for calculations with grouped data.
Histogram rectangles join up with no gaps, whether for ungrouped or grouped data. They are centred on values or class centres.
Frequency polygons pass through the centres of rectangle tops and extend to the horizontal axis at both ends.
Ogives start at zero (bottom left corner) and pass through the top right corners of cumulative histogram rectangles, ending at the sample size.
Mean and variance formulas use class centres for grouped data: $\bar{x} = \frac{\sum xf}{n}$ and $s^2 = \frac{\sum x^2f}{n} - \bar{x}^2$ .

Grouped Data and Histograms (HSC SSCE Mathematics Advanced): Revision Notes