Displaying Data Revision Notes for HSC SSCE Mathematics Advanced

Displaying Data

Introduction to data analysis

When you collect data, whether small amounts or huge datasets, you usually end up with unsorted lists of numbers or categories. Just scanning through this raw data rarely gives you useful insights. The field of statistics provides tools to help analyse and make sense of this information.

infoNote

There are three main stages to analysing raw data:

Display the data using various tables and graphs (or charts) to gain an overview and initial insights into what's happening
Calculate summary statistics to measure key features. For single variable data, you'll use measures of location (like mean and median) and measures of spread (like variance, standard deviation and interquartile range). For two variable data, you'll also use correlation and lines of best fit
Investigate patterns and make predictions by using probability theory to calculate theoretical probability distributions, then test how well your data fits these distributions

After this analysis, you might suggest further experiments to collect more data and test your findings.

Review of random variables

Before working with data displays, let's review some fundamental concepts about random variables and experiments.

Experiments and random variables

An experiment in statistics can be either deterministic or random:

A deterministic experiment has only one possible outcome
A random experiment has more than one possible outcome

A random variable is the outcome when you run a random experiment. We usually denote it with an uppercase letter like $X$ . The various possible outcomes of the experiment are called the values of the random variable.

Scores and frequency

When you run an experiment many times, the outcomes are called scores. The complete (finite) list of all the scores is called a sample.

The frequency of an outcome or value is the number of times it occurs in your sample.

Types of random variables

Random variables can be classified in different ways:

A random variable may be numeric (if its values are numbers) or categorical (if its values are categories or labels)
A numeric random variable is called discrete if its values can be listed in a sequence like $x_1, x_2, x_3, ...$
A numeric random variable is called continuous if its values cannot be listed (like measuring height as a real number rather than a rounded measurement)

lightbulbExample

Examples of Random Variables:

Recording a person's country of birth is a categorical variable
Recording the number of overseas countries a person has visited is a numeric discrete variable (values $0, 1, 2, ...$ can be listed)
Recording a person's exact height is a numeric continuous variable (we cannot list all possible real number values)

Frequency tables and cumulative frequency tables

What is a frequency table?

The most basic tool for organising raw data is a frequency table. You can create one using a spreadsheet or database, or by hand using tally marks.

When working with numeric data, you can extend this to a cumulative frequency table, which shows the number of scores less than or equal to each given value.

lightbulbExample

Worked Example: Spelling Test Data

At the start of Year $7$ , Cedar Heights High School gave $40$ students a spelling test marked out of $10$ . Here are the raw results:

$4, 7, 2, 8, 7, 6, 3, 2, 8, 2, 9, 5, 8, 5, 8, 3, 6, 7, 5, 2,$

$10, 6, 7, 5, 6, 6, 9, 1, 5, 7, 8, 1, 6, 5, 7, 10, 6, 7, 8, 6$

We can organise this data using tallies and frequencies:

Table

The frequency row shows how many students achieved each mark. For example, 6 students scored 5 marks, and 8 students scored 6 marks.

The cumulative frequency row shows the running total. For example, the cumulative frequency at mark $6$ is $23$ , meaning 23 students scored 6 marks or less.

Insight: Looking at the cumulative frequencies, we can see that $8$ - $9$ students appear to have poor spelling, or perhaps they had limited experience with tests in earlier years.

Cumulative frequency for categorical data

If the values in a categorical dataset have been sorted into a meaningful order, you can also create a cumulative frequency table for them (this is particularly useful for Pareto charts, which we'll discuss later).

Understanding cumulative frequency

For numeric data, the cumulative frequency tells you the number of scores that are less than or equal to a given score.

infoNote

You can extend a frequency distribution table to a cumulative frequency distribution table by taking the accumulating sums of the frequencies.

Finding the median from cumulative frequencies

When analysing single variable data, two main questions guide our investigation:

Measures of location: Where is the centre of the distribution?
Measures of spread: How spread out are the data?

The median is an important measure of location.

What is the median?

The median (symbol $Q_2$ , meaning second quartile) is the middle score when you arrange all scores in ascending order. More specifically:

For an odd number of scores, the median is the single middle score
For an even number of scores, the median is the average of the two middle scores

A cumulative frequency table makes it easy to identify the median.

lightbulbExample

Example with odd number of scores:

$4, 7, 8, 10, 10, 11, 13, 15, 21$ ( $9$ scores)

The median is the $5$ th score (the middle one), which is 10.

Example with even number of scores:

$4, 7, 8, 10, 10, 11, 11, 13, 13, 21$ ( $10$ scores)

There are two middle scores: the $5$ th score is $10$ and the $6$ th score is $11$ .

The median is their average: $\frac{10 + 11}{2} = \text{:success[10.5]}$ .

Using cumulative frequency to find the median

Let's use the spelling test data from earlier to find the median.

There are $40$ scores, so the median is the average of the 20th and 21st scores.

infoNote

From the cumulative frequency table:

The $15$ th score is $5$
All scores from the $16$ th up to the $23$ rd are $6$

Therefore, both the $20$ th and $21$ st scores are $6$ , so the median is 6.

This tells us that a student with a score of $6$ is in the middle of the class.

Displaying categorical data

There are many different ways to display data using tables and graphs. Good displays should be easy to read, even for people who aren't statistics experts. They should help you gain insights and communicate findings clearly.

chatImportant

You need to be careful when reading tables and graphs. They can sometimes distort information, either intentionally or unintentionally. Always examine displays critically to spot potential misleading features.

Pareto charts

A Pareto chart is a powerful tool for displaying categorical or discrete data. While it can be used for any categorical dataset, its main purpose is to identify which problems in a business are most urgent and need attention first. It's classified as one of the 'seven basic tools of quality' in business management.

lightbulbExample

Example: Business Problems at Secure Roofs

Secure Roofs is a company that arranges roof repairs. Sometimes scheduled repairs don't take place on the expected day, causing loss of income while expenses continue. The manager analysed the last $200$ such failures and organised them into six categories:

Table

Constructing a Pareto chart

To create a Pareto chart, follow these steps:

Step 1: Arrange the categories in descending order of frequency (highest first)

Step 2: Calculate the cumulative frequency for this new order

Step 3: Draw two graphs together:

A frequency histogram with columns in descending order
A cumulative frequency polygon (line graph)

The chart typically has two vertical axes:

Left axis: actual frequencies
Right axis: percentage frequencies

Here's the reorganised table and the Pareto chart:

Interpreting the Pareto chart

With this chart, the manager can work through issues from left to right, tackling the most serious problems first:

Rain ( $88$ occurrences): The manager might decide to only schedule external roof repairs three days ahead when weather forecasts are more reliable
Owner not home ( $64$ occurrences): The manager could personally ring each owner two days ahead with a friendly reminder, followed by an SMS the evening before
Truck breakdown ( $22$ occurrences): Perhaps a new truck has been budgeted for next year
Other issues (illness, absence, blackout): The manager may have little control over these

infoNote

Key feature of Pareto charts:

The cumulative frequency polygon in a Pareto chart is always concave down (curves downward) because the categories are arranged in descending order of frequency. This means each chord connecting two points on the curve lies below or on the curve.

Conventions for Pareto charts

There are no universal standards for drawing Pareto charts, but common features include:

The histogram rectangles join up with each other
The cumulative frequency polygon starts at the left-hand bottom corner (because the initial sum is zero)
Each point is plotted at the right-hand top corner of its corresponding rectangle

Two-way tables (contingency tables)

A two-way table (also called a contingency table) combines two or more related frequency tables. It allows you to investigate whether two variables are related.

chatImportant

Even in its simplest form with just four numbers, a two-way table can be surprisingly complex to interpret. This topic connects to bivariate data analysis and conditional probability.

lightbulbExample

Example: Phone Cover Preferences

A survey asked $200$ adults what colour phone cover they preferred. Responses were recorded as either Dark (black-brown) or Colour (coloured), and the gender of each person was also recorded:

Table

At first glance, looking at the numbers $38$ and $56$ under 'Dark', you might think women prefer dark colours more than men. However, this would be misleading.

The importance of marginal frequencies

To analyse this properly, we need to add row and column totals (called marginal frequencies):

chatImportant

Now we can see that the survey included $150$ women but only $50$ men. The sample was heavily biased towards women, which we must account for in our analysis.

Analysing proportions

To answer the question "Do men prefer dark colours more than women do?", we need to calculate proportions:

For men:

Proportion preferring dark covers: $\frac{38}{50} = 76\%$
Proportion preferring coloured covers: $\frac{12}{50} = 24\%$

For women:

Proportion preferring dark covers: $\frac{56}{150} \approx 37\%$
Proportion preferring coloured covers: $\frac{94}{150} \approx 63\%$

The analysis clearly shows that men prefer dark covers much more than women do ( $76\%$ versus $37\%$ ).

infoNote

The overall proportion preferring dark covers is $\frac{94}{200} = 47\%$ , which is NOT the average of $76\%$ and $37\%$ because the sample was biased towards women.

Understanding the table structure

Each row and each column in the extended table is actually a frequency table:

The last row and last column are called marginal distributions
The inner rows and columns are called conditional distributions

This terminology relates to conditional probability concepts.

Conditional probability in two-way tables

The proportions we calculated are actually probabilities. If we choose a person from the survey at random:

$P(\text{person prefers dark covers}) = \frac{94}{200} = 0.47$

We can also calculate conditional probabilities. These tell us the probability of one event given that another event has occurred.

To find the probability that a person prefers dark covers given that they are a man (or woman), we use a reduced sample space:

P(\text{prefers dark} \mid \text{man}) = \frac{38}{50} = 0.76 P(\text{prefers dark} \mid \text{woman}) = \frac{56}{150} \approx 0.37

We can also work in the opposite direction:

P(\text{person is a man}) = \frac{50}{200} = 0.25

P(\text{man} \mid \text{prefers dark}) = \frac{38}{94} \approx 0.40 P(\text{man} \mid \text{prefers coloured}) = \frac{12}{106} \approx 0.11

These conditional probabilities give us deeper insights into the relationships between variables.

The mode and the range

Besides the median, there are two other simple but useful statistics: the mode and the range.

The mode

The mode is the most popular score in a dataset. It's the score with the greatest frequency.

infoNote

The word 'mode' comes from 'fashion', referring to what's most popular. The mode is the simplest measure of location to identify because it's immediately obvious from a frequency table. It's even easier to spot from a histogram.

lightbulbExample

Examples of the Mode:

In the Secure Roofs frequency table, the mode is 'Rain' with a frequency of $88$
In the spelling test scores, the mode is 6 (which happens to equal the median, though this isn't always the case)

Multiple modes

Some frequency tables have two or more scores with the same maximum frequency. These are called:

Bimodal: two scores with equal highest frequency
Trimodal: three scores with equal highest frequency
Multimodal: several scores with equal highest frequency

The range

The range is only defined for numeric data. It measures how spread out the data are.

Definition: The range is the difference between the minimum and maximum scores.

lightbulbExample

Example: Range of Spelling Test Scores

For the $40$ spelling test scores:

Minimum = $1$
Maximum = $10$
Range = $10 - 1 = \text{:success[9]}$

The range is the simplest measure of spread for a dataset.

chatImportant

This statistical meaning of 'range' is different from its use in function notation, where it means the set of all output values of a function.

Summary: measures of location and spread

The mode is a measure of location (tells you where the centre or most common value is)
The range is a measure of spread (tells you how dispersed the data are)

bookmarkSummary

Key Points to Remember:

Three stages of data analysis: Display data using tables and graphs, calculate summary statistics, then investigate patterns and make predictions
Frequency tables organise raw data by counting how many times each value occurs. Cumulative frequency tables show running totals of frequencies
The median is the middle score when data are arranged in order. Use cumulative frequency tables to find it quickly: for $n$ scores, find the average of the $\frac{n}{2}$ th and $\left(\frac{n}{2} + 1\right)$ th scores
Pareto charts display categorical data in descending order of frequency, combining a histogram and cumulative frequency polygon. They help identify the most important issues to address first
Two-way tables (contingency tables) show relationships between two variables. Always calculate proportions within rows or columns, not just raw frequencies, to avoid being misled by unequal sample sizes
The mode is the most frequent score (measure of location) and the range is the difference between maximum and minimum scores (measure of spread)

Displaying Data (HSC SSCE Mathematics Advanced): Revision Notes

Displaying Data

Introduction to data analysis

Review of random variables

Experiments and random variables

Scores and frequency

Types of random variables

Frequency tables and cumulative frequency tables

What is a frequency table?

Cumulative frequency for categorical data

Understanding cumulative frequency

Finding the median from cumulative frequencies

What is the median?

Using cumulative frequency to find the median

Displaying categorical data

Pareto charts

Constructing a Pareto chart

Interpreting the Pareto chart

Conventions for Pareto charts

Two-way tables (contingency tables)

The importance of marginal frequencies

Analysing proportions

Understanding the table structure

Conditional probability in two-way tables

The mode and the range

The mode

Multiple modes

The range

Summary: measures of location and spread

Explore HSC SSCE Mathematics Advanced Model Answers by Topics

Sequences and Series

Graphs and Equations

Curve-Sketching Using the Derivative

Integration

The Exponential and Logarithmic Functions

The Trigonometric Functions

Motion and Rates

Series and Finance

Displaying and Interpreting Data

Continuous Probability Distributions

Explore HSC SSCE Mathematics Advanced Quizzes by Topics

Sequences and Series

Graphs and Equations

Curve-Sketching Using the Derivative

Integration

The Exponential and Logarithmic Functions

The Trigonometric Functions

Motion and Rates

Series and Finance

Displaying and Interpreting Data

Continuous Probability Distributions

Explore HSC SSCE Mathematics Advanced Flashcards by Topics

Sequences and Series

Graphs and Equations

Curve-Sketching Using the Derivative

Integration

The Exponential and Logarithmic Functions

The Trigonometric Functions

Motion and Rates

Series and Finance

Displaying and Interpreting Data

Continuous Probability Distributions

Join 100,000+ SSCE students studying Revision Notes with us.