Classifying and Displaying Categorical Data Revision Notes for VCE SSCE General Mathematics

Classifying and Displaying Categorical Data

Introduction to variables and data

When we collect information through surveys or experiments, we are gathering data about different variables. A variable is any characteristic that can change or vary from one person to another.

Consider a student survey that asks about:

Gender (male or female)
Preferred coffee size (small, medium, or large)
Number of brothers
Hand span in centimetres

Each of these questions represents a different variable because the answers will vary from student to student. The actual values we collect are called data.

These variables, and the data they produce, can be sorted into two main types: categorical or numerical.

Understanding categorical data

Categorical data sorts people or items into groups or categories. When you collect categorical data, you are essentially placing each person into one of several possible groups.

However, not all categorical data works the same way. There are two subtypes:

Nominal data

Nominal data uses values that are simply names or labels. These labels identify which category something belongs to, but they don't tell us anything more than that.

Example: Gender recorded as 'M' or 'F'

The letters M and F identify whether someone is male or female, but they don't imply any ordering or ranking. We can't say that one is "more" or "less" than the other in any meaningful way. The values are just names.

Other examples of nominal data:

Favourite colour (red, blue, green, yellow)
Type of pet (dog, cat, bird, other)
Country of birth

infoNote

Memory Aid: Nominal = Names

Think of "nominal" as coming from the word "name." Nominal data simply names or labels categories without any ranking or ordering.

Ordinal data

Ordinal data does two things: it names categories AND allows us to order or rank them.

Example: Coffee size recorded as 'small', 'medium', or 'large'

These values not only tell us the size category, but we can also arrange them in order from smallest to largest. There is a natural ranking to the categories.

Other examples of ordinal data:

Level of agreement (strongly disagree, disagree, neutral, agree, strongly agree)
Education level (primary, secondary, tertiary)
Satisfaction rating (very unsatisfied, unsatisfied, neutral, satisfied, very satisfied)

infoNote

Memory Aid: Ordinal = Order

The word "ordinal" relates to "order." Ordinal data has categories that can be arranged in a meaningful sequence or ranking.

The key difference is that ordinal data has a meaningful order, while nominal data does not.

Understanding numerical data

Numerical data consists of values that represent quantities. These are numbers that you can perform arithmetic operations on, such as adding, subtracting, or finding an average. Like categorical data, numerical data has two subtypes:

Discrete data

Discrete data can only take specific, separate values. These values are typically whole numbers that come from counting.

Example: Number of brothers ( $0$ , $1$ , $2$ , $3$ , ...)

You can have $0$ , $1$ , or $2$ brothers, but you cannot have $1.5$ brothers. The values are distinct and separate from each other.

Other examples of discrete data:

Number of students in a class
Number of cars in a car park
Number of goals scored in a match

Continuous data

Continuous data can take any value within a range. These values typically come from measurement and are limited only by the precision of your measuring instrument.

Example: Hand span measured in centimetres

If a student records their hand span as $24.1$ cm, their actual hand span could be anywhere between $23.05$ cm and $24.15$ cm, depending on how they measured and rounded. The variable could theoretically take any value within a certain range.

Other examples of continuous data:

Height in centimetres
Weight in kilograms
Time in seconds
Temperature in degrees

infoNote

Memory Aid: Discrete vs Continuous

Discrete = Distinct countable values (think "separate" or "distinct")
Continuous = Can be anywhere in a range (think "continuous line")

Alternative: "CATS count, DOGS measure" - CATegorical data you COUNT in frequency tables, Continuous data often comes from MEASUREment

A note on interval and ratio scales

There is another way to classify numerical data, though for most statistical purposes this distinction is not critical:

Interval scales have equal differences between successive values, but zero does not mean the complete absence of what is being measured.

Example: Temperature in degrees Celsius. The difference between $13°\text{C}$ and $14°\text{C}$ is the same as between $14°\text{C}$ and $15°\text{C}$ . However, $0°\text{C}$ does not mean a complete absence of heat. Also, we cannot say that $40°\text{C}$ is twice as hot as $20°\text{C}$ .

Ratio scales have all the properties of interval scales, plus zero means complete absence, and we can make meaningful ratio statements.

Example: Number of brothers or hand span in centimetres. Zero brothers means no brothers at all. Someone with $4$ brothers has exactly twice as many brothers as someone with $2$ brothers.

For statistical analysis in this course, we treat both interval and ratio data the same way, so you don't need to distinguish between them.

Types of variables summary

Variables that produce categorical data are called categorical variables. These can be further classified as:

Nominal variables (e.g., gender, hair colour)
Ordinal variables (e.g., size preferences, satisfaction levels)

Variables that produce numerical data are called numerical variables. These can be further classified as:

Discrete variables (e.g., number of siblings, number of students)
Continuous variables (e.g., height, weight, time)

The following diagram shows how different variable types relate to each other:

bookmarkSummary

Key Classification Framework:

All variables fall into two main categories:

Categorical variables → Nominal (names only) or Ordinal (names + order)
Numerical variables → Discrete (countable, specific values) or Continuous (measurable, any value in range)

Classifying variables: worked examples

Let's practice identifying different types of variables with some examples.

lightbulbExample

Example 1: Categorical or numerical?

Question: Classify the following variables as categorical or numerical.

a) Students choose their favourite pet from 'dog', 'cat', 'bird', or 'other'.

b) The time, in seconds, taken to solve a puzzle is recorded.

Solution:

a) Categorical - The values are categories of pets. Students are being sorted into groups based on their preference.

b) Numerical - The data consists of numbers representing the amount of time taken. We can perform calculations like finding the average time.

lightbulbExample

Example 2: Nominal or ordinal?

Question: Classify the following categorical variables as nominal or ordinal.

a) A group of people record their level of happiness as 'very happy', 'happy', 'not too happy', or 'very unhappy'.

b) Students select their favourite country to visit.

Solution:

a) Ordinal - The categories represent different levels of happiness, and there is a clear order from very unhappy to very happy. We can rank the responses.

b) Nominal - The data values are simply names of countries. There is no natural ordering to countries.

lightbulbExample

Example 3: Discrete or continuous?

Question: Classify the following numerical variables as discrete or continuous.

a) The number of children in the family is recorded for all students in a school.

b) The birth weight of babies, measured in grams, is recorded at a hospital.

Solution:

a) Discrete - The number of children can only be whole numbers ( $0$ , $1$ , $2$ , $3$ , etc.). This data comes from counting.

b) Continuous - Weight can take any value within a range. A baby could weigh $3250$ grams or $3251.5$ grams, depending on the accuracy of the scale. This data comes from measurement.

lightbulbExample

Example 4: Classifying mixed variables

Question: Classify the following variables as nominal, ordinal, discrete, or continuous.

a) The number of students in each of $10$ classes is counted.

b) The time taken for $20$ mice to each complete a maze is recorded in seconds.

c) Diners at a restaurant were asked to rate their meal: $1$ = Very satisfied, $2$ = Satisfied, $3$ = Indifferent, $4$ = Dissatisfied, $5$ = Very dissatisfied.

d) Students choose a colour from a list: $1$ = Blue, $2$ = Green, $3$ = Red, $4$ = Yellow.

e) Students' heights were classified as 'less than $160$ cm', ' $160$ cm - $180$ cm', or 'more than $180$ cm'.

Solution:

a) Discrete - The number of students can only be whole numbers.

b) Continuous - Time can take any value, limited only by the precision of measurement.

c) Ordinal - Although numbers are used, they represent categories of satisfaction that can be ordered from very dissatisfied to very satisfied.

d) Nominal - The numbers are just codes for colour names. The numbers don't represent quantities, and there's no meaningful order.

e) Ordinal - Heights are grouped into three ordered categories (short, medium, tall).

chatImportant

Watch out for numbers that represent categories!

Just because something is recorded as a number doesn't automatically make it numerical data. In part (d) above, the numbers $1$ , $2$ , $3$ , $4$ are simply codes for colour names - they don't represent quantities. This is a common exam trap!

Frequency tables

Once we have collected categorical data, we need to organize it in a way that makes it easy to understand and analyze. Frequency tables are the most common way to do this.

What is a frequency table?

A frequency table lists the possible values of a variable along with how often each value occurs in the data set.

Frequency can be recorded in two ways:

Frequency: The actual count of how many times each value appears.

Percentage frequency: The percentage of times each value appears, calculated using:

$\text{Percentage frequency} = \frac{\text{count}}{\text{total}} \times 100$

A frequency distribution is simply a complete listing of all values and their frequencies.

chatImportant

Rounding percentages

When calculating percentage frequencies, the total may come to $99.9\%$ or $100.1\%$ instead of exactly $100\%$ . This is normal and occurs because we round each individual percentage. Don't worry if your percentages don't sum to exactly $100\%$ - this is expected!

lightbulbExample

Example 5: Constructing a frequency table

Question: Thirty children chose a sandwich, a salad, or a pie for lunch:

sandwich, salad, salad, pie, sandwich, sandwich, salad, salad, pie, pie, pie, salad, pie, sandwich, salad, pie, salad, pie, sandwich, sandwich, pie, salad, salad, pie, pie, pie, salad, pie, sandwich, pie

Construct a table showing both frequency and percentage frequency.

Solution:

Step 1: Set up a table with the variable name (Lunch choice) and its three possible values (Sandwich, Salad, Pie).

Step 2: Count how many children chose each option:

Sandwich: $7$ children
Salad: $10$ children
Pie: $13$ children

Step 3: Add the frequencies to find the total: $7 + 10 + 13 = 30$

Step 4: Calculate percentage frequencies:

Sandwich: $\frac{7}{30} \times 100 = 23.3\%$
Salad: $\frac{10}{30} \times 100 = 33.3\%$
Pie: $\frac{13}{30} \times 100 = 43.3\%$

Step 5: Complete the table:

Table

Notice that the percentages total $99.9\%$ rather than $100\%$ due to rounding.

Bar charts

While frequency tables summarize data effectively, visual displays often make patterns easier to spot. For categorical data, the appropriate visual display is a bar chart.

Features of bar charts

A bar chart has these key features:

The vertical axis shows frequency or percentage frequency
The horizontal axis shows the categories of the variable
The height of each bar represents the frequency (or percentage) for that category
Gaps between bars indicate that each value is a separate category
There is one bar for each category

lightbulbExample

Example 6: Constructing bar charts

Question: Use the frequency table for Lunch choice from Example 5 to construct:

a) A bar chart

b) A percentage bar chart

Solution:

Part a: Frequency bar chart

Step 1: Label the horizontal axis "Lunch choice" and mark it with the three categories (Pie, Salad, Sandwich).

Step 2: Label the vertical axis "Frequency" and create a scale from $0$ to $15$ (slightly above the maximum frequency of $13$ ). Mark the scale in intervals of $5$ .

Step 3: Draw a bar for each category. Make each bar narrower than the space allocated to show the categories are separate. The height of each bar matches the frequency:

Pie: height of $13$
Salad: height of $10$
Sandwich: height of $7$

Part b: Percentage bar chart

Step 1: Follow the same process as Part a, but label the vertical axis "Percentage" instead.

Step 2: Create a scale from $0$ to $45\%$ (slightly above the maximum percentage of $43.3\%$ ). Mark the scale in intervals of $5\%$ .

Step 3: Draw bars with heights matching the percentage frequencies:

Pie: height of $43.3\%$
Salad: height of $33.3\%$
Sandwich: height of $23.3\%$

infoNote

Arranging categories in bar charts

For nominal variables, it's common (though not required) to arrange categories in order of decreasing frequency. This makes it easier to identify patterns and compare categories. This is why the example shows Pie first (highest frequency), then Salad, then Sandwich.

The mode or modal category

One important feature that bar charts help us identify quickly is the mode or modal category.

The mode is the category that occurs most frequently - the most common or popular value. In a bar chart, this is the category with the tallest bar.

In Example 5, the modal category is 'pie' because it was chosen by more children ( $13$ ) than any other lunch option.

When is the mode useful?

The mode is most useful when:

A single category clearly occurs more often than others
You want to identify the most popular or common choice
You're conducting popularity polls or market research

Examples of questions answered by the mode:

"Which TV station is most frequently watched between 6 p.m. and 8 p.m.?"
"When is a supermarket busiest?"
"What is the most common shoe size sold?"

The mode tells us what is typical or most common in categorical data, helping us understand what most people choose or prefer.

bookmarkSummary

Remember! Key Points

Data can be categorical or numerical. Categorical data sorts things into groups; numerical data consists of quantities you can calculate with.
Categorical data is either nominal or ordinal. Nominal data uses names or labels with no order (e.g., colors, countries). Ordinal data has categories that can be ranked (e.g., small/medium/large, satisfaction levels).
Numerical data is either discrete or continuous. Discrete data comes from counting and takes specific values (e.g., number of siblings). Continuous data comes from measuring and can take any value in a range (e.g., height, time).
Frequency tables organize categorical data by showing how often each category occurs, either as counts or percentages. Use the formula: $\text{Percentage frequency} = \frac{\text{count}}{\text{total}} \times 100$
Bar charts visualize categorical data with gaps between bars to show separate categories. The height of each bar shows the frequency or percentage, and the tallest bar identifies the mode (most common category).

Classifying and Displaying Categorical Data (VCE SSCE General Mathematics): Revision Notes