Comparing data sets (Edexcel GCSE Statistics): Revision Notes
Comparing data sets
Understanding the basics of comparing data sets
When you need to compare two or more data sets, you can't just look at the raw numbers and hope to spot patterns easily. Instead, you need to use statistical measures to make meaningful comparisons. This process involves calculating specific values that help you understand both the central tendency (where the data clusters) and the spread (how scattered the data is) of each data set.
The key to successful comparison lies in being systematic and thorough. You should always calculate the same measures for each data set you're comparing, then use these to draw conclusions about similarities and differences between the groups.
Statistical comparison is essential because raw data can be misleading. Two data sets might have the same total but completely different distributions, making proper statistical analysis crucial for accurate conclusions.
Key measures for comparison
Measures of average (central tendency)
These tell you where the middle or typical value of your data lies:
- Mean: Add all values and divide by the number of values
- Median: The middle value when data is arranged in order
- Mode: The most frequently occurring value
Measures of spread (dispersion)
These tell you how scattered or concentrated your data is:
- Range: Highest value minus lowest value
- Interquartile range (IQR): The range of the middle 50% of the data
- Standard deviation: A measure of how far values typically spread from the mean
Understanding both central tendency and spread is crucial because two data sets can have the same average but completely different levels of variability. For example, test scores of 85, 85, 85 and scores of 60, 85, 110 both have the same mean but very different spreads.
Golden rules for comparing data
Essential Rules for Data Comparison
When comparing any data sets, follow these essential steps:
- Always calculate a measure of average and make a comment about which group has the higher/lower typical value
- Always calculate a measure of spread and make a comment about which group shows more/less variation
The specific measures you choose depend on what you've calculated:
- If you use the mode, compare it with the range
- If you use the median, compare it with the range or interquartile range
- If you use the mean, compare it with the range or standard deviation
You can also compare the skew (symmetry) of the distributions to add more depth to your analysis.
Step-by-step worked examples
Worked Example 1: Comparing boys' and girls' marks
Let's say you have test marks for boys and girls:
- Boys: 6, 7, 12, 13, 16, 16, 18, 20
- Girls: 8, 8, 9, 13, 14, 15, 20
Step 1: Calculate the medians
For boys: There are 8 values, so the median is the average of the 4th and 5th values
- 4th value = 13, 5th value = 16
- Boys' median =
For girls: There are 7 values, so the median is the middle (4th) value
- Girls' median = 13
Step 2: Calculate the ranges
- Boys' range =
- Girls' range =
Step 3: Make comparisons
- The boys generally achieved higher marks because they had a higher median (14.5 compared to 13)
- The boys' marks showed greater spread because they had a larger range (14 compared to 12)
Worked Example 2: Cars arriving at crossroads
Given this data for cars arriving per minute: 2, 2, 3, 5, 5, 6, 6, 6, 7, 7, 7, 8, 9, 9, 11
Finding the median:
- There are 15 values
- Position of median = th value
- The 8th value in the ordered list is 6
- Therefore, median = 6
Finding the lower quartile:
- Position of lower quartile = th value
- The 4th value in the ordered list is 5
- Therefore, lower quartile = 5
Using box plots for comparison
Box plots provide a visual way to compare data sets. They show five key values that make up what's called the five-number summary.
The Five Key Values in Box Plots:
- Minimum value (leftmost whisker)
- Lower quartile (left edge of box)
- Median (line inside box)
- Upper quartile (right edge of box)
- Maximum value (rightmost whisker)
When comparing box plots, look at:
- Which has the higher median (indicates higher typical values)
- Which has the larger interquartile range (indicates more spread in the middle 50% of data)
- The overall range from minimum to maximum
- The symmetry of the distribution
Comparing summary statistics
Sometimes you'll be given summary statistics in a table rather than raw data. This approach allows you to make comparisons without having to work with individual data points.
Worked Example: Comparing Car Lengths
| Group | Mean length (cm) | Standard deviation |
|---|---|---|
| American cars | 498 | 50.8 |
| European cars | 425 | 34.3 |
From this you can conclude:
- American cars are typically longer (higher mean: 498cm vs 425cm)
- American cars show more variation in length (higher standard deviation: 50.8 vs 34.3)
Common exam tips and traps
Essential Exam Techniques:
- Always state which measure you're using (e.g., "Using the median..." or "Comparing the means...")
- Make sure you calculate the same measures for each data set
- Always give a reason for your conclusion (e.g., "because the median is higher")
- When finding quartiles, use the and positions
Common Mistakes to Avoid:
- Don't just state which number is bigger - explain what this means in context
- Don't compare mean with range - stick to appropriate pairings
- Don't forget to order your data before finding the median
- Don't round too early in multi-step calculations
Key Points to Remember:
- Always calculate both an average AND a measure of spread when comparing data sets
- Use appropriate pairings: median with range/IQR, mean with range/standard deviation
- Make your comparisons meaningful by explaining what the numbers tell you about the real situation
- Show all working clearly - especially when finding medians and quartiles from ordered lists
- Box plots are excellent visual tools for comparing the five-number summary of different data sets