Problems with collected data (Edexcel GCSE Statistics): Revision Notes
Problems with collected data
When you collect data for any statistical investigation, things don't always go perfectly. Understanding the problems that can occur with collected data and knowing how to deal with them is crucial for ensuring your results are reliable and accurate.
Why checking and cleaning data matters
Before you can analyse your data and draw conclusions, you need to make sure the data you've collected is fit for purpose. If you don't check and clean your data first, your results could be completely wrong, making your entire investigation meaningless.
Critical Point: Failing to check and clean your data first can make your entire statistical investigation meaningless, regardless of how sophisticated your analysis methods are.
Data problems can arise from several sources:
- Human error when recording measurements
- Equipment malfunctions
- Participants providing incorrect information
- Inconsistent measurement methods
Checking data
The first step after collecting data is to thoroughly check it for any obvious problems. This involves looking for patterns and inconsistencies that might indicate errors.
Identifying outliers and anomalous values
Outliers are data values that don't fit the expected pattern of your dataset. These might be genuine extreme values, or they could indicate errors in data collection or recording.
When you spot an outlier, you need to investigate why it occurred:
- Measurement errors: Wrong units used, decimal point in wrong place, misread instruments
- Recording errors: Copying numbers incorrectly, typing mistakes
- Genuine extreme values: Sometimes unusual but legitimate results occur
Key point: Don't automatically remove outliers without investigating the cause. If they're due to errors, they should be corrected or removed. If they're genuine, they might be the most interesting part of your data!
Considering your collection method
Think carefully about how your data was collected, as this can affect reliability:
Factors Affecting Data Reliability
Consider these aspects of your data collection method:
- Self-reported data: When people provide their own measurements (like height and weight), they might not be accurate
- Sample size: Very small samples make it harder to draw reliable conclusions
- Consistency: Were all measurements taken the same way by the same person?
Cleaning data
Once you've identified problems, you need to clean your data. This means correcting errors where possible and deciding what to do about values that can't be fixed.
Steps for cleaning data
1. Check and standardise units Make sure all measurements use the same units. If someone recorded their height in feet and inches while others used centimetres, you'll need to convert everything to the same unit before analysis.
2. Deal with missing data
Decide how to handle gaps in your dataset:
- Remove the entire record if too much information is missing
- Estimate missing values if you have enough other information
- Collect the missing data if possible
3. Correct or remove inaccurate values
- Fix obvious errors (like a person's weight recorded as 10kg instead of 100kg)
- Remove values that are clearly impossible (like a person being 16.6m tall)
- Record values consistently (remove extra symbols or words)
4. Document your decisions
Keep track of what changes you made and why, so others can understand your process.
Worked Example: Medical Study Data Problems
Let's look at a medical study where adults recorded their height and weight:
| Person | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| Height | 162cm | 5ft 7in | 1.72m | 166m | 6ft 1 inch | 158cm | 1.43m | 1.5m | 5ft |
| Weight | 110kg | 16 stone | 10kg | - | 130kg | 116kg | 10 stone 3 pounds | 105kg | 8 stone |
Problems identified:
Different units being used: The data contains a mixture of metric and imperial measurements, making comparison impossible without conversion.
Missing data: Person 4 has no weight recorded, creating a gap in the dataset.
Impossible values:
- Person 4's height of "166m" is clearly wrong - no human is 166 metres tall! This should probably be "166cm" or "1.66m"
- Person 3's weight of "10kg" is too low for a healthy adult - this might be a recording error
Inconsistent formatting: Some entries include units while others don't, and there's variation in how measurements are written.
Impact on reliability:
This collection method creates several reliability issues:
- Small sample size: Only 9 people makes it difficult to draw broad conclusions
- Self-reported data: People might not measure themselves accurately or might provide estimates rather than exact measurements
- Inconsistent data quality: The mix of units and obvious errors suggests the data collection wasn't well controlled
Dealing with outliers: Memory test example
Consider this memory study where students took a test after different amounts of sleep. Understanding how to handle outliers is essential for accurate analysis.
Worked Example: Memory Test Outlier Analysis
Results for 75 students:
- Day 1: 74 students remembered more than 15 words
- Day 2: 70 students remembered 2 or more words better than Day 1
- Outlier: One student remembered 14 words on Day 1 but only 8 words on Day 2
Should we include the outlier?
Arguments for including it:
- It's a genuine result from the experiment
- Removing data without good reason can introduce bias
- This student might represent people who don't respond well to extra sleep
Arguments for excluding it:
- It goes against the trend shown by 70 other students
- There might have been external factors affecting this student's performance
- It could be due to illness, stress, or other confounding variables
Best approach: Analyse the data both with and without the outlier to see how much it affects your conclusions. If the overall trend remains the same, the outlier probably doesn't invalidate your findings.
Professional Tip: When dealing with outliers, transparency is key. Always report what outliers you found, what you did with them, and why you made those decisions. This allows others to evaluate the validity of your conclusions.
Key Points to Remember:
-
Always check your data before analysing it - look for obvious errors, inconsistent units, and missing values
-
Clean your data systematically - standardise units, deal with missing data, and correct or remove clearly wrong values
-
Don't automatically remove outliers - investigate why they occurred first, as they might be genuine and important results
-
Consider your collection method - self-reported data and small samples can reduce reliability
-
Document your decisions - keep track of what changes you made during data cleaning so others can understand your process