Problems with collected data (AQA GCSE Statistics): Revision Notes
Problems with collected data
When conducting research or analysis, the data you collect may not always be perfect. Understanding how to identify and address these problems is crucial for obtaining reliable results.
Data quality issues are extremely common in real-world research. Even professional studies often encounter problems that need to be addressed before analysis can begin.
Why we need to check our data
Before analysing any dataset, it's essential to examine your data carefully to ensure it's reliable and accurate. Raw data often contains various issues that can significantly affect your conclusions if left unaddressed.
Your dataset might include unusual or extreme values that don't follow the expected pattern of the rest of your data. These are called outliers or anomalous values. However, it's important to understand why these values exist before deciding what to do with them. Sometimes outliers occur due to measurement errors or recording mistakes, in which case they should be corrected or removed. Other times, they represent genuine but unusual observations that should be kept in your analysis.
Never automatically remove outliers without investigation. Some outliers represent genuine extreme cases that are important to your research, while others indicate data collection problems that need correction.
The way you collected your data (your collection method) can also impact how reliable your results will be. You need to consider whether your collection approach might have introduced any bias or limitations that could affect your conclusions.

Steps for cleaning your data
When you discover problems in your dataset, you'll need to clean it systematically. Data cleaning involves several key steps that help ensure your analysis will be based on accurate information.
Identify and correct inaccurate values: Look for data points that seem impossible or highly unlikely. For example, if you're measuring adult weights and find an entry of 10kg, this is clearly incorrect and needs addressing.
Check unit consistency: Ensure all measurements use the same units throughout your dataset. Having some heights in centimetres, others in metres, and some in feet creates confusion and makes analysis impossible.
Record values properly: Remove any unnecessary symbols or text from numerical data, and ensure missing values are clearly identified rather than left as blank spaces.
Handle missing data: Decide how to deal with gaps in your dataset. You might choose to exclude incomplete records, estimate missing values, or collect additional data to fill the gaps.
Worked example: medical study data problems
Worked Example: Identifying Data Problems in Medical Research
Let's examine a practical example of data problems and how to identify them.

In this medical study examining height and weight data, we can identify several serious problems:
Inconsistent units: The height measurements mix different units (centimetres, feet/inches, metres), whilst weights combine kilogrammes, stone, and stone/pounds. This makes comparison and analysis impossible without conversion.
Missing values: Person 4 has no weight recorded, creating a gap in the dataset that needs addressing.
Impossible values: Some entries are clearly incorrect - for instance, 166m for height (which would make someone taller than a skyscraper) should probably be 166cm, and 10kg for an adult's weight is medically impossible.
Small sample size: With only 9 people, this dataset may be too small to draw reliable conclusions about the broader population.
Collection method reliability: The data was self-reported by participants, which can introduce errors due to people estimating incorrectly, using different measurement methods, or providing inaccurate information.
Impact on reliability
These data problems don't just make analysis difficult - they directly affect how much you can trust your conclusions. A small sample size limits how well your results represent the wider population you're studying. Self-reported data may be less accurate than professionally measured data. Inconsistent units and impossible values suggest poor data quality control during collection.
When presenting your findings, it's essential to acknowledge these limitations and explain how they might affect your conclusions. This demonstrates good statistical practice and helps others evaluate the strength of your evidence.
Key Points to Remember:
- Always examine your data carefully before beginning any analysis - look for outliers, missing values, and inconsistencies
- Clean your data systematically by checking units, correcting obvious errors, and deciding how to handle missing information
- Consider how your data collection method might affect the reliability of your results
- Small sample sizes and self-reported data can limit how confidently you can generalise your findings
- Document any data problems and cleaning steps you've taken - this transparency helps others evaluate your work