Outliers & Cleaning Data Revision Notes for AQA A-Level Mathematics

2.3.1 Outliers & Cleaning Data

Outliers are data points that are significantly different from the rest of the data. They can either be much higher or much lower than the other values in the data set. Outliers can affect statistical analyses by skewing results, so it's important to identify and handle them appropriately.

Identifying Outliers

Box Plots:

One of the simplest ways to identify outliers is by using a box plot. In a box plot:

Any data point outside of the whiskers ( $1.5$ times the interquartile range above $Q3$ or below $Q1$ ) is considered an outlier.

Z-Scores:

The Z-score measures how many standard deviations a data point is from the mean. Data points with a $Z-score$ greater than $3$ or less than $-3$ are often considered outliers.

Interquartile Range (IQR) Method:

Calculate the IQR: $\text{IQR} = Q3 - Q1$
Find the lower bound: $Q1 - 1.5 \times \text{IQR}$
Find the upper bound: $Q3 + 1.5 \times \text{IQR}$
Any data point outside this range is considered an outlier.

infoNote

Example: Consider the data set:

$2, 5, 7, 8, 10, 12, 15, 18, 20, 100$

Q1 (Lower Quartile) = $6.5$
Q3 (Upper Quartile) = $17$
IQR = $17 - 6.5 = 10.5$
Lower Bound = $6.5 - (1.5 × 10.5) = -9.25$ (No outliers on the lower side)
Upper Bound = $17 + (1.5 × 10.5) = 32.75$ Here, $100$ is an outlier because it is greater than $32.75$ .

Cleaning Data

Cleaning data involves addressing issues in the data, such as outliers, missing values, or errors, to ensure accurate analysis.

Steps for Cleaning Data:

Identify and Handle Outliers
Handle Missing Data
Correct Errors
Standardise and Normalise Data

Identify and Handle Outliers:

Remove Outliers: If an outlier is due to an error or doesn't belong to the dataset context, it may be removed.
Transform Data: Sometimes, transforming the data (e.g., using logarithms) can reduce the impact of outliers.
Use Robust Statistics: Instead of the mean, use the median, which is less sensitive to outliers.

Handle Missing Data:

Remove Missing Data: If the data set is large, you can remove rows or columns with missing values.
Impute Missing Data: Replace missing data with a reasonable estimate, such as the mean, median, or mode of the remaining data.
Use Algorithms: Advanced methods like k-nearest neighbours or regression models can predict and fill in missing values.

Correct Errors:

Typographical Errors: Correct any data entry errors, like typos or incorrect values.
Consistency Checks: Ensure data is consistent across the data set. For instance, if a person's age is entered as 200, it's likely an error.

Standardise and Normalise Data:

Standardisation: Adjust data to have a mean of 0 and a standard deviation of 1, useful for algorithms that assume normally distributed data.
Normalisation: Scale data to a range, usually between 0 and 1, which is useful when comparing different data sets.

infoNote

Example: Cleaning Data Suppose you have the following data set of test scores:

$85, 90, 95, 100, 110, 700$

Step 1: Identify Outliers:

The score $700$ is an outlier.

Step 2: Handle the Outlier:

Investigate the cause. If it's a data entry error, correct or remove it.

Step 3: Handle Missing Data:

If you had a missing value in the test scores, you might replace it with the mean score or use another method.

Step 4: Check for Consistency:

Ensure all scores are within the expected range ( $0$ to $100$ ).

Cleaning your data ensures that your analysis is based on accurate and reliable data, leading to more trustworthy results.

An outlier is an item of data that lies:

$2$ standard deviations from the mean.
$1.5$ interquartile ranges from the median.

infoNote

Example: Cleaning Data with Standard Deivation Let's go through a detailed example to understand how to formally identify outliers using standard deviation.

Question: Using standard deviation, formally identify any outliers in the following set: $1.3, 2.4, 6.7, 2.8, 3.9, 0.1$

Step 1: Calculate the Mean and Standard Deviation

Using a calculator (shown below), we can find the mean and standard deviation of the data set.

From the calculator screen, we have:

Mean ( $\bar{x}$ ) = $2.867$
Standard deviation (σ) = $2.085$

Step 2: Determine the Outlier Boundaries

Outliers are defined as data points that lie outside two standard deviations from the mean.

We calculate the boundaries for the outliers using the formula: $ˉ ±2σ$

Substitute the values of the mean ( $\bar{x}$ ) and standard deviation (σ)

xˉ±2σ=2.867±2.085×2

=2.867±4.17= 2.867 \pm 4.17=2.867±4.17

Thus, the boundaries for outliers are: $−1.303$ and $7.037$

Step 3: Identify the Outliers

Any data points that fall outside the range $[−1.303,7.037]$ are considered outliers.

Checking the data set:

$1.3, 2.4, 6.7, 2.8, 3.9, 0.1$

All of these values lie within the range $[−1.303,7.037]$ , so there are no outliers in this data set.

Explanation:

Since no data points lie outside the boundaries of $[−1.303,7.037]$ , we conclude that this data set has no outliers.

infoNote

Example: Identifying Outliers using the IQR Let's go through a detailed example to understand how to identify outliers using the Interquartile Range (IQR).

Question: Using the same data set as before, identify any outliers using the IQR method: $1.3, 2.4, 6.7, 2.8, 3.9, 0.1$

Step 1: Calculate the Median and IQR

The median and quartiles can be found using a calculator. Here's the result:

From the calculator screen, we have:

Median ( $Med$ ) = $2.6$
Lower quartile ( $Q_1$ ) = $1.3$
Upper quartile ( $Q_3$ ) = $3.9$

Thus, the IQR (Interquartile Range) is:

IQR=Q 3 −Q 1 =3.9−1.3=2.6

Step 2: Determine the Outlier Boundaries

Outliers are defined as any data points that lie 1.5 times the IQR above $Q_3$ or below $Q_1$ .

We calculate the boundaries for outliers using the formula:

$Med±1.5×IQR$

Substitute the values:

Outlier boundaries=2.6±1.5×2.6

Outlier boundaries=2.6±3.9

Thus, the outlier boundaries are:

$[-1.3, 6.5]$

Step 3: Identify the Outliers

Any data points that fall outside the range $[-1.3, 6.5]$ are considered outliers.

Checking the data set:

$1.3, 2.4, 6.7, 2.8, 3.9, 0.1$

The value $6.7$ lies outside this range (greater than $6.5$ ), so $6.7$ is an outlier.

Explanation:

Since $6.7$ lies outside the interval $[-1.3, 6.5]$ , we can conclude that $6.7$ is an outlier in this data set.

Outliers & Cleaning Data (AQA A-Level Mathematics): Revision Notes

2.3.1 Outliers & Cleaning Data

Identifying Outliers

Box Plots:

Z-Scores:

Interquartile Range (IQR) Method:

Cleaning Data

Steps for Cleaning Data:

Explore AQA A-Level Mathematics Model Answers by Topics

Statistical Measures

Data Presentation

Working with Data

Correlation & Regression

Further Correlation & Regression (A Level only)

Explore AQA A-Level Mathematics Quizzes by Topics

Statistical Measures

Data Presentation

Working with Data

Correlation & Regression

Further Correlation & Regression (A Level only)

Explore AQA A-Level Mathematics Flashcards by Topics

Statistical Measures

Data Presentation

Working with Data

Correlation & Regression

Further Correlation & Regression (A Level only)

Join 100,000+ A-Level students studying Revision Notes with us.