Inferential Statistics and Tests (Edexcel A-Level Psychology): Revision Notes
Inferential Statistics and Tests
What are inferential statistics?
Descriptive statistics (such as summary tables and graphs) display data but do not reveal whether observed differences between conditions reflect genuine effects or simply chance variation. Inferential statistics allow researchers to determine whether the independent variable has produced a real effect on the dependent variable.
When data differs between conditions, researchers must establish whether this difference represents a true effect or random variation. An inferential test of significance provides this answer by testing the likelihood that results occurred by chance. The test indicates whether to retain or reject the null hypothesis (which assumes no real effect) in favour of the alternative hypothesis (which predicts a real effect).
The fundamental purpose of inferential statistics is to distinguish between genuine experimental effects and random chance variation. Without these tests, researchers cannot be confident that their results reflect real psychological phenomena rather than coincidental patterns in the data.
Understanding probability in inferential testing
Inferential tests operate on the principle of probability – the likelihood of an event occurring. For instance, the probability of obtaining heads when tossing a fair coin equals 0.5 (50% or one in two). When testing data, researchers assess the probability that differences between conditions arose from random chance rather than the independent variable.
The test evaluates whether to accept the null hypothesis (no real effect exists) or support the alternative hypothesis (a real effect exists). This decision rests on whether results were likely produced by chance factors or something systematic.
Significance levels in psychology
The 0.05 criterion
Psychology generally adopts p < 0.05 as the threshold for determining statistical significance. This value means researchers accept a 1 in 20 (or 5%) probability that results occurred by chance.
When conducting an inferential test, the outcome reveals whether this 0.05 probability threshold has been met.
If the probability of results occurring by chance equals or falls below 0.05, researchers support the alternative hypothesis. However, if the probability of results occurring by chance exceeds 0.05, the null hypothesis must be retained.
When an inferential test proves significant, researchers demonstrate 95% confidence that their prediction was correct, with only a 5% (or less) likelihood that results occurred by chance. Conversely, non-significant results indicate less than 95% confidence in the prediction, meaning chance produced the results more than 5% of the time.
Other significance levels
Whilst 0.05 represents the accepted standard, researchers may encounter or use alternative significance levels:
- p < 0.1 (10% or 1 in 10 probability): Results may still be reported at this level, though they require further investigation
- p < 0.01 (1% or 1 in 100 probability): Indicates higher confidence in results
Statistical tests do not 'prove' results are true. Researchers can only claim the test reasonably supports the alternative hypothesis, or that insufficient confidence exists so the null hypothesis must be retained.
Types of error in hypothesis testing
Two types of error can occur when interpreting inferential test results, both arising from the choice of significance level:
Type 1 error
A Type 1 error occurs when the null hypothesis is rejected and the alternative hypothesis accepted, despite no real effect existing. This happens when the significance level is too lenient (for example, accepting p < 0.1 rather than p < 0.05). Setting an insufficiently stringent significance level increases the risk of rejecting alternative hypotheses when genuine effects exist, known as a Type 1 error.
Think of Type 1 errors as "false positives" – claiming an effect exists when it doesn't. This is like a false alarm: you think you've found something important, but it was actually just chance.
Type 2 error
A Type 2 error occurs when the alternative hypothesis is rejected and the null hypothesis retained, even though a real effect does exist. This happens when the significance level is too stringent (for example, requiring p < 0.01 rather than accepting p < 0.05). Retaining the null hypothesis when a real effect existed constitutes a Type 2 error, arising because the significance level was set too strictly.
Think of Type 2 errors as "false negatives" – missing a real effect that actually exists. This is like failing to detect something important that was really there. There's an inherent trade-off: being too cautious (strict significance levels) risks missing real effects, whilst being too lenient risks finding effects that aren't real.
Levels of measurement in data
Different inferential tests suit different types of data. Understanding data levels helps determine which test to apply.
Nominal level data
Nominal data represents the most basic data form. This categorical or grouped data involves calculating the total number of values in each category. The data provides limited information because researchers only know category totals, not individual differences within categories.
Example: Pet Ownership Survey
A class survey on pet ownership would generate data on how many students own pets versus those who do not, or the types of pets students own. The class would indicate whether they own a pet, and researchers would calculate the frequency of pet owners versus non-owners. Researchers know very little about individual differences between pet or non-pet owners – only the totals for each category.
Similarly, dividing a class into students under 1.85 metres tall versus over 1.85 metres tall creates nominal data by calculating frequency in each category. However, this reveals nothing about actual individual heights or how heights relate to one another.
Ordinal level data
Ordinal data provides more information by ranking values into an order or position. For instance, schools may collect house points and award prizes to the house with the greatest points at year end. Each house receives a rank order of first, second, third, fourth based on their position. This data reveals the position of each house but not how many points were achieved or the difference between ranks.
The house in first place might have only ten points more than second place, whilst the house in third place may lag far behind second place with 100 fewer points. This illustrates the key limitation of ordinal data: equal intervals between ranks do not represent equal differences in the measured quality.
Ordinal data commonly derives from arbitrary scales, such as test grades or attractiveness ratings from 1-10. The scales are arbitrary because intervals between each value are not equal in reality. The difference between grade A and grade B does not equal the difference between grade C and grade D, nor will someone rated 5 on an attractiveness scale be half as attractive as someone rated 10.
Interval and ratio level data
With interval and ratio data, researchers understand the differences between each value because measurements use a scale where intervals between values are equal. Typically, interval and ratio data come from recognised scales or tested psychological instruments.
The only difference between interval and ratio data is that ratio data has an absolute zero. Measurements such as height in centimetres, speed in seconds, and distance in kilometres represent ratio data because they start at zero on the scale.
For selecting inferential tests, researchers only need to distinguish between nominal data and 'at least' ordinal level data – interval and ratio data can be treated as ordinal when choosing which test to use.
Selecting the appropriate inferential test
Researchers must consider several factors when choosing which inferential test to use:
- Are you investigating a difference or relationship between variables?
- Are you using a related or unrelated design?
- What type of data are you analysing?
Decision tree for test selection
Test Selection Guide
The following structure helps determine which test to apply:
Looking for a difference?
- Nominal data → Chi-squared test
- Ordinal data → (depends on design)
- Repeated measures or matched pairs → Wilcoxon Signed Ranks test
- Independent groups → Mann-Whitney U test
Looking for a relationship?
- Nominal data → Chi-squared test
- Ordinal data → Spearman's rho test
Key design terms
Repeated measures design: All participants complete all conditions of the experiment.
Matched pairs design: Different participants are allocated to only one experimental condition but are matched on important characteristics (for example, IQ). For statistical test selection purposes, matched pairs designs should be treated as repeated measures designs.
Independent groups design: Only one group of participants completes one condition whilst a different group completes another condition.
Memory aid: "Mr Wilcoxon is cross (×) - does both conditions" (repeated measures), whilst "Mr Mann and Mr Whitney are happy - one condition each" (independent groups).
The Wilcoxon Signed Ranks test
When to use the Wilcoxon test
The Wilcoxon Signed Ranks test is employed when:
- Data is at ordinal level or above
- The experimental design uses repeated measures or matched pairs
- Researchers are testing for a difference between two conditions
Calculation procedure
The Wilcoxon test follows these steps:
-
Calculate the difference between pairs of scores achieved by each participant on both tests (for example, subtracting column A score from column B score)
-
Rank the score differences, ignoring any plus or minus signs (this is called ranking the absolute differences)
-
Calculate the sum total of ranks for positive differences and the sum total of ranks for negative differences
-
The smaller of these scores becomes the T value (the test statistic or calculated value)
When assigning ranks to scores, equal values cannot share the same rank position but the positions must be divided between them. For instance, if scores of 4 and 10 would occupy rank positions 2 and 3, they receive ranks of 2.5 each (as 7 and 8 would share ranks of 7.5).
Interpreting the Wilcoxon test
To determine whether the calculated T value shows significance (indicating a real difference), compare the calculated value to critical values in a Wilcoxon Signed Ranks test table.
Critical Rule for Wilcoxon Test
For the Wilcoxon test, the calculated value of T must be equal to or less than the table (critical) value for significance at the specified level.
Critical value tables account for:
- The number of participants (N, which equals the number of scores after removing those with zero difference)
- The significance level (typically 0.05)
- Whether the test is one-tailed or two-tailed
One-tailed and two-tailed tests
One-tailed tests
A one-tailed test is used when the direction of difference can be predicted, meaning a directional hypothesis has been stated. For example, predicting that more words will be recalled from a categorised list than a non-categorised list represents a directional hypothesis. The accepted significance level in psychology is 0.05.
Two-tailed tests
A two-tailed test is used when the direction of difference cannot be predicted, meaning a non-directional hypothesis has been stated. Researchers simply predict a difference exists but cannot specify which condition will produce higher or lower scores.
Choosing Between One-tailed and Two-tailed Tests
- One-tailed: Use when you can predict the direction (e.g., "Condition A will score higher than Condition B")
- Two-tailed: Use when you only predict a difference exists but not which direction (e.g., "There will be a difference between Condition A and Condition B")
The Mann-Whitney U test
When to use the Mann-Whitney test
The Mann-Whitney U test is employed when:
- Data is at ordinal level or above
- The experimental design uses independent groups
- Researchers are testing for a difference between two conditions
Calculation procedure
The Mann-Whitney test follows these steps:
- Use the Mann-Whitney U test formula:
Where:
- and = number of participants in each group
- and = sum total of ranks for each group
- is the smaller of and
-
Rank all scores as a whole group (combining both groups together), assigning position 1, position 2, and so forth
-
Find the sum (total) of ranks for both groups, dividing each group back into their original sets
-
Calculate and using the formula
-
The lowest value of or is taken as the U value (the test statistic)
Calculation Steps Summary
Step 1: Combine all scores from both groups and rank them together as one dataset
Step 2: Separate the ranks back into their original groups (Group A and Group B)
Step 3: Sum the ranks for each group separately to find and
Step 4: Apply the formulae to calculate both and
Step 5: Select the smaller U value as your test statistic
Interpreting the Mann-Whitney test
To determine significance, compare the calculated U value to critical values in a Mann-Whitney U test table.
Critical Rule for Mann-Whitney Test
For the Mann-Whitney test, the observed value of U is significant at the given level if it is equal to or less than the table (critical) value.
The Mann-Whitney test has several different critical value tables. First, decide which table to use by referring to the test title. Tables exist for different significance levels:
- One-tailed test at 0.005; two-tailed test at 0.01
- One-tailed test at 0.01; two-tailed test at 0.02
- One-tailed test at 0.025; two-tailed test at 0.05
- One-tailed test at 0.05; two-tailed test at 0.1
Researchers must:
- Decide whether the hypothesis is directional (one-tailed) or non-directional (two-tailed)
- Determine the significance level to be used (typically 0.05, though 0.1 or 0.01 may be requested)
Examination Tip
When displaying calculations in examinations, show all workings step by step, using the formulae as demonstrated in worked examples. This ensures you receive full marks even if your final answer is incorrect.
Remember!
Key Points to Remember:
-
Inferential statistics determine whether observed differences reflect real effects or chance variation; the standard significance level in psychology is p < 0.05 (1 in 20 or 5% probability)
-
Type 1 errors occur when the null hypothesis is incorrectly rejected (false positive), whilst Type 2 errors occur when the null hypothesis is incorrectly retained (false negative)
-
Different data types require different tests: nominal data is categorical, ordinal data is ranked, and interval/ratio data uses equal interval scales
-
The Wilcoxon Signed Ranks test is used for repeated measures or matched pairs designs with ordinal data (T value must be equal to or less than critical value)
-
The Mann-Whitney U test is used for independent groups designs with ordinal data (U value must be equal to or less than critical value)