Terminology (Leaving Cert Mathematics): Revision Notes
Statistics Terminology
Understanding populations and samples
In statistics, we need to distinguish between the group we want to learn about and the group we actually study.
Population refers to the complete group of individuals or items that we want to investigate. This could be all Leaving Cert students in Ireland, all cars manufactured in 2023, or all trees in a particular forest. The population represents everyone or everything we're interested in understanding.
Sample means a smaller subset taken from the population for practical study. Since we usually cannot examine every member of a population (imagine trying to survey every Leaving Cert student!), we select a manageable group to represent the whole.
For example, if we want to understand the study habits of all 60,000 Leaving Cert students in Ireland (the population), we might survey 200 randomly chosen students (the sample) to draw conclusions about the entire group.
A representative sample accurately reflects the characteristics of the entire population. This means the sample should have similar proportions of different groups as found in the population.
Bias occurs when certain groups are over-represented or under-represented in our sample, leading to inaccurate conclusions about the population. Bias is one of the biggest threats to valid statistical conclusions.
Types of data
Understanding how to classify data helps us choose appropriate analysis methods. Data falls into two main categories, each with important subcategories.
Categorical data consists of information sorted into distinct categories or groups.
- Nominal data has no natural ordering between categories. Examples include colours, gender, or favourite subjects. You cannot rank these meaningfully.
- Ordinal data has categories with a natural order or ranking. Examples include satisfaction ratings (poor, fair, good, excellent) or finishing positions in a race.
Numerical data represents measured or counted quantities.
- Discrete data consists of countable values, often whole numbers. Examples include number of students in a class, goals scored in a match, or books owned.
- Continuous data can take any value within a range and is typically measured rather than counted. Examples include height, weight, time, or temperature.
Data Classification Hierarchy: The distinction between these data types determines which statistical methods and graphs are appropriate for analysis. Always identify your data type before choosing analysis techniques.
Sampling methods and errors
Different methods exist for selecting samples, including simple random, systematic, stratified, and quota sampling. Each method has specific advantages and applications depending on the research context.
Sampling error represents the natural difference between what we find in our sample and the true population value. This occurs simply because we're studying part of the population, not all of it.
Non-sampling error includes mistakes unrelated to sample selection, such as measurement errors, recording mistakes, or response bias.
Even with perfect sampling methods, sampling error will always exist. The key is to minimise it through proper sample size and selection techniques, while eliminating non-sampling errors through careful data collection procedures.
Measures of central tendency
These statistics help us identify the "typical" or "average" value in our data. Each measure has specific uses and limitations.
Mean calculates the arithmetic average by adding all values and dividing by the number of observations. The mean can be significantly affected by extreme values (outliers), which may make it less representative of typical values.
Worked Example: Calculating the Mean
Test scores: 65, 70, 72, 75, 78, 82, 95
Mean =
Median identifies the middle value when data is arranged in order. With an even number of values, we take the average of the two middle numbers. The median remains stable even when extreme values are present.
Worked Example: Finding the Median
Same test scores arranged in order: 65, 70, 72, 75, 78, 82, 95
With 7 values, the median is the 4th value: 75
Mode represents the most frequently occurring value in the dataset. A dataset might have no mode, one mode, or multiple modes.
When to Use Each Measure:
- Use mean for symmetrical data without outliers
- Use median when data is skewed or contains outliers
- Use mode for categorical data or when identifying the most common value
Measures of spread
These statistics describe how much the data varies around the central value, giving us insight into data consistency and reliability.
Range provides the simplest measure of spread by calculating the difference between the maximum and minimum values: .
Interquartile range (IQR) measures the spread of the middle 50% of data by finding the difference between the third quartile and first quartile: .
Variance calculates the average of the squared deviations from the mean, providing a measure of overall variability:
Standard deviation represents the typical distance of data points from the mean. It's calculated as the square root of variance and uses the same units as the original data:
Outliers are data points that fall far from most other observations and can significantly impact our analysis. They're typically defined as values more than 1.5 × IQR below Q₁ or above Q₃.
Generalisability and data presentation
Generalisability refers to our ability to apply sample findings to the broader population. This is only valid when our sample is representative and bias is minimised.
Common Data Presentation Methods: Different visualisation methods suit different purposes: frequency tables and histograms for numerical data, bar charts and pie charts for categorical data, box plots for showing spread and outliers, and scatter plots for relationships between variables.
Probability concepts
Basic probability terminology forms the foundation for understanding statistical inference and decision-making under uncertainty.
Events are specific outcomes we're interested in, while outcomes are possible results. The sample space contains all possible outcomes.
Mutually exclusive events cannot occur simultaneously, while independent events mean one outcome doesn't affect another.
Understanding these probability concepts is essential for hypothesis testing and confidence intervals, which form the basis of statistical inference.
Correlation and regression
Correlation describes the relationship between two variables, which can be positive (both increase together), negative (one increases as the other decreases), or show no relationship.
The correlation coefficient (r) quantifies this relationship on a scale from -1 to +1, where indicates perfect negative correlation, indicates no linear relationship, and indicates perfect positive correlation.
Line of best fit represents the straight line that best describes the relationship between two variables, minimising the sum of squared residuals.
Remember that correlation does not imply causation. A strong correlation between two variables doesn't mean one causes the other - there may be other factors involved or the relationship may be coincidental.
Common statistical misuse
Be Aware of These Problems in Statistical Presentations:
- Biassed samples that don't represent the population
- Misleading graphs with inappropriate scales or formats
- Using inappropriate averages for the data type
- Ignoring the importance of sample size in drawing conclusions
- Cherry-picking data that supports a predetermined conclusion
- Confusing correlation with causation
These misuses can lead to incorrect conclusions and poor decision-making.
Summary
Key Points to Remember:
- Population is everyone you want to study; sample is who you actually study
- Categorical data goes in groups; numerical data involves numbers you can calculate with
- Mean is affected by outliers; median is more resistant to extreme values
- IQR measures the spread of the middle 50% of your data
- A sample must be representative to make valid conclusions about the population
- Always consider the context and limitations of your data when drawing conclusions