Data Collection (HSC SSCE Mathematics Standard): Revision Notes
Data Collection
Introduction to data collection
Data collection is a fundamental process in statistics that involves three key steps: deciding what information you need, finding where that information exists, and actually gathering it. The quality and reliability of any statistical analysis depends entirely on the quality of data collected at this stage.
When collecting data, you need to obtain information from appropriate sources and ensure proper procedures are in place. This helps guarantee that your data is accurate, current, relevant and stored securely. Poor quality data from unreliable sources will lead to incorrect conclusions, no matter how sophisticated your analysis techniques might be.
The foundation of any statistical analysis is the quality of data collected. Even the most advanced analytical methods cannot compensate for poor quality data from unreliable sources.
Sources of data
Data can be obtained from two main types of sources, each with different characteristics and uses.
Primary sources
Primary sources involve collecting data directly yourself through first-hand methods. This includes interviewing people face-to-face, conducting questionnaires or surveys, or observing a system while it operates. When you use primary sources, you have direct control over what data is collected and how it is gathered.
Example of Primary Data Collection: If you wanted to know students' opinions about school canteen food, you could interview students directly or distribute a questionnaire asking for their feedback. This would be primary data collection because you are gathering the information yourself for your specific purpose.
Secondary sources
Secondary sources involve using data that has already been collected or created by someone else. This includes information from newspapers, books, websites, government reports, or existing databases. Secondary sources can save significant time and effort since the data already exists.
Example of Secondary Data: If you wanted to analyse population trends in Australia, you could use data from the Australian Bureau of Statistics rather than conducting your own nationwide survey. This existing data is a secondary source.
Ensuring data quality
Regardless of which source you use, it is essential that your data meets quality standards. Data should be:
- Accurate: Free from errors and correctly measured
- Up-to-date: Current and relevant to the time period being studied
- Relevant: Directly related to the question you are investigating
- Secure: Properly stored and protected, especially if it contains personal information
If your data fails to meet these standards, any conclusions drawn from it will be unreliable and potentially misleading.
Census and samples
When collecting data about a group of people or items, you can either survey everyone or just a portion. These approaches are called a census and a sample respectively.
Census
A census is a survey that collects data from every single member of a population. The population refers to the entire group you are interested in studying. For example, if you wanted to survey all students in your school, that entire student body would be your population, and surveying every single student would be a census.
The main advantage of a census is that it provides complete and accurate information about the entire population. However, censuses have significant drawbacks:
- They are very expensive to conduct
- They take a long time to complete
- They may be impractical or impossible if the population is very large
For these reasons, censuses are typically only conducted when absolutely necessary, such as the national census conducted by governments every few years.
Sample
A sample involves collecting data from only part of a population. For instance, if you surveyed just the students in your mathematics class about their study habits, this class would be a sample of the entire school population.
Using a sample has several advantages:
- Much cheaper than a census
- Quicker to complete
- More manageable for large populations
However, samples also have limitations:
- Not as accurate as a census
- Prone to bias if not selected carefully
- May not represent the population well if too small
When using a sample, you make estimates about the entire population based on the characteristics of your sample. For this to work effectively, your sample must be large enough to give a good representation of the population, but small enough to remain practical and manageable.
Types of sampling methods
There are several different methods for selecting a sample from a population. Each method has its own strengths and weaknesses, and the choice depends on your specific situation and research goals.
Random sample
In a random sample, every member of the population has an equal chance of being selected. This is similar to how lottery numbers are drawn - each number has exactly the same probability of being chosen.
For example, imagine selecting students at random from your entire school population. If done properly, every student has the same likelihood of being one of those selected individuals. Similarly, in a lottery, when numbers are chosen from possible numbers, each number has an equal chance.
Advantages:
- Simple to understand and implement
- Works well for small populations
- Unbiased selection process
Disadvantages:
- May miss representing certain groups in large populations
- Sometimes difficult to ensure truly equal chances
- May not capture important subgroups
Stratified sample
A stratified sample involves first dividing the population into categories (called strata) and then randomly selecting members from within each category. This ensures that all important groups are represented in your sample.
For instance, if you wanted to survey students about school policies, you might divide students by year level (Years , , , , , and ) and then randomly select one student from each year group. This guarantees that every year level has representation in your sample.
Common categories used in stratified sampling include:
- Age groups
- Gender
- Year level
- Religion
- Marital status
- Income brackets
Advantages:
- Ensures all important subgroups are represented
- Particularly useful when categories are clear and simple
- Reduces bias by guaranteeing diversity
Disadvantages:
- Requires careful thought about which categories to use
- Can introduce bias if categories are chosen poorly
- More complex than simple random sampling
Systematic sample
Systematic sampling occurs when the population is organised in some way, and then members are selected at regular intervals following a structured pattern. This creates a gap of consistent size between each selection.
For example, you might arrange all students in a school alphabetically and then select the th student, th student, th student, and so on. The gap between selections is students.
Practical Application: This method is commonly used in quality control. A manufacturer might test every th item coming off a production line, or check a machine's performance every minutes. These regular intervals help monitor consistency over time.
Advantages:
- Easy to implement once the system is established
- Spreads selections evenly across the population
- Useful for ongoing monitoring processes
Disadvantages:
- May introduce bias if there is a hidden pattern in the population
- Less random than other methods
- Requires the population to be organised in advance
Self-selected sample
In a self-selected sample (also called a volunteer sample), members of the population choose to participate themselves rather than being selected by the researcher. Participants volunteer to be part of the survey or study.
For example, six students from a school might voluntarily offer to complete a questionnaire about school facilities. Nobody selected these students - they chose to participate on their own.
Self-selected samples are extremely common on the internet, where websites ask visitors to complete surveys or provide feedback. Only people who are motivated and interested choose to respond.
Advantages:
- Easy and inexpensive to collect
- Participants are often willing and cooperative
- Common and practical for online research
Disadvantages:
- High potential for bias
- Only captures views of motivated individuals
- May miss the opinions of people who are busy, uninterested, or have different viewpoints
- Results may not represent the broader population
Common Pitfall: Self-selected samples often suffer from significant bias because only people with strong opinions or particular motivations choose to participate. This means the sample may not accurately represent the views of the entire population.
Comparison of sampling methods
Worked example: Distinguishing sample types
Worked Example: Identifying Sampling Methods in a Retirement Village
A retirement village has residents, consisting of women and men. Decide whether each sample of residents would be random, stratified, systematic or self-selected.
a) Every seventh resident
b) Six of the women and three of the men
c) Nine names picked from a hat containing the names of the residents
d) Residents sorted into alphabetical order and each ninth resident selected
e) Residents are divided into four age groups (-, -, -, -) and two residents selected from each age group
Solution:
a) Systematic sample
The population has been divided into a structured sample size with regular intervals. Every seventh resident is selected, which means the th, th, st, and so on up to the rd resident. This regular pattern is characteristic of systematic sampling.
b) Stratified sample
The population has been divided into two categories: women and men. Then members from each category are selected (six women and three men). Dividing into categories first, then selecting from each category, defines stratified sampling.
c) Random sample
The sample is taken completely at random. When names are picked from a hat, each resident has an equal chance of being selected. This is the defining characteristic of random sampling.
d) Systematic sample
The population is first organised into alphabetical order, creating a structure. Then a pattern is applied by selecting every ninth resident. This structured, regular selection process makes it systematic sampling.
e) Stratified sample
The population has been divided into four categories based on age groups. Two residents are then randomly selected from each of these age categories. This process of categorising first, then selecting from each category, makes it stratified sampling.
Remember!
Key Points to Remember:
-
Data collection requires careful planning to ensure you gather accurate, current, relevant and secure information from reliable sources.
-
Primary sources involve collecting data yourself directly, while secondary sources use existing data collected by others.
-
A census surveys everyone in a population (accurate but expensive), while a sample surveys only part of a population (cheaper but less accurate).
-
Random sampling gives everyone equal selection chance; stratified sampling divides into categories first; systematic sampling selects at regular intervals; self-selected sampling relies on volunteers.
-
Choose your sampling method carefully based on your population size, available resources, and the importance of representing different subgroups.