Conducting a Regression Analysis Using Data Revision Notes for VCE SSCE General Mathematics

Conducting a Regression Analysis Using Data

Introduction

A regression analysis allows us to investigate the association between two numerical variables and make predictions. In your statistical investigation project, you will need to conduct a complete regression analysis from start to finish. This note will guide you through the essential steps and help you understand what each part of the analysis tells us.

chatImportant

A full regression analysis is a required component of statistical investigation projects. Mastering this process will enable you to draw meaningful conclusions from bivariate data.

Understanding variables

Before starting any regression analysis, you must identify two key variables:

Explanatory Variable (EV): This is the variable we use to make predictions. It is plotted on the horizontal axis (x-axis) of a scatterplot.

Response Variable (RV): This is the variable we are trying to predict. It is plotted on the vertical axis (y-axis) of a scatterplot.

The choice of which variable is explanatory and which is response depends on the research question. We are investigating whether the explanatory variable can be used to predict the response variable.

infoNote

Remember: The explanatory variable (EV) predicts the response variable (RV). Think of it as: "Does EV help us predict RV?"

The worked example

Throughout this note, we'll use a real example investigating the association between birth rate (births per 1000 people) and life expectancy (in years) across 10 countries. Let's identify our variables:

Explanatory Variable (EV): birth rate
Response Variable (RV): life expectancy

Here is the data we'll be working with:

Step-by-step regression analysis

Step 1: Identify your variables

Clearly state which variable is the explanatory variable and which is the response variable. This decision should be based on your research question.

For our example:

EV: birth
RV: life

Step 2: Enter the data

Input your bivariate data into your calculator, using meaningful variable names. Each pair of values should be entered as a row, with the explanatory variable in one column and the response variable in another.

infoNote

Using clear variable names (like "birth" and "life") rather than generic names (like "x" and "y") makes it much easier to interpret your results later.

Step 3: Construct a scatterplot

Create a scatterplot with the explanatory variable on the horizontal axis and the response variable on the vertical axis. This visual representation helps you assess the nature of the relationship between the variables.

Step 4: Describe the association

When describing the association shown in a scatterplot, you must comment on four features. A useful way to remember these is the acronym DFSO:

D - Direction: Is the association positive (upward trend) or negative (downward trend)?
F - Form: Is the relationship linear (straight line pattern) or non-linear (curved pattern)?
S - Strength: How closely do the points follow the pattern? Strength can be described as weak, moderate, or strong.
O - Outliers: Are there any points that don't fit the general pattern?

chatImportant

Always use the DFSO framework when describing scatterplots in your statistical investigation. A complete description must address all four components: Direction, Form, Strength, and Outliers.

lightbulbExample

Worked Example: Describing the Scatterplot

For our birth rate and life expectancy data:

There is a strong, negative, linear relationship between life expectancy and birth rate. There are no obvious outliers.

Breaking it down:

Direction: Negative (as birth rate increases, life expectancy decreases)
Form: Linear (points follow a straight line pattern)
Strength: Strong (points are close to the line pattern)
Outliers: None (all points fit the general pattern)

Step 5: Find the regression equation

Calculate the least squares regression line. This is the straight line that best fits the data by minimising the sum of squared residuals. Your calculator will provide:

The regression equation in the form $y = a + b \times x$
The correlation coefficient ( $r$ )
The coefficient of determination ( $r^2$ )

lightbulbExample

Worked Example: Regression Output

For our birth rate and life expectancy data:

Regression equation: $\text{life} = 105.4 - 1.445 \times \text{birth}$

Correlation coefficient: $r = -0.8069$

Coefficient of determination: $r^2 = 0.651$

Step 6: Generate a residual plot

A residual plot helps us check whether a linear model is appropriate for our data. The residual is the difference between the actual $y$ -value and the predicted $y$ -value for each data point.

To create a residual plot, plot the residuals on the vertical axis against the explanatory variable on the horizontal axis.

What to look for: If the linear model is appropriate, the residual plot should show a random scatter of points around zero, with no clear pattern.

lightbulbExample

Worked Example: Interpreting the Residual Plot

For our birth rate and life expectancy data:

The random residual plot suggests linearity.

This means our linear model is appropriate for this data. The points are scattered randomly around zero with no obvious pattern or trend.

Step 7: Interpret the results

Once you have all the output from your regression analysis, you need to interpret what it means. The following sections explain how to interpret each component.

Interpreting regression output

The regression equation

The regression equation has the form:

$\text{response} = a + b \times \text{explanatory}$

Where:

$a$ is the intercept (the predicted response value when the explanatory variable equals zero)
$b$ is the slope (the average change in the response variable for each one-unit increase in the explanatory variable)

lightbulbExample

Worked Example: Interpreting the Regression Equation

For our example: $\text{life} = 105.4 - 1.445 \times \text{birth}$

Interpretation of the slope: For each additional birth per 1000 people, life expectancy decreases by approximately 1.445 years on average.

Interpretation of the intercept: When birth rate is zero, the predicted life expectancy would be 105.4 years.

infoNote

The intercept often doesn't make practical sense, as it involves extrapolation beyond our data range. In this example, a birth rate of zero is not realistic for any country.

The correlation coefficient (r)

The correlation coefficient measures the strength and direction of the linear relationship between the two variables.

Values range from $-1$ to $+1$
The sign (positive or negative) indicates direction
The magnitude (size) indicates strength:
- $|r| \approx 0.8$ to $1.0$ suggests a strong correlation
- $|r| \approx 0.5$ to $0.8$ suggests a moderate correlation
- $|r| \approx 0$ to $0.5$ suggests a weak correlation

lightbulbExample

Worked Example: Interpreting the Correlation Coefficient

For our example: $r = -0.8069$

This indicates a strong negative correlation. The negative sign tells us that as birth rate increases, life expectancy tends to decrease. The magnitude of approximately 0.81 indicates a strong relationship.

The coefficient of determination (r²)

The coefficient of determination tells us the proportion of variation in the response variable that can be explained by the explanatory variable.

Values range from $0$ to $1$
Often expressed as a percentage by multiplying by 100

lightbulbExample

Worked Example: Interpreting the Coefficient of Determination

For our example: $r^2 = 0.651$

Interpretation: Approximately 65.1% of the variation in life expectancy can be explained by the variation in birth rate.

The remaining 34.9% of variation is due to other factors not included in our model.

infoNote

To convert $r^2$ to a percentage, simply multiply by 100. For example: $0.651 \times 100 = 65.1\%$

Using the regression equation for predictions

Once you have a regression equation, you can use it to make predictions. Simply substitute a value for the explanatory variable and calculate the predicted response value.

lightbulbExample

Worked Example: Making a Prediction

Using our equation $\text{life} = 105.4 - 1.445 \times \text{birth}$ , what is the predicted life expectancy for a country with a birth rate of 35 per 1000?

\begin{aligned} \text{life} &= 105.4 - 1.445 \times 35 \\ &= 105.4 - 50.575 \\ &= 54.825 \text{ years} \end{aligned}

Answer: The predicted life expectancy is approximately 54.8 years.

chatImportant

Beware of extrapolation! Be cautious about making predictions outside the range of your data. The relationship may not continue in the same way beyond the data values you observed.

Checking the linearity assumption

The residual plot is crucial for checking whether a linear model is appropriate:

Random scatter: Points scattered randomly around zero with no pattern → Linear model is appropriate
Curved pattern: Points form a curve → Non-linear relationship, linear model not appropriate
Fan shape: Spread of points increases or decreases → Variation is not constant, may need different approach

chatImportant

Always examine your residual plot before relying on your regression equation for predictions. If the residual plot shows a pattern, a linear model may not be appropriate for your data.

bookmarkSummary

Key Points to Remember:

Identify variables first: Clearly determine which is the explanatory variable (EV) and which is the response variable (RV) based on your research question.
Use DFSO to describe scatterplots: Always comment on Direction, Form, Strength, and Outliers when describing an association.
Interpret the slope carefully: The slope tells you the average change in the response variable for each one-unit increase in the explanatory variable.
Understand r²: The coefficient of determination ( $r^2$ ) tells you what percentage of variation in the response variable is explained by the explanatory variable. Multiply by 100 to convert to a percentage.
Check linearity: Always generate and examine a residual plot to confirm that a linear model is appropriate before making predictions. Look for random scatter around zero.
Watch for extrapolation: Only make predictions within the range of your original data to ensure reliability.

Conducting a Regression Analysis Using Data (VCE SSCE General Mathematics): Revision Notes