Fitting a Least Squares Regression Line to Numerical Data (VCE SSCE General Mathematics): Revision Notes
Fitting a Least Squares Regression Line to Numerical Data
What is linear regression?
When we want to model the relationship between two numerical variables with a straight line, we use a process called linear regression. The line we create is known as the regression line.
Every straight line can be described using an equation in the form:
In this equation:
- represents the y-intercept (where the line crosses the y-axis)
- represents the slope (how steep the line is)
To fit a line to our data, we need to find the best values for and that make the line pass as close as possible to all our data points.
Methods for fitting a line
There are two main approaches to fitting a line to bivariate data:
Drawing by eye
The simplest method is to plot your data on a scatterplot and draw a line with a ruler that seems to follow the general pattern. However, this method has a major weakness: different people will draw different lines, making it unreliable and inconsistent.
The least squares method
The more rigorous approach is the least squares method. This mathematical technique finds the one line that best fits the data according to a specific criterion. The method assumes the variables have a linear relationship and works most effectively when there are no obvious outliers in your data.
Understanding residuals
Before we can explain the least squares method, we need to understand what a residual is.

A residual is the vertical distance between an actual data point and the regression line. In other words, it measures how far each point is from the line.
Residuals can be:
- Positive: when the data point sits above the regression line
- Negative: when the data point sits below the regression line
- Zero: when the data point falls exactly on the regression line
Think of residuals as prediction errors. If we use the regression line to predict a value and the actual value is different, the residual tells us how much our prediction was off by.
The least squares line
The least squares line is special because it minimises the sum of the squares of all the residuals. Mathematically, this means it makes this expression as small as possible:
where , etc. are the residuals.
Why do we square the residuals?
You might wonder why we square the residuals instead of just adding them up. Here's why: if we simply added up all the residuals (without squaring), they would always sum to zero! This happens because the least squares line balances the data, similar to how a see-saw balances weight on both sides. Some residuals are positive and some are negative, and they cancel each other out.
Squaring the residuals solves this problem by making all values positive, so we can genuinely minimise the total distance of points from the line.
Assumptions for using the least squares method
Before fitting a least squares line, we must check that our data meets three important conditions:
Critical Assumptions for Least Squares Regression:
- The data must be numerical - both variables need to be measured on a numerical scale
- The association must be linear - the relationship between variables should form a roughly straight pattern
- There should be no clear outliers - extreme values can distort the regression line
These are the same assumptions we use when calculating the correlation coefficient.
Formulas for the least squares regression line
To find the equation of the least squares line, we need to calculate the slope () and intercept (). While the mathematics behind these formulas involves calculus, you can use these rules:
Slope:
Intercept:
Where:
- is the correlation coefficient
- and are the standard deviations of and
- and are the mean values of and
- is the response variable (what we want to predict)
- is the explanatory variable (what we use to make predictions)
Finding the correlation coefficient from the slope
If you know the slope of the regression line but need to find the correlation coefficient, you can rearrange the slope formula:
Critical: Always Calculate Slope First!
You must correctly identify which variable is explanatory () and which is response () before you start your calculations. Getting this wrong will give you an incorrect regression line. The question will usually tell you which variable predicts which (for example, "predict weight from height" means height is the explanatory variable).
Worked example 1: Finding the regression line using summary statistics
Worked Example: Calculating the Least Squares Regression Line
Question: The height and weight of 11 people have been recorded. The following statistics were calculated:

Use the formulas to find the equation of the least squares regression line that enables weight to be predicted from height. Round the slope and intercept to two decimal places.
Solution:
Step 1: Identify the variables
Since we want to predict weight from height:
- Explanatory variable (EV): height ()
- Response variable (RV): weight ()
Step 2: Write down the given information
Step 3: Calculate the slope
(rounded to two decimal places)
Step 4: Calculate the intercept
(rounded to two decimal places)
Step 5: Write the regression equation
Using the calculated values:
Or in terms of the actual variables:
Worked example 2: Finding the correlation coefficient from the slope
Worked Example: Finding the Correlation Coefficient
Question: Use the following information to find the correlation coefficient , rounded to three decimal places.

Solution:
Step 1: Identify the variables
- Explanatory variable (EV): hours studied ()
- Response variable (RV): exam score ()
Step 2: Write down the given information
From the least squares equation, we can see that
Step 3: Calculate the correlation coefficient
Using the rearranged formula:
(rounded to three decimal places)
Using technology to find the regression line
Modern calculators like the TI-Nspire CAS and ClassPad can calculate the least squares regression line automatically. This is much faster than using formulas by hand, especially with large datasets.
Basic process using a calculator
Here's the general approach:
- Enter your data into two lists or columns (one for each variable)
- Create a scatterplot to visualise the relationship
- Identify which variable is explanatory and which is response
- Use the regression function to calculate the line equation
- Display the regression line on your scatterplot
The calculator will provide:
- The equation of the regression line
- The slope () and intercept ()
- The coefficient of determination ()
This saves significant time and reduces calculation errors, especially when working with large datasets.




For example, with the height and weight data, the calculator produces:
with
This matches our manual calculation from Example 1.
Exam tips
Essential Tips for Success:
- Always start by identifying which variable is explanatory and which is response
- Calculate the slope first, then use it to find the intercept
- Round your final answers appropriately (check what the question asks for)
- Check that your regression line makes sense by looking at the scatterplot
- Remember: correlation doesn't equal causation, even with a strong regression line
- Write your final equation using the actual variable names, not just and
Key takeaways
Key Points to Remember:
- Linear regression models the relationship between two numerical variables using a straight line
- Residuals are the vertical distances from data points to the regression line
- The least squares line minimises the sum of squared residuals, making it the best-fitting line
- Calculate slope using:
- Calculate intercept using:
- Always identify the explanatory variable () and response variable () before calculating
- The method assumes numerical data, linear association, and no clear outliers