**Linear modelling**

Contents:

A Correlation

B Pearson’s correlation coefficient

C Line of best fit

D The least squares regression line

E Interpolation and extrapolation

**OPENING PROBLEM**

At a junior tournament, a group of young athletes throw a discus. The age and distance thrown are recorded for each athlete.

Athlete | A | B | C | D | E | F | G | H | I | J | K | L |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Age (years) | 12 | 16 | 16 | 18 | 13 | 19 | 11 | 10 | 20 | 17 | 15 | 13 |

Distance thrown (m) | 20 | 35 | 23 | 38 | 27 | 47 | 18 | 15 | 50 | 33 | 22 | 20 |

**Things to think about**:

a) Do you think the distance an athlete can throw is related to the person’s age?

b) How can you graph the data so we can clearly see the relationship between the variables?

c) How can we measure the relationship between the variables?

d) How can we use this data to predict the distance a 14 year old athlete can throw a discus?

Statisticians are often interested in how two variables are related.

For example, in the **Opening Problem**, we want to know how a change in the age of the athlete will affect the *distance* the athlete can throw.

We can observe the relationship between the variables 50 by plotting the data on a **scatter plot**.

We place the **independent variable** *age* on the horizontal axis, and the **dependent variable** *distance* on the vertical axis.

We then plot each data value as a point on the scatter plot. For example, the red point represents athlete H, who is 10 years old and threw the discus 15 metres.

From the general shape formed by the dots, we can see that as the age increases, so does the distance thrown.

**A. CORRELATION**

**Correlation** refers to the relationship or association between two variables.

There are several characteristics we consider when describing the correlation between two variables: direction, linearity, strength, outliers, and causation.

**DIRECTION**

For a generally *upward* trend, we say that the correlation is **positive**. An increase in the independent Variable means that the dependent variable generally increases.

For a generally *downward* trend, we say that the correlation is **negative**. An increase in the independent variable means that the dependent variable generally decreases.

For *randomly scattered* points, with no upward or downward trend, we say there is **no correlation**.

**LINEARITY**

We determine Whether the points follow a **linear** trend, or in other words approximately form a straight line.

These points are roughly linear.

These points do not follow a linear trend.

**STRENGTH**

We want to know how closely the data follows a pattern or trend. The strength of correlation is usually described as either strong, moderate, or weak.

**OUTLIERS**

We observe and investigate any **outliers**, or isolated points which do not follow the trend formed by the main body of data.

If an outlier is the result of a recording or graphing error, it should be discarded. However, if the outlier proves to be a genuine piece of data, it should be kept.

For the scatter plot for the data in the **Opening Problem**, we can say that there is a strong positive correlation between age and *distance thrown*. The relationship appears to be linear, with no outliers.

**CAUSATION**

Correlation between two variables does not necessarily mean that one variable causes the other.

Consider the following:

1. The

*arm length*and

*running speed*of a sample of young children were measured, and a strong, positive correlation was found to exist between the variables.

Does this mean that short arms cause a reduction in running speed or that a high running speed causes your arms to grow long? This would clearly be nonsense.

Rather, the strong, positive correlation between the variables is attributed to the fact that both

*arm length*and

*running speed*are closely related to a third variable,

*age*. Up to a certain age, both

*arm length*and

*running speed*increase with

*age*.

2. The number of television sets sold in Ballarat and the number of stray dogs collected in Bendigo were recorded over several years and a strong positive correlation was found between the variables. Obviously the number of television sets sold in Ballarat was not inﬂuencing the number of stray dogs collected in Bendigo. Both variables have simply been increasing over the period of time that their numbers were recorded.

If a change in one variable *causes* a change in the other variable then we say that a **causal relationship** exists between them.

For example, in the **Opening Problem** there is a causal relationship in which increasing the *age* of an athlete increases the *distance thrown*.

In cases where this is not apparent, there is no justification, based on high correlation alone, to conclude that changes in one variable cause the changes in the other.

**CASE STUDY**

Suppose we wish to examine the relationship between the *length* of a helical spring and the *mass* that is hung from the spring.

The force of gravity on the mass causes the spring to stretch.

The length of the spring depends on the force applied, so the cm dependent variable is the *length*.

The following experimental results are obtained when objects of varying mass are hung from the spring:

mass w (grams) |
0 | 50 | 100 | 150 | 200 | 250 |
---|---|---|---|---|---|---|

length L (cm) |
17.7 | 20.4 | 22.0 | 25.0 | 26.0 | 27.8 |

**MASS ON A SPRING**

For each addition of 50 grams in mass, 30 the consecutive increases in length are roughly constant.

There appears to be a strong positive correlation between the mass of the object hung from the spring, and the length of the spring. The relationship appears to be linear, with no obvious outliers.

**B. PEARSON’S CORRELATION COEFFICIENT**

In the previous section, we classified the strength of the correlation between two variables as either strong, moderate, or weak. We observed the points on a scatter plot, and made a judgement as to how clearly the points formed a linear relationship.

However, this method can be quite inaccurate, so it is important to get a more precise measure of the strength of linear correlation between two variables. We achieve this using **Pearson’s product-moment correlation coefficient** *r .*

where x̄ and ȳ are the means of theFor a set of *n* data given as ordered pairs (*x*_{1}, *y*_{1}), (*x*_{2}, *y*_{2}), (*x*_{3}, *y*_{3}), ⋯, (*x _{n}*,

*y*),

_{n}**Pearson’s correlation coefficient**is

where x̄ and ȳ are the means of the

*x*and*y*data respectively, and Σ means the sum over all the data values.You are not required to learn this formula. Instead, we generally use technology to find the value of *r .*

*
*
The values of *r* range from -1 to +1.

The **sign** of *r* indicates the **direction** of the correlation.

● A positive value for *r* indicates the variables are **positively correlated**. An increase in one of the variables will result in an increase in the other.

● A negative value for *r* indicates the variables are **negatively correlated**. An increase in one of the variables will result in a decrease in the other.

The size of *r* indicates the **strength** of the correlation.

● A value of *r* close to +1 or -1 indicates strong correlation between the variables.

● A value of *r* close to zero indicates weak correlation between the variables.

The following table is a guide for describing the strength of linear correlation using *r .*

A chemical fertiliser company wishes to determine the extent of correlation between the

There is a very strong positive correlation between the

We draw a line of best fit connecting variables

Consider the following data for a mass on a spring:

b) The line of best fit above passes through (125, 23.15) and (200, 26).

The line has gradient

Its equation is

or in this case**Example 1**A chemical fertiliser company wishes to determine the extent of correlation between the

*quantity of compound X*used and the*lawn growth*per day.Lawn | Compound X (g) |
Lawn growth (mm) |
---|---|---|

A | 1 | 3 |

B | 2 | 3 |

C | 4 | 6 |

D | 5 | 8 |

Find and interpret the correlation coefficient between the two variables.

answer:

There is a very strong positive correlation between the

*quantity of compound X*used and*lawn growt*h.This suggests that the more of compound *X* used, the greater the lawn growth per day. However, care must be taken, as the small amount of data may provide a misleading result.

**C. LINE OF BEST FIT**

Consider again the data from the **Opening Problem**:

Athlete | A | B | C | D | E | F | G | H | I | J | K | L |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Age (years) | 12 | 16 | 16 | 18 | 13 | 19 | 11 | 10 | 20 | 17 | 15 | 13 |

Distance thrown (m) | 20 | 35 | 23 | 38 | 27 | 47 | 18 | 15 | 50 | 33 | 22 | 20 |

We have seen that there is a strong positive linear correlation between age and *distance thrown*.

We can therefore model the data using a **line of best fit**.

We draw a line of best fit connecting variables

*X*and*Y*as follows:Step 1: Calculate the mean of the

Xvaluesx̄, and the mean of theYvaluesȳ.

Step 2: Mark themean point(x̄, ȳ) on the scatter plot.

Step 3: Draw a line through the mean point which fits the trend of the data, and so that about the same number of data points are above the line as below it.

The line formed is called a **line of best fit** by eye. This line will vary from person to person.

For the **Opening Problem**, the mean point is (15, 29). So, we draw our line of best fit through (15, 29).

We can use the line of best fit to estimate the value of *y* for any given Value of *x*, and vice versa.

**Example 3**Consider the following data for a mass on a spring:

Mass w (grams) |
0 | 50 | 100 | 150 | 200 | 250 |
---|---|---|---|---|---|---|

Length L (cm) |
17.7 | 20.4 | 22.0 | 25.0 | 26.0 | 27.8 |

a) Draw a scatter plot for the data, and draw a line of best fit incorporating the mean point.

b) Find the equation of the line you have drawn.

answer:

a) The mean of the masses in the experiment is *w̄*=125 g.

The mean of the spring lengths is *L̄*=23.15 cm. ∴ the mean point is (125, 23.15). 30

b) The line of best fit above passes through (125, 23.15) and (200, 26).

The line has gradient

Its equation is

or in this case

*L**≈0.04**w**+18***D. THE LEAST SQUARES REGRESSION LINE**

The problem with drawing a line of best fit by eye is that the line drawn will vary from one person to another.

Instead, we use a method known as **linear regression** to find the equation of the line which best fits the data. The most common method is the method of ‘least squares’.

Consider the set of points alongside.

For any line we draw to model the linear relationship between the points, we can find the Vertical distances *d_{1}, d_{2}, d_{3}, ⋯ between each point and the line.*

We can then square each of these distances, and find their sum *d_{1}^{2}+d_{2}^{2}+d_{3}^{2}+ ⋯*

The

The demonstration alongside allows you to experiment with various data sets. Use trial and error to find the least squares regression line for each set.

In practice, rather than finding the regression line by experimentation, we use a

We use the least squares regression line to estimate values of one variable given a value for the other.

(i)

(ii) The least squares regression line is

c) (i) When

So, it will take Pam approximately 34 minutes to travel to school.

(ii) The estimate is an interpolation, and the correlation coefficient indicates a very strong correlation. This suggests that the estimate is reliable.

Let’s read next post Drawing a Scatter Plot.If the line is a good fit for the data, most of the distances will be small, and so will the sum of their squares.

The

**least squares regression line**is the line which makes this sum as small as possible.The demonstration alongside allows you to experiment with various data sets. Use trial and error to find the least squares regression line for each set.

In practice, rather than finding the regression line by experimentation, we use a

**calculator**or**statistics package**.**E. INTERPOLATION AND EXTRAPOLATION**

Suppose we have gathered data to investigate the association between two variables. We obtain the scatter plot shown below. The data with the lowest and highest values of *x* are called the **poles**.

We use the least squares regression line to estimate values of one variable given a value for the other.

If we use values of *x* **in between** the poles, we say we are **interpolating between** the poles.

If we use values of *x* **outside** the poles, we say we are **extrapolating outside** the poles.

The accuracy of an interpolation depends on how linear the original data was. This can be gauged by determining the correlation coefficient and ensuring that the data is randomly scattered around the linear regression line.

The accuracy of an extrapolation depends not only on how linear the original data was, but also on the assumption that the linear trend will continue past the poles. The validity of this assumption depends greatly on the situation we are looking at.

As a general rule, it is reasonable to interpolate between the poles, but unreliable to extrapolate outside the poles.

Example 5

The table below shows how far a group of students live from school, and how long it takes them to travel there each day.

Distance from school x (km) |
7.2 | 4.5 | 13 | 1.3 | 9.9 | 12.2 | 19.6 | 6.1 | 23.1 |
---|---|---|---|---|---|---|---|---|---|

Time to travel to school y (min) |
17 | 13 | 29 | 2 | 25 | 27 | 41 | 15 | 53 |

a) Draw a scatter plot of the data.

b) Use technology to find:

(i) *r*

(ii) the equation of the least squares regression line.

c) Pam lives 15 km from school.

(i) Estimate how long it takes Pam to travel to school.

(ii) Comment on the reliability of your estimate.

Solution:

a)

(i)

*r*≈0.993(ii) The least squares regression line is

*y*≈2.16x+1.42.c) (i) When

*x*=15,*y*≈2.16(15)+1.42≈33.8.So, it will take Pam approximately 34 minutes to travel to school.

(ii) The estimate is an interpolation, and the correlation coefficient indicates a very strong correlation. This suggests that the estimate is reliable.

Let’s read next post Drawing a Scatter Plot.