B Pearson’s correlation coefficient
C Line of best fit
D The least squares regression line
E Interpolation and extrapolation
At a junior tournament, a group of young athletes throw a discus. The age and distance thrown are recorded for each athlete.
|Distance thrown (m)||20||35||23||38||27||47||18||15||50||33||22||20|
Things to think about:
a) Do you think the distance an athlete can throw is related to the person’s age?
b) How can you graph the data so we can clearly see the relationship between the variables?
c) How can we measure the relationship between the variables?
d) How can we use this data to predict the distance a 14 year old athlete can throw a discus?
Statisticians are often interested in how two variables are related.
For example, in the Opening Problem, we want to know how a change in the age of the athlete will affect the distance the athlete can throw.
We can observe the relationship between the variables 50 by plotting the data on a scatter plot.
We place the independent variable age on the horizontal axis, and the dependent variable distance on the vertical axis.
We then plot each data value as a point on the scatter plot. For example, the red point represents athlete H, who is 10 years old and threw the discus 15 metres.
From the general shape formed by the dots, we can see that as the age increases, so does the distance thrown.
Correlation refers to the relationship or association between two variables.
There are several characteristics we consider when describing the correlation between two variables: direction, linearity, strength, outliers, and causation.
For a generally upward trend, we say that the correlation is positive. An increase in the independent Variable means that the dependent variable generally increases.
For a generally downward trend, we say that the correlation is negative. An increase in the independent variable means that the dependent variable generally decreases.
For randomly scattered points, with no upward or downward trend, we say there is no correlation.
We determine Whether the points follow a linear trend, or in other words approximately form a straight line.
These points are roughly linear.
These points do not follow a linear trend.
We want to know how closely the data follows a pattern or trend. The strength of correlation is usually described as either strong, moderate, or weak.
We observe and investigate any outliers, or isolated points which do not follow the trend formed by the main body of data.
If an outlier is the result of a recording or graphing error, it should be discarded. However, if the outlier proves to be a genuine piece of data, it should be kept.
For the scatter plot for the data in the Opening Problem, we can say that there is a strong positive correlation between age and distance thrown. The relationship appears to be linear, with no outliers.
Correlation between two variables does not necessarily mean that one variable causes the other.
Consider the following:
1. The arm length and running speed of a sample of young children were measured, and a strong, positive correlation was found to exist between the variables.
Does this mean that short arms cause a reduction in running speed or that a high running speed causes your arms to grow long? This would clearly be nonsense.
Rather, the strong, positive correlation between the variables is attributed to the fact that both arm length and running speed are closely related to a third variable, age. Up to a certain age, both arm length and running speed increase with age.
2. The number of television sets sold in Ballarat and the number of stray dogs collected in Bendigo were recorded over several years and a strong positive correlation was found between the variables. Obviously the number of television sets sold in Ballarat was not inﬂuencing the number of stray dogs collected in Bendigo. Both variables have simply been increasing over the period of time that their numbers were recorded.
If a change in one variable causes a change in the other variable then we say that a causal relationship exists between them.
For example, in the Opening Problem there is a causal relationship in which increasing the age of an athlete increases the distance thrown.
In cases where this is not apparent, there is no justification, based on high correlation alone, to conclude that changes in one variable cause the changes in the other.
Suppose we wish to examine the relationship between the length of a helical spring and the mass that is hung from the spring.
The force of gravity on the mass causes the spring to stretch.
The length of the spring depends on the force applied, so the cm dependent variable is the length.
The following experimental results are obtained when objects of varying mass are hung from the spring:
|mass w (grams)||0||50||100||150||200||250|
|length L (cm)||17.7||20.4||22.0||25.0||26.0||27.8|
For each addition of 50 grams in mass, 30 the consecutive increases in length are roughly constant.
There appears to be a strong positive correlation between the mass of the object hung from the spring, and the length of the spring. The relationship appears to be linear, with no obvious outliers.
B. PEARSON’S CORRELATION COEFFICIENT
In the previous section, we classified the strength of the correlation between two variables as either strong, moderate, or weak. We observed the points on a scatter plot, and made a judgement as to how clearly the points formed a linear relationship.
However, this method can be quite inaccurate, so it is important to get a more precise measure of the strength of linear correlation between two variables. We achieve this using Pearson’s product-moment correlation coefficient r.
For a set of n data given as ordered pairs (x1, y1), (x2, y2), (x3, y3), ⋯, (xn, yn),
Pearson’s correlation coefficient is
where x̄ and ȳ are the means of the x and y data respectively, and Σ means the sum over all the data values.
You are not required to learn this formula. Instead, we generally use technology to find the value of r.
The values of r range from -1 to +1.
The sign of r indicates the direction of the correlation.
● A positive value for r indicates the variables are positively correlated. An increase in one of the variables will result in an increase in the other.
● A negative value for r indicates the variables are negatively correlated. An increase in one of the variables will result in a decrease in the other.
The size of r indicates the strength of the correlation.
● A value of r close to +1 or -1 indicates strong correlation between the variables.
● A value of r close to zero indicates weak correlation between the variables.
The following table is a guide for describing the strength of linear correlation using r.
A chemical fertiliser company wishes to determine the extent of correlation between the quantity of compound X used and the lawn growth per day.
|Lawn||Compound X (g)||Lawn growth (mm)|
Find and interpret the correlation coefficient between the two variables.
There is a very strong positive correlation between the quantity of compound X used and lawn growth.
This suggests that the more of compound X used, the greater the lawn growth per day. However, care must be taken, as the small amount of data may provide a misleading result.
C. LINE OF BEST FIT
Consider again the data from the Opening Problem:
|Distance thrown (m)||20||35||23||38||27||47||18||15||50||33||22||20|
We have seen that there is a strong positive linear correlation between age and distance thrown.
We can therefore model the data using a line of best fit.
We draw a line of best fit connecting variables X and Y as follows:
Step 1: Calculate the mean of the X values x̄, and the mean of the Y values ȳ.
Step 2: Mark the mean point (x̄, ȳ) on the scatter plot.
Step 3: Draw a line through the mean point which fits the trend of the data, and so that about the same number of data points are above the line as below it.
The line formed is called a line of best fit by eye. This line will vary from person to person.
For the Opening Problem, the mean point is (15, 29). So, we draw our line of best fit through (15, 29).
We can use the line of best fit to estimate the value of y for any given Value of x, and vice versa.
Consider the following data for a mass on a spring:
|Mass w (grams)||0||50||100||150||200||250|
|Length L (cm)||17.7||20.4||22.0||25.0||26.0||27.8|
a) Draw a scatter plot for the data, and draw a line of best fit incorporating the mean point.
b) Find the equation of the line you have drawn.
a) The mean of the masses in the experiment is w̄=125 g.
The mean of the spring lengths is L̄=23.15 cm. ∴ the mean point is (125, 23.15). 30
b) The line of best fit above passes through (125, 23.15) and (200, 26).
The line has gradient
Its equation is
or in this case L≈0.04w+18
D. THE LEAST SQUARES REGRESSION LINE
The problem with drawing a line of best fit by eye is that the line drawn will vary from one person to another.
Instead, we use a method known as linear regression to find the equation of the line which best fits the data. The most common method is the method of ‘least squares’.
Consider the set of points alongside.
For any line we draw to model the linear relationship between the points, we can find the Vertical distances d1, d2, d3, ⋯ between each point and the line.
We can then square each of these distances, and find their sum d12+d22+d32+ ⋯
If the line is a good fit for the data, most of the distances will be small, and so will the sum of their squares.
The least squares regression line is the line which makes this sum as small as possible.
The demonstration alongside allows you to experiment with various data sets. Use trial and error to find the least squares regression line for each set.
In practice, rather than finding the regression line by experimentation, we use a calculator or statistics package.
E. INTERPOLATION AND EXTRAPOLATION
Suppose we have gathered data to investigate the association between two variables. We obtain the scatter plot shown below. The data with the lowest and highest values of x are called the poles.
We use the least squares regression line to estimate values of one variable given a value for the other.
If we use values of x in between the poles, we say we are interpolating between the poles.
If we use values of x outside the poles, we say we are extrapolating outside the poles.
The accuracy of an interpolation depends on how linear the original data was. This can be gauged by determining the correlation coefficient and ensuring that the data is randomly scattered around the linear regression line.
The accuracy of an extrapolation depends not only on how linear the original data was, but also on the assumption that the linear trend will continue past the poles. The validity of this assumption depends greatly on the situation we are looking at.
As a general rule, it is reasonable to interpolate between the poles, but unreliable to extrapolate outside the poles.
The table below shows how far a group of students live from school, and how long it takes them to travel there each day.
|Distance from school x (km)||7.2||4.5||13||1.3||9.9||12.2||19.6||6.1||23.1|
|Time to travel to school y (min)||17||13||29||2||25||27||41||15||53|
a) Draw a scatter plot of the data.
b) Use technology to find:
(ii) the equation of the least squares regression line.
c) Pam lives 15 km from school.
(i) Estimate how long it takes Pam to travel to school.
(ii) Comment on the reliability of your estimate.
(ii) The least squares regression line is y≈2.16x+1.42.
c) (i) When x=15, y≈2.16(15)+1.42≈33.8.
So, it will take Pam approximately 34 minutes to travel to school.
(ii) The estimate is an interpolation, and the correlation coefficient indicates a very strong correlation. This suggests that the estimate is reliable.
Let’s read next post Drawing a Scatter Plot.