**Bivariate data and the scatter plot (scatter graph)**

A scatter-plot is a graph that shows the relationship between two ﬁvariables. We say this is bivariate data and we plot the data from two different sets using ordered pairs. For example, we could have mass on the horizontal axis (first variable) and height on the second axis (second variable), or we could have current on the horizontal axis and voltage on the vertical axis.

● A **scatter plot** is a graph using the x- and y-axes to represent **bivariate** data.

● Bivariate data means that each point on the graph represents **two variables** that are **independent** of each other.

● In a scatter plot, we plot a point for each pair of coordinates and look at the overall pattern or **trend** in the data.

● The points in the data are compared to see if there is a **correlation** or some kind of pattern (or trend) in the data.

● When a point does not fit the trend of the other points, it is called an **outlier**.

● **Outliers** are easy to identify on a scatter plot or a box and whisker plot.

● We can sometimes represent the trend in the data with a line or curve of **best fit**. The line or curve can be represented by an equation that could be linear, quadratic, exponential, hyperbolic etc.

**Example 1**. Ohm’s Law is an important relationship in physics. Ohm’s law describes the relationship between current and voltage in a conductor, like a piece of wire. When we measure the voltage (dependent variable) that results from a certain current (independent variable) in a wire, we get the data points as shown in this table below.

Current | Voltage | Current | Voltage |
---|---|---|---|

0 | 0.4 | 2.4 | 1.4 |

0.2 | 0.3 | 2.6 | 1.6 |

0.4 | 0.6 | 2.8 | 1.9 |

0.6 | 0.6 | 3 | 1.9 |

0.8 | 0.4 | 3.2 | 2 |

1 | 1 | 3.4 | 1.9 |

1.2 | 0.9 | 3.6 | 2.1 |

1.4 | 0.7 | 3.8 | 2.1 |

1.6 | 1 | 4 | 2.4 |

1.8 | 1.1 | 4.2 | 2.4 |

2 | 1.3 | 4.4 | 2.5 |

2.2 | 1.1 | 4.6 | 2.5 |

When we plot this data as points, we get the scatter plot shown in Figure A.

If we are to come up with a function that best describes the data, we would have to say that a straight line best describes this data.

**Extension: Ohm’s Law Ohm’s**

Law describes the relationship between current and voltage in a conductor. The gradient of the graph of voltage vs. current is known as the resistance of the conductor.

**Example 2**. *A* science teacher compares the marks for the mid-year examination with the marks for final examinations achieved by 11 learners.

mid-year marks | 80 | 68 | 94 | 72 | 74 | 83 | 56 | 68 | 65 | 75 | 88 |
---|---|---|---|---|---|---|---|---|---|---|---|

final marks | 72 | 71 | 96 | 77 | 82 | 72 | 58 | 83 | 78 | 80 | 92 |

a) Draw a scatter graph of this data.

b) Describe the curve of best fit.

c) Use the scatter plot to estimate the final mark of a learner who had a midyear mark of 75%.

**Solution**:

(a)

b) The ‘curve’ or line of best fit is a straight line. There should be about five dots above the line and five dots below the line.

c)

*A*line from 75 on the x-axis to the trendline takes us to about 78 on the y-axis. So we can predict that a learner with a midyear mark of 75% can expect to get about 78% in the final exam.

**Example 3**. The outdoor temperature (in °C) at noon is measured. It is compared with the number of units of electricity used to heat a house each day.

Temp in °C | 7 | 11 | 9 | 2 | 4 |
---|---|---|---|---|---|

Units of electricity used | 32 | 20 | 27 | 37 | 32 |

Temp in °C | 7 | 0 | 10 | 5 | 3 |

Units of electricity used | 28 | 41 | 23 | 33 | 36 |

a) Draw in a line of best fit.

b) Use your line of best fit to estimate the noon temperature when 30 units of electricity are used.

**Solution**:

a) Line of best fit.

b) If the noon temperature is 6.25°C, about 30 units of electricity will probably be used in the house.

**The linear regression line (or the least squares regression line)**

The line of best fit for a set of bivariate numerical data is the linear regression line. So far, we have seen this trend line on a scatter graph. Now we use a scientific calculator to determine the equation for this line.

We know the straight line equation: *y=mx+c*

Statistics (as used on any calculator device) uses *y=A+Bx*, where *B* is the gradient and *A* is the cut on the y-axis of the straight line of best fit.

So the gradient is *B* instead of *m* and the y-intercept is *A* instead of *c*.

**The Regression Coefficient ‘ r’**

This is a statistical number that measures the strength of the correlation (relationship) between two sets of data.

● This number is calculated from two sets of data using a calculator.

●

*r*always lies between -1 and +1.

● The closer

*r*is to -1, the stronger the negative correlation.

● The closer

*r*is to +1, the stronger the positive correlation.

● If

*r*=0, there is no correlation between the two sets of data.

The number line shows the

*r*values and the strength of the correlation between bivariate data.

__note__: We only study the

*r*value of bivariate data when the line of best fit is a straight line.

A negative correlation means that as

*x*increases,

*y*decreases.

The more closely the points are clustered together around the line, the stronger the correlation.

A positive correlation means that as

*x*increases,

*y*also increases.

A correlation of zero means that there is no relationship between

*x*and

*y*.

**Example 4**. Pick ‘n Pay wants to survey how long in seconds (

*y*) it takes a teller to scan (

*x*) items at the till.

The table shows the results from 9 shoppers.

shopper | x (n. of items) |
y (time in seconds) |
---|---|---|

A | 5 | 3 |

B | 8 | 11 |

C | 12 | 9 |

D | 15 | 6 |

E | 15 | 15 |

F | 17 | 13 |

G | 20 | 25 |

H | 21 | 15 |

I | 25 | 13 |

a) Use your calculator to determine the equation of the line of best fit (the regression line or the least squares regression line) correct to two decimal places.

b) Calculate the value of

*r*, the correlation coefficient for the data. What can you say about the correlation between

*x*and

*y*?

c) How long would the teller take to scan 21 items at the till?

d) How many items could a teller scan in 21.28 seconds?

**Solution**:

a)

*A*=2.68;

*B*=0.62

*y*=2.68+0.62

*x*

b)

*r*=0.62847…=0.63

This is a weak positive correlation.

c)

*y*=2.68+0.62(21)=15.7

(about 16 seconds)

d) 21.28=2.68+0.62

*x*→21.28-2.68=0.62

*x*

*x*=18.6/0.62=30

30 items can be scanned in 21.28 seconds.

**Example 5**. *A* restaurant wants to know the relationship between the number of customers and the number of chicken pies that are ordered.

number of customers (x) |
5 | 10 | 15 | 20 | 25 | 30 | 35 | 40 |
---|---|---|---|---|---|---|---|---|

number of chicken pies (y) |
3 | 5 | 10 | 10 | 15 | 20 | 20 | 24 |

a) Determine the equation of the regression line correct to

two decimal places.

b) Determine the value of r, the correlation coefficient.

Describe the type and strength of the correlation between the number of people and the number of chicken pies ordered.

c) Determine how many chicken pies 100 people would order.

d) If they only have 12 pies left, how many people can they serve?

**Solution**:

a)

*A*=-0.39285…;

*B*=0.61190

*y*=-0.4+0.6

*x*

b)

*r*=0,9866…

This is a very strong positive correlation

(r is close to +1)

c)

*y*=-0.4+0.6

*x*

*y*=-0.4+0.6(100)

*y*=59.6

About 60 chicken pies are ordered by 100 people.

d) 12=-0.4+0.6

*x*

12+0.4=0.6

*x*

*x*=12.4/0.6=20.6

About 21 people will order 12 pies.

**Example 6**. *A* recording company investigates the relationship between the number of times a CD is played by a national radio station and the national sales of the same CD in the following week. The data below was collected for random sample of CDs. The sales figures are rounded to the nearest 50.

Number of times CD is played | Weekly sales of the CD |
---|---|

47 | 3950 |

34 | 2500 |

40 | 3700 |

34 | 2800 |

33 | 2900 |

50 | 3750 |

28 | 2300 |

53 | 4400 |

25 | 2200 |

45 | 3400 |

a) Identify the independent variable.

b) Draw a scatter plot of this data.

c) Determine the equation of the least squares regression line.

d) Calculate the correlation coefficient.

e) Predict, correct to the nearest 50, the weekly sales for a CD that was played 45 times by the station in the previous week.

f) Comment on the strength of the relationship between the variables.

**Solution**:

a) the number of times the CD is played

b)

c)

*a*=264.326;

*b*=75.21

*y*=264.33+75.21

*x*

d)

*r*=0.95

e)

*y*=264.33+75.2145

*x*→

*x*=45

=3648.78=3649=3650 (to the nearest 50)

f) There is a very stromg positive relationship between the number of times that a CD was played and the sales of that CD in the following week.

Let’s read next post Identification of Outliers — Effect of Outliers on mean and median.