**Statistics**

Suppose we are concerned to know about an object or target of research. For that we need to do research or observation. If the object that we observe has a small amount, then we can observe the whole of its members. But if the object to be investigated has a big amount or it is impossible for us to observe all of it, then the portion of the object is taken to be observed.

In observation, we collect data. This data can be numeric data (quantitative) or categorical data (qualitative). The data that we collect can be proceed by using statistic rules, so that it can be presented and more over it can give meaningful information about the thing which we observe. Then we can continue to analyze the data, so that we can predict possible thing that may happen from the object that we observe.

Generally in statistic method, we will learn about presenting data and predicting things that may happen from the object that we observe. And according to that, then statistical method is divided into two big groups, namely:

1. Statistical Descriptive

Statistical descriptive is methods that related with collecting and presenting data, so that we can get meaningful information about the object that we observe.

2. Statistical Inference

Statistical Inference is methods that related with analysis of a portion of data which will be used to predict and draw conclusions about the whole data that we observed.

**Population and Sample**

The collection of all objects which is observed or examined is called

*population*. The size or amount of the observation in one population can be limited or unlimited. Values that explain about characteristic of population are called

**parameter**. Generally, parameters are denoted by Greek letters, for example the average of population is denoted by

*μ*(read: mu) and standard deviation of population is denoted by

*σ*(read: si), the number of population member is denoted by

*N*.

In the observing or research of an object, there is a time that it is impossible to observe whole its members. For example to know the average age of incandescent lamps which are produced in a factory, of course it is impossible to observe all of incandescent lamp which are produced by the factory. Beside that, the observations for an object that population have very large sizes tend to spend long time and very big costs.

Because of that, then for the purposes of research on population, generally some members of the population that is considered can be representedthe population will be taken to be observed. The set of some members of the population that is taken to be observed is called a **sample**. So the sample is a subset of the population. Values that describe characteristics of the sample is called **statistics**. For example: the average of samples which is denoted by *x̄* and standard deviation of sample which is denoted by *s*, the number of data is denoted by *n*. If all member of the population are observed then the observation is called **census**.

Example 1:

According to past experience known that average of daily production from laborers in a factory is 600 boxes of cake, with standard deviation is 200. If we count the production in 16 days that is chosen randomly, then we get average which is 624 boxes of cake and standard deviation which is 196.

From the passage above we can get:

i. The population is the daily production of laborers in a factory.

ii. Average of population: *μ*=600.

iii. Standard deviation of population: *σ*=200.

iv. Sample taken about 16 or *n*=16.

v. Average of sample *x̄*=624.

vi. Standard deviation of sample: *s*=196.

Multiple Choice Example (Example 2):

Intruction: Choose the one correct answer for each question.

According to past experience of a flight company the average of empty seat in every flight is equal to 14, and standard deviation is 3. To investigate that thing, the company do an observation by taking 100 flight as a sample. From that sample, got that the average of empty seat is equal to 13 and standard deviation is equal to 4.

1. The population in the passage above is …

A. Flight of a company

B. The empty seat on an airline

C. airline

D. average empty seats

E. flight

Key: B. B is more specific than D.

2. The number 14 in the passage above refers to …

A. *N* B. *s* C. *x̄* D. *σ* E. *μ*

Key: E.

3. In the passage above, the value of *s*=⋯

A. 110 B. 12 C. 14 D. 4 E. 3

Key: D.

4. The number 13 in the passage above refers to …

A. *N* B. *s* C. *x̄* D. *σ* E. *μ*

Key: C.

5. The number 13 in the passage above refers to …

A. *N* B. *s* C. *x̄* D. *σ* E. *μ*

Key: D.

**Standard Deviation and Variance**

The measures of central tendency (mean, median and mode) and measures of dispersion (quartiles, percentiles, ranges) provide information on the data values at the centre of the data set and provide information on the spread of the data. The information on the spread of the data is however based on data values at specific points in the data set, e.g. the end points for range and data points that divide the data set into 4 equal groups for the quartiles. The behaviour of the entire data set is therefore not examined.

A method of determining the spread of data is by calculating a measure of the possible distances between the data and the mean. The two important measures that are used are called the *variance* and the *standard deviation* of the data set.

(a). **Variance**

The variance of a data set is the average squared distance between the mean of the data set and each data value. An example of what this means is shown in __Figure A__. The graph represents the results of 100 tosses of a fair coin, which resulted in 45 heads and 55 tails. The mean of the results is 50. The squared distance between the heads value and the mean is (45-50)^{2}=25 and the squared distance between the tails value and the mean is (55-50)^{2}=25. The average of these two squared distances gives the variance, which is ½(25+25)=25.

(a.i). **Population Variance**

Let the population consist of n elements {*x*_{1}, *x*_{2}, …, *x*_{n}} with mean *x̄* (read as “x bar”). The variance of the population, denoted by *σ*^{2}, is the average of the square of the distance of each data value from the mean value.

__Figure A__: The graph shows the results of 100 tosses of a fair coin, with 45 heads and 55 tails. The mean value of the tosses is shown as a vertical dotted line. The difference between the mean value and each data value is shown.

Since the population variance is squared, it is not directly comparable with the mean and the data themselves.

(a.ii). **Sample Variance**

Let the sample consist of the n elements {*x*_{1}, *x*_{2}, …, *x*_{n}}, taken from the population, with mean 9?. The variance of the sample, denoted by 52, is the average of the squared deviations from the sample mean:

Since the sample variance is squared, it is also not directly comparable with the mean and the data themselves.

A common question at this point is “Why is the numerator squared?” One answer is: to get rid of the negative signs. Numbers are going to fall above and below the mean and, since the variance is looking for distance, it would be counterproductive if those distances factored each other out.

(a.iii). **Difference between Population Variance and Sample Variance**

As seen a distinction is made between the variance, *σ*^{2}, of a whole population and the variance, *s*^{2} of a sample extracted from the population.

When dealing with the complete population the (population) variance is a constant, a parameter which helps to describe the population. When dealing with a sample from the population the (sample) variance varies from sample to sample. Its value is only of interest as an estimate for the population variance.

(a.iv). **Properties of Variance**

If the variance is defined, we can conclude that it is never negative because the squares are positive or zero. The unit of variance is the square of the unit of observation. For example, the variance of a set of heights measured in centimeters will be given in square centimeters. This fact is inconvenient and has motivated many statisticians to instead use the square root of the variance, known as the standard deviation, as a summary of dispersion.

(b). **Standard Deviation**

Since the variance is a squared quantity, it cannot be directly compared to the data values or the mean value of a data set. It is therefore more useful to have a quantity which is the square root of the variance. This quantity is known as the standard deviation.

In statistics, the standard deviation is the most common measure of statistical dispersion. Standard deviation measures how spread out the values in a data set are. More precisely, it is a measure of the average distance between the values of the data in the set. If the data values are all similar, then the standard deviation will be low (closer to zero). If the data values are highly variable, then the standard variation is high (further from zero).

The standard deviation is always a positive number and is always measured in the same units as the original data. For example, if the data are distance measurements in metres, the standard deviation will also be measured in metres.

(b.i). **Population Standard Deviation**

Let the population consist of *n* elements {*x*_{1}, *x*_{2}, …, *x*_{n}}. with mean *x̄*. The standard deviation of the population, denoted by *σ*, is the square root of the average of the square of the distance of each data value from the mean value.

(b.ii).

**Sample Standard Deviation**

**Sampling From A Population**

Populations are often huge, and gathering data from every individual is often impossible due to time constraints and cost.

Consequently, a **random sample** is taken from the population with the hope that it will truly reflect the characteristics of the population. To ensure this, the sample must be sufficiently large, and be taken in such a way that the results are unbiased.

To help distinguish between a sample and the whole population, we use different notation for the mean, Variance, and standard deviation. This is shown in the table opposite.

In general, the population mean

*μ*and standard deviation

*σ*will be unknown.

However, given statistics from a sample, we can make inferences about the population using the following results which are assumed without proof:

When a sample of size *n* is used to draw inference about a population:

● the mean *x̄* of the sample is an unbiased estimate of *μ*

● *s*_{n} is an estimate of the standard deviation *σ*.

Example 3:

A random sample of 48 sheep was taken from a flock of over 2000 sheep. The sample mean of their weights was 48.6 kg with variance 17.5 kg^{2}.

a) Find the standard deviation of the sample. Hence estimate the standard deviation of the population from which the sample was taken.

b) Find an unbiased estimate of the mean weight of sheep in the flock.

solution:

a) *s*_{n}=√variance=√17.5≈4.18 kg

*σ* is estimated by *s*_{n}, so we estimate the standard deviation to be 4.18 kg.

b) *μ* is estimated by *x̄*=48.6 kg.

Difference between Population Variance and Sample Variance

As with variance, there is a distinction between the standard deviation, *σ*, of a whole population and the standard deviation, *s*, of sample extracted from the population.

When dealing with the complete population the (population) standard deviation is a constant, a parameter which helps to describe the population. When dealing with a sample from the population the (sample) standard deviation varies from sample to sample.

In other words, the standard deviation can be calculated as follows:

1. Calculate the mean value *x̄*.

2. For each data value *x*_{i} calculate the difference *x*_{i}–*x̄* between *x*_{i} and the mean value *x̄*.

3. Calculate the squares of these differences.

4. Find the average of the squared differences. This quantity is the variance, *σ*^{2}.

5. Take the square root of the variance to obtain the standard deviation, *σ*.

Worked Example (Example 4): **Variance and Standard Deviation**

Question: What is the variance and standard deviation of the population of possi- bilities associated with rolling a fair die?

Answer

Step 1: Determine how many outcomes make up the population

When rolling a fair die, the population consists of 6 possible outcomes. The data set is therefore *x*={1,2,3,4,5,6}. and *n*=6.

Step 2: Calculate the population mean

The population mean is calculated by:

*x̄*=⅙(1+2+3+4+5+6)=3.5

Step 3: Calculate the population variance

The population variance is calculated by:

Step 4: Alternately the population variance is calculated by:

Step 5: Calculate the standard deviation

The (population) standard deviation is calculated by:

*σ*=√2.917=1.708

Notice how this standard deviation is somewhere in between the possible deviations.

**Interpretation and Application**

A large standard deviation indicates that the data values are far from the mean and a small standard deviation indicates that they are clustered closely around the mean.

For example, each of the three samples (0, 0, 14, 14), (0, 6, 8, 14), and (6, 6, 8, 8) has a mean of 7. Their standard deviations are 7, 5 and 1, respectively. The third set has a much smaller standard deviation than the other two because its values are all close to 7. The value of the standard deviation can be considered ‘large’ or ‘small’ only in relation to the sample that is being measured. In this case, a standard deviation of 7 may be considered large. Given a different sample, a standard deviation of 7 might be considered small.

Standard deviation may be thought of as a measure of uncertainty. In physical science for example, the reported standard deviation of a group of repeated measurements should give the precision of those measurements. When deciding whether measurements agree with a theoretical prediction, the standard deviation of those measurements is of crucial importance: if the mean of the measurements is too far away from the prediction (with the distance measured in standard deviations), then we consider the measurements as contradicting the prediction. This makes sense since they fall outside the range of values that could reasonably be expected to occur if the prediction were correct and the standard deviation appropriately quantified. See prediction interval.

**Relationship between Standard Deviation and the Mean**

The mean and the standard deviation of a set of data are usually reported together. In a certain sense, the standard deviation is a “natural” measure of statistical dispersion if the center of the data is measured about the mean. This is because the standard deviation from the mean is smaller than from any other point.