In this unit, based on the survey instrument and data used in the Census At School: New Zealand project, students learn about the notion of a distribution, and compare distributions of different variables. Several relevant Level 7 Achievement Objectives are addressed in this unit.

- Calculate means
- Choose appropriate ways of grouping continuous variables in order to represent them on histograms
- Compare two distributions
- Discuss the difference between the distribution of a variable and the distribution of the mean of a variable.

While we have labelled this unit as a Level 6 unit, we have also included several Level 7 Achievement Objectives which are applicable to the unit. In a complete statistical investigation, the student would be required to pose a question for investigation. This unit is intended to aid the development of the skills and techniques needed to complete an investigation once the question has been posed. For this reason, the unit suggests giving greater guidance to the student than they would receive in an NCEA assessment event at this level.

This unit is based on the results of a survey of school students across New Zealand that is being produced by the Census At School: New Zealand project. The survey and the data are on the website: http://www.censusatschool.org.nz/. This is an excellent source of interesting data for statistics lessons. In this unit students will explore the notion of distribution, and investigate how the distribution of the mean of a data set is different from the distribution of the original data set from which the mean was taken.

The terms used in this unit are described below for clarity:

A *dataset *is an array of information; it consists of a column for each of a number of variables. Each row represents a *record *and includes an entry for each variable. These entries, which may be numbers, letters, codes or words are *values*. Each *variable* has a *distribution* which describes the way the values are spread.

In this unit students work with a one-variable dataset, which could be referred to as a dataset, a variable, or data.

Spreadsheet tutorial (see Related Resources if needed)

dataset, distribution, outliers, variable, mean of a variable, continuous variables, histogram,

#### Session 1

In this session students will learn to think of their data as a variable with a distribution that will have interesting features.

- Provide the class with a brief overview of the structure of these sessions based on the ‘Specific Learning Outcomes’ above.
- Most statisticians represent their data graphically in order to see how it is distributed. Are some values more likely than others? Or are all values more or less equally likely? Are the smaller values more likely than the larger ones? Are the middle values the most likely? The answers to questions such as these can be found if we know how the variables in the dataset are distributed.
- Ask the students to suggest a situation where the data gathered is more or less uniform; that is, the variables are equally likely. [Rolling a die is a good example. There is an equal chance of getting each of the values 1 to 6.] Ask them what students’ reaction time data would be like. Will it be uniform as in the case of the die? Ask them to explain why.
- Obtain a random sample of Left Hand Reaction Time data from the census at school website. Go to http://www.censusatschool.org.nz/ find the random sampler. Under variables click all levels or the level of your class, and in the second variable box select "all questions". Take a sample of 60 students.

Transfer the variable to a spreadsheet and sort it into order. Place the class into groups of three or four. Their task is to represent the distribution of this variable as a histogram and accordingly to begin to explore the nature of the distribution. - This task is far from straight forward. This is because it is not at all clear how the values ought to be grouped into categories. Ask each group of students to decide on the most appropriate way of grouping the values and thus drawing a histogram. Ask them to consider each of the following sets of intervals, and to comment on the pros and cons of each.
[0; 0.25), [0.25; 0.5), [0.5; 0.75), and so on.

Why, in particular, are the first and fourth probably not appropriate? Which of the second and third is most appropriate for their variable? Can they suggest one that is even better than either of these? [Of course, the categories need to be the same size.] There are no right answers here. The answer depends on the particular distribution. But, difficult as it is to answer, an answer is required if a distribution is to be drawn. This is one of those difficult decisions that statisticians are required to make. These sorts of decision usually involve a bit of trial and error and some discussion.

[0; 0.1), [0.1; 0.2), [0.2; 0.3), and so on.

[0; 0.05), [0.05; 0.1), [0.1; 0.15), and so on.

[0; 0.02), [0.02; 0.04), [0.04; 0.06), and so on. - Discuss outliers. Sometimes the outliers are shown on these graphs. Sometimes they are omitted before the distributions are drawn. Make a decision on which of these to choose and say why.
- Ask each group to select a method of grouping, then draw a histogram that they believe gives a reasonable representation of the distribution. Get the class to compare the histograms that the groups have created, and to discuss the fitness for purpose (exploration, communication) of each.

#### Session 2

In this session students will be asked to compare the distributions of two variables: ‘reaction time’, and ‘travel to school time’.

- Group the class in groups of three or four. Ask each group to predict what the distribution of the ‘travel to school time’ variable will look like. Get them to sketch the distribution and write down why they think the variable will have this distribution. Ask them if they think the distribution will be similar, or dissimilar, in
__shape__to that of the reaction time variable.

Go to the website: http://www.censusatschool.org.nz/ and find the random sampler. Select a sample of 60, mixed, New Zealand, year 10 students. You will be provided with a table containing a__random selection__of data for__60 Year 10 students__from a large survey taken from New Zealand schools. You can choose the format for this information. - Find the column headed ‘time travel. This provides you with a list of the travel times, in minutes, for students travelling to school.
- Sort this data by the time travel variable. Using the sort function in the Data menu of Excel is an easy way to do this. Print the sorted time travel data for your students to use.
- Distribute a copy of this variable to each group and ask them to draw the histogram and hence the distribution of the variable. Ask them to describe the distribution in general terms. Ask each group to predict whether the ‘travel time’ data distributions would be different for North and South Island students? How? Why?
- Take two more samples from the website: (1) 100 South Island Students; and (2) 100 North Island Students of the same Year. Compare the two distributions using histograms and hence answer the question.
- Explain why it would be difficult to compare these two sets of travel times using box and whisker plots. [The data is only available in grouped format.]

#### Session 3, 4 and 5

In these sessions students will be introduced to the distribution of means.

- The problem that motivates these sessions is this: A class of 20 Year 10 (or the year level of your class) students want an estimate of the class
__mean__reaction time (without practising) for their right hands. How can they estimate a value, and how good/reliable will this value be? (Remind the class that the mean is found by adding up all the data values and dividing by the number of students.) - First find the mean reaction time for right hands for the class.
- Go to the website http://www.censusatschool.org.nz/. Click on "see the questions" under the "survey" heading and go to the item abut reaction time (question 13). Get each student in the class to find his or her reaction time, for the right hand, as directed on the survey. This is all done automatically. Students just need to press the mouse when directed to by the computer. The reaction time is calculated by the computer and displayed. Do not allow the students to practise. The reading must be their first attempt.
- Record the results for the class in a column on a spreadsheet such as Excel and calculate the mean reaction time. You can use the average function from the auto-sum icon on the toolbar to do this. Make the values of the variable available to students on paper or screen. Ask them, in groups, to graph the distribution (easiest as a dotplot), mark in the mean, and check whether it looks sensible. What features does the distribution have? Can they explain these features (eg in terms of fast/slow/distracted students?) With this distribution, is it OK to look for a centre? If so, is the mean a good measure of it?
- The next phase of this combined lesson is very important, but it also takes quite a bit of time. Basically, it is necessary to randomly sample a large number of groups of 20 Year 10 students and in each case to find the average reaction time for the right hand. The goal is to obtain a large number of average times for classes of 20, and to draw the distribution of these means by means of a histogram.
- The ideas behind this are very important and so it is worth taking a generous amount of time over it. Have each student contribute their raw value reaction time and one mean of 20 values to the class investigation. The mean values can be obtained as below.
- Have every child in your class take a sample of 20 students on the random sampler as above. (No. 1 Session 2). The reaction times are listed in the Speedster section. Choose the format you would like the data in; using an excel spreadsheet will enable the means of the data to be easily calculated.
- It is tempting to do all this for the students before the lesson and provide them with the mean values; and thus save time during the lesson. Such a temptation should be resisted. If the teacher is seen to trust that the law of large numbers will work, and that the experiment is not contrived, the impact is much stronger. The lessons learned about the distribution of the means by doing this slowly and painstakingly are too valuable to be avoided. Once you have all the values of means from the students in your class put these on a spreadsheet, sort them into numerical order, and print out a copy for each group of students.
- Get each group to draw two histograms, one showing the distribution of the means and the second showing the distribution of the raw set of values for the classes reaction time. Using the same scale on each histogram will enable comparisons to be easily made. Students will need to decide on the best way of grouping the data. Once this is done, make a big point of showing them that the distribution of
__mean reaction times__is very different from the distribution of__reaction times.__The two distributions have approximately the same central values, but the distribution of the means is very narrow by comparison with that of the distribution of the reaction times. That is, the spread of the means is much less than the spread of the reaction times. This point is very important. - Now ask the groups to find the 2 main differences. They are: the means distribution has a smaller spread, and it has a more symmetrical shape. Ask them to explain why this happened here, and whether it will always happen.
- Now go back to the initial questions and ask the groups to decide how reliable or variable the mean from their replication is, and how it relates to the class’s own mean. They could give an informal confidence interval for the pop Census at School mean: ‘It is very likely to be between … and …’ They could comment on how much faster or slower their mean is, compared with the census at school mean and its interval.
- Students should finish by making a poster that explains how the distribution of a
__variable__and the distribution of the__mean of a variable__refer to different things and have different distributions and features. Include the relevant graphs in this poster.