This unit is based on the survey instrument and data used in the Census At School: New Zealand project. In this unit students will become familiar with the site, generate random samples of data, and compare data using box and whisker plots.

- identify questions that can be answered using statistical data
- discuss the notion of ‘outlier’
- draw box and whisker plots
- compare two such plots

This unit is based on the results of a survey of school students across New Zealand, published on the web. This is an excellent source of interesting data for statistics lessons. In this unit students will use a variety of statistical thinking skills and explore data and convey findings, using box and whisker plots, for the purposes of making comparisons.

The survey and the data are on the website: http://www.censusatschool.org.nz/.

Internet access and a spreadsheet (such as excel or google docs) or a graphical calculator.

box and whisker plot, upper and lower quartile, random sample, random sampler, median, outliers, ambiguous values, decision criterion, cleaning a data set

#### Session 1

In this session students will be introduced to the Census At School: New Zealand Survey and Data, and they will create their own ‘reaction time’ data.

- Give the class a brief overview of the structure of these sessions based on the ‘Specific Learning Outcomes’ above. In particular emphasise that they are going to think of questions about hand reaction times that they can investigate using data from the website.
- Find an online reaction tester. Get each student in the class to find his or her reaction time. Do not allow the students to practise. The reading must be their first attempt.
- Make up a poster showing the dataset for the class. Use the three headings:
__Name, Gender, Reaction Time__. - Go to http://www.censusatschool.org.nz/ and have the students download data for a random sample of 20, year 8, female students from the South Island. Ask the class why the selection of the data set is random and why might random selection be important.
- Find the information about reaction times in the random data set. Select and print this information.
- Now have the students repeat the whole process of taking a random selection for exactly the same group of students: South Island, Year 8 Females. Compare the two printouts. Note that they are different. The purpose of doing this a second time is to convince the students that a
__different__sample is taken each time a random selection is taken. If necessary take a third sample.

#### Session 2

In this session students will learn how to form questions about reaction times, and how to statistically process their own ‘reaction time’ data.

- Put students into groups of three or four. Ask each group to write down a list of ten to fifteen interesting questions that might be asked about reaction times. Once this is done, ask them to cross out any that they think could not be answered by using random samples from the survey data. Then get them to select the
__three__they are most interested in. [Students should come up with questions such as: Do female Year 8 students tend to have faster reaction times than males? Do South Island females generally have faster reaction times than North Island females? Do members of this class have reaction times that are roughly typical of students elsewhere in New Zealand?] - In order to answer these sorts of questions students need to
__process__their data in various ways. Use the__class__dataset for this processing. - First get each group to write down the class reaction time data in
__numerical order__from smallest to largest. It is usually easier to make judgements about a dataset__after__it has been ordered in this way. - If you are lucky, and your class is typical, you will have data ranging from about 0.3 seconds up to fairly large values such as 2.9 or even 5.3. These somewhat peculiar values are of immediate interest, and need to be examined. What students need to decide is this: Do these large values result from people who have genuinely slow reaction times, or are they misleading, and really the people concerned were distracted or confused when they were doing the reaction time study? One sample of 20 from the New Zealand data, taken by this website, had a value of 17.373. The question that needs to be answer is this: Does this value of 17.373 result from a student with a slow reaction time or not? Of course we cannot know the answer for certain; we have to make decisions based on sound judgments.
__Outliers__are any extreme values that do not appear to belong with the rest of the distribution. They may be unusual values that need further investigation, or they may be "mistakes". If some results are judged to be not genuine measures of reaction-time, i.e. mistakes, they are not used in making inferences based on the dataset. Graphs are a useful tool for exploring data sets and helping to identify outliers. Often it is not conclusively clear which outlier values are mistakes and not just values from people with slow reaction times. Is a 1.9 result, for instance, a mistake or from a person with a slow reaction time? What about 2.3? What about 3.4? What about 19.5? In some cases it is very hard to know. So there needs to be ample discussion about this, and arguments for and against need to be presented and considered. Perhaps the answer is to collect more evidence by watching students doing the reaction time measurement. Take a generous amount of class time to do all of this. These discussions are important, and these are exactly what professional statisticians need to address when they are dealing with data such as this. When in doubt, keep the value in. - If your class did not have any such ambiguous values of the sort just mentioned, it would be useful for you to have a random sample prepared in advance which does contain such values. In fact it would be useful to discuss another example regardless. It is recommended that you take a sample of 100 South Island Year 5 students. This should produce some large and ambiguous results for the class to discuss. In Excel you can quickly sort the data into numerical order using the sort function in the Data menu.
- Have each group identify a
__decision criterion__that could be used for judging if a value is a mistake that needs to be dropped or not. This will add a greater focus to the discussion just held. For instance, students could come up with something like this: "If a value is above a 3.0 and more than 0.5 beyond the next smallest value it will be dropped". There is no right answer here. And it is likely that the answer would vary depending on the age of the students in the sample. What is important is that students debate the issue and come up with some well-considered decision criterion. You should also consider whether there are any outliers at the fast end of the spectrum. This is unlikely, but it is still worth raising as a possibility.

#### Session 3

In this session students will learn a way of representing the data for the purposes of making comparisons using a box and whisker plot.

- Show the students how to find the median, and upper and lower quartiles of the class dataset for the left hand reaction times.

First take the ordered data. The median is the ‘middle’ value of the ordered data. If there is an odd number of values (excluding outliers), take the middle value. If there is an even number, take the value half way between the two middle values. The lower and upper quartiles are found in a similar way by finding values a quarter and three quarters along the ordered data. Next show the students how to draw a box and whisker plot of the dataset. Ask students what other sorts of graph they could use, what they prefer for certain tasks, and why. - The students will no doubt be interested in how they compare with other students in New Zealand. Take a random sample from a comparable group of students from the web survey data. Use a sample size slightly different from the size of your own class in order that the students can see that comparisons can be made even when the number of data points is different. Get each group to find the outliers and draw a box and whisker plot beneath the one for the class.
- Get each group to compare the two sets of data represented by the two plots. Ask the groups to identify in writing how the sets of data are similar and how they are distinct. What can they say about the two sets of reaction time? Can they say that, overall, one group of students is faster? Why?

#### Sessions 4 and 5

In these sessions students will address two or three of the questions they identified in Session 2.

- In Session 2 each group identified three questions that particularly interested them. From these questions each group should choose one to investigate. In these sessions they should identify the variables that will help to answer their question, and then decide what sample or samples are needed.
- Take any random samples they need from the web data.
- Clean the dataset using the decision criterion in Session 2.

Explore the distributions of the variables from the data set and subsets of it using appropriate graphs of their choice, and medians etc. - Use the dataset and graphs to help answer their question.
- Repeat this process for the other two questions.
- When they have finished answering each question they should make a poster that outlines the central question, the data collected, the results of the statistical processing, and the conclusions drawn. They should include at least one graph in their poster that helps illustrate the conclusions that they have drawn.