**6**

This chapter shows how to compute simple statistical hypothesis tests and confidence intervals for means, for proportions, and for variances, along with simple nonparametric tests, a test of normality, and correlation tests. Many of these tests are typically taken up in a basic statistics class, and, in particular, tests and confidence intervals for means and proportions are often employed to introduce statistical inference.

The R Commander *Statistics > Means* menu (see Figure A.4 on page 202) includes items for tests for a single mean, for the difference between means from independent samples, for the difference between means from matched (paired) samples, and for one-way and multi-way analysis of variance (ANOVA). The test for matched pairs compares the means of two variables in the active data set, presumably measured on the same scale (e.g., husband’s and wife’s annual income in dollars in a data set of married heterosexual couples).

**6.1.1 Independent Samples Difference-of-Means t Test**

The dialogs for tests on a single mean and for the difference between two means are similar, and so I’ll illustrate with an *independent-samples t test*, using the Guyer data set in the car package. I begin by reading the data via *Data > Data in packages > Read data set from an attached package* (as described in Section 4.2.4).

The Guyer data set is drawn from an experiment, reported by Fox and Guyer (1978), in which 20 four-person groups each played 30 trials of a “prisoners’ dilemma” game. On each trial of the experiment, the individuals in a group could make cooperative or competitive choices, and the response variable in the data set (cooperation) is the number of cooperative choices, out of the 120 choices made in each group during the course of the experiment.

Ten of the groups were randomly assigned to a treatment in which their choices were “public,” in the sense that each individual’s choice on a trial was visible to the other members of the group, while the other 10 groups made their choices anonymously, so that group members were aware only of the *numbers* of cooperative and competitive choices made on each trial. The treatment for each group is recorded in the factor condition as P (public choice) or A (anonymous choice).1

With Guyer as the active data set, selecting *Statistics > Means > Independent samples t-test* brings up the dialog box in Figure 6.1. The *Groups* list box in the *Data* tab displays the two-level factors in the data set; I select condition. The *Response Variable* list box includes numeric variables in the active data set; because there is just one numeric variable—cooperation—it is preselected. In this example, there are equal numbers of cases—10 each—in the two groups to be compared, but equal sample sizes aren’t necessary for a difference-of-means *t* test.

FIGURE 6.1: *Independent Samples t-Test* dialog box, showing the *Data* and *Options* tabs.

The levels of condition are in default alphabetic order, and so the *Options* tab shows that the difference in means will be computed as A — P (i.e., anonymous minus public choice, x¯A−x¯P). Because I expect higher cooperation in the public-choice condition, I select a directional alternative hypothesis—that the difference in “population” means between the anonymous and public-choice conditions is less than 0 (i.e., negative): *Ha*: *μA – μP <* 0; the default choice is a *Two-sided* alternative hypothesis: *Ha*: *μA – μP =* 0. I leave the other options at their defaults: a 95% confidence interval, and no assumption of equal variances for the two groups.

Because the alternative hypothesis is directional, the t.test command invoked by the dialog also computes a *one-sided confidence interval* for the difference in means.2 Because equal group variances are not assumed, t.test computes the standard error of the difference in means using separate group standard deviations and approximates the degrees of freedom for the test by the *Welch–Satterthwaite* formula (see, e.g., Moore et al., 2013, p. 480).

FIGURE 6.2: One-sided independent-samples difference-of-means *t* test for cooperation by condition in the Guyerdata set.

The resulting output, in Figure 6.2, shows that the mean number of cooperative choices is indeed higher in the public-choice group (x¯P=55.7) than in the anonymous-choice group (x¯A=30.9), and that this difference is statistically significant (*p* = 0.0088, one-sided). The t.test function doesn’t report the standard deviations in the two groups, but they are simple to compute via *Statistics > Summaries > Numerical summaries* (as described in Section 5.1) as *sP* = 14.85 and *sA* = 9.42.

The difference-of-means *t* test makes the assumption that the response variable is normally distributed within groups in the “population,” and, because the sample sizes are small in the Guyer data set, I should be concerned if the distributions are skewed or if there are outliers. To check, I draw back-to-back stem-and-leaf displays for cooperation by condition, selecting *Graphs > Stem-and-leaf display* from the R Commander menus (see Section 5.3.1), taking all of the defaults in the resulting dialog (except for plotting back-to-back), and producing the result shown in Figure 6.3. There are no obvious problems here, although it’s apparent that the values for the public-choice group are more variable than those for the anonymous-choice group. It would indeed have been more sensible to examine the data *before* performing the *t* test!

**6.1.2 One-Way Analysis of Variance**

*One-way analysis of variance* tests for differences among the means of several independent samples. To illustrate, I’ll use data from a memory experiment conducted by Friendly and Franklin (1980). Subjects in the experiment were presented with a list of 40 words that they were asked to memorize. After a distracting task, the subjects recalled as many words as they could on each of five trials. Thirty subjects were assigned randomly to one of three experimental conditions, 10 to each condition: In the *standard free recall* condition, the words were presented in random order on each trial; in the *before*condition, words that were recalled on the previous trial were presented first to the subject in the order in which they were recalled, followed by the forgotten words; in the *meshed* condition, recalled words were also presented in the order in which they were previously recalled, but were intermixed with the forgotten words. As in the preceding *t* test example, there are equal numbers of cases in the groups (here, three groups), but the one-way ANOVA procedure in the R Commander can also handle unequal sample sizes.

FIGURE 6.3: Back-to-back stem-and-leaf displays for cooperation by condition, with the A (anonymous-choice) group on the left and the P (public-choice) group on the right.

The data from Friendly and Franklin’s experiment are in the Friendly data set in the car package.3 The data set consists of the variable correct, giving the number of words (of the 40) correctly recalled on the final trial of the experiment, and the factor condition, with levels Before, Meshed, and SFR. I read the Friendly data set in the usual manner (Section 4.2.4), making it the active data set in the R Commander.

Dot plots of the data (Section 5.3.1), in the upper panel of Figure 6.4, suggest that the spreads in the three groups are quite different, and that data in the Before group are negatively skewed, at least partly due to a “ceiling effect,” with 6 of the 10 subjects recalling 39 or 40 words. This is potentially problematic, because ANOVA assumes normally distributed populations with equal variances.

Using the *Compute New Variable* dialog (described in Section 4.4.2), I add the variable logit.correct to the data set—the logit transformation of the proportion of words correctly recalled—computed as logit(correct/40).4 A dot plot of the transformed data, in the lower panel of Figure 6.4, shows that the spreads in the three groups have been approximately equalized.

With this preliminary work accomplished, I select *Statistics > Means > One-way ANOVA* from the R Commandermenus, producing the dialog box in Figure 6.5. The *Groups* variable, condition, is preselected because there is only one factor in the data set. I pick logit.correct as the *Response Variable*, and check the *Pairwise comparisons* box. Clicking the *OK* button results in the printed output in Figures 6.6 and 6.7, and the graph in Figure 6.8.5

As the dot plots in Figure 6.4 suggest, the mean logit of words recalled is greatest in the Before condition, nearly as high in the Meshed condition, and lowest in the SFR condition, while the standard deviations of the logits are similar in the three groups. The differences among the means are just barely statistically significant, with *p* = 0.0435 from the one-way ANOVA.

The pairwise comparisons among the means of the three groups are adjusted for simultaneous inference using Tukey’s “honest significant difference” *(HSD*) procedure (Tukey, 1949). The results are shown both as hypothesis tests and as confidence intervals for the pairwise mean differences, with the latter graphed in Figure 6.8. The only pairwise comparison that proves statistically significant is the one between SFR and Before, *p* = 0.047.6

**6.1.3 Two-Way and Higher-Way Analysis of Variance**

In Section 5.1, I introduced data from an experiment conducted by Adler (1973) on experimenter effects in psychological research. To recapitulate briefly, ostensible “research assistants” (the actual subjects of the study) were assigned randomly to three experimental conditions, and were variously instructed to collect “good data,” “scientific data,” or given no such instruction. The research assistants showed photographs to respondents who were asked to rate the apparent “successfulness” of the individuals in the photos, and the assistants were also told, at random, to expect either low or high ratings. The data from the experiment are in the Adler data set in the car package, which I now read into the R Commander. The data set contains the experimentally manipulated factors instruction and expectation, and the numeric response variable rating, with the average rating obtained by each assistant.

FIGURE 6.4: Dot plots by experimental condition of number (top) and logit (bottom) of words recalled correctly in Friendly and Franklin’s memory experiment.

FIGURE 6.5: *One-Way Analysis of Variance* dialog.

FIGURE 6.6: One-way ANOVA output for Friendly and Franklin’s memory experiment, with the response variable as the logit of words correctly recalled: ANOVA table, means, standard deviations, and counts.

FIGURE 6.7: One-way ANOVA output for Friendly and Franklin’s memory experiment: Pairwise comparisons of group means.

FIGURE 6.8: Simultaneous confidence intervals for pairwise comparisons of group means from Friendly and Franklin’s memory experiment, with the response variable as the logit of words correctly recalled.

FIGURE 6.9: *Multi-Way Analysis of Variance* dialog box.

Numerical summaries of the average ratings by combinations of levels of the two factors instruction and expectation appear in Figure 5.5 (on page 85), and the means are graphed in Figure 5.24 (page 108). The pattern of means suggests an interaction between instruction and expectation: When assistants were asked to collect “good” data, those with a high expectation produced ratings higher in apparent successfulness than those with low expectations; this pattern was reversed for those told to collect “scientific” data or given no such instruction—as if these assistants leaned over backwards to avoid bias, consequently producing a bias in the reverse direction.

To compute a *two-way ANOVA* for the Adler data, I select *Statistics > Means > Multi-way ANOVA*, bringing up the dialog in __Figure 6.9__.__7__ The *Response Variable* rating is preselected; I *Ctrl*-click on the *Factors* expectation and instruction, and click *OK*, obtaining the two-way ANOVA output in __Figure 6.10__, with an ANOVA table,__8__ and cell means, standard deviations, and counts.__9__ The interaction between expectation and instruction proves to be highly statistically significant, *p* = 2.6 × 10-6.

FIGURE 6.10: Two-way ANOVA output for the Adler data: ANOVA table, and cell means, standard deviations, and counts.

The *Statistics > Proportions* menu (see Figure A.4 on __page 202__) includes items for single-sample and two-sample tests. Both dialogs are simple and generally similar to the corresponding dialogs for means.

I’ll illustrate a single-sample test for a proportion using the Chile data set in the car package (introduced in __Section 5.3.2__). The data are from a poll of eligible voters in Chile, conducted about six months prior to the 1988 plebiscite to determine whether the country would continue under military rule or initiate a transition to electoral democracy. A “yes” vote represents support for the continuation of Augusto Pinochet’s military government, while a “no” vote represents support for a return to democracy. In the event, of course, the “no” side prevailed in the plebiscite, garnering 56% of the vote.

Of the 2700 voters who were interviewed for the poll, 889 said that they were planning to vote “no” (coded N in the Chile data set) and 868 said that they were planning to vote “yes” (coded Y); the remainder of the respondents said that they were undecided (U, 588), were planning to abstain (A, 187), or didn’t answer the question (NA, 168). After reading the Chile data set in the usual manner, I begin my analysis by recoding vote into a two-level factor, retaining the variable name vote, and employing the recode directives “Y” = “yes”, “N” = “no”, and else = NA (see __Section 4.4.1__).

Choosing *Statistics > Proportions > Single-sample proportion test* produces the dialog in __Figure 6.11__. I select vote in the *Data* tab and leave the selections in the *Options* tab at their defaults. In particular, the default null hypothesis that the population proportion is 0.5 against the default two-sided alternative makes sense in the context of the pre-plebiscite poll.

Clicking *OK* generates a test, shown at the top of __Figure 6.12__, based on the normal approximation to the binomial distribution. For comparison, I’ve also shown (at the bottom of __Figure 6.12__) the output from the exact binomial test, obtained by selecting the corresponding radio button in the *Single-Sample Proportions Test* dialog. The proportion reported is for the no level of the two-level factor vote—the level that is alphabetically first. Because of the large sample size and the sample proportion close to 0.5, the normal approximation to the binomial is very accurate, even without using a continuity correction (which is also available in the dialog). With the sample proportion p^no = 0.506, the null hypothesis of a dead heat can’t be rejected, *p*-value = 0.63 (by the exact test). Close polling results like this motivated the “no” campaign in the Chilean plebiscite to greater unity and effort.

The prop.test function reports a chi-square test statistic (labelled X-squared) on one degree of freedom for the test based on the normal approximation to the binomial. It is more common to use the normally distributed test statistic

z=p^−p0p0(1−p0)/n

where p^ is the sample proportion, *p*0 is the hypothesized population proportion (0.5 in the example), and *n* is the sample size. The relationship between the two test statistics is very simple, *X*2 = *z*2, and they produce identical *p*-values (because a chi-square random variable on one *df* is the square of a standard-normal random variable). where p^ is the sample proportion, *p*0 is the hypothesized population proportion (0.5 in the example), and *n* is the sample size. The relationship between the two test statistics is very simple, *X*2 = *z*2, and they produce identical *p*-values (because a chi-square random variable on one *df* is the square of a standard-normal random variable).

FIGURE 6.11: *Single-Sample Proportions Test* dialog, with *Data* and *Options* tabs.

FIGURE 6.12: Single-sample proportion tests: normal approximation to the binomial (top) and exact binomial test (bottom).

Tests on variances are generally not recommended as preliminaries to tests on means but may be of interest in their own right. The *Statistics > Variances* menu (shown in Figure A.4 on page 202) includes the *F-test* for the difference between two variances, *Bartlett’s test* for differences among several variances, and *Levene’s test* for differences among several variances. Of these tests, Levene’s test is most robust with respect to departures from normality, and so I’ll use it to illustrate, returning to Friendly and Franklin’s memory data (introduced in Section 6.1.2 on one-way ANOVA): I select Friendly via the *Data set* button in the R Commander toolbar to make it the active data set.

The *Levene’s Test* dialog is depicted in Figure 6.13. Because there is only one factor in the Friendly data set, condition is preselected in the dialog. More generally, several factors can be selected, in which case groups are defined by combinations of levels of the factors. I select correct—the number of words correctly remembered—as the response; recall that I computed the logit transformation of the proportion correct partly to equalize the variances in the different conditions. I leave the *Center* specification at its default, to use the *median*, the more robust of the two choices. These selections lead to the output in Figure 6.14, showing that the differences in variation among the three conditions are not quite statistically significant (*p* = 0.078).

FIGURE 6.13: The *Levene’s Test* dialog box to test for differences among several variances.

FIGURE 6.14: Levene’s test for differences in variation of number of words recalled among the three conditions in Friendly and Franklin’s memory experiment. FIGURE 6.14: Levene’s test for differences in variation of number of words recalled among the three conditions in Friendly and Franklin’s memory experiment.

The *Statistics > Nonparametric tests* menu (see Figure A.4 on __page 202__) contains items for several common nonparametric tests—that is, tests that don’t make distributional assumptions about the populations sampled. These tests include single-sample, two-sample, and paired-samples *Wilcoxon signed-rank tests* (also known as *Mann–Whitney tests*), which are nonparametric alternatives to single-sample, two-sample, and paired-samples *t* tests for means; the *Kruskal–Wallis test*, which is a nonparametric alternative to one-way ANOVA (and is a generalization of the two-sample Wilcoxon test); and the *Friedman rank-sum test*, which is an extension of a matched-pairs test to more than two matched values, often used for repeated measures on individuals.

All of the nonparametric test dialogs in the R Commander are very simple. I’ll use the Kruskal–Wallis test for Friendly and Franklin’s memory data as an example. A oneway ANOVA for the logit of the proportion of words correctly recalled by each subject in the experiment was performed in __Section 6.1.2__. Choosing *Statistics > Nonparametric tests > Kruskal-Wallis test* produces the dialog box in __Figure 6.15__. The factor condition is preselected to define *Groups*, and I pick correct as the response variable, obtaining the output in __Figure 6.16__. The Kruskal–Wallis test for differences among the conditions in the Friendly and Franklin memory experiment is not quite statistically significant (*p* = 0.075).

In contradistinction to the parametric one-way analysis of variance, I would get *exactly the same* Kruskal–Wallis test whether I use correct or logit.correct as the response variable (try it!), because the Kruskal–Wallis test is based on the *ranks* of the response values and not directly on the values themselves.__10__

FIGURE 6.15: *Kruskal-Wallis Rank Sum Test* dialog box.

FIGURE 6.16: Kruskal–Wallis test for differences among conditions in Friendly and Franklin’s memory experiment.

The *Statistics > Summaries* menu contains menu items for two simple statistical tests: The Shapiro–Wilk test of normality, and tests for Pearson product-moment and rank correlation coefficients. These tests are for numeric variables. I’ll illustrate using the Prestige data set, introduced in Section 4.2.3 and available in the car package, reading the data set as usual via *Data > Data in packages > Read data from an attached package*.

The distributional displays in Figure 5.10 (on page 92) for education in the Prestige data set suggest that the variable isn’t normally distributed: The histogram and density plot for the data appear to have multiple modes, and the normal quantile-comparison plot shows shorter tails than a normal distribution. Selecting *Statistics > Summaries > Shapiro-Wilk test of normality* brings up the dialog box in Figure 6.17; selecting education and clicking *OK* produces the output in Figure 6.18. The departure from normality is highly statistically significant (*p* = 0.00068).

The scatterplot for prestige versus education in the Prestige data set, shown in Figure 5.16 (page 99) suggests a monotone (strictly increasing) but nonlinear relationship between the two variables. Selecting *Statistics > Summaries > Correlation test* from the R Commander menus leads to the dialog box in Figure 6.19. I select education and incomein the *Variables* list box, and, because the relationship between the two variables is apparently nonlinear, I’ll examine a rank-order correlation rather than the Pearson product-moment correlation between the variables. There are ties in the data, and so I select *Kendall’s tau* in preference to *Spearman’s rank-order* correlation. I anticipate a positive correlation between education and income, as reflected in the choice of *Alternative Hypothesis*. The output produced by clicking *OK*, shown in Figure 6.20, indicates that the positive ordinal relationship between the two variables is very highly statistically significant, with *p* = 5.5 × 10−10, effectively 0; the estimated Kendall correlation is τ^ = 0.41.

FIGURE 6.17: *Shapiro-Wilk Test for Normality* dialog box.

FIGURE 6.18: Output for the Shapiro–Wilk normality test of education in the Prestige data set.

FIGURE 6.19: *Correlation Test* dialog box.

FIGURE 6.20: Test of the ordinal relationship between education and income in the Prestige data set.

1There is a third variable in the Guyer data set, sex, indicating the gender composition of each group, with half the groups composed of females (coded F) and the other half of males (coded M). This factor won’t figure in the *t* test that I report in this section, but I invite the reader to perform a two-way analysis of variance for the Guyer data, as described in Section 6.1.3.

2The lower bound of the one-sided confidence interval is necessarily –∞. If, as is likely, you’re unfamiliar with one-sided confidence intervals, you can simply disregard the reported confidence interval when you perform a one-sided test.

3I’m grateful to Michael Friendly of York University for making the data available.

4You may well be unfamiliar with the *logit transformation*, defined as the log of the odds: That is, if *p* is the proportion correct (i.e., the number correct divided by 40), the *odds* of being correct are *p*/(1 – *p*) (the proportion correct divided by the proportion incorrect), and the log-odds or logit is log[*p*/(1 – *p*)]. The logit tends to improve the behavior of proportions that get close to 0 or 1. When, as here, some values of *p* are *equal* to 0 or 1, the logit is undefined. The logit function (from the car package) used in computing logit.correct moves these extreme proportions slightly away from 0 or 1 prior to calculating the logit, as reflected in a warning printed in the R Commander *Messages* pane.

More generally, don’t be concerned about the details of the logit transformation. The object is simply to make the data conform more closely to the assumptions of one-way analysis of variance.

5The observant reader working along with the examples in this chapter will notice that, in addition to printed and graphical output, the *One-Way Analysis of Variance* dialog also produces a *statistical model*, named AnovaModel.1, which becomes the *active model* in the R Commander. This is true as well of the *Multi-Way Analysis of Variance* dialog discussed in the next section. The active model can then be manipulated via the *Models* menu. The treatment of statistical models in the R Commander is the subject of Chapter 7.

6Because Friendly and Franklin (1980) expected higher rates of recall in the Before and Meshed condition than in the SFR condition, it would arguably be legitimate to halve the *p*-values for the corresponding pairwise comparisons.

7The same dialog can be employed for *higher-way analysis of variance* by selecting more than two factors.

8A technical note: Because the Adler data are *unbalanced*, with unequal cell counts, there is more than one way to perform the analysis of variance, and the *Multi-Way Analysis of Variance* dialog produces so-called “Type II” tests. Alternative tests are available via *Models > Hypothesis tests > ANOVA table*. For a discussion of this point, see Section 7.7.

9The tables of means and standard deviations duplicate those shown in Figure 5.5 on page 85.

10The group medians, also reported by the Kruskal–Wallis dialog, differ, however, depending on which response variable is used, although even here there is a fundamental invariance: The median of the logit-transformed data is the logit of the median (with slight slippage due to interpolation when—as in this example—a median is computed for an even number of values).