Output from computer packages (e.g. MINITAB) are typically *.lis, *.out, *.log.

Links in this file take you directly to the specific data or data analysis example.

This file is cumulative; I'll add entries as we move to that material.

(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)

NAME DESCRIPTION mlapair.dat Paired pre-test post-test data example. Story from textbook (MM p517).EXAMPLE 8.3 "The National Endowment for the Humanities sponsors summer institutes to improve the skills of high school teachers of foreign languages. One such institute hosted 20 French teachers for 4 weeks. At the beginning of the period, the teachers were given the Modern Language Association s (MLA) listening test of under-standing of spoken French. After 4 weeks of immersion in French in and out of class, the listening test was given again. (The actual spoken French in the two tests was different, so that taking the first test should not improve the score on the second test.) The maximum possible score on the test is 36. mlapair.lis Analysis of Paired pre-test post-test data example using Minitab. mlasign.lis Nonparametric analysis of Paired pre-test post-test data via sign test procedures using Minitab. smsg.dat Used in Part I and analysis of covariance). Data from a mathematics curriculum evaluation, circa 1961. Purpose of the large scale study was to compare mathematics achievement in a traditional ninth-grade algebra course with that in an alternative course developed by the School Mathematics Study Group (SMSG). 43 teachers from schools across the US participated; by random assignment there were 21 SMSG (new math) classrooms with 22 traditional math classrooms. Columns c1 and c3 contain group indicator variables; c3 = 1 is SMSG classroom and c3 = 0 is traditional. The post-instruction outcome measure (classroom average) on math achievement given at the end of the school year is in c2; this test was a traditional algebra test published by the Cooperative Test division of Educational Testing Service. In c4 is a pre-instruction ("pre-test") measure of knowledge of number systems. smsg.lis Used in Part I review. Descriptive and inferential two-group comparisons for the outcome measure (c2) in smsg.dat. drptwot.dat Two group comparison example. Story from textbook (MM p542).EXAMPLE 8.8 "An educator believes that new directed reading activities in the classroom will help elementary school pupils improve some aspects of their reading ability. She arranges for a third-grade class of 21 students to follow these activities for an 8-week period. A control classroom of 23 third graders follows the same curriculum without the activities. At the end of the 8 weeks are given a Degree of Reading Power (DRP) test, which measures the aspects of reading ability that the treatment is designed to improve. data are in unstacked form with treatment in C1 and control in C2. drptwot.lis Two sample Analysis of Paired pre-test post-test data example using Minitab. alphatot.tab Tabulation of total error rate probabilities for c inferences each done at level alph: tot = 1 - (1 - alph)^c is solved for alph. Mathematica script appended. harr.dat Data obtained from the Hopkins&Glass textbook. Their description is "Harrington (1968) experimented with the order of 'mental organizers' that structure the material for the learner. A group of 30 persons were randomly split into three groups of 10 each. Group I received organizing material before studying instructional material on mathematics; Group II received the 'organizer' after studying the mathematics; Group III received the math materials but no organizing materials. Scores are from a 10-item mathematics test on the instructional content. The data are in "unstacked" form in c1-c3. harr.lis One-way anova (MINITAB) on harr.dat. harr1v.out BMDP1V output for harr.dat using orthogonal contrasts. hartukey.lis Minitab implementation of Tukey post-hoc comparison procedures with the harr.dat data. ibs.dat Used in Part I.A.1. These are waiting-time data under three different protocols. Data are in stacked form. The actual data are from Ott's text and are described as follows: "Irritable bowel syndrome (IBS) is a non- specific intestinal disorder characterized by abdominal pain and irregular bowel habits. Each person in a random sample of 24 patients having periodic attacks of IBS was randomly assigned to one of three treatment groups. The number of hours of relief while on therapy is recorded for each patient." Outcome in c1, group indicator in c2. ibsbmd7d.log Part I.A.1. BMDP7D output for ibs.dat. Implements Levene's test. Implements two versions of one-way anova (Welch, Brown-Forsythe) that do not assume equal within-group variances. ibslev.lis Part I.A.1. Gives description of ibs.dat; implements in Minitab two forms of Levene's test for equal within-group variances. ibstrans.log Part I.A.1. Carries out (in MINITAB) natural log transformation of ibs.dat outcome to stabilize variance. Compares anova on raw and transformed data. clergy.lis Part I.A.4. Illustration of Kruskal-Wallis test (in MINITAB), non-parametric alternative to one-way anova. Comparison with standard anova on ranked data. Data taken from Ott text: "Three random samples of clergyman were drawn: one containing 10 Methodist ministers, the second containing 10 Catholic priests, the third containing 10 Pentecostal ministers. Each of the clergyman was examined with a test to measure his knowledge about causes of mental illness. bakery.dat 3 x 2 fixed effects with 2 replications per cell. The Castle Bakery Company supplies wrapped Italian bread to a large number of supermarkets in a metropolitan area. An experimental study was made of the effects of height of the shelf display (factor A: bottom, middle, top in c2) and the width of the shelf display (factor B: regular, wide in c3) on sales of this bakery’s bread during the experimental period (c1, measured in cases). Twelve supermarkets, similar in terms of sales volume and clientele, were utilized in the study. The six treatments were assigned at random to two stores each according to a completely randomized design, and the display of the bread in each store followed the treatment specifications for that store. Sales of the bread were recorded, and these results are presented in bakery.dat. bakery.lis Table of cell means and two-way fixed effects anova for bakery.dat. integ.dat 2 x 2 fixed effects with 50 replications per cell. Data obtained from early Minitab Handbook which gives the following description: "A researcher at Columbia University was interested in the effect of school integration on racial attitudes. He gave an "ethnocentrism" test to four groups of children: black children in a segregated school, white children in a segregated school, black children in an integrated school, and white children in an integrated school. 'Ethnocentrism' is defined as the tendency of children to prefer to associate with, and respect, other children of the same ethnic group to those of another ethnic group. Thus, students who score high on this test have a stronger preference for their own race." The data are in stacked form, with the test score in c1, schooltype in c2 (1 = integrated, 2 = segregated) and race in c3 (1 = black, 2 = caucasian). integ.lis Cell means and anova table (from MINITAB) for integ.dat. scitest.dat Data collected as part of study designed to investigate the feasibility and technical quality of science performance assessments. Two tasks, called Radiation and Rate of Cooling, were developed from a common "task shell"; in other words, they were designed to be as parallel as possible in the science processes tested and in the format of stimulus materials and required response. They can be thought of as two sample tasks from a "universe" of similar, parallel tasks. The investigators treat task as a random factor because they could imagine creating additional tasks out of the task shell from which these two came. This data set contains the scores of thirty students, assumed to be drawn at random from the population of students,each tested on both tasks. Three raters scored the responses; each paper was scored by two of the three raters. The students come from three different schools, ten from each. Scores are in C1, student ID is in C2, task (1 for Radiation, 2 for Rate of Cooling) is in C3, rater is in C4, and school is in C5. scitest.lis Minitab output from a 2-way random effects anova with outcome the score on the science test, with the two random factors being student and task. So the design is 30x2 with 2 replications per cell. sunburn.dat Two-way mixed example; taken from Sunscreen ex. Ott p.770 A corporation is interested in comparing two different sunscreens (s1 and s2). A random sample of 10 females (ages 20-25 years) participated in the study. For each person two 1" x 1" squares were marked off on either side of the back, under the shoulder but above the small of the back. Sunscreen s1 was randomly assigned to the two squares on one side of the back, with s2 on the other two squares. Exposure to the sun was for a two-hour period. The outcome was change (postexposure minus preexposure) in a reading based on the color of skin in a square. So we have 10 levels of the random column factor subjects, two levels of the fixed row factor, sunscreen, and two replications per cell. In file sunburn.dat we have the outcome measure in c1, the type of sunscreen (s1 =1, s2=2) in c2, the person (i.e. female tanning subject) in c3. sunburn.lis Minitab output for the mixed model analysis of the sunburn.dat data, a 2X10 design with 2 replications per cell. unbalanc.dat Data for a 2 x 3 fixed effects design, having between 1 and 3 replications per cell. The data are shown and described in Table 20.1 and section 20.2 of NWK text. The first part of this data file has the outcome measure (growth rate in response to therapy) in c4, the row factor (subject gender 1,2) in c1, the column factor (degree of depressed development; severe = 1, moderate = 2, mild = 3) in c2, and the replication indicator in c3. This data structure is set up for the GLM approach to the analysis of unbalanced designs. The second part of the data file is set up for the application of the approximate analysis based on cell means; cell means in c1, row factor in c2, column factor in c3. unbalanc.log Analyses of the data in unbalanc.dat. First is shown the GLM analysis (cf. MTB version 7 manual p. 8-27). Second the approximate cell means analysis is constructed and then compared with GLM results. stress.dat Data are from a 2x2x2 fixed effects design with 3 replications per cell. Data are shown in Table 22.2 and described in Section 22.2 of NWK text. The outcome measure is exercise tolerance from a stress test in c1, with gender (male = 1, female = 2) in c2, body fat level (low = 1, high = 2) in c3 and smoking history (light = 1, heavy = 2) in c4. stress.lis Analysis of the 3-way design from stress.dat. Description using versions of MINITAB Table command along with Layout subcommand (cf. MTB version 7 manual pages 11-9,11-12). Three-way analysis of variance using anova command. *************************** PART II ******************************************** CORRELATION and REGRESSION corr.dat 28 bivariate observations, test 1 in c1, test 2 in c2. corr.out Simple plotting, correlation, and straight-line regression analyses of corr.dat. corrres.lis Illustration of different types of residual scores using corr.dat data. See NWK text Chap 9 (esp Sec. 9.2). predict.lis Illustration of PREDICT subcommand (cf. MTB ver 7 manual 7-10,11) using corr.dat. welfare.dat Children's Welfare in California. Data collected by the Oakland-based "Children Now" from government resources over the past four years to comprise a "year-in-the-life" composite index of children's welfare. Data are presented on a county-by-county basis. c1: County ranking on Welfare index c2: Median family income c3: Median family income ranking welfare.lis Illustrates descriptive univariate analyses (stem-and-leaf etc) and correlation and regression analyses and plots. coleman.dat Data from the Coleman report used to illustrate multiple regression. File coleman.dat contains data from a random sample of 20 schools (from the East) from the 1966 Coleman Report. The outcome measure C7 is the verbal mean test score for all sixth graders in the school. The predictor variables are: C2, staff salaries per pupil, C3, percent white collar fathers for the sixth graders; C4 is a SES composite measure (deviation) for the sixth graders, C5 Mean teacher's verbal test score, C6 6th grade mean mother's educational level (1 unit=2 school yrs) bodyfat.dat Data taken from NWK text, Table 8.1. Measurement data in which 3 relatively inexpensive methods of assessment are compared with the "gold standard" of accurate measurement. Description: "data for a study of the relation of the amount of body fat to several possible explanatory, independent variables, based on a sample of 20 healthy females 25-34 years old. The possible independent variables are triceps skinfold thickness, thigh circumference, and midarm circumference." c1 has triceps, c2 has thigh, c3 has midarm, c4 has amount of body fat. bodyfat.out Illustrates multiple regression procedures in NWK text Sec. xx, and residual diagnostics. marks.dat Used in Part II. Data from 17 students in a prior (many years ago) 2-qtr version of part of this course (i.e. Education 250A,B). c2 has the sum of the scores on the six graded homework assignments; c1 has the final exam for 250A, c3 has the midterm in 250B, and c4 has the outcome score, the final exam in 250B. marks.log Uses marks.dat to illustrate properties of multiple regression (and partial correlation) coefficients and diagnostics for same via adjusted variables approach. marksnew.log Repeats, revises aspects of the marks.log analyses to match partial regression slopes and plots approach in NWK Section 11.1. nels.dat Contains a subset of observations and variables from the public release data tape for National Educational Longitudinal Study of 1988 (NELS:88). The National Center for Education Statistics collected data from a representative sample of 8th-graders across the U.S. and followed these students through grades 10 and 12. At each grade, students took several achievement tests and completed surveys that included questions about their academic, family, and social lives. The nels.dat data set contains students' 10th-grade scores on the science achievement test, along with several variables that are hypothesized to be good predictors of 10th-grade science achievement. Student ID is in C1 and 10th-grade science score is in C2. Four achievement variables from 8th grade are included: science, reading, math knowledge, and math reasoning (C3-C6). The math knowledge and math reasoning scores are standardized (they have mean zero, variance one). Indicator variables are included for advanced "track" (i.e., high school program) and general track; each student receives a 1 on the variable if he or she is in that program and a 0 otherwise. Students in the academic track receive 0's on both variables. These are found in C7 and C8, respectively. In C9-C12 there are indicator variables for courses taken - biology or not in C9, chemistry or not in C10, earth science or not in C11, and general science or not in C12. C13 contains an indicator variable for gender: 1 for males, 0 for females. In C14-C16 are indicator variables for ethnicity: Asian or not in C14, African-American or not in C15, and Latino/Hispanic or not in C16. Finally, C17 and C18 contain indicator variables for socio-economic status: Lowest quartile or not in C17 and highest quartile or not in C18. grow.dat Data from the Berkeley Growth Study (Nancy Bailey). These data are for Child #8 in the BGS study with age in months in c2 (ranging from 1 to 60) and intellectual performance in C1. grow.lis Fitting a score on age regression for grow.dat, using polynomial regression. dummy.log Single classification anova via regression with dummy (group membership) predictor variables. Uses smsg.dat and harr.dat ancova.log Illustration of 2-group, pre-post analysis of covariance with data from smsg.dat. First the multiple regression approach is shown, followed by the MINITAB ancova routine for comparison. ancvdrug.dat Data taken from Ott's text to illustrate a 2-group, pre-post design. The description of these data is: "An investigator is interested in comparing two drug products (A and B) in overweight female volunteers. The experiment calls for 20 randomly selected subjects who are at least 25% overweight. Ten of these women are to be randomly assigned to product 1 and the remaining 10 to product 2. The response of interest is a score on a rating scale used to measure the mood of a subject. To obtain a score, a subject must complete a checklist indicating how each of 50 adjectives describes her mood at that time. On the study day, all 20 volunteers are required to complete the checklist at 8 AM. Then each subject is given the prescribed medication (product 1 or 2). Each subject is required to complete the checklist again at 10 AM. The 8AM score is in c1, the 10 AM score in c2 and the group membership indicator (1 = product 1; 0 = product 2) in c3. ancvdrug.lis Description of 2-group pre-post data in ancvdrug.dat. Analysis of covariance is carried out with multiple regression, dummy-variable approach and then compared with MINITAB ancova command. huitema.dat Three groups, each of size 10, single outcome, 2 covariates. Taken from the Huitema text with the description: "The investigator is concerned with the effects of three different types of study objectives on student achievement in freshman biology. The three types of objectives are: 1.General--students are told to know and understand everything in the text. 2.Specific--students are provided with a clear specification of the terms and concepts they are expected to master and of the testing format. 3.Specific with study time allocations--the amount of time that should be spent on each topic is provided in addition to specific objectives that describe the type of behavior expected on examinations. The dependent variable is the biology achievement test. A population of freshman students scheduled to enroll in biology is defined, and 30 students are randomly selected. The investigator obtains aptitude test scores and scores from an academic motivation test for all students before the investigator randomly assigns 10 students to each of the three treatments. Treatments are administered, and scores on the dependent variable are obtained for all students." In the data file, the dependent variable is in c1, aptitude test in c2, academic motivation in c3, and group membership variable (1,2,3) in c6. In c4-c5 are two 0,1 dummy variables that define the group membership in c6. huitema.lis Description of data in huitema.dat. Carries out ancova for the 3-group two-covariate design using MINITAB ancova and multiple regression approach. *************************** PART III ******************************************** BINARY and CATEGORICAL DATA Binomial Distribution examples. binchina.lis You've just entered a class in ancient Chinese literature. You haven't even learned the alphabet yet but they've given you a pop quiz. You'll have to guess on every question. It's a multiple choice test, with each of the 20 questions having three possible answers. To pass, you must get at least 12 correct. What are the chances you'll pass? binfreet.lis Rick is a basketball player who makes 75 percent of his free throws over the course of a season. In a key game Rick shoots 12 free throws and misses 5 of them. The fans think he failed because he was nervous. Is it unusual for Rick to perform this poorly? binnorm.lis Illustrations of normal approximations to the binomial. binsign.lis Sign test example from GH section 9.11; use of binomial proability. Poisson Distribution examples. poisson.lis Illustration of Poisson distribution and binomial approximations for rare events. draft.cnt Draft lottery data from 1971. Rows are months Jan-Dec and columns are #days with highest risk C1 (numbers 1-122), numbers 123-244 in C2 and lowest risk (numbers 245-366) in C3. draft.lis Chi-square test for independence (fairness) for draft lottery data. teacher1.dat Part III. Source: U.S. Department of Education, National Center for Education Statistics, 1987-1988 Schools and Staffing Survey. Data: Willingness to become a teacher again for Elementary and Secondary school teachers. (Data + Output). This example illustrates cross-classified categorical data, 2x5 table and chi-square test. teacher2.dat Part III. 1987-1988 Schools and Staffing Survey Data: Gender distribution for teachers in Elementary and Secondary schools. (Data + Output) Illustrates 2x2 table and chi-square test. Agresti Supplement Tables from the Appendix of "An Introduction to Categorical Data Analysis," by Alan Agresti, published by John Wiley and Sons, Inc., January 1996. The tables show SAS code for the analyses conducted in that text, and contain the major data sets from that text. Aspirin and MI Data and SAS analysis for Aspirin Use and Myocardial Infarction, Agresti Section 2.2.2 Lung Cancer Data and SAS analysis for Smoking and Lung Cancer example Tea Tasting Data and SAS analysis Fishers Tea Tasting example; Fisher's Exact test, Agresti Section 2.6.1 program.dat Dichotomous outcome, single quantitative predictor. From NWK supplement (or the NWK regression book), the description is: "A small-scale investigation was undertaken to study the effect of computer programming experience on ability to complete a complex programming task, including debugging, within a specified time. Twenty-five persons were selected for the study. They had varying amounts of programming experience (measured in months of experience). All persons were given the same programming task. The results are coded in binary fashion; if the task was completed successfully in the allotted time, it was scored 1, and if the task was not completed successfully, it was scored 0." Months of experience are in c1, and the binary outcome measure is in c2. program.lis Plots and description of program.dat. OLS and WLS fits of straight-line functional form. BMDPLR logistic regression fit (presented in class) compared with straight-line fit. NEW! Minitab blog binary logistic regression. progsas.sas contains the SAS instructions to carry out a logistic regression for the data in program.dat. progsas.lst SAS output obtained from the command line statement: "sas progsas" on an elaine. Contains the logistic regression parameter estimates and fits. coupon.dat Dichotomous outcome, single quantitative predictor (*with replication*). From NWK supplement (or the NWK regression book), the description is: "In a study of the effectiveness of coupons offering a price reduction on a given product, 1,000 homes were selected and a coupon and advertising material for the product were mailed to each. The coupons offered different price reductions (5,10,15,20, and 30 cents), and 200 homes were assigned at random to each of the price reduction categories. The independent variable in this study is the amount of price reduction, and the dependent variable is a binary variable indicating whether or not the coupon was redeemed within a six-month period." The price reduction is in c1, number of households (200) in c2, and number redeemed from the 200 households in c3. coupon.lis Logit transformation and OLS and WLS fits to coupon.dat.