*.dat are ASCII data files. Output from computer packages (e.g. MINITAB) are typically *.lis, *.out, *.log. Links in this file take you directly to the specific data or data analysis example.
This file is cumulative; I'll add entries as we move to that material.

(Note: Additional examples not in electronic form will be introduced throughout the course during lectures. Also, additional data sets and output in electronic form for homework assignments, solutions, etcetera will be described in those documents and posted to the course directory at the appropriate point in the course.)

I. Design and Analysis of Comparative Studies (Experiments)

 NAME          DESCRIPTION   Tabulation of total error rate
               probabilities for c inferences each done at level
               alph:  tot = 1 - (1 - alph)^c  is solved for
               alph. Mathematica script appended.

counsel.dat    A 3 x 10 mixed model
               design with 2 replications per cell. Fixed factor
               in c1 has 3 levels, random factor in c2 has 10
               levels.  The fixed factor is 3 different
               methods/strategies of counseling, and the
               random factor represents 10 counselors sampled
               from a population of counselors.  Six clients
               from each counselor are divided amongst the 3
               counseling strategies.  The outcome measure in
               c3 is a self-report of neurotic symptoms.  The
               mixed model analysis (using MINITAB) is
               described in lecture.

harr.dat       Data obtained from the
               Hopkins&Glass textbook.  Their description is
               "Harrington (1968) experimented with the order
               of 'mental organizers' that structure the
               material for the learner.  A group of 30 persons
               were randomly split into three groups of 10 each.
               Group I received organizing material before
               studying instructional material on mathematics;
               Group II received the 'organizer' after studying
               the mathematics; Group III received the math
               materials but no organizing materials.  Scores
               are from a 10-item mathematics test on the
               instructional content.
               The data are in "unstacked" form in c1-c3.
harr.lis       One-way anova (MINITAB) on harr.dat.
harr1v.out     BMDP1V output for harr.dat using
               orthogonal contrasts.
hartukey.lis   Minitab implementation of Tukey post-hoc
               comparison procedures with the harr.dat data.

integ.dat      2 x 2  fixed effects with 50
               replications per cell.  Data obtained from early
               Minitab Handbook which gives the following
               description: "A researcher at Columbia
               University was interested in the effect of
               school integration on racial attitudes.  He
               gave an "ethnocentrism" test to four groups of
               children: black children in a segregated school,
               white children in a segregated school, black
               children in an integrated school, and white
               children in an integrated school. 'Ethnocentrism'
               is defined as the tendency of children to prefer
               to associate with, and respect, other children
               of the same ethnic group to those of another
               ethnic group.  Thus, students who score high on
               this test have a stronger preference for their
               own race."  The data are in stacked form,
               with the test score in c1, schooltype in c2
               (1 = integrated, 2 = segregated) and race in c3
               (1 = black, 2 = caucasian).
integ.lis      Cell means and anova table
               (from MINITAB) for integ.dat.

rand2way.dat   The data are from a 3 x 3
               design with 2 replications per cell.  A classic
               measurement study design, as in generalizability
               theory G-studies.  The actual data are from
               Ott's text, with the following (appetizing)
               description: "Consider an experiment to examine
               the effects of different analysts and subjects
               in chemical analyses for the DNA content of
               plaque.  Three female subjects (ages 18-20
               years) were chosen for the study.  Each subject
               was allowed to maintain her usual diet,
               supplemented with 30 mg of sucrose per day.  No
               toothbrushing or mouthwashing was allowed during
               the study.  At the end of the week, plaque was
               scraped from the entire dentition of each subject
               was divided into six samples.
               Each of the analysts chosen at random made a DNA
               concentration determination on two samples for
               each subject.  Data are in units of 10
               micrograms. The DNA concentrations are in c1,
               analysts in c2 (1,2,3), subjects in c3 (1,2,3).
rand2way.lis   Analysis of rand2way.dat using
               MINITAB.  Table of cell means, random effects
               anova including variance components estimation.

scitest.dat    Data collected as part of study designed
               to investigate the feasibility and technical
               quality of science performance assessments.  Two
               tasks, called Radiation and Rate of Cooling,
               were developed from a common "task shell";
               in other words, they were designed to be as
               parallel as possible in the science processes
               tested and in the format of stimulus materials
               and required response.  They can be thought of
               as two sample tasks from a "universe" of similar,
               parallel tasks.  The investigators treat task as
               a random factor because they could imagine
               creating additional tasks out of the task shell
               from which these two came.  This data set
               contains the scores of thirty students, assumed
               to be drawn at random from the population of
               students,each tested on both tasks.  Three
               raters scored the responses; each paper was
               scored by two of the three raters.  The students
               come from three different schools, ten from each.
               Scores are in C1, student ID is in C2, task (1
               for Radiation, 2 for Rate of Cooling) is in C3,
               rater is in C4, and school is in C5.
scitest.lis    Minitab output from a 2-way random effects anova
               with outcome the score on the science test, with
               the two random factors being student and task.
               So the design is 30x2 with 2 replications per

smsg.dat       Used in Part I review and analysis of covariance).
               Data from a mathematics curriculum evaluation,
               circa 1961. Purpose of the large scale study was
               to compare mathematics achievement in a
               traditional ninth-grade algebra course with
               that in an alternative course developed by the
               School Mathematics Study Group (SMSG). 43
               teachers from schools across the US
               participated; by random assignment there were
               21 SMSG (new math) classrooms with 22 traditional
               math classrooms.
               Columns c1 and c3 contain group indicator
               variables; c3 = 1 is SMSG classroom and c3 = 0
               is traditional.
               The post-instruction outcome measure (classroom
               average) on math achievement given at the end of
               the school year is in c2; this test was a
               traditional algebra test published by the
               Cooperative Test division of Educational Testing
               In c4 is a pre-instruction ("pre-test") measure
               of knowledge of number systems.
smsg.lis       Used in Part I review.  Descriptive and
               inferential two-group comparisons for the outcome
               measure (c2) in smsg.dat.

sunburn.dat    Two-way mixed example; taken from Sunscreen ex.
               Ott p.770
               A corporation is interested in comparing two
               different sunscreens (s1 and s2).  A random
               sample of 10 females (ages 20-25 years)
               participated in the study.  For each person two
               1" x 1" squares were marked off on either side
               of the back, under the shoulder but above the
               small of the back.  Sunscreen s1 was randomly
               assigned to the two squares on one side of the
               back, with s2 on the other two squares. Exposure
               to the sun was for a two-hour period.
               The outcome was change (postexposure minus
               preexposure) in a reading based on the color of
               skin in a square.  So we have 10 levels of the
               random column factor subjects, two levels of the
               fixed row factor, sunscreen, and two replications
               per cell.  In file sunburn.dat we have the
               outcome measure in c1, the type of sunscreen
               (s1 =1, s2=2) in c2, the person (i.e. female
               tanning subject) in c3.
sunburn.lis    Minitab output for the
               mixed model analysis of the sunburn.dat data,
               a 2X10 design with 2 replications per cell.

               Data for a 2 x 3 fixed effects
               design, having between 1 and 3 replications per
               cell. The data are shown and described in Table
               20.1 and section 20.2 of our NWK text. The first
               part of this data file has the outcome measure
               (growth rate in response to therapy) in c4, the
               row factor (subject gender 1,2) in c1, the column
               factor (degree of depressed development;
               severe = 1, moderate = 2, mild = 3) in c2, and
               the replication indicator in c3.
               This data structure is set up for the GLM
               approach to the analysis of unbalanced designs.
               The second part of the data file is set up for
               the application of the approximate analysis based
               on cell means; cell means in c1, row factor in
               c2, column factor in c3.
               Analyses of the data in unbalanc.dat.
               First is shown the GLM analysis (cf. MTB version
               7 manual p. 8-27). Second the approximate cell
               means analysis is constructed and then compared
               with GLM results.

stress.dat     Data are from a 2x2x2 fixed
               effects design with 3 replications per cell.
               Data are shown in Table 22.2 and described in
               Section 22.2 of our NWK text.  The outcome
               measure is exercise tolerance from a stress
               test in c1, with gender (male = 1, female = 2)
               in c2, body fat level (low = 1, high = 2) in c3
               and smoking history (light = 1, heavy = 2) in c4.
stress.lis     Analysis of the 3-way design from
               stress.dat.  Description using versions of
               MINITAB Table command along with Layout
               subcommand (cf. MTB version 7 manual pages
               11-9,11-12).  Three-way analysis of variance
               using anova command.

***Randomized Blocks***

bhhtab71.dat   Data from a 5 x 4 randomized
               block design with 5 levels of the blocking
               variable and 4 levels of the (fixed) treatment
               variable.  One replication per cell.
               The data are from the Box, Hunter and Hunter
               text with the following description:
               "In this example a process for the manufacture of
               penicillin was being investigated, and the yield
               was the response of primary interest.  There were
               4 variants of the basic process to be studied.
               It was known that an important raw material, corn
               steep liquor, was quite variable.  Fortunately
               blends sufficient for four runs could be made,
               thus supplying the opportunity to run all 4
               treatments with each of the 5 blocks (blends of
               corn steep liquor).  The experiment was protected
               from extraneous unknown sources of bias by
               running the treatments in random order within
               each block."  The yield is in c1, block indicator
               in c2, and treatment indicator in c3.
bhhtab71.lis   Description and analysis of variance
               on randomized block design data in bhhtab71.dat.

dental.dat     Randomized block example, factorial treatment
               structure From NWK prob  DENTAL PAIN.
               The "learning statistics is like pulling teeth"
               analogy is irresistable.
               An anesthesiologist made a comparative study of
               the effects of acupuncture and codiene on
               postoperative dental pain in male subjects.  The
               four treatments were (1) placebo treatment-- a
               sugar capsule and two inactive acupuncture
               points, (2) codiene treatment only--a codeine
               capsule and two inactive acupuncture points; (3)
               acupucture only--a sugar capsule and two active
               acupuncture points (4) both codeine and
               acupuncture. These 4 conditions have a 2x2
               factorial structure.
               Thirty-two subjects were grouped into 8 blocks
               of four according to an initial evaluation of
               their level of pain tolerance.  The subjects in
               each block were then randomly assigned to the 4
               treatments.  Pain relief scores were obtained 2
               hours after dental treatment.  Data were
               collected on a double-blind basis.  In file
               dental.dat c1 is pain relief score (higher
               means more pain relief), c2 is block c3 is
               codiene c4 is acupuncture--for c3 and c4, 1=no.
dental.lis     Minitab analysis for randomixed block design of

***Nested Designs***

training.dat   NESTED DESIGN,
               training school example, from NWK Chap 28.
               Description p.970:  A large manufacturing company
               operates 3 regional training schools for
               mechanics, one in each of its operating
               districts.  The schools have two instructors each
               who teach classes of about 15 mechanics in 3-week
               The company was concerned about the effect of
               School (factor A) and instructor (factor B) on
               the learning achieved.  To investigate these
               effects, classes in each district were formed in
               the usual way and then randomly assigned to one
               of the two instructors in the school [making
               class the "unit of analysis"].  This design was
               implemented for two 3-week sessions, and at the
               end of each session a suitable measure of
               learning for the class was obtained.
               Data are given in training.dat:  C1 has class
               learning score, C2 is School (1,2,3), C3 is
               instructor (1,2), and C4 is class (first or
               second 3-week period).
training.lis   Data analysis using Minitab for training.dat,
               including the nested design anova.

NWK 28.9 Cross-nested design ("three factor partially nested design")
Data for decision making example    
Minitab analysis for decision making example, NWK Fig 28.7

schoolcn.dat   Crossed and Nested factors--
               teaching methods, schools, teachers
               students, Can you work it out?

               This example is taken from a well-known
               educational statistics textbook: Hopkins&Glass.
               On the theme that you should rejoice that we use
               NWK instead, I found six (and counting) major errors
               in this text's exposition and solution for this single

               The example involves the comparison of 5 teaching
               methods.  Two Schools (considered to be sampled
               at random) each employ these five teaching
               methods--i.e. each of the 5 teaching methods
               appears with each of the two schools--5x2
               combinations.  Within *each* of the two schools,
               3 teachers are chosen at random, so we have three
               teachers chosen in School 1 and three different
               teachers chosen in School 2.  Each teacher employs
               each of the 5 teaching methods, and the outcome
               data are mastery scores (mastery or not)for
               three students for each teacher-method combination
               within each school. NWK calls this a
               partially nested or crossed-nested design:
               section 28.9 pp. 1149-1154 (minitab p1153)

               In file schoolcn.dat the columns are outcome; method;
               school; teacher(within school); student replication.

               Note that the outcome measure is 0/1 ; this text goes
               on to assert "Balanced anova designs have been shown
               to yield accurate results even with dichotomous
               dependent variables [refs]..." For the present we
               will take them at their word.

               In file schoolcn.sol we answer the following questions

               a.  Table the means for method crossed with school, and
               construct the corresponding profile plot.  Do there
               appear to be main effects or interactions?
               b.  Obtain means for each teacher(within school); do
               there appear to be teacher effects?
               c.  Construct an appropriate anova model and obtain
               the corresponding anova table for this design.
               d.  Carry out the series of statistical tests for the
               terms (effects) identified in your model in part c;
               state your results, being careful to control the
               overall Type I error rate.

schoolcn.sol   Analyses for
               cross-nested schoolcn example

***Repeated Measures***

drugrep.dat    Example from Winer Sec 4.3, Table 4.3-1
               A study of the effects 4 drugs upon reaction time
               to a series of standardized tasks was undertaken
               with 5 subjects all of whom had been well-trained
               in these tasks.
               The 5 subjects are a random sample from a
               population of interest to the experimenter.  Each
               subject was observed under each of the drugs; the
               order that the drugs were administrered was
               randomized.  Time separation between doses was
               employed.  The outcomes (C1 in drugrep.dat) were
               mean reaction time on the series of standardized
               tasks; in drugrep.dat C2 (1,2,3,4,5) is the
               person and C3 (1,2,3,4) is the drug.
               The drug data comprise a oneway repeated measures
               classification with 4 levels representing the
               reaction times associated with 4 types of drug.
drugrep.lis    Minitab analysis for
               repeated measures design for drugrep.dat.

bloodflow.dat  bloodflow example  NWK sec 29.3

               Section 29.3 Two-Factor Experiments with Repeated Measures on Both Factors 1181
               TABLE 29.7 Data for Blood Flow Example.

               	Subject  	Treatment
               	       A1B1	A1B2    A2B1	A2B2
               	1	2	10	9	25
               	2	—1	8	6	21
               	3	0	11	8	24

               	10	—2	10	10	28
               	11	2	8	10	25
               	12	—1	8	6	23

               A clinician studied the effects of two drugs used either alone or
               together on the blood flow in human subjects. Twelve healthy
               middle-aged males participated in the study and they are viewed
               as a random sample from a relevant population of middle-aged
               males. The four treatments used in the study are defined as

               	A1B1	placebo (neither drug)
               	A1B2	drugB alone
               	A2B1	drugAalone
               	A2B2	bothdrugsAandB

               The 12 subjects received each of the four treatments in
               independently randomized orders. The response variable is the
               increase in blood flow from before to shortly after the
               administration of the treatment. The treatments were administered
               on successive days. This prevented any carryover effects because
               the effect of each drug is short-lived. The experiment was
               conducted in a double-blind fashion so that neither the physician
               nor the subject knew which treatment was administered when the
               change in blood flow was measured.

               Table 29.7 and bloodflow.dat contains the data for this study.
               A negative entry denotes a decrease in blood flow. Figure 29.5
               and bloodflow.lis contains the MINITAB output for the fit of
               repeated measures model (29.10). Included in the output are the
               expected mean squares for the specified ANOVA model. As explained
               in Chapter 28, each term in an expected mean square is
               represented in the MINITAB output by (1) the numeric code, in
               parentheses, for the variance of the model term, and (2) the
               preceding number which is the numerical multiple. When the model
               term is fixed, the letter Q is used in the printout

bloodflow.lis  Minitab analyses of bloodflow

Brogan Kutner Example     Pre-post Repeated Measures
Brogan Kutner Analyses    Minitab and SAS repeated measures analyses

shoes.dat     "It's gotta be the shoes"
               Athletic Shoe sales example from NWK Chap 29.4
               Between subjects factor, repeated measures on
               one-factor:  A national retail chain
               wanted to study the effects of two advertising
               campaigns (factor A) on the sales of athletic
               shoes over time (factor B).  Ten similar test
               markets (subjects S) were randomly chosen to
               participate in the study (each campaign used in
               5 of these markets). Sales data (c1 in shoes.dat)
               were collected for 3 two-week periods (two weeks
               prior to campaign, two-weeks during, two weeks
               after; coded 1,2,3 in c3 in shoes.dat).  In
               shoes.dat c2 indicates the ad campaign (1,2)
               and c4 indicates test market site (1,2,3,4,5).
shoes.lis      The minitab analysis replicates NWK
               'sales' is the outcome measure, 'ad' is type of
               advertising campaign; 'time' is the repeated
               measures factor; and 'subj' is test market site.


corr.dat       28 bivariate observations,
               test 1 in c1, test 2 in c2.
corr.out       Simple plotting,
               correlation, and straight-line regression
               analyses of corr.dat.
corrres.lis    Illustration of different types of
               residual scores using corr.dat data.
               See NWK text Chap 9 (esp Sec. 9.2).
predict.lis    Illustration of PREDICT subcommand
               (cf. MTB ver 7 manual 7-10,11) using corr.dat.

welfare.dat    Children's Welfare in California.
               Data collected by the Oakland-based
               "Children Now" from government resources over the
               past four years to comprise a "year-in-the-life"
               composite index of children's welfare.  Data are
               presented on a county-by-county basis.
               c1: County ranking on Welfare index
               c2: Median family income
               c3: Median family income ranking
welfare.lis    Illustrates descriptive univariate analyses
               (stem-and-leaf etc) and correlation and
               regression analyses and plots.

coleman.dat    Data from the Coleman report used
               to illustrate multiple regression.
               File coleman.dat contains data from a random
               sample of 20 schools (from the East) from the
               1966 Coleman Report.
               The outcome measure C7 is the verbal mean test
               score for all sixth graders in the school.  The
               predictor variables are:  C2, staff salaries
               per pupil, C3, percent white collar fathers for
               the sixth graders; C4 is a SES composite measure
               (deviation) for the sixth graders, C5 Mean
               teacher's verbal test score, C6 6th grade mean
               mother's educational level (1 unit=2 school yrs)

bodyfat.dat    Data taken from NWK text,
               Table 8.1.  Measurement data in which 3
               relatively inexpensive methods of assessment
               are compared with the "gold standard" of
               accurate measurement.
               Description: "data for a study of the relation of
               the amount of body fat to several possible
               explanatory, independent variables, based on a
               sample of 20 healthy females 25-34 years old.
               The possible independent variables are triceps
               skinfold thickness, thigh circumference, and
               midarm circumference."
               c1 has triceps, c2 has thigh, c3 has midarm, c4
               has amount of body fat.
bodyfat.out    Illustrates multiple regression
               procedures in NWK text Sec. xx, and residual

marks.log      Uses marks.dat to illustrate
               properties of multiple regression (and partial
               correlation) coefficients and diagnostics for
               same via adjusted variables approach.
marksnew.log   Repeats, revises aspects of the
               marks.log analyses to match partial regression
               slopes and plots approach in NWK Section 11.1.

nels.dat       Contains a subset of observations and variables
               from the public release data tape for National
               Educational Longitudinal Study of 1988 (NELS:88).
               The National Center for Education Statistics
               collected data from a representative sample of
               8th-graders across the U.S. and followed these
               students through grades 10 and 12.  At each
               grade, students took several achievement tests
               and completed surveys that included questions
               about their academic, family, and social lives.
               The nels.dat data set contains students'
               10th-grade scores on the science achievement
               test, along with several variables that are
               hypothesized to be good predictors of 10th-grade
               science achievement.
               Student ID is in C1 and 10th-grade science score
               is in C2.  Four achievement variables from 8th
               grade are included:  science, reading, math
               knowledge, and math reasoning (C3-C6).  The
               math knowledge and math reasoning scores are
               standardized (they have mean zero, variance
               one).  Indicator variables are included for
               advanced "track" (i.e., high school program) and
               general track; each student receives a 1 on the
               variable if he or she is in that program and a
               0 otherwise.  Students in the academic track
               receive 0's on both variables.  These are found
               in C7 and C8, respectively.  In C9-C12 there are
               indicator variables for courses taken - biology
               or not in C9, chemistry or not in C10, earth
               science or not in C11, and general science or
               not in C12.  C13 contains an indicator variable
               for gender:  1 for males, 0 for females.  In
               C14-C16 are indicator variables for ethnicity:
               Asian or not in C14, African-American or not
               in C15, and Latino/Hispanic or not in C16.
               Finally, C17 and C18 contain indicator variables
               for socio-economic status:  Lowest quartile or
               not in C17 and highest quartile or not in C18.
               Data from the Berkeley Growth Study
               (Nancy Bailey).  These data are for Child
               #8 in the BGS study with age in months in c2
               (ranging from 1 to 60) and intellectual
               performance in C1.
grow.lis       Fitting a score on age regression
               for grow.dat, using polynomial regression.

                         SPRING QTR
dummy.log      Single classification anova via
               regression with dummy (group membership)
               predictor variables. Uses smsg.dat and harr.dat

dum2way.dat      The response data in c1 are obtained from the
               following 2x3 design.  An experiment was
               conducted to examine the effects of different
               levels of reinforcement and different levels of
               isolation on children's ability to recall. A
               single analyst was to work with a random sample
               of 30 children selected from a relatively
               homogeneous group of fourth-grade students. Two
               levels of reinforcement (none and verbal) and
               three levels of isolation (20, 40, and 60
               minutes) were to be used.
               Students were randomly assigned to the six
               treatment groups, with a total of six students
               being assigned to each group.  Each student was
               to spend a 30-minute session with the analyst.
               During this time the student was to memorize a
               specific passage, with reinforcement provided
               as dictated by the group to which the student
               was assigned.  Following the 30-minute session,
               the student was isolated for the time specified
               for his or her group and then tested for recall
               of the memorized passage.
               These data appear in the accompanying table.

                                    Time of Isolation (Minutes)
                  Level of
               Reinforcement       20          40          60

                                26  19     30   36      6   10
                   None         23  18     25   28     11   14
                                28  25     27   24     17   19

                                15  16     24   26     31   38
                 Verbal         24  22     29   27     29   34
                                25  21     23   21     35   30

               Clearly, both factors are fixed factors. In this
               data file the responses above are in c1 with row
               (1,2) in c2 and column (1,2,3) in c3.  In c10-c14
               are the dummy (0,1) codings for the regression
               version of a two-way anova.
dum2way.lis     Constructs the dummy variables in
               dum2way.dat. Carries out regression and GLM
               analyses of the 2x3 fixed effects design.

               Illustration of 2-group, pre-post analysis of
               covariance with data from smsg.dat.  First the
               multiple regression approach is shown,
               followed by the MINITAB ancova routine
               for comparison.

ancvdrug.dat    Data taken from Ott's text
               to illustrate a 2-group, pre-post design.  The
               description of these data is: "An investigator is
               interested in comparing two drug products (A and
               B) in overweight female volunteers.  The
               experiment calls for 20 randomly selected
               subjects who are at least 25% overweight.  Ten
               of these women are to be randomly assigned to
               product 1 and the remaining 10 to product 2.
               The response of interest is a score on a rating
               scale used to measure the mood of a subject.  To
               obtain a score, a subject must complete a
               checklist indicating how each of 50 adjectives
               describes her mood at that time.
               On the study day, all 20 volunteers are required
               to complete the checklist at 8 AM.  Then each
               subject is given the prescribed medication
               (product 1 or 2). Each subject is required to
               complete the checklist again at 10 AM. The 8AM
               score is in c1, the 10 AM score
               in c2 and the group membership indicator
               (1 = product 1; 0 = product 2) in c3.
               Description of 2-group pre-post data in
               ancvdrug.dat.  Analysis of covariance is carried
               out with multiple regression, dummy-variable
               approach and then compared with MINITAB ancova

               Three groups, each of size 10,
               single outcome, 2 covariates.  Taken from the
               Huitema text with the description: "The
               investigator is concerned with the effects of
               three different types of study objectives on
               student achievement in freshman biology. The
               three types of objectives are:
               1.General--students are told to know and
               understand everything in the text.
               2.Specific--students are provided with a clear
               specification of the terms and concepts they are
               expected to master and of the testing format.
               3.Specific with study time allocations--the
               amount of time that should be spent on each
               topic is provided in addition to specific
               objectives that describe the type
               of behavior expected on examinations.
               The dependent variable is the biology
               achievement test.
               A population of freshman students scheduled to
               enroll in biology is defined, and 30 students
               are randomly selected.  The investigator obtains
               aptitude test scores and scores from an academic
               motivation test for all students before the
               investigator randomly assigns 10 students to each
               of the three treatments.  Treatments are
               administered, and scores on the dependent
               variable are obtained for all students."
               In the data file, the dependent variable is in
               c1, aptitude test in c2, academic motivation in
               c3, and group membership variable (1,2,3) in
               c6.  In c4-c5 are two 0,1 dummy variables that
               define the group membership in c6.
huitema.lis    Description of data in huitema.dat.
               Carries out ancova for the 3-group two-covariate
               design using MINITAB ancova and multiple
               regression approach.

               2-group data (10 cases per group)
               with single outcome and single covariate taken
               from Rogosa (1980).  Outcome in c1, covariate
               in c2, group membership (1,0) in c3.
cnrl.lis         Description of cnrl.dat.
               Carries out computations needed for Comparing
               Nonparallel Regression Lines procedures.

nwkt12p1.dat   Data from NWK text,
               now Chapter 8, formerly Table 12.1.
              "A hospital surgical unit was interested in
               predicting survival in patients undergoing a
               particular type of liver operation.  A random
               selection of 54 patients was available for
               analysis.  From each patient record, the
               following information was extracted from the
               preoperation evaluation: blood clotting score,
               prognostic index, enzyme function test, liver
               function test.  The dependent variable is
               survival time."
               Blood clotting score is in c1, prognostic index
               in c2, enzyme function test in c3, liver
               function test in c4, survival time in c5 and
               log10survival in c6.
stepw.lis      Uses nwkt12p1.dat to illustrate
               stepwise regression variable selection procedures
               (Forward stepwise, Backward Elimination.)
               Reproduces results in NWK .
breg.lis       Uses nwkt12p1.dat to illustrate
               "best subsets" variable selection procedure
               (using breg in MINITAB).  Reproduces results in

               Data from 18 students in the
               prior (many years ago) 2-quarter version of part
               of this course (i.e. Education 250A,B). c1-c6
               are the scores on the six graded homework
               assignments; c7 has the final exam for 250A,
               c8 has the midterm in 250B, and c9 has the
               outcome score the final exam in 250B.
pca257.lis     Uses composite construction
               and principal components (using MINITAB pca) to
               examine data reduction procedures for the
               predictors in pcamarks.lis.

Path Analysis  First path analysis example from 
               lecture: 4 variables, SES IQ nAch GPA
Path Analysis  Second path analysis example from 
               lecture: three longitudinal observations.

      ************** PART III  ANALYSIS OF CATEGORICAL DATA ****************

Agresti Supplement  Tables from the Appendix
               of "An Introduction to Categorical Data Analysis,"
               by Alan Agresti, published by John Wiley and Sons, Inc.,
               January 1996.  The tables show SAS code for the analyses
               conducted in that text, and contain the major data sets
               from that text.

Exact Confidence Interval for Proportion SAS implementation in PROC FREQ for
               Exact Confidence Interval for Proportion, see Agresti Ch.1
               (also Mathematica handout).

Generalized Linear Models: Logistic and Poisson Regression

coupon.dat     Dichotomous outcome, single
               quantitative predictor (*with replication*).
               From NWK supplement (or the NWK regression book),
               the description is:
              "In a study of the effectiveness of coupons
               offering a price reduction on a given product,
               1,000 homes were selected and a coupon and
               advertising material for the product were mailed
               to each.  The coupons offered different price
               reductions (5,10,15,20, and 30 cents), and 200
               homes were assigned at random to each of the
               price reduction categories.  The independent
               variable in this study is the amount of price
               reduction, and the dependent variable is a binary
               variable indicating whether or not the coupon
               was redeemed within a six-month period."
               The price reduction is in c1, number of
               households (200) in c2, and number redeemed from
               the 200 households in c3.
coupon.lis     Logit transformation and
               OLS and WLS fits to coupon.dat.

program.dat    Dichotomous outcome, single
               quantitative predictor.  From NWK supplement
               (or the NWK regression book), the description is:
              "A small-scale investigation was undertaken to
               study the effect of computer programming
               experience on ability to complete a complex
               programming task, including debugging, within
               a specified time.
               Twenty-five persons were selected for the study.
               They had varying amounts of programming
               experience (measured in months of experience).
               All persons were given the same programming task.
               The results are coded in binary fashion; if the
               task was completed successfully in the allotted
               time, it was scored 1, and if the task was not
               completed successfully, it was scored 0."
               Months of experience are in c1, and the binary
               outcome measure is in c2.
program.lis    Plots and description
               of program.dat. OLS and WLS fits of straight-line
               functional form.
               BMDPLR logistic regression fit (presented in
               class) compared with straight-line fit.
               NEW! Minitab blog binary logistic regression.    contains the SAS instructions to carry out
               a logistic regression for the data in program.dat.
progsas.lst    SAS output obtained from the command
               line statement: "sas progsas" on an elaine.
               Contains the logistic regression parameter
               estimates and fits.

disease.dat    Data set, 98 cases, shown
               in NWK Table 14.3 and App.C.3. In disease.dat
               C1 is Age, C2 and C3 the SES indicators (see p.582)
               C4 City Sector, C5 disease status.
diseaseselect.lis   Comparison
               for variable selection of various logistic
               regression models for disease data following
               NWK sec 14.5.

Poisson Regression Construction of Artificial Data
               using Minitab and SAS analysis using PROC GENMOD

Contingency Tables and Log-linear Models

draft.cnt      Draft lottery data from 1971. Rows are
               months Jan-Dec and columns are #days with
               highest risk C1 (numbers 1-122), numbers
               123-244 in C2 and lowest risk
               (numbers 245-366) in C3.
draft.lis      Chi-square test for independence
               (fairness) for draft lottery data.

Aspirin and MI Data and SAS analysis for Aspirin Use
               and Myocardial Infarction, Agresti Section 2.2.2

Lung Cancer    Data and SAS analysis for Smoking and
               Lung Cancer example

Tea Tasting    Data and SAS analysis Fishers Tea Tasting
               example; Fisher's Exact test, Agresti Section 2.6.1

Bayes Rule and Conditional Probability: At-risk Students example 

Matched Pairs, McNemar's test    Data and SAS analysis using PROC FREQ 
               of matched pairs data. Example is approval rating 
               (approve/disapprove) data from 1600 individuals at 
               two times. see Agresti Ch 9

CMH analysis Ex   Data and SAS analysis SAS file
               for CMH analysis (Cochran-Mantel-Haenszel Statistics) for
               meta analysis of Chinese smoking data in Agresti Table 3.3

CMH analysis for Migraine Ex   Data and SAS analysis
               Cochran-Mantel-Haenszel Statistics for Migraine Ex.
               2x2 factorial design--Gender by Treatment (Active, Placebo)
               with binary outcome Improve (Better, Same).

Belief in Afterlife Ex    Data and SAS GENMOD analysis loglinear model
               for Agresti Ch. 2 2x2 Example.

Death Penalty Ex    Cross-classification Tables for Death Penalty Data.
               Illustration of Simpson's Paradox. 2x2x2 Table: Death Penalty
               dp (yes/no); Defendant Race defr, Victim Race victr,
               (# white=1, black=2)

PROC GENMOD code for Migraine Ex   SAS run file for all partial and saturated
               log-linear models for Migraine Ex.
               2x2 factorial design--Gender by Treatment (Active, Placebo)
               with binary outcome Improve (Better, Same).
PROC GENMOD output for Migraine Ex   Resulting SAS output for all partial and
               saturated log-linear models for Migraine Ex.
Selected models for Migraine Ex   SAS code and output for selected log-linear models
               for Migraine Ex. Subset of examples above.

Alcohol, Cigarette, and Marijuana Use Example
Agresti Table 6.3 A survey conducted in 1992 by the
Wright State University School of Medicine and the
United Health Services in Dayton, Ohio. Among other
things, the survey asked students in their final year
of high school in a nonurban area near Dayton, Ohio
whether they had ever used alcohol, cigarettes, or marijuana.
Denote the variables in this 2 X 2 X 2 table by A for alcohol use,
C for cigarette use, and M for marijuana use.
Table 6.3       Alcohol (A), Cigarette (C), and Marijuana (M) Use
                 for High School Seniors
                                  Marijuana Use
        Alcohol     Cigarette
        Use         Use           Yes   No
        Yes         Yes           911   538
                    No            44    456
        No          Yes           3     43
                    No            2     279

PROC GENMOD code for A C M Ex   SAS run file for all partial and saturated
               log-linear models for A C M Ex.
PROC GENMOD output for A C M Ex   Resulting SAS output for all partial and
               saturated log-linear models for A C M Ex.
Drugs AC AM CM    SAS GENMOD analysis for best loglinear model
               (AM, AC, CM) shown in Agresti Table 6.7
Drugs AM CM    SAS analysis for (poor-fitting) loglinear model
               (AM, CM) shown in Agresti Table 6.7

Trend in 2xC tables Agresti section 2.5.2 Alcohol and Infant Malformation Example
               Table 2.7 refers to a prospective study of maternal drinking and
               congenital malformations. After the first three months of pregnancy, 
               the women in the sample completed a questionnaire about alcohol
               consumption. Following childbirth, observations were recorded 
               on presence or absence of congenital sex organ malformations. 
           Table 2.7 Infant Malformation and Mothers Alcohol Consumption

        Alcohol          Malformation          Percentage 
        Consumption     Absent  Present Total   Present 
        0               17,066  48      17,114  0.28    
        less1           14,464  38      14,502  0.26    
        1—2             788     5       793     0.63    
        3—5             126     1       127     0.79    
          6             37      1       38      2.63    
           Source: B. I. Graubard and E. L. Kom, Biometrics 43:471—476 (1987). 

               SAS output illustrates Cochran-Armitage test for trend.

Linear Association Models for Ordinal Data Agresti, Chapter 7.    Data from 
               the 1991 General Social Survey, illustrates the 
               inadequacy of ordinary loglinear models for analyzing 
               ordinal data. Subjects were asked their opinion about a man 
               and woman having sex relations before marriage, with possible 
               responses “always wrong,” “almost always wrong,” “wrong only sometimes,” 
               and “not wrong at all.” They were also asked if they “strongly disagree,”
               “disagree,” “agree, or “strongly agree” that methods of birth control
               should be made available to teenagers between the ages of 14 and 16. 
               Both classifications have ordered categories.
               SAS analysis using PROC GENMOD compare independence model and 
               linear association model