Education 351B  Autumn 2013
    Statistical issues in testing and assessment

David Rogosa Sequoia 224,   rag{AT}stat{DOT}stanford{DOT}edu
Course web page:
For Autumn 2012 complete materials go here

Registrar's information
  EDUC 351B: Statistical issues in testing and assessment
  Units: 2-3
  Room:  160-325
  Schedule: Monday 3:15-5:05pm
  Grading Basis: Letter-S/NC
  Course Description:
  The new book by Howard Wainer, "Uneducated Guesses: Using Evidence to Uncover 
  Misguided Education Policies" is the basis for this seminar. Also included will 
  be supporting research literature and data analysis activities for topics 
  such as college admissions, methods for missing data, assessment of 
  achievement gaps, and the use of value-added analysis.

Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies.   Howard Wainer (Author)
amazon page    available in paper and Kindle
Publisher: Princeton University Press (August 21, 2011), ISBN-10: 0691149283 ISBN-13: 978-0691149288
Wainer youtube video from PUPress--Howard Wainer critiques misguided education policies

Week 1  9/23.   Organization; meet and greet.
In the news, September 2013
1.   California Voters Strongly Support Student Testing According to Rossier/PACE Poll     New USC Poll: Public Approval for Testing and Evaluations
2.  College Admissions Officers Ready for Revamped SAT     Hints on the New SAT

Week 2  9/30.   SAT and College Admissions:
In the news
1. STAR (Calif) teast cheating. See also Wainer Chap 8
   Schools lose academic ratings after claims of cheating     SD County teachers caught cheating on state tests    San Jose teacher helped second-graders cheat on STAR test
2. PACE Accountability Conference

Class content (also see "Unit 1" grouping of materials)
1. Wainer Ch1 (SAT optional per NACAC)
    Wainer posting of early form (Feb 2009) of Chap 1 materials
    Source Material: " Report of the Commission on the Use of Standardized Tests in Undergraduate Admission" September 2008, National Association for College Admission Counseling.
2.Empirical Research on SAT Use and Usefullness
Questions: Does SAT help (enough) in college admissions?    Would something less noxious (e.g State tests, HS grades) do as well?    Isn't it all SES?
ETS reports: 2012 SAT Report (see displays pp. 6, 10, 22)

Week 3  10/7.   SAT and College Admissions, continued
In the news
Stagnant 2013 SAT Results are Call to Action for the College Board       ETS report, 2013 version. The 2013 SAT Report on College and Career Readiness

Class content (also see "Unit 1" grouping of materials)
1. Wainer Ch1 (SAT optional per NACAC) and Empirical Research on SAT Use and Usefullness
Questions: Does SAT help (enough) in college admissions?    Would something less noxious (e.g State tests, HS grades) do as well?    Isn't it all SES?
One source to examine: UC and the SAT: Predictive Validity and Differential Impact of the SAT I and SAT II at the University of California Saul Geiser with Roger Studley University of California Office of the President October 29, 2001
Mantra of UC researchers:
 High-school grades provide a fairer, more equitable and ultimately more meaningful basis for
admissions decision-making and, despite their reputation for "unreliability," remain the
best available indicator with which to hazard predictions of student success in college. 
Background materials on regression: MT woes of regression coefficients slides
Coleman data: adjusted-variables multiple regression   data file, 20 schools     Adjusted variable plot

Week 4  10/14
   In the news
Kids aren't our only failure.   American Adults Score Poorly on Global Test      U.S. adults lag behind counterparts overseas in skills

1. Wainer Chap 2/5. Use of achievement tests (and test equating). Chap 2/5 technical topics (Use of achievement tests instead of SAT).
Test Equating handout. California K-12 testing comparability saga. Comparability calculation (CA AB265) relevant to Wainer Ch.2/5.       Test equating, from R MiscPsycho package, the SL function for the Stocking Lord Equating Procedure. IRT equating procedures: Weeks, J. P. (2009). plink: IRT separate calibration linking methods
   Non-IRT alternatives R-package equate vignette Statistical Equating Methods Anthony Albano January 7, 2011   Kernel and Traditional Equipercentile Equating With Degrees of Presmoothing Paul Holland April 2007  EQUATING TEST SCORES (without IRT) Samuel A. Livingston 2004. A very good review of equating error in observed-score test equating: Evaluating Equating Error in Observed-Score Equating Wim J. van der Linden University of Twente, Enschede, The Netherlands Law School Admission Council Computerized Testing Report 04-03 July 2006

2. Wainer Chap 3. Cut scores, PSAT
Ch 3 technical topic: Cut scores and diagnostic testing. Stat141 handout false-positive, false negative Medical Diagnosis (SW text section ex3.17)
2011 in the news and PSAT background
 Washington Post: The problems with the PSAT and National Merit program
 NYU Exiting National Merit Scholarship Citing Test Process    
          Background: ETS PSAT page
Technical Resources:
Bayes Thm relevant for cut-scores and scholarships, Wainer Ch.3.
     Bayes Thm and diagnosis, Jim Berger, Objective Bayes Berger Talk 2005;    Univ Chicago lectures 2011     2006 publication

Unit 1 Materials Core and background
Wainer Chapters 1-3 (and Chap 5, continuation of Chap 2)-- Rebuttal to Report of the Commission on the Use of Standardized Tests in Undergraduate Admission September 2008
In the words of Wainer:
  a report, commissioned by the National Association for College Admission Counseling, 
  that was critical of the current college admission exams, the SAT and the ACT. 
  The commission was chaired by William R. Fitzsimmons, the dean of admissions and 
  financial aid at Harvard.
  The report was reasonably wide-ranging and drew many conclusions while offering alternatives. 
  Although well-meaning, many of the suggestions only make sense if you say them fast.
  Among their conclusions were:
      Schools should consider making their admissions "SAT optional," that is allowing 
      their applicants to submit their SAT/ACT scores if they wish, 
      but they should not be mandatory. The commission cites the success that pioneering 
      schools with this policy have had in the past as proof of concept.
      Schools should consider eliminating the SAT/ACT altogether and substituting instead 
      achievement tests. They cite the unfair effect of coaching as the 
      motivation for this -- they weren't naive enough to suggest that because there was no 
      coaching for achievement tests now that, if they became more high stakes 
      coaching for them would not be offered. Rather, they argued that such coaching would be 
      related to schooling and hence more beneficial to education than is 
      coaching that focuses on test-taking skills.
      That the use of the PSAT with a rigid qualification cut-score for such scholarship 
      programs  as the Merit Scholarships be immediately halted.
    Wainer posting of early form (Feb 2009) of Chap 1 materials
    comparing the incomparable early version (Dec 1999) of Chap 5
Source Material: "Report of the Commission on the Use of Standardized Tests in Undergraduate Admission" September 2008, National Association for College Admission Counseling.
More NACAC    Preparation for College Admission Exams National Association for College Admission Counseling    National Association for College Admission Counseling Foundations of Standardized Admission Testing
Some commentary on the NACAC report:
Dramatic Challenge to SAT and ACT
In Defense of the SAT, Columbia U
Standardized Tests: Fair or Unfair?

Dick Atkinson on College Admissions testing:
Reflections on a Century of College Admissions Tests  Educational Researcher, Vol. 38, No. 9, pp. 665-676  
      cited Univ of Calif report Validity Of High-School Grades In Predicting Student Success Beyond The Freshman Year: High-School Record vs. Standardized Tests as Indicators of Four-Year College Outcomes
The New SAT: A Test at War with Itself   invited presidential address at the annual meeting of the American Educational Research Association held in San Diego, California on April 15, 2009

A more substantial regression exercise with SAT and GPA: SAT Scores, High Schools, and Collegiate Performance Predictions Jesse Rothstein Princeton University
Most recent comprehensive item on SAT, SES etc
Psychological Science 2012 23: 1000 originally published online 2 August 2012. Paul R. Sackett, Nathan R. Kuncel, Adam S. Beatty, Jana L. Rigdon, Winny Shen and Thomas B. Kiger. The Role of Socioeconomic Status in SAT-Grade Relationships and in College Admissions Decisions
This has cites to the Rothstein paper and to earlier Geiser et al UC Presidents Office studies NOTE: remarkably, all links at UC president are broken, they calim they reorganized their website: I have the Geiser and Studly, may just post it, try a google search: Geiser Studley "UC and SAT" for some proprietary postings.
12/7/11 for Chap 1,2. PACE report: State Standards, the SAT, and Admission to the University of California
For better or worse, these analyses are taken seriously by the University of California administration and Regents:   ADMISSIONS TESTS AND UC PRINCIPLES FOR ADMISSIONS TESTING: A Report from the Board of Admissions and Relations with Schools (BOARS)
The following seems to be the source data analysis document: Agronow, S., and Studley, R., 2007, Prediction of college GPA from new SAT test scores - a first look. Annual meeting of the California Association for Institutional Research (CAIR), Nov 16, 2007

Chap 2/5 technical topics (Use of achievement tests instead of SAT).
Test Equating handout. California K-12 testing comparability saga. Comparability calculation (CA AB265) relevant to Wainer Ch.2/5.       Test equating, from R MiscPsycho package, the SL function for the Stocking Lord Equating Procedure. IRT equating procedures: Weeks, J. P. (2009). plink: IRT separate calibration linking methods
   Non-IRT alternatives R-package equate vignette Statistical Equating Methods Anthony Albano January 7, 2011   Kernel and Traditional Equipercentile Equating With Degrees of Presmoothing Paul Holland April 2007  EQUATING TEST SCORES (without IRT) Samuel A. Livingston 2004. A very good review of equating error in observed-score test equating: Evaluating Equating Error in Observed-Score Equating Wim J. van der Linden University of Twente, Enschede, The Netherlands Law School Admission Council Computerized Testing Report 04-03 July 2006

Week 5  10/21
In the news
   Ed Haertel speaks on value-added.   Do student test scores provide solid basis to evaluate teachers?     Reliability And Validity Of Inferences About Teachers Based On Student Test Scores

AP news items: More schools opening Advanced Placement courses to all students      Schools Shutting Out Some Top Students from AP Courses

1. Wainer Chap 4.
Chapter 4 resources
Earlier version of Chap 4: Educational Psychology Review Volume 12, Number 2 (2000), 201-228, The Aptitude-Achievement Function: An Aid for Allocating Educational Resources, with an Advanced Placement Example William Lichten and Howard Wainer
Using the PSAT/NMSQT and Course Grades in Predicting Success in the Advanced Placement Program, Wayne Camara and Roger Millsap College Board Report No. 98-4
College Board AP report AP report     homepage, updates
Denise Pope: Should AP Be Plan A? (4/13)      Full report    Washington Post coverage
Changes in Advanced Placement Test Taking in California High Schools 1998-2003 Richard S. Brown 01-01-2005
2. Technical topics: IRT intro
see   ltm: An R Package for Latent Variable Modeling and Item Response Theory Analyses Dimitris Rizopoulos Journal of Statistical Software November 2006, Volume 17, Issue 5.     Riz ltm talk at useR! 2008    Manuals: ltm     mirt
LSAT basics data analysis handout
Revelle who has a draft text which covers standard statistics plus specialized measurement topics. Ch 7 is test reliability and Chap 8 is IRT
Revelle also did the R-package, psych: psych package
The Psychometrics Task View provides an annotated listing of more than you really want for R-packages relevant to educational testing.

Week 6  10/28
In the news
With no state tests, Palo Alto wants one anyway

a. Finish AP discussion: Stand and Deliver, Denise Pope (challenge, don't celebrate, success) reports, other empirical research.
b. Main event: Wainer content: Examineee Choice (Chap 6, 7)
c. Continue IRT examples

Chapter 6-7 resources
On Examinee Choice in Educational Testing. Howard Wainer. Educational Testing Service. David Thissen. University of North Carolina at Chapel Hill REVIEW OF EDUCATIONAL RESEARCH 1994 64: 159
Item Response Theory Models Applied to Data Allowing Examinee Choice   JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS 1998 23: 236, Eric T. Bradlow and Neal Thomas
Problem Choice by Test Takers, RL Linn, 1998, CRESST Tech Report 485
Quick review of performance assessments May 2011, Performance-based Assessment: Some New Thoughts on an Old Idea
also from Laura Hamilton, SUSE Ph.D
An Investigation of Students' Affective Responses to Alternative Assessment Formats      Construct validity of constructed-responseassessments: Male and female high school science performance    Laura S. Hamilton (1999): Detecting Gender-Based Differential Item Functioning on a Constructed- Response Science Test, Applied Measurement in Education, 12:3, 211-235   The Search for Value-Added: Assessing and Validating Selected Higher Education Outcomes

Week 7  11/4
In the news
U.S. threatens to take $3.52 billion from California schools in testing dispute      Our state's loss over school testing

a. Student choice followup
b. Technical topics carryover.
    Standardized regression coefficients.    Reliability vs Accuracy.  Followup to Discussion in class mtg of reliability vs accuracy
    Shoe Shopping Example (esp Bundy version),    Materials on Accuracy of Student Test Scores. See collection at How Accurate are the STAR Scores for Individual Students? An interpretive guide   see 1999 Accuracy Guide, and NY Times column for illustrations of accuracy vs traditional reliability.
c. Main event: Value-added Analysis, Wainer Chapter 9, Value-added analyses
Other versions of the Chap 9 materials Value-Added Models to Evaluate Teachers: A Cry For Help H Wainer, Chance, 2011.         Journal of Consumer Research Vol. 32, No. 2, Sept 2005
More Value-added analysis.
   Ed Haertel speaks on value-added.   Do student test scores provide solid basis to evaluate teachers?     Reliability And Validity Of Inferences About Teachers Based On Student Test Scores
Haertel and friends, EPI report: Problems with the use of student test scores to evaluate teachers
Helen Ladd papers (prob better ones). Teacher effects    NC talk
Journal of Educational and Behavioral Statistics Vol. 29, No. 1, Spring, 2004 Value-Added Assessment Special Issue
Value-Added Measures of Education Performance: Clearing Away the Smoke and Mirrors, PACE
LA Times Teacher Ratings, summer 2010        NEPC vs LATimes
J.R. Lockwood, Harold Doran, and Daniel F. McCaffrey. Using R for estimating longitudinal student achievement models. R News, 3(3):17-23, December 2003.
Fitting Value-Added Models in R  Harold C. Doran and J.R. Lockwood
New York, New York
Value-added does New York City. New York schools release 'value added' teacher rankings     Formula uncovers the 'value added'    from the unions: THIS IS NO WAY TO RATE A TEACHER
More on Value-added: A better way to grade teachers By Linda Darling-Hammond and Edward Haertel. NY unions ad
Andrew Gelman on Value-added arithmetic: It's no fun being graded on a curve     more NY  Principals rebel against 'value-added' evaluation   (from Ben Shear) Some VAM results for NYC    Rogosa R-recreation
The Don't do VAM letter to NY from Stanford and friends
Missing Data and Chap 9 stories
A. Wald from the Boeing Math Group (good pictures, pp.20-24)
R packages and resources.  1. Missing data     Stat222 class handout, imputation and analysis using mice
R resources.
Multivariate Analysis Task View, Missing data section, esp packages mice and mi
van Buuren S and Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. see also multiple imputation online

Week 8  11/11
In the news
California students score at bottom of nation in reading, math    California students among worst performers on national assessment of reading and math

Wainer Chap 10,
Chapter 10 resources
Shopping for Colleges When What We Know Ain't Journal of Consumer Research Vol. 32, No. 2, Sept 2005
More rankings:
   From in the news October 2012 What is the Best University in America?  Comprehensive rundown of the various University rankings (c.f. Wainer Ch. 10)
   America's Top Colleges 2013, Forbes.     Best Values in Private Colleges 2014
Technical items:
Avery at al A Revealed Preference Ranking of U.S. Colleges and Universities , NBER WP10803
Bradley-Terry methods for rankings in R-package BradleyTerry2
Background on forming composites, the classic by H Wainer-- Estimating Coefficients in Linear Models: It Don't Make No Nevermind
Firth, D. (2005) Bradley-Terry models in R. Journal of Statistical Software, 12(1), 1-12.
Turner, H. and Firth, D. (2012) Bradley-Terry models in R: The BradleyTerry2 package. Journal of Statistical Software, 48(9), 1-21.

Week 9  11/18
In the news More NAEP
1. NAEP road trip? Explaining California Students' Performance on NAEP Martin Carnoy
2. Michelle Rhee reconsidered.
John Merrow 
Subject: Michelle Rhee's Reform Strategy & DC's NAEP Results
Dear Friends and Readers,
Much has been made about the big jump in NAEP scores in Washington, DC.  Chancellor Kaya Henderson attributed the "breakthrough
gains" to better teachers, a stronger curriculum and ai 'get tough on teachers' policy, the approach begun by her predecessor,
Michelle Rhee.
Many pundits, politicians and policymakers echoed Henderson's message: "Getting tough" works.

But does it?  I can interpret that same NAEP data to produce THREE very different stories, stories that might carry these

It's a variation on the old saw, "Lies, damn Lies and statistics," substituting "headline writers" for the last term.
So rather than rush to judgment, let's take a careful look at the numbers--and the spin:

A. Wainer Chapter 11.
Chapter 11 resources
Collection of resources at The International Association for Computerized and Adaptive Testing (IACAT)  esp linked Rudner tutorial    original David Weiss CAT site
Computerized Adaptive Testing: A Primer [Hardcover] Howard Wainer
Nontechnical primer. A Framework for the Development of Computerized Adaptive Tests Nathan A. Thompson, Assessment Systems Corporation David J. Weiss, University of Minnesota
R-resources   Google search: Computerized and Adaptive Testing
  Package 'catR' August 29, 2013 Title Procedures to generate IRT adaptive tests (CAT) Author David Magis (U Liege, Belgium), Gilles Raiche (UQAM, Canada) Description The catR package allows the generation of response patterns under computerized adaptive testing (CAT) framework,with the choice of several starting rules, next item selection routines, stopping rules and ability estimators. Control methods for item exposure and content balancing are also included
Random Generation of Response Patterns under Computerized Adaptive Testing with the R Package catR David Magis, Gilles Raiche Vol. 48, Issue 8, May 2012
Package catIrt August 29, 2013 Title An R Package for Simulating IRT-Based Computerized Adaptive Tests Author Steven W. Nydick Description Functions designed to simulate data that conform to basic unidimensional IRT models (for now 3-parameter binary response models and graded response models) along with Post-Hoc CAT simulations of those models with various item selection methods, ability estimation methods, and termination criteria

B. Wainer Chapter 8.
Chap 8 resources
online version: A Little Ignorance: How Statistics Rescued a Damsel in Distress
2012 in the news: Teachers also cheat (cf Chap 8). (11/26)  Feds: Teachers embroiled in test-taking fraud   Test cheating probe nets former educator   TN, other teachers embroiled in test-taking fraud, feds say
The 'MiscPsycho' package by Harold Doran has a number of functions useful for standard psychometrics and some of the topics in the Wainer text cheat. Vignette section 12
cheating example in package poLCA. French data in examCheating from package r2lh.

Week 10  12/2
Student presentations.