Current Grants:

Previous Grants:

Exploratory Study of the Relationship between Students' Mental Models of Climate Change and Their Environmentally Sustainable Decisions and Behavior
Sponsor Ref Number: 0746137
Period of Support: 08/07 - 09/09
Amount of Award: $188,312

Project summary: The purpose of this SGER is to develop measures of students' conception of climate change and examine the conjecture that environmentally sustainable decisions and behavior are related to these conceptions of the natural world. Specifically, the study asks: (1) Can mental models of climate change be measured reliably, validly and efficiently to provide a gauge on students' understanding: (2) if so, what mental models of Climate Change do middle, high school and college students hold? (3) What is the correlation between students' mental models and their demographic characteristics, reported behavioral and policy preferences regarding Climate Change?

The project is timely because environmental issues are receiving wide media coverage and much public discussion. Improving our understanding of what kinds of education interventions alter students' behaviors around climate change is of national importance. The study would initiate an innovative approach to building the empirical knowledge base about how and why secondary school and college students' alter their behaviors around the critical issue of climate change.

Understanding Academic Performance in Organic Chemistry: An Investigation Examining Underrepresented Groups
Sponsor Ref Number: 0814559
Period of Support: 08/08 - 08/10
Amount of Award: $804,594

Project summary: Thousands of students enroll in "Introduction to Organic Chemistry" (O-CHEM) each year. Successful completion of O-CHEM is a prerequisite for many graduate and professional STEM programs, yet the failure rate is notoriously high. O-CHEM has unique knowledge representation protocols that often challenge even students who initially master general chemistry. There are very few large-scale studies examining why some students succeed while others have difficulty in O-CHEM. Such issues are of particular importance when considering the impact on under-represented minority students and women. A large body of evidence indicates that these groups perform significantly worse in O-CHEM, contributing to the under-representation of these groups in STEM careers. Previous studies by the PI focusing on how students succeed in STEM courses used techniques such as concept-mapping to examine "knowledge organization." Recent studies show that both experts and high-achieving students demonstrate enhanced knowledge bases. No study has examined how knowledge organization mediates academic performance in O-CHEM. Furthermore, no study has examined the activities employed by students to organize their knowledge appropriately, although more generally, investigations into both expert performance and academic success have demonstrated that concentrated, goal-directed activities are correlated with superior performance. Previous studies by the PI have successfully identified the importance of such activities using techniques such as think aloud protocol analysis, structured interviews, and diaries.

The goal of this study is to combine the insights provided by multiple measurement techniques such as concept-mapping, think-aloud protocol analysis, and diaries in order to examine factors contributing to both academic success and difficulties in O-CHEM, particularly among under-represented minority students and women. This is a "frontier research empirical study." It is being undertaken collaboratively between departments of education, psychology (cognitive science), and chemistry. It is examining equal numbers of minority and non-minority students, and equal numbers of males and females within each group. Groups are also comprised of equal numbers of high-, average-, and low-achieving students.

This study has four specific objectives; to 1) Examine O-CHEM knowledge structures to identify major conceptual difficulties; 2) Compare student-instructor O-CHEM knowledge structure correspondence and identify specific discrepancies; 3) Compare O-CHEM problem solving success and knowledge structures, and; 4) Compare specific study activities and knowledge structures. It may lead to the design of randomized trial studies involving individualized and group tutoring in O-CHEM and, ultimately, the restructuring of O-CHEM syllabi and teaching methods in order to close potential gaps between student knowledge and instructor ideals.

Stanford Challenge Award: Marine Science Curriculum for K-12
Period of Support: 09/08-09/09
Amount of Award: $70,000

Project summary: Stanford University's Hopkins Marine Station, SEAL, and the Hilton Bialek Habitat (affiliated with Carmel Unified School District) are collaborating on a program in which Hopkins graduate students, post-docs and faculty teach K-12 students from underserved communities in Monterey County about the oceans, marine conservation, and sustainability using a place-based, outdoor learning approach. We call this the Hopkins Ocean Literacy Program. Learning objectives include demonstrated student understanding of the oceans and how they relate to all forms of life in the watershed, as well as an increased sense of environmental stewardship indicated by demonstrated changes in attitudes and behaviors. Activities will include ocean-based restoration projects, weekend ocean and watershed explorations, after-school programs, and trips to the Hopkins Marine Station. Through our curriculum, students in communities without outdoor programs will have the opportunity to learn from nature by participating in place-based learning activities that are both informative and fun. We will use the experience gained through this program to develop an on-line curriculum to help students, teachers and parents increase their understanding of marine science and the central role healthy oceans play in the global environment. These web-based learning tools will be hosted by Stanford and made available to teachers nationally.

National Center for Research on Evaluation, Standards and Student Learning (CRESST), Studies of Performance-Based Assessments-Measurement of Progress
Sponsor Ref Number: UCLA 0070 G CC908-06
Period of Support: 2/2001-2/2006
Amount of Award: $979,080

Project summary: Standards and accountability measures require concrete links to everyday practices of teachers and the learning opportunities of students if they are going to make a real difference in student learning.  Needed are classroom assessments that are aligned with external accountability requirements, which teachers can use on a regular basis to assess their students’ progress, provide feedback, and take action according. Such measures also can communicate to students what is important to learn as well as provide essential feedback on how they are doing.  In one of our projects we investigate strategies for creating such classroom assessments—embedding assessments in science curricula that systematically tap various dimensions of learning, and to explore the integration of classroom-based assessments and measures of students’ opportunities to learn.  In the other, we conduct psychometric studies—theoretical and empirical—to examine the reliability and validity of cognitive interpretations of different kinds of science achievement assessments.

Assessing Student Learning and Accounting for Their Achievement: The Quest to Hold Higher Education Accountable
Award Number: Atlantic Philanthropies (USA) Inc.
Period of Support: 9/2000-9/2005
Amount of Award: $720,863.75

Project summary: This four-year study evaluates alternative accountability systems and recommends principles for designing such systems with the goal of improving teaching, and learning, while recognizing the critical importance that research plays in some of these institutions. The study takes the perspective of a decision-maker and looks at the intended and unintended consequences of alternative accountability system designs, evaluating the fit between input, process, and output indicators with valued outcomes. Inevitably, the project will have to conceptualize and empirically test ideas for matching output indicators, especially learning-assessment indicators, with valued outcomes; identify and develop alternative accountability models for tracking progress in improving student outcomes, and recommend elements of a comprehensive indicator system for higher education institutions.

List of Publication & presentations acknowledging award:

  • Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. Windows into the Mind.  Paper (In Press).  International Journal of Higher Education.
  • Shavelson, R.J., & Huang, Leta (2003). Responding responsibly to the frenzy to assess learning in higher education, Change, 35(1), 10-19.
  • Naughton, B.A., Suen, A.Y., & Shavelson, R.J. (April 2003). Accountability for what? Understanding the learning objectives in state higher education accountability programs. Annual meeting of the American Educational Research Association, Chicago.
  • Shavelson, R. "Assessment and Achievement: The Quest to Hold Higher Education Accountable."  Stanford University Invited Address, California Association for Institutional Research, Rohnert Park, CA, November 14, 2003.

Embedding Assessments in the FAST Curriculum:
On Beginning the Romance among Curriculum, Teaching and Assessment

Award Number: NSF ESI-0095520
Period of Support: 8/15/2001 - 7/31/2003
Amount of Award: $764,316

Project summary: Paul Black and Dylan Wiliam (1998), in a comprehensive review of the effects of formative evaluation on student performance, concluded that if this feedback were closely connected to instruction and provided information about how to improve performance, it would produce a large positive effect on student performance. They also noted that this kind of feedback rarely occurred in classrooms. The importance of this kind of evaluation was recognized in the National Science Education Standards where a link was forged between content, teaching and assessment standards. This feasibility study begins a "romance" among curriculum, teaching and assessment. It sets forth a framework for conceiving of science achievement with links to methods for assessing different aspects of achievement. This framework is then used to create a set of assessments, both formative (embedded within a unit) and summative (end-of unit) for a unit from the "Foundational Approaches in Science Teaching" (FAST) middle school science program. The developmental process will be documented; the assessment-embedded-FAST unit will be evaluated in a small, randomized experiment focusing on both student learning and teacher implementation (especially formative feedback to students); and claims about the link between types of science achievement and assessment methods will be evaluated empirically. If the feasibility study demonstrates that the framework and methods have a salutary effect on teaching and learning, a full-scale research and development effort to link assessments with curricula will be proposed to NSF, using FAST and at least one other middle-school program.

List of publications & presentations acknowledging award:

  • On The Integration of Formative Assessment in Teaching and Learning: Implications for New Pathways in Teacher Education
    Richard J. Shavelson, Stanford University, USA, for the Stanford Education Assessment Laboratory and the University of Hawaii Curriculum Research and Development Group. Paper presented at the biannual meeting of the European Association for Research on Learning and Instruction, 2003, Padua, Italy. 
  • On The Integration of Formative Assessment in Teaching and Learning: Implications for New Pathways in Teacher Education (In Press)
    Richard J. Shavelson, Stanford University, USA, for the Stanford Education Assessment Laboratory and the University of Hawaii Curriculum Research and Development Group.  Pergamon Publication in an edited book: Achtenhagen, F., & Oser, F. (Eds.) (In Press), New Pathways in the Field of Teacher Education.

Developing, Supporting, and Aligning Classroom and Large-Scale Assessment to Sustain Education Reform: Phase One

Award number: NSF REC-9909370
Period of support: 4/15/2000 - 3/31/2003
Amount of award: $1,505,125

Project summary: This project investigates an innovation that has the potential to raise student achievement in science education to the level and quality espoused in standards-based reform efforts: the development of models and practices to enhance teachers' formative assessment repertoires. Using a variety of research methods, this project investigates and develops assessment procedures that teachers can employ to promote the quality of science teaching and learning, and related areas of mathematics. While the focus initially is on formative assessment practices - paying particular attention to the impact of these types of classroom assessment on student learning, engagement, and sense of purpose - it then moves to teachers' summative judgments and their link to formative work.
An additional aspect of the project is to identify issues associates with broader implementation of programs for teachers that support these assessment practices, then develop and test model plans for high quality professional-development activities about classroom assessment procedures that comport with the research findings. The research also investigates the challenges associated with promulgating these models and practices to large numbers of teachers. Thus the study aims for wide dissemination of ways in which students respond to changes in science instruction and assessment that aim to enhance their roles in classroom assessment, including in peer- and self-assessment.
Toward the end of the initial three-year period, the study begins to probe how classroom assessment and large-scale assessment might be mutually reinforcing to raise educational quality. A second phase - to study and formulate alternative large-scale assessment systems that integrate classroom assessments with examination data for accountability and monitoring purposed - may follow if evidence warrants.

List of publications & presentations acknowledging award:

  • Coffey, J., Sato, M., & Schneider, B. (2001, April). Classroom Assessment Up Close-And Personal. Paper presented at the Annual Meeting of the American Education research Association, Seattle, WA.
  • Shavelson, R.J. (January 2003). Bridging the Gap between Formative and Summative Assessment. Paper presented at the National Research Council Workshop on Bridging the Gap between Large-scale and Classroom Assessment.  National Academies Building, Washington, DC.
  • On Linking Formative and Summative Functions in the Design of Large-Scale Assessment Systems (in publication)
    Richard J. Shavelson, Stanford University
    Paul J. Black, Dylan Wiliam, Kings College London
    Janet Coffey, Stanford University

Center for Assessment and Evaluation of Student Learning
Award Number: WestEd ESI-0119790

Period of Support: 9/2001-8/2006
Amount of Award: $727,019

Project Summary: The aim of the proposed national Center for Assessment and Evaluation of Student Learning (hereafter CAESL or "the Center") is to address the need for increasing the assessment capacity within the K-12 science education system, thereby increasing student literacy in science. CAESAL is a collaboration among The Concord Consortium, CRESST/University of California Los Angeles, Stanford University, Lawrence Hall of Science (LHS) and the Graduate School of Education at the University of California-Berkeley, and WestEd. CAESAL will also partner with the Bay Area Schools for Excellence in Education (BASEE, an NSF-funded local systemic change project involving eight school districts in the San Francisco Bay Area), El Centro Unified School District, Fresno Unified School District, Garvey School District, Kings Canyon Unified School District, Pomona Unified School District, Sacramento City Unified School District, San Francisco Unified School District, San Diego Unified School District which will serve as test beds for Center activities and resources. In addition, the Center will work with San Jose State University to co-develop and integrate a variety of resources into its preservice and graduate programs. Apple Computer, Inc. will support the Center by providing significant cost share ($1.25 million) in the form of consulting, equipment and dissemination.

In order to achieve its goal, CAESAL will initiate research and development projects to address all facets of assessment through the following:
1. Enhance the capacity of prospective assessment and evaluation professionals though a collaborative graduate program among the university partners.
2. Develop and field test models for enhancing the formative and summative assessment capabilities of practicing science teachers though professional development and developing teachers' capacity to lead assessment efforts in their districts.
3. Enhance the formative and summative assessment capabilities of preservice science teachers though the development of assessment course modules.
4. Conduct applied research to inform the Center itself, the field, and practitioners on (a) formative and summative assessment practices and (b) technology-intensive assessment environments, and use findings from this research to generate new products.
5. Enhance the capacity of parents, school administrators policy makers, and the public to make decisions about support the appropriate educational roles of different kinds of assessment and evaluation through outreach programs.

Inverness Research Associates will provide formative and summative evaluation services to assess Center outcomes and success.

List of publications & presentations acknowledging award:

Transferring New Assessment Technologies to Teachers and Other Educators
Award number: NSF ESI-9596080, A004
Period of support: 3/1/1995 - 7/31/2000
Amount of award: $854,738

Project summary: The current grant was a supplement to a grant awarded in 1990 (NSF TPE-905543; $1,038,718). The original grant was carried out in four phases: (1) compile and classify emerging concepts and procedures in new assessment from currently available sources, (2) create performance assessments and study the procedures for doing so, and (3) design and evaluate student portfolios, and (4) evaluate the quality of the assessments. The supplement to this grant sought to disseminate the professional development parts of the original grant beyond research reports and professional development materials (e.g., Brown, J. & Shavelson, R.J. (1996). Assessing hands-on science: A teacher's guide to performance assessment. Thousand Oaks, CA: Corwin Press, Inc.), and extended what we had learned by designing a study to evaluate inquiry-based science education programs to meet Congressional demands for accountability. This latter extension occupied a significant amount of project time. It incorporated traditional and performance-based assessments into a multilevel-multifaceted model of evaluation to assess the learning impacts of science education reform programs. The multilevel aspect of the evaluation conceived of learning assessments varying in distance from the enacted curriculum in the classroom. Science journals provided immediate measures of curricular impact. Close measures took one step further away from classroom activities in pushing students to reason. Proximal measures focused on the same concepts but in a novel way. Distal measures were consistent with a state science framework but students had not been prepared directly in the curriculum for the assessments. The multifacet aspect of the model recognized that science achievement involved more than extensive propositional knowledge as measured by multiple-choice tests and recognized the importance of procedural and strategic knowledge (see Figure 1). We collected extensive evaluation data and provided evidence of the capacity of the model to pick up even weak "treatment" effects of inquiry curriculum. The present proposal draws heavily on the work of this prior grant.

List of publications & presentations acknowledging award:

  • Stecher, B.M., Klein, S.P., Solano-Flores, G., McCaffrey, D., Robyn, A., Shavelson, R. J., and Haertel, E. (2000). The effects of content, format, and inquiry level on performance on science performance assessment scores. Applied Measurement in Education, 13, (2), 139-160.
  • Shavelson, R.J., & Ruiz-Primo, M.A. (1999). On the psychometrics of assessing science understanding. In J.J. Mintzes, J.H. Wamhersee & J.D. Novak (Eds.), Assessing science understanding: A human constructivist view. New York: Academic Press.
  • Shavelson, R.J. (1999). On the role of assessment in self-directed learning. In W. Althof, F. Baeriswyl & K.H. Reich (Eds.), Autonomie und Entwicklung: Fentschrift zum 60 Geburtstag von Fritz Oser, 65-93. University of Freiburg Press.
  • Shavelson, R.J. & Ruiz-Primo, MA (1999). Leistungsbewertung im naturwissenschaftlichen Unterricht (Evaluation in natural science instruction). Unterrichtswissenschaft, 27, 102-127.
  • Solano-Flores, G., Jovanovic, J., Shavelson, R.J., & Bachman, M. (1999). On the development and evaluation of a shell for generating science performance assessments. International Journal of Science Education, 21 (3), 293-315.
  • Shavelson, R.J., Ruiz-Primo, MA& Wiley, E.W. (1999). Note on sources of sampling variability in science performance assessments. Journal of Educational Measurement, 36 (1), 61-71.
  • Shavelson, R.J., Solano-Flores, G., & Ruiz-Primo, MA (1998). Toward a science performance assessment technology. Evaluation and Program Planning, 21, 171-184.
  • Klein, SP, Stecher, BM, Shavelson, R.J., McCaffrey, D., Ormseth, T., Bell, R.M., Comfort, K., & Othman, A.R. (1998) Analytic versus holistic scoring of science performance tasks. Applied Measurement in Education, 11(2), 121-137.
  • Shavelson, R.J. (1997). On a science performance assessment technology: Implications for the future of the national assessment of educational progress. In National Academy of Education (Ed.), Assessment in transition: Monitoring the nation's educational progress, background studies. Stanford, CA: National Academy of Education.
  • Solano-Flores, G., & Shavelson, R.J. (1997). Development of performance assessments in science: Conceptual, practical, and logistical issues. Educational Measurement: Issues and Practice, 16(3), 16-24.
  • Klein, SP, Jovanovic, J., Stecher, BM, McCaffrey, D., Shavelson, R.J., Haertel, E., Solano-Flores, G., & Comfort, K. (1997). Gender and racial/ethnic differences on performance assessments in science. Educational Evaluation and Policy Analysis, 19(2), 83-97.
  • Shavelson, R.J. (1996). On school reform: Curriculum and instruction, yes... but don't forget assessment! Hong Kong Educational Research Journal, 11 (2), 147-156.
  • Ruiz-Primo, MA & Shavelson, R.J. (1996). Rhetoric and reality in science performance assessments: An update. Journal of Research in Science Teaching, 33(10), 1045-1063.
  • Ruiz-Primo, MA, Shavelson, R.J., & Baxter, G.P. (1995). Evaluation of a prototype teacher enhancement program on performance assessment. In P. Kansanen (Ed.), Discussions of some education issues VI. (Research Report 145). Finland: University of Helsinki, Department of Teacher Education, 173-211. Reform in the United States. In D.K. Sharpes & A-L Leino (Eds.), The dynamic concept of curriculum: Invited papers to honour the memory of Paul Hellgren. (Research Bulletin 90). Finland: University of Helsinki, Department of Education, 57-76.
  • Shavelson, R.J., Gao, X., & Baxter, GP (1995). On the content validity of performance assessments: Centrality of domain specification. In M. Birenbaum & F. Dochy (Eds.), Alternatives in assessment of achievements, learning focuses and prior knowledge. Boston: Kluwer Academic Publishers.
  • Shavelson, R.J. (2000, May). Trends in science assessment: Linking methods to facets of achievement.
    Shavelson, R.J., & Ruiz-Primo, MA (1999 March). Assessing NSF Programming-Standards-Based Reform Assessment Technologies: San Francisco Unified School District.
    Briefing, Performance Evaluation Review, Washington, DC: NSF. Invited talk to the Department of Educational Measurement, Umea University, Sweden.
  • Ayala, C.C. & Shavelson, R.J.(2000, April). New dimensions for performance assessments. Paper presented at the American Educational Research Association (AERA) Annual Meeting, New Orleans.
  • Shavelson, R.J. (2000, March). On balancing accountability and learning goals in assessing science achievement. Invited talk for the Washington Educational Research Association (WERA) Research Conference, Seattle, WA.
  • Shavelson, R.J. (2000, March). Accountability issues from invited talk & generalizability of performance measurements. Invited breakout session for the Washington Educational Research Association (WERA) Research Conference, Seattle, WA.
  • Shavelson, R.J. (1999, May). On linking assessment to a cognitive model of science achievement. Invited talk to Berkeley Evaluation and Assessment Research (BEAR), Berkeley, CA. Boston , MA.
  • Shavelson, R.J. (1998, October). On assessment and evaluation in science education reform. Presentation to Bay Area Schools for Excellence in Education (BASEE), Palo Alto, CA.
  • Backman, J., Hardy, C. & Shavelson, R.J. (1998, July). Assessing student learning. Presentation to the California LASER K-8 Science Education Strategic Planning Institute. Shavelson, R.J. (1998, June). On linking assessment of learning with sustainable development. Invited address at the Conference on Sustainable Development, Rantasalmi, Finland.

Models-Based Assessment Design: Individual and Group Problem Solving in Science
Award number: UCLA 0070-G-9H813
Period of supports: 6/15/1996-6/30/2001
Amount of award: $875,776

Project summary: Assessment plays a central role in current education reform. Indeed, it is one of the major instruments of reform, if not the major instrument. The rationale behind this policy instrument goes something like: (a) if teachers teach to the test, and they do; and (b) if students tend to learn what they are taught, and they do; then (c) by using authentic, alternative, or performance assessments that tap into "higher-order" thinking and/or problem-solving in a subject-matter, the chances are that they way teachers teach and what student learn may be changed in a manner consistent with these assessment and the reform (Glaser Raghavan, & Baxter, 1992; Shavelson, Carey & Webb, 1989; Shavelson, Baxter & Pine, 1991). This project addresses one aspect of this chain of reasoning: the claim that authentic, alternative or performance assessments tap into problem-solving in a subject-matter domain. Messick's work (1995) provides conceptual background for addressing the construct validity issues raised by such claims (see also Embertson, 1995).
More specially, this project examines the problem-solving and cognitive-structure clams through a series of studies that address the: (a) conceptual underpinnings of the design of alternative, especially performance, assessments (e.g., Shavelson, in press); (b) exchangeability of alternative methods of measuring performance, (e.g., notebooks based on hands-on investigations, computer simulations, pencil & paper objective tests), especially as they impact diverse populations such as language minority and handicapped students (e.g., Baxter, & Shavelson, 1994; Dalton, Morocco, Tivnan, & Rawson,1994; Solano-Flores et al., in preparation); (c) structural representations of knowledge generated by concept maps (Ruiz-Primo & Shavelson, in press).
The team of researchers conducting studies in this project seeks, then, to understand how variability in students' problem-solving performances arise over different tasks, over different methods, over differences in students' backgrounds, and the like. Ambitiously stated, the project attempts to make progress toward building a framework or "theory" of subject-matter achievement-in large part achievement in mathematics and science-that underlies claims that alternative assessments tap higher-order knowledge and can address policy needs (e.g., use of alternative assessment for different student populations).
Such a theory of learning and performance (Glaser & Bassock, 1989; Glaser & Silver, 1994) would, for example, illuminate the variability observed in students' performances from task to task, from occasion to occasion, and from one measurement method to another (e.g., Cronback, Linn, Brennan, & Haertel, 1995), perhaps by explaining the transition of students' knowledge from little to partial to full to expert in a subject-matter domain. This theory would also address the propositional and procedural knowledge, problem-solving heuristics, and attitudes of mind that students are to acquire in the domain. Measurement methods vary as to what aspect of achievement they tap. The theory would link measurement methods to the components of the domain, indicating which methods are best suited to measure each type of knowledge, or the relation between knowledge types in applications or problem solving. For example, paper-and-pencil methods (e.g., multiple-choice) appear to be particularly apt for measuring facts and concepts (at least on face validity grounds); concept maps at measuring relations among concepts; performance assessments at measuring procedural knowledge. Equally important, the theory would also link the medium or symbolic representation used by a measurement method to the propositional or procedural knowledge tapped by that medium. Finally the theory would address both demographics and content, paying particular attention to the learning and performance of language minority and disabled students (see Shavelson, in press). More realistically, the studies proposed in this project will shed light on some of the parameters of such a theory.
We now turn to descriptions of studies in each of the following areas: (a) conceptual underpinnings of the design of alternative, especially performance, assessments (Solano-Flores & Shavelson); (b) exchangeability of alternative methods of measuring performance, especially as they impact diverse populations (Solano-Flores, Ruiz-Primo, & Shavelson); and (c) concept-map representation of knowledge structure (Ruiz-Primo & Shavelson).

Study I: Validity of Conceptual Underpinnings For The Design Of Performance Assessments
We have developed a framework for conceptualizing science performance-assessment tasks and their corresponding scoring systems (Shavelson, in press). Four types of tasks that parallel investigations carried out by scientists have been identified so far: comparative (compare two or more objects along some dimension), component-identification (decompose a whole into its component parts), classification (classify a set of objects along a set of dimensions for a particular purpose), and observation (systematically observe and record data in time series) investigations (Shavelson, 1994; Solano-Flores et al., 1995). For each type of task, we have identified a corresponding scoring system. For example, procedure-based scoring is used with comparative investigations (Baxter et al., 1992) and evidence-based scoring is used with component-identification investigations (Shavelson et al, 1991). With this framework, we are learning how to structure performance assessment tasks and build scoring systems to go with them.
Background and Research Objectives. Although this framework holds promise for developing a performance-assessment technology, its cognitive foundations have not been systematically studied. What is needed, then, is a cognitive analysis linked to observed performance on the four types of performance tasks. Research conducted by Baxter and Glaser on cognitive analysis of science performance assessment shows that talk aloud techniques can be used fruitfully to validate cognitive interpretations of these assessments (e.g., Baxter & Elder, 1995; Baxter et al., 1994 Glaser et al., 1992) However, talk aloud techniques may potentially affect performance (e.g., Shavelson, Webb & Burstein, 1986).
The purpose of this study, then, is twofold: to examine the reasoning used by students as they take different types of science performance assessments and to see whether different methods of cognitive data collection provide consistent information on this reasoning.
Technical Approach. Fifth-to seventh-grade students with instruction in a specific domain of science (e.g., electricity) will be given one of four types of science performance assessments corresponding to that knowledge domain: for example, Incline Planes (comparative), Electric Mysteries (component identification), Sink and Flat (classification), and Daytime Astronomy (observation). (for descriptions of these assessments, see Shavelson, in press; Shavelson, Baxter & Pine, 1991). For each type of assessment, half the students (n=10) will perform the investigation individually, talking aloud as they complete it. Students' talk aloud will be tape recorded. The other half (n=20) will perform the assessment in pair (npairs=10) and their verbal interaction will be tape recorded as they complete the investigation. The talk aloud and verbal interaction protocols will be analyzed to determine: (a) the kinds of reasoning used by the students across assessments (cf., Baxter et al., 1995), and (b) the kind of information on the students' reasoning obtained with each method of data collection.
Anticipated Impact. This study will provide evidence as to whether the four types of investigations defined by the conceptual framework effectively tap relevant, different higher-order scientific reasoning. It will also shed light on whether talk aloud and verbal interaction protocols provide comparable information about cognition. A better understanding of the cognitive processes underlying problem solving with hands-on science assessments will bear directly on the validity of the conceptual framework, enrich the process of developing and validating science performance assessments, and bring data to bear on proposed interpretations that performance assessments measure higher-order thinking.

Study II: Exchangeability of Alternative Methods of Measuring Performance
Previous research has found that science performance assessments are highly sensitive not only to the tasks sampled (Shavelson, Baxter & GAO, 1993), but also to the method used to measure performance: direct observation, notebooks, computer simulation and paper-and pencil (e.g., Baster & Shavelson 1994). Assuming observation of hands-on investigations are the benchmark, paper-and-pencil (i.e., multiple-choice and short-answer) methods are the least exchangeable for the benchmark (r <.30). In contrast, notebooks were found to be an adequate surrogate for the benchmark (r >.80). To our surprise, computer simulations fell in between notebooks and paper-and pencil as surrogated (r =.45); we had expected them to be as good a surrogate to observation as notebooks. We have interpreted these findings to indicate that: (a) a fundamental difference between paper-and pencil methods and the other methods is that paper-and pencil assessments do not react to the actions taken by students (Shavelson, Boaxter & Pine, 1992), and (b) students have incomplete knowledge and skills in the domain assessed and that the symbolic representations used to assess that performance may or may not tap that partial understanding. These results notwithstanding, many large-scale assessments rely on scores obtained using paper-and pencil methods (e.g., short answer) not only for practical reasons, but also because the assessment developers assume that students' achievement does not depend on the method used to measure achievement. Finally the exchangeability issue is also fundamental to the question of how to accommodate large scale assessment to individual differences in students (e.g., language minority, special needs students): If different methods tap different aspects of understanding in a subject matter, how should assessments deal with accommodation (see Dalton et al, 1994)?
A. Exchangeability of Hands-on and Computer-Simulation Science Investigations
Background and Research Objectives. Unlike the paper-and pencil methods, computer simulations have great potential as surrogates for direct observation in that they react to student's actions. Moreover, computer simulations are affordable and offer logistical advantages over direct observation and notebooks. Further exploration of this assessment methods, then, seems warranted.
The moderate correlation between computer simulation and direct observation is intriguing, especially because students do not have problems interpreting the 2 dimensional computer simulation as representing the 3 dimensional real object (Ruiz-Primo, Solano-Flores, Brown, Druker, & Shavelson, 1994). At least two reasons, then might account for unexpectedly moderate correlation between direct observation and computer simulation, and between notebook and computer simulation: (a) Students' performances are inherently unstable from one performance-assessment occasion to another, that is, unstable from the time that the hands-on task is administered and the computer simulation is administered (see Ruiz-Primo, Baxter, & Shavelson, 1993); or (b) Computer simulations pose different cognitive demands than hands-on methods of measurement.
The purpose of this study, then, is to test these competing explanations of performance differences: Are they due to instability of students performances, to cognitive demands imposed by computer simulations or to some combination?
Technical Approach. Two substudies are envisioned. In Study IIA, students will take the same assessment [either electric mysteries (EM) or bugs (B); see Shavelson et al., 1991] using two methods: hands-on and computer simulation. The assessments will be administered in two sequences (hands-on and computer simulation and computer simulation -hands-on) and with two times between administrations (same day, two weeks). Variations between scores produced by the different versions and administration times will be examined statistically.
In Study IIB, students will carry out hands-on and the computer simulation versions of EM and B while talking aloud as they conduct their investigations. Interview protocols and observations will be analyzed with respect to the solution strategy used in conducting the investigations and the reasoning underlying conclusions (e.g., systematic strategy or trial and error; cf. Baxter et al, 1995) and cognitive demands imposed by each method (e.g., Do students understand the problem in the same way with both methods? Do explanations and reasoning change from one method to the other?).
Anticipated Impact. Results of these substudies will lead to a better understanding of the differences among the assessment methods and the implications of using specific methods in measuring students' achievement. Of particular importance are issues of cost-savings and accommodation of differences among students using alternative methods.
B. Searching for accommodations
Background and Research Objectives. The concept of accommodation of individual differences in student performance assumes that alternative methods measure the same construct. The foregoing studies of exchangeability investigate part of this assumption. Even if measurement methods are not completely exchangeable (Baxter & Shavelson, 1994), they may measure important, overlapping aspects of the construct of interest. In that case, which method or combination of methods might be relied upon to evaluate performance for different individual? Should the highest score be taken? Some combination of scores (e.g., discard the highest and lowest and take the average of the remainder)? How are individuals' performances affected by variation in exchangeability and scoring method? The appropriate design of testing policies relies heavily on (a) finding psychometrically justifiable score combinations, and (b) identifying patterns of performance associated with different students. This study explores these questions about accommodation.
Technical Approach. Data from 300 students from a pervious investigation (e.g., Shavelson, Baxter, & Pine, 1992) will be reanalyzed. These data contain information about students' performance on different investigations ("Paper Towels", "Bugs", "Electric Mysteries"), across assessment methods (observation, notebook, computer simulation, multiple-choice and short-answer), as well as information on student characteristics (e.g., ethnicity, cognitive ability). The reanalyzes will focus on: (a) patterns of performance across different measurement methods associated wit individual and groups (e.g., ethnicity) of students, (b) different combinations of scores reflecting students' performance levels in the context of accommodation, and (c) the psychometric properties of these new scores or profiles.
Anticipated Impact. Results of this study will provide a better understanding of the challenges posed by accommodation and possible ways to address these challenges.

Study III. Validity of Concept-Map Representations of Knowledge Structure
Alternative assessments are intended to provide evidence about what students know and can do in a subject matter. Performance assessments in science, for example, provide evidence especially about what students can do in carrying out investigations. Other assessment techniques, such as concept maps, are supposed to provide information on another aspect of science learning: development of knowledge structures-representations of the interrelation of important science concepts in students' minds. Even though we know little about their psychometric properties, concept maps have being used in large scale assessment for this purpose (e.g., Lomask et al., 1992).
Background and Research Objectives. Our review of literature has revealed the use of a myriad of mapping techniques (e.g., Ruiz-Primo & Shavelson, in press), varying the tasks presented to students, in the response required from students, and in methods of scoring. We suspect that these variations may tap different aspect of student's cognitive structures even though they all are interpreted as representing the same structure. Before concept maps are formally used as assessment tools we need to investigate whether they tap important aspects of the students' knowledge in a subject domain. Preference for one or another technique should be based on the accuracy of their cognitive representations, psychometric qualities, and practicality.
The purpose of this study, then, is to investigate concept maps as tools for large-scale assessment of students' knowledge structures. Among the questions addressed are: Do concept maps actually provide evidence on students' propositional knowledge of a topic? Do maps provide reliable scores? Do different mapping techniques tap the same aspects of students' conceptual knowledge; that is, do concept maps varying in task and format produce exchangeable representations of structure?
Technical Approach. This study will examine the cognitive processes involved in and the exchangeability of mapping techniques. The selection of mapping techniques will be based on criteria developed in our review, including differences in the cognitive demands required by the task, the structure of content domain to be mapped and practicality of the methods used. Since the focus of the study is on large-scale assessment, techniques that require one-to-one interaction between student and tester will be excluded.
High school students will be asked to construct a map on, say, "atomic structure," after they have been taught that unit, under three different mapping conditions that vary in the flexibility afforded students in constructing the map: (a) construct a map on paper using circles as concepts and labeled lines connecting the concepts; (b) construct a map using circles as concepts and labeled lines connecting the concepts; (b) construct a map using "node" cards (i.e., cards with concept labels) so the student can physically move the concepts around until a satisfactory structure is reached and then draw the labeled lines between the "node" card (cf. White & Gunstone, 1991); and (c) construct a map using a computer program that allows students to manipulate icons. A sample of student (n=30 in each condition) randomly assigned to each condition and a random subsample (n=10) in each condition will be asked to talk aloud as they complete the task to reveal possible differences in cognitive process involved in constructing. The mapping techniques will be compared as to their psychometric (i.e., reliability and validity) and practical (e.g., training time, construction time, scoring time) properties.
Anticipated Impact. This study will provide information on the practical and psychometric characteristic of different mapping techniques. It will recommend promising concept-mapping techniques for large-scale assessment, if any.

List of publications & presentations acknowledging award:

  • Baxter, G.P., Elder, A.D., & Glaser, R. (1995). Cognitive analysis of a science performance assessment. CSE Technical Report 398. Los Angeles, CA: UCLA-CRESST.
  • Baxter, G.P., Glaser, R., & Raghavan, K. (1994). Analysis of cognitive demand in selected alternative science assessments. CSE Technical Report 382. Los Angeles, CA: UCLA-CRESST.
  • Baxter, G.P., & Shavelson, R.J. (1994). Science performance assessments: benchmarks and surrogates. International Journal of Educational Research, 21, 267-297.
  • Cronbach, L.J., Linn, R., Brennan, R., & Haertel, E. (1995 summer). Generalizability analysis for educational assessments. Evaluation Comment, 1-29.
  • Dalton, B., Morocco, C.C., Tivnan, T., & Rawson, P. (1994). Effect of format on learning disabled and non-learning disabled students' performance on a hands-on science assessment. International Journal of Educational Research, 21, 299-315.
  • Embertson, S.E. (1995). A measurement model for linking individual learning to processes and knowledge: Application to mathematical reasoning. Journal of Educational Measurement, 32, 277-294.
  • Glaser, R., Bassock, M. (1989). Learning theory and the study of instruction. Annual Review of Psychology, 40, 631-666.
  • Glaser, R., Raghavan, K. & Baxter, G.P. (1992). Cognitive theory as the basis for design of innovative assessment: Design characteristics of science assessments. CSE Technical Report 349. Los Angeles, CA: UCLA-CRESST.
  • Glaser, R., & Silver, E. (1994). Assessment testing, and instruction: Retrospect and prospect. CSE Technical Report 379. Los Angeles, CA: UCLA_CRESST.
  • Messick, S. (1995). Validity of Psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.
  • Ruiz-Primo, M.A., & Shavelson, R.J. (in press). Concept maps as potential alternative assessments in science. Journal of Research on Science Teaching.
  • Ruiz-Primo, Solano-Flores, Brown, Druker, & Shavelson, 1994.
    Shavelson, R.J., Baxter, G.P., & Gao, X. (1993). Sampling variability of performance assessment. Journal of Educational Measurement, 30, 215-232.
  • Shavelson, R.J., Baxter, G.P., & Pine, J. (1991). Performance assessment in science. Applied Measurement in Education, 4, 347-362.
  • Shavelson, R.J., Webb, N.M., & Burstein, L, (1986). The measurement of teaching. In M. Wittrock (Ed.), Handbook of Research on Teaching. New York: MacMillan.
  • Solano-Flores, G., Ruiz-Primo, M.A., Baxter, G.P., & Shavelson, R.J. (in preparation) Science performance assessments with language minority students. Stanford, CA: Stanford University School of Education.
  • White, R., & Gunstone, R. (1991). Probing understanding. New York: Falmer Press.

Construct Validity of Problem Solving Assessment

Award number: UCLA 0070-G-9H813
Period of support: 6/15/1996 - 6/30/2001
Amount of award: $318,987

Project summary: This project seeks to capitalize on our prior research showing that psychologically meaningful and useful subscores can be obtained from conventional achievement tests designed for large-scale educational surveys. The prior research analyzed the NELS: 88 math and science tests at 8th, 10th, and 12th grade levels, using both statistical analyses of data from the national sample and small-scale interview studies of local high school students to obtain think aloud protocols associated with task performance. The results show that these subscores: 1) represent important ability distinctions in high school mathematics and science achievement; 2) show significantly different patterns of relationships with instructional, course taking, and educational program variables as well as gender, ethnicity, and other student background variables; and 3) are derivable from multiple choice tests not designed for this purpose. This in turn suggests a new approach to test design and validation in which content x process distinctions in test specification tables are subjected to multivariate statistical and intensive cognitive analysis and redrawn to identify component ability constructs explicitly. For details on this work, see Kupermintz, Ennis, Hamilton, Talbert, & Snow, (1995); Hamilton, Nussbaum, Kupermintz, Kerkhoven, & Snow, (1995); Kupermintz, & Snow, (submitted); Nussbaum, Hamilton, & Snow, (submitted); Hamilton, Kupermintz, & Snow, (submitted); and also Hamilton, Nussbaum & Snow (in press).
Background and Research Objectives. The next step in this research is to examine the construct validity of such distinctions in an expanded array of both multiple choice and constructed response tasks. Of particular interest are performance assessment tasks explicitly designed to assess different kinds of complex knowledge and problem solving in large scale survey tests such as those used in NAEP and various state assessments. For example, our previous studies distinguished reasoning and knowledge subscores in the NELS: 88 math knowledge and reasoning subscores in the NELS: 88 science tests. The results suggest that these or other constructs might be distinguishable in tasks of the sorts now being developed for new performance assessments and that performance tasks might be explicitly designed to sharpen and improve such distinctions. Furthermore, this possibility might exist in other subject-matter fields, such as history and geography.
This prior results also show that instructional variables such as teacher emphasis on understanding and higher order thinking relate more to math knowledge development than to math reasoning, and that the source of gender differences in science may be located in the spatial mechanical reasoning dimension of science problem solving tasks, rather than in other aspects of science achievement. But we do not yet understand the properties of math and science assessment tasks that underlie such correlations. Construct validation research must seek an understanding of such properties in the design and evaluation of new as well as old forms of performance assessment. Furthermore, such research may provide clues for targeted instructional improvements. For example, if some female students typically fail science assessment tasks requiring spatial mechanical reasoning, understanding the properties of such tasks may suggest how instruction can be revised to improve the preparation of such students.
Other previous results indicate differential relations between cognitive achievements subscores and measures of affective and conative variable such as student motivation, anxiety learning style, and self regulation. Thus, a new multidimensional approach to achievement test validation should include affective and conative as well as cognitive reference constructs. Previous work in our project has provided a catalogue of such constructs relevant to research on instruction (see Snow & Jackson, 1994; Snow, Corno, & Jackson, in press). But so far there have been no systematic attempts to examine affective and conative aspects of new forms of educational assessment.
In brief then, the primary objective of this study is to determine if knowledge and ability distinctions previously found important in high school math and science achievement tests occur also in other multiple choice and constructed response assessments, particularly in those used in large-scale educational surveys. A second objective is to examine alternative assessment designs that would sharpen and elaborate such knowledge and ability distinctions in such fields as math, science, and history-geography, and suggest instructional intervention strategies related to them.
Technical Approach. Our procedure builds on the previous large-scale statistical analyses of multiple choice and constructed response tasks from NESL: 88. We would extend our local interview procedure to examine constructed response tasks chosen from NAEP, but also similar tasks specifically designed to afford use of different cognitive structures and processes to bring out particular knowledge and ability distinctions. We would also administer cognitive tests and conative-affective questionnaires to groups of local high school students, to select students who show different cognitive-conative-affective aptitude profiles, and thus to obtain interview think-aloud protocols on the selected tasks from students with known attitude profiles. Our interview procedures are designed to obtain in- depth verbal descriptions of student thinking as they work through constructed response tasks. These procedures are described in Ennis, Kerkhoven, and Snow (1993) and Hamilton, Nusbaum, and Show (in press). Small-scale instructional interventions will be designed as miniature experiments or case studies. Statistical procedures for large scale analysis rely on routine correlational and regression methods as well as new methods for item factor analysis provided by Bock (see Bock, Gibbons, & Muraki, 1988).
Time Line. Year 1: June 1996-May1997. Complete review and analysis of NELS: 88 math, science and history survey data on knowledge and ability distinctions and their cognitive and affective correlates. Review existing NAEP and other large scale assessment instruments to identify tasks in which these and other such distinctions might occur. Plan experimental redesign of chosen tasks. Year 2: June 1997-May 1998. Plan and conduct local high school study of assessment instruments along with test and questionnaire measures of cognitive and affective reference constructs. Select subsamples with different aptitude and achievement profiles and conduct interview study of knowledge and reasoning contrast. Year 3: June 1998-May1999. Redesign tasks to sharpen contrasts. Replicate Year 2 study adding redesigned tasks and refinement of reference measures. Again conduct subsample interview study. Year 4-5: June 1999-May 2001. Summarize validation evidence and implications for review by local teachers. Conduct survey and interview study of local teachers to explore links between knowledge and ability distinctions in assessment tasks and instructional tasks and practices. Complete major monograph on multidimensional knowledge and ability components in achievement assessment, the redesign of assessment tasks to represent them validly in large scale surveys, and their use in targeting and evaluating local instruction in high school math and science.
Anticipated Impact. This study is expected to influence the way especially large-scale achievement assessments design and validate tests. It provides the concepts and methodological tools that will better align interpretations of achievement test scores with the evidence of what they are measuring.

List of publications & presentations acknowledging award:

  • Bock, D., Gibbons, R., & Muraki, E. (1998). Full-information item factor analysis. Applied Psychological Measurement, 12, 261-280.
  • Ennis, M., Kerkhoven, J.I.M., & Snow, R.E. (1993). Enhancing the validity and usefulness of large-scale educational assessments (Report No. P93-151).
  • Stanford University, Center for Research on the Context of Secondary School Teaching.
  • Kupermintz, H., Ennis, M.N., Hamilton, L.S., Talbert, J.E., & Snow, R.E. (1995). Enhancing the validity and usefulness of large-scale educational assessments: I. NELS: 88 mathematics achievement. American Educational Research Journal, 31, 525-554.
  • Hamilton, L.S., Nussbaum, E.M., Kupermintz, H., Kerkhoven, J.I.M., & Snow, R.E. (1995). Enhancing the validity and usefulness of large-scale educational assessments: II. NELS: 88 science achievement. American Educational Research Journal, 31, 555-581.
  • Kupermintz, H. & Snow, R.E. (submitted). Enhancing the validity and usefulness of large-scale educational assessments: III, NELS: 88 mathematics achievement to twelfth grade.
  • Nussbaum, E.M., Hamilton, L.S., & Snow, R.E. (submitted). Enhancing the validity and usefulness of large-scale educational assessments: III NELS: 88 mathematics achievement to twelfth grate.
  • Hamilton, L.S., Kupermintz, H., & Snow, R.E. (submitted). Enhancing the validity and usefulness of large-scale educational assessments: V, NELS: 88 mathematics and science achievement and affective interrelationships.
  • Hamilton, L.S., Nussbaum, & Snow, R.E. (submitted). Interview procedures for validating science assessments.
  • Snow, R.E. & Jackson, D.N. III (1994). Individual differences in conation: Selected constructs and measures. In H.F. O'Neill, Jr. and M. Drillings (Eds.) Motivation: Research and Theory. (Pp.72-99) Hillsdale, N.J.: Lawrence Erlbaum Associates.
  • Snow, R.E., Corno, L, & Jackson, D.N. III (1995). Individual differences in affective and conative functions. In D.C. Berliner & R.C. Calfee (Eds.) Handbook of Education Psychology. New York: Macmillan.

On Enhancing Teachers' Formative Assessment Practices: The Case for Science Journals
Award number: NSF ESI-9910020
Period of support: 10/1/1999 - 9/30/2001
Amount of award: $100,000

Project summary: The purpose of this project is to explore the use of students' science journals as a staff development tool to improve formative assessment practices in science classrooms at the elementary level by: 1) creating a conceptual and practical framework that can guide effective use of journals by teachers and students; 2) creating a pilot teacher enhancement program (Ruiz-Primo, 1994) that, by using science journals as an example of a classroom assessment, helps teachers reflect on their formative assessment practices and provides them with a framework that can be used to improve this practice; and 3) exploring the impact of the implementation of the program on the teachers' formative assessment practices and on student performance.

Two products will result from this project, a preliminary conceptual framework on the use of journals as a formative assessment tool and a pilot teacher enhancement program (TEP) for improving teachers' formative assessment practices. The next step in this project will be the revision, implementation, evaluation, and dissemination of the TEP in a larger context as well as explore science journals as an assessment tool for self- and peer-evaluation.

List of publications & presentations acknowledging award:

  • Min Li, Ruiz-Primo, M.A., Ayala, C.C., Shavelson, R.J. (2000, April). Study of the reliability and validity of inferring students' understanding from their science journals. Paper presented at the American Educational Research Association (AERA) Annual Meeting, New Orleans. 
  • Ruiz-Primo, M.A., Li, M., Ayala, C.C., & Shavleson, R.J. (2000, April). Students' science journals as an assessment tool. Paper presented at the American Educational Research Association (AERA) Annual Meeting, New Orleans.


Multidimensional Student Assessments for High School Mathematics and Science
Award number: NSF REC-9628293
Period of support: 8/15/1996 - 7/31/2001
Amount of award: $786,433

Project summary: This project develops a multidimensional construct validation approach to the design and analysis of student achievement assessments in high school mathematics and science. The general aim is to elaborate the indicator measures of student achievement and thus to enhance the validity and usefulness of large-scale educational surveys. Specific objectives are to: 1) identify knowledge and reasoning components of the math and reasoning tests, both multiple-choice and constructed response, used in the NELS:88 High School Effects Study (HSES) and related these to knowledge and reasoning components previously distinguished in the NELS:88 regular national sample; 2) analyze the HSES constructed response tasks in relation to student gender and ethnic differences, as well as other student background and instructional variables; 3) extend the construct validation approach to selected items from the National Assessment of Educational Progress (NAEP); 4) develop improved guidelines for other investigators using NELS:88 data to apply multidimensional scoring of student tests.

The results of this study should help develop improved indicators of student achievement in science and mathematics by distinguishing educationally and psychologically important components of total scores and showing their differential relation to student gender, ethnicity, other student background variables, and course-taking and instructional differences. They should also help develop the methodology of item construct validation for the improvement of achievement assessments in future national and international surveys.

List of publications & presentations acknowledging award:

  • Hamilton, L.S., Nussbaum, E.M., & Snow, R.E. (1997). Interview procedures for validating science assessments. Applied Measurement in Education, 10, 181-200. 
  • Kupermintz, H., & Snow, R. E. (1997). Enhancing the validity and usefulness of large-scale educational assessments: III. NELS:88 mathematics achievement to 12th grade. American Educational Research Journal, 34, 124-150. 
  • Nussbaum, M., Hamilton, L.S., & Snow, R.E. (1997). Enhancing the validity and usefulness of large-scale educational assessments: III. NELS:88 Science Achievement to 12th Grade. American Educational Research Journal, 34(1), 151-173.
  • Hamilton, L.S. (1997) Identifying differential item functioning on science achievement tests. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago