Educational Assessment and Evaluation – Formative and Summative Methods

Youssef Khoury
Definition and Core Concept
This article defines Educational Assessment as the systematic process of gathering, interpreting, and using information about student learning to make decisions about instruction, grading, progression, and programme effectiveness. Evaluation refers to the broader judgement of educational programmes, schools, or systems. Core features: (1) formative assessment (ongoing, low-stakes, used to adjust teaching), (2) summative assessment (end-of-unit or end-of-year, high-stakes, measures achievement), (3) diagnostic assessment (pre-instruction to identify prior knowledge), (4) norm-referenced (comparing student to peers) vs criterion-referenced (comparing student to fixed standards). The article addresses: stated objectives of educational assessment; key concepts including reliability, validity, fairness, and washback; core mechanisms such as test design, item analysis, and standard setting; international comparisons and debated issues (standardised testing pressure, grade inflation, authentic assessment); summary and emerging trends (computerised adaptive testing, portfolio assessment, AI scoring); and a Q&A section.
1. Specific Aims of This Article
This article describes educational assessment without endorsing any particular testing regime. Objectives commonly cited: measuring student learning to inform instruction, certifying competence for progression or graduation, evaluating teacher and school effectiveness, and providing accountability data to stakeholders. The article notes that assessment practices vary widely and are contested due to impacts on student motivation, curriculum narrowing, and equity concerns.
2. Foundational Conceptual Explanations
Key terminology:
- Reliability: Consistency of measurement. A test yields similar results across repeated administrations (test-retest), across different forms (parallel forms), or across raters (inter-rater). Reported as correlation coefficient (0.00-1.00); acceptable reliability ≥0.80 for high-stakes decisions.
- Validity: Degree to which a test measures what it claims to measure. Types: content validity (covers appropriate domain), criterion validity (correlates with relevant outcomes), construct validity (measures theoretical construct). Validity is not a property of the test but of the interpretation.
- Formative assessment: Assessment for learning. Examples: exit tickets, classroom polling, draft feedback. Effect size d=0.4-0.7 (Hattie & Timperley, 2007).
- Summative assessment: Assessment of learning. Examples: final exams, standardised tests.
- Authentic assessment: Tasks that mirror real-world applications (e.g., science investigation, historical document analysis, project presentation).
Historical context: Standardised testing expanded with compulsory schooling (early 20th century). SAT introduced 1926. No Child Left Behind Act (US, 2001) mandated annual testing. International large-scale assessments: PISA (2000), TIMSS (1995).
3. Core Mechanisms and In-Depth Elaboration
Test design principles:
- Blueprint (test specifications): Table aligning items with content topics and cognitive levels (Bloom’s taxonomy).
- Item types: Selected response (multiple choice, true/false, matching) efficient but measure recall; constructed response (short answer, essay) measure deeper thinking but scoring less reliable.
- Item analysis: Difficulty (proportion correct, ideal 0.3-0.8), discrimination (correlation with total score, ideal >0.3).
Standard setting methods:
- Angoff method: Experts estimate probability that minimally competent candidate answers each item correctly.
- Bookmark method: Ordered item booklets; experts mark where borderline student would succeed.
Effectiveness evidence:
- Formative assessment: Meta-analysis (Wiliam, 2011) of 30 studies shows effect on achievement of d=0.4-0.7, equivalent to 2-4 months additional learning.
- Standardised testing impact on learning: Mixed. Positive correlation with accountability (r=0.1-0.2) but curriculum narrowing documented. No causal evidence that testing alone improves outcomes.
4. Comprehensive Overview and Objective Discussion
Major testing programmes:
| Programme | Administering body | Age/grade | Purpose | Frequency |
|---|---|---|---|---|
| PISA | OECD | 15 years | Cross-national comparison | Every 3 years |
| TIMSS | IEA | Grade 4, 8 | Maths/science trends | Every 4 years |
| NAEP | US Dept of Education | Grades 4,8,12 | National assessment | Every 2-4 years |
| National College Entrance Exam (Gaokao) | China | Grade 12 | University admission | Annual |
Debated issues:
- Standardised testing pressure: High-stakes tests associated with teaching to the test, cheating incidents, narrowed curriculum (reduced arts, social studies, recess). Some studies show reduced student wellbeing but evidence mixed.
- Grade inflation: Rising average grades without corresponding rise in achievement. US high school GPA increased from 2.9 (1990) to 3.2 (2019) while NAEP scores flat. Causes: pressure for college admission, teacher leniency.
- Authentic assessment challenges: Portfolios and projects have higher validity for complex skills but lower inter-rater reliability (r=0.6-0.7 vs 0.9 for multiple choice) and higher cost to score.
5. Summary and Future Trajectories
Summary: Educational assessment includes formative (ongoing, lower stakes) and summative (end-of-period, higher stakes) methods. Reliability and validity are essential quality criteria. Standardised testing supports accountability but may narrow curriculum. Authentic assessment measures deeper skills at cost of reliability.
Emerging trends:
- Computerised adaptive testing (CAT): Algorithm selects items matching student ability level. Shorter tests, more precise scores. Used in GRE, some state assessments.
- AI scoring of constructed responses: Natural language processing for essays. Comparable to human scoring (r=0.7-0.8) but bias concerns remain.
- Learning analytics: Real-time dashboards using clickstream data from digital learning platforms. Privacy concerns.
- Assessment for learning competencies (e.g., collaboration, creativity): Emerging frameworks, low reliability.
6. Question-and-Answer Session
Q1: Does frequent testing improve learning?
A: Yes, the testing effect (retrieval practice) improves long-term retention. Frequent low-stakes quizzes (weekly) outperform fewer high-stakes exams. Effect size d=0.5.
Q2: What is the optimal class size for assessment reliability?
A: Not directly relevant. For reliable grading of constructed responses, single reader is sufficient with rubrics. Two readers improve reliability (r=0.7 to 0.85) but increase cost.
Q3: Can parents opt children out of standardised tests?
A: In some US states, yes (parental opt-out laws). Consequences vary: school may still be penalised for low participation; student may receive no score. Most countries do not permit opt-out.
Q4: How are tests adapted for students with disabilities?
A: Accommodations include extended time, readers, scribes, separate setting, braille. Research shows accommodations improve scores for disabled students (d=0.3-0.5) without artificially inflating for non-disabled.
https://www.ets.org/
https://www.nciea.org/
https://www.oecd.org/pisa/
https://www.iaea.info/ (International Association for Educational Assessment)
https://www.edglossary.org/assessment/
