Student assessment

From Ed•sy•clopedia
Cultivating student learning

Learning levels vary a lot—and, often, are low.
Let's raise them.
Edsyclopedia icon cultivating student learning.png
Introductory articles
Key categories


When done well, student assessments can be a meaningful and rigorous tool for aligning educational stakeholders toward the shared purpose of cultivating student learning. (For example, see Pritchett, 2013, Chapter 6, on the role of assessment information in performance-pressured starfish systems.)[1] Unfortunately, it is much easier to design a bad test than a good one.[2]

Bad tests and their negative effects

Bad tests lead to bad teaching and learning practices that emphasise showing over knowing. The high-stakes late-stage and exit examinations that many countries prioritise are bad tests: misaligned with the skills and knowledge that they aim to measure, focused on memorisation rather than practicable mastery, or irrelevant for wellbeing and success in adulthood.

Studies showing that badly designed tests skewed the teaching and learning process include:

  • Aiyar et al (forthcoming) on how a Teaching at the Right Level-inspired intervention in Delhi schools was hindered by stakeholder emphasis on overambitious exams
  • Burdett (2017) on examination materials in Uganda, Nigeria, India, and Pakistan that focus on recall of facts over higher-order skills[3]

Bad tests and cheating

When bad tests combine with accountability pressures, cheating is a common result.

This was documented rigorously in two RISE working papers, as well as a paper presented in the RISE Online Presentation Series. As summarised in a recent blog:[4]

  • A difference-in-difference analysis of a system-level, gradually phased-in shift from paper-based to computer-based testing in Indonesia found response patterns strongly indicating cheating in the paper-based tests (in a RISE working paper by Emilie Berkhout and co-authors).[5]
  • A comparison of student performance between an annual standardised test and an independent re-test in the Indian state of Madhya Pradesh found evidence of cheating by both students and teachers, which was confirmed in interviews and classroom observations (in a RISE working paper by Abhijeet Singh).[6]
  • An experimental evaluation of tablet-based testing versus paper-based testing in the Indian state of Andhra Pradesh, alongside a re-test in a subset of the schools, found significant indications of manipulation in the paper-based tests (also in Singh, 2020).[6]
  • A comparison of state-level average results from three countrywide assessments in India concluded that test scores from the official standardised achievement test were likely to be artificially high, and were less reliable than the sample-based tests that had been conducted by a civil society organisation and by household survey enumerators (in a RISE Online Presentation Series paper by Doug Johnson and Andrés Parrado).

Numerous other empirical analyses have documented test score manipulation. Besides the four studies listed above, other statistical analyses of assessment data have found answering patterns indicating test score manipulation in Sweden (Diamond and Persson, 2016), the US (Dee et al., 2019; Jacob and Levitt, 2003), southern Italy (Angrist, Battistin, and Vuri, 2017), Mexico (Martinelli et al., 2018), and on a regional assessment of southern and eastern African countries (Gustafsson and Nuga Deliwe, 2017). Other research methods have documented a few different types of test score manipulation, such as concentrating instructional attention or resources on children whose test scores are near accountability thresholds (e.g. Booher-Jennings, 2005, on Texas), or inflationary leniency in grading (e.g. Hinnerich and Vlachos, 2017, on Sweden), or outright cheating (e.g. Buckner and Hodges, 2016, on Jordan and Morocco; Patrick et al., 2018, on Atlanta).

Designing good tests

Designing good tests is a complicated undertaking, especially for standardised, systemwide benchmarking of learning levels. An illustrative (and far from exhaustive) set of design considerations includes:

Good tests must be well-calibrated. This is particularly important in low-performing education systems where the grade-level curriculum (and, thus, the typical standardised test) is far above the average student's learning level. Under such circumstances, test results will show significant floor effects, and will not give meaningful data on students' progress in developing crucial foundational skills. Besides being calibrated to students' learning levels, good tests also need to be calibrated to match national curricular standards. Such coherence between curricular standards and assessments cannot be taken for granted, as shown in significant mismatches between the content and skills emphasised across these instructional components, as documented in Uganda and Tanzania.[7]

Good tests should also be administered at appropriate points in a child's educational career. For example, given the great value of early mastery of foundational literacy and numeracy, because this enables children to access subsequent learning with greater ease and growing independence,[8], Bruns (2017) and others have recommended an international assessment should assess students' foundational skills around age 9, both to hold education systems accountable and to allow for catch-up instruction early rather than late.[9]

Another design consideration is the allocation of roles, responsibilities, and use cases across different levels of the education system. Gustafsson (2019) details the technical challenges that states face in developing reliable national assessments, emphasising that assessments depend on state authority for rigorous implementation, but these same assessments are necessary for other accountability relationships within the education system (e.g. for parents to hold schools accountable at the local level).[10] Additionally, while robust national assessment systems are pivotal, regional and international assessments can also play an important role in cross-country benchmarking and accountability.[11]

Note: add in a discussion of thin and thick indicators in testing, e.g. performance assessments/portfolios rather than typical tests (e.g. Darling-Hammond, Wilhoit & Pittenger (2014) on “assessment systems”), and the trade-offs in the quality of information vs resource intensity vs pedagogical value for children taking the test

Note: Once OxREP piece is out, add in the discussion of student assessments needing to be valid, reliable, and appropriately calibrated

Formative assessment

Student assessments can be categorised based on when they occur in the learning process:

  • diagnostic assessments gauge students' initial knowledge at the start of a course of study (which can be as short as a textbook unit or as long as a phase of schooling, e.g. lower secondary)
  • formative assessments monitor how well students are mastering the content during the course of study
  • summative assessments evaluate students' mastery of the content at the end of the course of study.

As noted in a RISE Insight Note (emphasis added),

Formative assessments of foundational skills are a powerful tool for aligning instruction with children’s learning levels on an ongoing basis: they provide feedback on children’s learning gains and enable continuous recalibration of instruction to children’s current learning levels. For example, in the TaRL approaches described in Section III, formative assessments are used to continually assess children’s learning progress and to determine when to move on to new skills. Additionally, formative assessment can provide an effective form of retrieval practice for students (i.e., actively recalling previously learned knowledge) which can improve the retention and flexibility of prior knowledge (Dunlosky et al., 2013). That said, while formative assessments are desirable, they do not appear in all ALIGNS approaches. Conducting accurate and efficient formative assessments can be a technically demanding task, and equipping teachers at scale to conduct such assessments can be resource-intensive, meaning such assessments may not be feasible or appropriate in all contexts.[12]

The considerations discussed above about coherence of information systems with implementation contexts also apply to formative assessment. As noted, in low-capability settings, teachers may need extensive support in learning how to design and administer formative assessment. For example, Sinha, Banerji & Wadhwa (2016) found that large proportions of teachers in Bihar could neither identify students’ mistakes, nor explain how to correct those mistakes, nor formulate appropriate and accurate assessment items.[13] That said, teachers need to be empowered not only with technical capacity in formative assessment, but also with the discretion to reorient classroom lessons in response to assessment results rather than moving in lockstep with the curriculum. Additionally, interventions to promote formative assessment need to account for local actors' perceptions and motivations, as discussed for information systems as a whole. For example, a randomised evaluation by J-PAL of a formative assessment programme in Haryana, India found no improvements in student test scores, even though the intervention included both teacher training and the formal incorporation of assessment and monitoring practices into teachers’ responsibilities. The authors posit that this lack of impact may be because teachers saw the programme as an additional administrative process to be fulfilled for its own standalone sake, rather than a subsystem designed to support the overall goal of improving student outcomes.[14] As with other forms of assessment, formative assessment can be very time-consuming for teachers to administer. Hence, the resource intensity and precision of an assessment needs to be balanced against teachers' other responsibilities in the particular context, as well as the purposes for which the assessment is being used. While rigorously standardised assessments are valuable for some purposes, informal, non-documented formative assessment may adequately facilitate teachers' ongoing support to individual students. For example, South African NGO Funda Wande advises teachers to conduct ongoing, formal assessments during shared reading sessions in everyday lessons:

Shared Reading offers the teacher the opportunity to conduct ongoing informal assessment. The teacher needs to pay very careful attention to individual learners as well as to the class as a whole.

  1. Watch learners very carefully, paying attention to who is not following, who joins in and who seems only to be echoing the others.
  2. Listen carefully to hear if learners lag behind in the reading.
  3. Notice which learners respond to the questions you ask about the text.
  4. Notice how learners practise what you have taught in the mini-lesson.
  5. Make mental notes of learners’ problems so you can assist them during Group Guided Reading. (p. 41)[15]

Types of assessments available to country governments

The table below provides a classification of different types of learning assessments available today [16].

Assessment Type Features Examples
Civil society led Learning assessments of samples of students conducted by civil society organizations; usually covers very basic literacy and numeracy skills; may be representative at the subnational or national level ASER in India; Uwezo in Tanzania, Kenya, and Uganda.
National An assessment conducted by the national government to assess student performance; should be nationally representative, either based on samples or census; could conducted at any point in the school cycle, most useful if conducted at multiple points during the schooling process SABER in Colombia; IDEB in Brazil
Regional An assessment generated by a regional or international organization to be administered in multiple countries; conducted on nationally representative samples; not necessarily comparable across countries or over time Early rounds of PASEC and SACMEQ
Regional Comparable An assessment generated by a regional or international organization; administered by countries with technical support; conducted on nationally representative samples; comparable across countries and overtime Later rounds of PASEC and SACMEQ, LLACE and its subsequent rounds
International Comparable An assessment generated by an international organization; administered with technical support; comparable across countries and over time. PISA and TIMSS

Using student assessments for teacher appraisal

SECTION TO BE DEVELOPED AND LINKED TO Teacher appraisal


Rothstein (2009), Briggs & Domingue (2011), and Garret & Steinberg (2015) on bias and unreliability of VAMs

Koretz (2008), Measuring Up

Amrein-Beardsley (2014), Rethinking Value-Added Models in Education: Critical Perspectives on Tests and Assessment-Based Accountability

Check references for Chilean example of teacher appraisal using portfolios etc (e.g. Taut et al, 2011)

Check the Gates MET Project conclusions about the optimal mix of sources in teacher appraisal

See also

Related publications

  • Bruns, B., Akmal, M., and Birdsall, N. 2019. The Political Economy of Testing in Latin America and Sub-Saharan Africa. RISE Working Paper Series. 19/032. https://doi.org/10.35489/BSG-RISE-WP_2019/032
  • Gray, L., & Baird, J.-A. (2020). Systemic influences on standard setting in national examinations. Assessment in Education: Principles, Policy & Practice, 27(2), 137–141. https://doi.org/10.1080/0969594X.2020.1750116editorial introduction; see also the other articles in this issue, which look at specific countries
  • Koretz, D. M. (2008). Measuring up: What educational testing really tells us. Harvard University Press.

References

  1. Pritchett, L., 2013. The rebirth of education: schooling ain’t learning. Center for Global Development, Washington, D.C.
  2. Burdett, N. 2016. The Good, the Bad, and the Ugly - Testing as a Key Part of the Education Ecosystem. RISE Working Paper Series. 16/010. https://doi.org/10.35489/BSG-RISE-WP_2016/010.
  3. Burdett, N. 2017. Review of High Stakes Examination Instruments in Primary and Secondary School in Developing Countries. RISE Working Paper Series. 17018. https://doi.org/10.35489/BSG-RISE-WP_2017/018.
  4. Hwa, Y-Y. 2020, October 23. Combatting Cheating: Ideas from India and Indonesia. RISE Programme blog. https://riseprogramme.org/blog/combatting-cheating-ideas-india-indonesia
  5. Berkhout, E. et al. 2020. From Cheating to Learning: An Evaluation of Fraud Prevention on National Exams in Indonesia. RISE Working Paper Series. 20/046. https://doi.org/10.35489/BSGRISEWP_2020/046.
  6. 6.0 6.1 Singh, A. 2020. Myths of Official Measurement: Auditing and Improving Administrative Data in Developing Countries. RISE Working Paper Series. 20/042. https://doi.org/10.35489/BSG-RISE-WP_2020/042.
  7. Atuhurra, J. and Kaffenberger, M. 2020. System (In)Coherence: Quantifying the Alignment of Primary Education Curriculum Standards, Examinations, and Instruction in Two East African Countries. RISE Working Paper Series. 20/057. https://doi.org/10.35489/BSG-RISE-WP_2020/057
  8. Belafi, C., Hwa, Y., and Kaffenberger, M. 2020. Building on Solid Foundations: Prioritising Universal, Early, Conceptual and Procedural Mastery of Foundational Skills. RISE Insight Series. 2020/021. https://doi. org/10.35489/BSG-RISE-RI_2020/021.
  9. Bruns, B. 2017, April 18. For Faster Education Progress, We Need to Know What Kids Know. https://riseprogramme.org/node/374
  10. Gustafsson, M., 2019. The case for statecraft in education: The NDP, a recent book on governance, and the New Public Management inheritance (No. WP16/2019), Stellenbosch Economic Working Papers. University of Stellenbosch. https://www.ekon.sun.ac.za/wpapers/2019/wp162019
  11. Bruns, B., Akmal, M., and Birdsall, N. 2019. The Political Economy of Testing in Latin America and Sub-Saharan Africa. RISE Working Paper Series. 19/032. https://doi.org/10.35489/BSG-RISE-WP_2019/032.
  12. Hwa, Y., Kaffenberger, M. and Silberstein, J. 2020. Aligning Levels of Instruction with Goals and the Needs of Students (ALIGNS): Varied Approaches, Common Principles. RISE Insight Series. 2020/022. https://doi.org/10.35489/BSG- RISE-RI_2020/022.
  13. Sinha,Shabnam; Banerji,Rukmini; Wadhwa, Wilima. 2016. Teacher performance in Bihar, India : implications for education (English). Directions in development;human development Washington, D.C. : World Bank Group. http://documents.worldbank.org/curated/en/484381467993218648/Teacher-performance-in-Bihar-India-implications-for-education
  14. Berry, J., Kannan, H., Mukherji, S., Shotland, M., 2020. Failure of frequent assessment: An evaluation of India’s continuous and comprehensive evaluation program. Journal of Development Economics 143, 102406. https://doi.org/10.1016/j.jdeveco.2019.102406
  15. Funda Wande. 2020. Reading Academy: Booklet 3, CAPS Reading Activities 1. Available from https://fundawande.org/learning-resources
  16. Akmal, Maryam, Birdsall, Nancy, and Bruns, Barbara (forthc.). A structured model of the dynamics of student learning in developing countries, with applications to policy, p.1.