Through funding cuts and bumps, integration and resegregation, panics and reforms, world wars and culture wars, American students have consistently learned at least one thing well: how to whip out a No. 2 pencil and mark exam answers on a sheet printed with row after row of bubbles. Whether you are an iPad baby or a Baby Boomer, odds are that you have filled in at least a few, if not a few hundred, of these machine-graded multiple-choice forms. They have long been the key ingredient in an alphabet soup of standardized tests, both national (SAT, ACT, TOEFL, LSAT, GRE) and local (SHSAT, STAAR, WVGSA). And they are used in both $50,000-a-year academies and the most impoverished public schools, where the classic green or blue Scantron answer sheets can accompany daily quizzes in every subject.
Machine grading, now synonymous with the brand Scantron the way tissues are with Kleenex, is so popular because it can provide rapid and straightforward results for millions of students. In turn, this technology has ushered in an epoch of multiple-choice testing. Why does English class involve not just writing essays but also choosing which of four potential themes a passage represents? Why does calculus require not just writing proofs but selecting the correct solution from various predetermined numbers? That is largely because of the Scantron and its brethren.
But soon, the country may have its first generation in decades not trained to instinctively fill in a series of tiny answer bubbles with no stray marks. The SAT will go fully digital next year; the ACT, AP exams, and numerous state tests have already done so or will follow. Taking class quizzes, too, could one day involve not bubbling in an answer sheet but typing on a keyboard or tapping a tablet. The advent of automatic, multiple-choice scoring technology has fundamentally shaped American education more than perhaps any other single thing. Now its demise could do the same.
An American student in the early 1900s might not have taken a single multiple-choice test throughout their time in school. At that point, assessments tended to center on essays, projects, oral exams, and other assignments that required more time for students to answer and teachers to grade, Linda Darling-Hammond, an emeritus professor of education at Stanford and a longtime federal education policy maker, told me. That model was more holistic than a multiple-choice test, but also prone to subjectivity and bias—and only possible, in part, because far fewer children received a formal education.
Soon, however, teachers and government officials sought ways to evaluate rapidly increasing numbers of students. In 1900, roughly 10 percent of teens attended high school; by 1940, some 70 percent did. Colleges, too, were figuring out how to choose among much larger pools of applicants. It was no longer feasible for educators “to rely on their eyes and ears” to evaluate students, Jack Schneider, an education historian at the University of Massachusetts at Amherst, told me. Schools and school districts needed data.
The multiple-choice test just made sense. Although some standardized tests did exist as early as 1845, they involved more open-ended questions. The first multiple-choice exam in the United States was a reading assessment administered in Kansas during WWI. Several others emerged shortly after, including a military aptitude test in 1917—which was soon adapted into a version for students—and then the SAT in 1926. Having limited, fixed answers to each question created a uniform way to numerically represent and sort students—some into college, others into trade school, and so on. Even without machines, administrators and teachers could much more quickly grade multiple-choice tests by hand than they could read an essay or geometry proof.
Assessing students through multiple-choice tests, of course, presumed that the exams provided objective insights into students’ abilities. They did not, and instead many exams only confirmed existing biases around race and class, Sevan Terzian, an historian of American education at the University of Florida, told me. Accurate or not, rising numbers of students were enrolling in school and taking these exams, exposing the limitations of human graders. “With lots of students taking these exams … this becomes really important: the ability to quickly grade all those exams so that it’s possible to get scores in a timely way so students can move on,” Ethan Hutt, who studies education and testing at the University of North Carolina at Chapel Hill, told me. Speed was crucial for exams that could influence college admissions, grades, and graduation. In search of greater efficiency, IBM released the first automatic-scoring machine in 1937, which worked by sensing the electrical conductivity of pencil marks.
But the real breakthrough came in the 1950s, when Everett Lindquist, a co-creator of the ACT, invented an optical-mark recognition system that remains the basis of many test-grading devices used today. The technology identified marks using light instead of electricity and was much faster, capable of scoring some 4,000 tests an hour in comparison to the IBM machine’s 800. Lindquist’s scanner, he wrote in his patent application, would make it “possible to perform the desired scoring, converting, analyzing and reporting operations in a matter of days, even hours, as compared to weeks. In other words, it is unnecessary to have a staff of from 50 to 100 persons.”
Soon, machine grading was everywhere. Test scores became “like a GDP measure for education” during the Cold War, Hutt told me, and in a country where education is so decentralized, knowing where a school stood relative to others became crucial—and easier to determine in the 1960s thanks to computers that could store and process large amounts of data. It was this “drive for comparison scores that really leads to the obsession with standardized tests,” Schneider said.
By the time Scantron was founded in 1972, machine grading had already made multiple-choice tests a key part of American education, and an enormous push for statewide tests only increased the demand for scoring technology. The company and its business model helped make those tests even more pervasive: Scantron provided scoring machines for cheap, and turned a profit by selling answer sheets to a captive market of schools and school districts. Teachers had already been borrowing the A/B/C/D format from standardized tests for years, but Scantron provided smaller, affordable scanners that made doing so even easier. As of 2019, Scantron served 96 of what it referred to as the “top 100 school districts in the United States” and printed some 800 million sheets globally each year; their scanners can process 15,000 sheets an hour. Teachers and leaders who already believed that these tests provided neutral assessments of ability found “the technology to grade these multiple-choice exams very appealing,” Terzian said.
Nearly every aspect of American education has now bent to Scantron and machine grading. The technology enabled 21st-century laws like No Child Left Behind to massively proliferate testing and tie student scores to funding. Schools are physically transformed, converting their libraries and gymnasiums and auditoriums and computer labs into test-taking, -collection, and -grading centers; they also cough up 15 to 20 cents per sheet. Students bring boxes of No. 2 pencils on exam days (the graphite is particularly opaque and easier for the scanner to register), share Scantron memes, and try to devise ways to cheat by marking multiple bubbles; educators “teach to the test,” and children learn to think in terms of the A/B/C/D format, Becky Pringle, the president of the National Education Association, one of the two major teachers’ unions in the country, told me.
The dominance of bubble-in answer sheets and the thin red mark next to wrong answers, however, is beginning to erode. Many standardized tests are now offering more open-ended questions intended to measure higher-order thinking, Linda Darling-Hammond said. And physical answer sheets are slowly giving way to computer screens, a transition the pandemic and remote schooling accelerated: State tests, college-admissions exams, and other assessments across the country are going digital. For now, many online exams aren’t meaningfully different. Come January, the SAT will no longer use bubble sheets for the first time in several decades, but it will still be stuffed with the same kind of multiple-choice questions. Teachers checking multiple-choice answers by hand, running an answer sheet through a Scantron machine, or instant grading on a screen are all different technologies to evaluate the same sort of exam and extract the same sort of data, whether from graphite or the click of a cursor.
That is the case for now, at least. Computers could well transform American testing by allowing for more creative and interactive questions, Kara McWilliams, the vice president of product innovation and development at ETS, a testing company that provides exams such as the GRE, told me. McWilliams also runs the company’s AI lab, which is using advanced AI models to both create and help score test questions. After having subject-matter experts annotate a huge number of essays, for instance, an AI program trained on those human evaluations could grade tests on its own, with its final output still being verified by a person. Computers might similarly be used to grade oral assessments or foreign-language exams, such as whether a student asked to translate “apple” into Spanish has pronounced manzana correctly. Similar to how machine grading allowed for wide-scale multiple-choice tests, students might eventually end up answering more free-form questions and writing more essays that are graded just as quickly and easily as a Scantron form is today. A spokesperson for Scantron told me that the company is proud of its “digital solutions” and “looking forward to our continued impact over the next 50 years and beyond.”
If the epoch of multiple-choice tests is truly ending, the assessments won’t necessarily be missed. Not only is the format inherently reductive—bubble-in question-and-answer forms have also been prone to bias. In turn, they have spawned decades of debate over whether America’s standardized tests are more racist, sexist, or classist than alternatives such as essays and oral exams.
The shift to computers still may not free us from these fights. Scantron and AI are two versions of a computer that gives rapid feedback purporting to be more objective than a teacher could ever be. Yet the results of, say, a statewide multiple-choice math test still have to be translated into how to better teach a student who might be lagging behind. Insights from computer programs, too—especially given AI models’ many biases and inaccuracies—are unlikely to escape the same failures of human interpretation. Better data are still only as good as what educators do with them.