MIT Asked 17,000 Workers What AI Can Actually Do at Their Jobs
A new MIT FutureTech study flipped the automation forecast: instead of experts predicting AI impact, 17,000+ workers evaluated real LLM outputs on their own tasks. The results upend conventional wisdom about who is most exposed.
Workers themselves just told MIT what AI can and cannot do at their jobs — and the answer flips a lot of conventional wisdom about automation. Across 17,000+ evaluations of real LLM outputs by people doing those exact jobs, frontier models hit "minimally sufficient quality" on 50% to 75% of text-based tasks. That is closer to a tide than a tidal wave, and the surprise is who is most exposed.
If you are a paralegal or a research scientist convinced your work is too nuanced for AI, the new data is on your side — for now. If you are in installation, repair, transportation logistics, or construction administration, the same data is much less comforting. The MIT FutureTech team, led by Matthias Mertens and Neil Thompson, published the preliminary findings in March 2026 as "Crashing Waves vs. Rising Tides" (arXiv:2604.01363).
This study matters because almost every prior automation forecast — Frey & Osborne 2013, OECD 2018, Goldman Sachs 2023, Anthropic Economic Index — has been top-down: experts or models look at a task taxonomy and guess. Mertens and Thompson flipped the telescope.
A bottom-up measurement nobody had really done
Instead of asking economists or even asking AI itself, the MIT team recruited workers with relevant on-the-job experience and showed them actual LLM output for tasks from their own occupation. The workers scored each output on a 1–9 scale. The headline metric is binary: did the model produce work of "minimally sufficient" quality — a rating of 7 or higher, meaning a human would need no edits?
The scope is unusually large:
- 17,000+ task evaluations completed so far (survey ongoing)
- 3,000+ broad task families drawn from the U.S. Department of Labor's O\*NET database
- 20,000+ unique task instances generated from over 10,000 O\*NET tasks
- 40+ different LLMs tested, including frontier 2025 models
- Tasks filtered by GPT-4 to retain only those with at least 10% time-saving potential
[Fact] Across all models surveyed, success rates landed in the 50% to 75% band. [Fact] By Q3 2025, frontier models reached 50% success on tasks that take a human "approximately a day" of work.
That last number is the one that should make every white-collar worker pause. A year ago, similar studies were measuring AI on 15-minute tasks. The horizon has moved.
The occupations workers themselves rank as most exposed
Here is where the bottom-up methodology produces results most labor economists did not predict. The occupational groups with the highest LLM success rates were not law firms or research labs. They were:
- Installation, Maintenance, and Repair — 72.5% success rate
- Construction and Extraction — 71.0%
- Transportation and Material Moving — 70.6%
- Food Preparation and Serving — 65.5%
At first that looks absurd. AI cannot fix a refrigerator or drive a truck. But the workers in those fields said something subtler: the text-based portions of their jobs — work orders, dispatch routing, safety logs, parts ordering, supplier emails, training material, customer communication, regulatory checklists — turn out to be highly automatable. The hands-on physical work is still safe. The paperwork that surrounds it is not.
This is consistent with what dispatchers, fleet managers, and field supervisors have been quietly observing for two years. The "blue-collar" occupational families now have substantial administrative layers — and that administrative layer is exactly what large language models handle best.
If your role is largely the administrative coordination of a physical operation, the MIT data suggests you should treat the next two years as preparation time, not denial time.
The occupations workers say AI still cannot do
The flip side surprised the authors as much as the high-exposure list did. The groups with the lowest success rates were exactly the ones that most public commentary has labeled "first to go":
- Legal — 46.8% (the lowest in the entire study)
- Life, Physical, and Social Science — 51.8%
- Architecture and Engineering — 52.8%
[Claim] These are knowledge-work fields where domain experts evaluating real LLM output said, repeatedly, "this would need substantial editing before I could use it." The gap between an impressive demo and minimally sufficient professional-grade output is widest exactly where stakes — liability, replicability, structural safety — are highest.
[Estimate] One reading is that legal, scientific, and engineering work require chains of verified reasoning rather than fluent paragraphs. Another is that experts in these fields apply a stricter quality bar than experts in, say, food service. Both can be true. The practical implication is the same: paralegal alarmism has been ahead of the data. So has biotech-researcher alarmism. Lawyers and scientists reviewing the actual outputs are unimpressed.
Why "rising tide" not "crashing wave" is the most important sentence in the paper
The metaphor in the title is the part worth memorizing. A crashing wave would mean AI suddenly becomes capable of one whole occupation at once — say, all paralegals replaced in 18 months. A rising tide means broad, gradual lift across the entire task landscape — a 15% productivity bump for almost every text-based worker over five years, with displacement concentrated in specific tasks, not specific people.
The MIT data shows the rising-tide pattern, not the crashing-wave pattern. The success-versus-task-difficulty curve is "surprisingly flat," the authors write — progress is widespread rather than punctuated. They explicitly note: "progress typically resembles a rising tide, with widespread gains across many tasks simultaneously."
[Fact] This is the same pattern that the Anthropic Economic Index has been reporting from a completely different vantage point — the actual conversation logs of Claude users. Two methodologies, two data sources, one converging finding: AI is not vaporizing job categories. It is reshaping every job category at once.
That is much harder to plan for politically, but easier to plan for personally. If 60% of your tasks become 30% faster, your job does not disappear — it changes. You absorb more work or you have time freed up for the parts AI cannot do.
What the authors are careful to tell you the data does not mean
The honesty in section 6 of the paper is worth quoting. The authors flag four limitations every reader should hold in mind:
- The findings cover text-based or partially text-based tasks only. Most physical work is excluded by construction.
- Results "will not translate directly to shares of job automation" because of "last mile" deployment costs, integration friction, regulatory constraints, and selection effects in which tasks were surveyed.
- The projections assume "AI progress continues at the pace observed over the past two years" — an upper-bound scenario, not a forecast.
- The survey is ongoing. Numbers may shift as more occupations are sampled.
[Claim] In other words: this is the best evidence so far on what AI can already do, judged by people who actually do the work. It is not yet evidence on what employers will deploy or workers will lose. Those two questions have always been different, and the gap between them is where policy lives.
What to actually do this week if you are one of the workers in scope
For installation/repair, transportation logistics, construction admin, and food-service operations: the text-and-coordination layer of your role is in scope. Catalog which 20% of your weekly tasks are pure text — emails, scheduling, reports, customer communication, parts ordering — and learn one AI tool that handles them. Use the time freed up to deepen the physical or interpersonal craft that the MIT data still says AI cannot match.
For legal, scientific, and engineering professionals: the data says you have more runway than your LinkedIn feed suggests. Use that runway to build AI literacy on the offense — not to dismiss the technology, but to become the person in your firm or lab who knows exactly what it can and cannot deliver. The MIT score for your function is in the paper. Read it.
For everyone else — the giant middle of office and administrative work, healthcare support, education, sales, customer service — your tasks are statistically near the 65% global mean. Rising tide territory. Expect a productivity gain, expect role redesign, do not expect mass replacement in the next 24 months.
Sources
- Mertens, M., Thompson, N., et al. (2026). _Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks_. arXiv:2604.01363 (HTML)
- MIT FutureTech research program: https://futuretech.mit.edu
- U.S. Department of Labor O\*NET database (task taxonomy underlying the study)
Update History
- 2026-05-14: Initial publication based on MIT FutureTech preliminary findings (March 2026 draft).
_This post was produced with AI-assisted analysis (Claude Opus 4.7). The underlying data is from peer-reviewable preprint research at MIT FutureTech; interpretations and emphasis are editorial. We will update this post as the MIT survey expands its occupational coverage._
Analysis based on the Anthropic Economic Index, U.S. Bureau of Labor Statistics, and O*NET occupational data. Learn about our methodology
Historial de actualizaciones
- Publicado por primera vez el 13 de mayo de 2026.
- Última revisión el 13 de mayo de 2026.