What Staff engineers actually do in 2026 versus what the career ladder says they should
Most IC rubrics still measure the skills AI is making table stakes. The three gaps your career ladder does not cover.
Most IC career ladders have not changed much since 2019. They still describe Staff engineers the same way: leads technical direction, mentors junior engineers, reduces ambiguity, writes the hard code that nobody else can. The definition was accurate then. The problem is that several things on that list are now much easier to do, and the things that actually require Staff-level judgment in 2026 are not on it.
This is not a complaint about career ladders — they are hard to maintain, and most companies update them every three to five years at best. It is a practical observation: if you are being evaluated against criteria that no longer align with the actual high-value work in your organisation, you will optimise for the wrong things. And if you are a manager calibrating performance reviews, you may be measuring the wrong signals.
What IC ladders were designed to measure
The canonical Senior and Staff engineer rubric, written somewhere around 2015–2019 at most companies, assumes implementation is the bottleneck. The Staff engineer is valuable because she can write the complex distributed-systems code that junior and mid-level engineers cannot. She is the one who catches the subtle concurrency bug, who knows how the database behaves at 10x current load, who can explain the six-year-old legacy system to anyone who asks.
Ownership of a system meant understanding its internals deeply, because that understanding was hard to acquire. Technical leadership meant being the person who could execute on the hard things when others were blocked. The career criteria reflected that: ship complex work, lead technical projects, multiply your team's output.
These are real skills. The ladder was measuring something meaningful. The question is whether it is measuring the same thing today.
Where the actual work shifted
By mid-2025, AI coding tools had crossed a threshold. Not that they replaced engineers — that has not happened — but they changed which parts of engineering were scarce. Several activities that were previously Staff-level signals in many organisations:
- Explaining a complex subsystem to a new hire, with specific examples and context
- Writing a first draft of a design document from a product brief
- Reviewing a pull request for common patterns and obvious issues
- Writing the scaffolding for a new service: API client, repository layer, basic tests
AI handles the first two adequately most of the time. It catches a reasonable share of the third. It does the fourth entirely. Teams that adopted AI-assisted coding seriously in 2025 are asking a different question from "can we build this?" They are asking "is what we built actually right?"
Junior and mid-level engineers produce more code, faster. What they still lack is the judgment to know whether the code is solving the right problem, whether the abstraction will age well, whether the AI missed a class of input that will matter in production. That judgment is where Staff-level work has concentrated.
The seniority signal has not disappeared. It has moved. Code output used to separate Senior from Mid. Now the separator is something harder to name on a rubric: the judgment layer that sits above what AI can produce.
Three gaps that 2026 ladders miss
Three categories of work that create real Staff-level impact now, but do not appear in most rubrics.
Spec precision
The most underrated skill in an AI-heavy engineering team is writing a specification that produces the right output when fed to an AI system and knowing, from the output, when the spec was wrong versus when the AI was wrong.
This is adjacent to prompt engineering, but broader. It is the skill of translating ambiguous product intent into a precise-enough description of the problem that the solution space is meaningfully constrained. A junior engineer uses AI to write code. A Staff engineer writes the spec that gets AI-generated code into the right solution space in the first iteration, not the fifth.
Most IC rubrics do not measure this at all. There is "leads technical design," which is related. But leading a design review and writing a spec that another system can execute from without clarification are different skills, and most ladders do not distinguish them.
The feedback loop is fast and honest: if AI-generated code from your spec needs three rounds of major correction, the spec was not precise enough. That is a measurable output. Most teams are not yet measuring it.
AI system design and inference cost
A growing category of Staff-level work that most ladders have not absorbed: owning the architecture of systems where an LLM is in the critical path.
This includes designing retrieval pipelines that degrade gracefully when context windows fill, managing token budgets across inference calls without degrading output quality, and knowing when to add a cheaper re-ranker versus a more expensive context step. At production volumes, inference cost shows up on the P&L. Someone has to own that trade-off. In most organisations, nobody's career rubric names it.
At $10–30 per million tokens, a poorly designed inference pipeline running against a large model at volume will produce a meaningful cost overrun. The Staff engineer who owns that system owns both the quality and the cost — the same way a Staff engineer who owns a Postgres-backed service owns both the query latency and the disk spend.
Recognising confident-but-wrong output
AI systems fail in a specific way: they are confident and wrong. Not wrong in obvious ways — those get caught immediately. Wrong in the way that requires domain knowledge to spot: a technically valid answer that solves the wrong problem, makes an implicit assumption that is false in your context, or misses a constraint the model had no access to.
This failure mode did not exist in the old rubric because there was no system in the loop that failed that way. Code failed loudly, at compile time or at runtime. AI output fails silently, producing something that looks right, passes tests, and breaks in production for reasons the model could not have known.
Catching this reliably is a Staff-level skill. Most ladders do not describe it because the language for it does not yet exist in most rubric frameworks. But in teams that have been running AI-generated code in production for a year, the engineers who catch these failures before they ship are clearly doing something that matters at a level the rubric was not designed to capture.
| Activity | Old rubric weight | 2026 reality |
|---|---|---|
| Writing complex implementation code | Core Staff signal | Table stakes; AI-assisted at most seniority levels now |
| First-draft design documents | Staff-level work | First draft is commoditised; quality review is what matters |
| PR review for common patterns | Core signal | Routine patterns are cheap to catch; non-obvious domain errors are the signal |
| Writing precise, executable specs | Not measured explicitly | High value; determines AI output quality in the first iteration |
| LLM infrastructure and cost design | Not applicable | Significant; maps directly to P&L at production scale |
| Catching domain-context errors in AI output | Not applicable | High value; requires judgment that AI cannot supply |
How some teams are recalibrating
A few patterns from teams that have been running AI-heavy engineering long enough to have opinions.
The most direct: some teams have added explicit rubric criteria around "AI force multiplier" — not just "uses AI tools" (which any engineer does now) but "creates systems and specifications that multiply the output of others using AI." That is the closest thing to a functional Staff definition for 2026.
A second pattern: separating code contribution from engineering judgment more clearly in performance reviews. Good ladders always tried to do this, but as AI raises the floor on code output, the relative weight of judgment has to go up. Teams that have not reweighted are promoting people with high output but low architectural judgment. It looks fine until a system breaks in a way that is expensive to fix.
A third pattern, still rare: including LLM system design as an explicit competency at L5 and above. Most companies still treat this as a specialisation. That is likely to change as more organisations move LLMs into critical production paths and the cost shows up in ways that cannot be ignored or attributed to individual teams.
What none of these teams did was rewrite the whole ladder from scratch. The underlying values — reducing ambiguity, multiplying team output, technical leadership — are still right. What changed is the evidence that satisfies them.
“The rubric says "demonstrates technical leadership." In an AI-forward team in 2026, that phrase increasingly means: this person's judgment is what ensures we are building the right thing at the right quality.”
For ICs navigating this gap now
If you are working toward Staff or Principal and your company's rubric has not updated, the practical reality is this: the written criteria are not the only criteria you will be evaluated against. Your manager and skip-level are forming opinions about your impact in the world as it actually is, not the world the rubric describes.
Three things that create visible Staff-level impact in this environment:
- Get in the habit of writing specs that others, AI or human, can execute from without needing clarification. If AI-generated code from your spec needs three rounds of major correction, the spec was not precise enough. That feedback loop is fast; use it.
- Take ownership of one AI-in-production system, even a small one. Practical experience with inference cost, latency budgets, and quality degradation patterns puts you ahead of people who only use AI as a coding tool — and gives you something concrete to discuss in promotion conversations.
- When AI output is confidently wrong in your domain, write up the failure. Not just the fix, but the pattern. "AI gets this class of problem wrong because X" is useful institutional knowledge, and writing it is exactly what Staff engineers are supposed to do.
The ladder will catch up. Most of them do, eventually. The question is whether you will be measured by the 2019 rubric or the 2026 reality when your next promotion conversation happens.
Frequently asked questions
Related reading
Your LLM judge works in the test harness. Here's why it fails in production.
LLM-as-a-judge evals look reliable in the test harness. Here's what breaks after months in production: calibration drift, noisy decision boundaries, cascade failures in multi-step pipelines, and the meta-evaluation trap.
Where your engineers work matters less than whether they chose it
Most companies frame remote-vs-office as a location question. The research consistently points to a different variable — here's what it is.
AI didn't kill the take-home interview. It revealed what we were actually measuring.
AI tools complete most take-home coding assignments in minutes. The more important question is what we were measuring in the first place, and whether the format ever measured it well.