Role Definition
| Field | Value |
|---|---|
| Job Title | Model Alignment Researcher |
| Seniority Level | Mid-Level |
| Primary Function | Conducts original research in RLHF, reward modelling, Constitutional AI, mechanistic interpretability, and value alignment at frontier AI labs. Designs novel techniques to ensure AI systems behave in accordance with human intentions — inventing new reward functions, improving preference learning pipelines, developing scalable oversight methods, and researching how to formally represent and encode human values into AI training. This is theoretical and mathematical research, not applied engineering. |
| What This Role Is NOT | NOT an AI Safety Researcher (broader scope — red-teaming, adversarial robustness, safety evals, policy; scored 85.2 Green). NOT an ML/AI Engineer (builds production models). NOT a Reinforcement Learning Engineer (implements RL systems; scored 64.7 Green). NOT an AI Governance Lead (manages compliance and policy). Alignment research is specifically the science of making AI systems reliably do what humans want. |
| Typical Experience | 3-7 years. PhD in ML, mathematics, CS, or physics typically required. Publication record at NeurIPS, ICML, ICLR on alignment-specific topics. Prior work at frontier labs (Anthropic, OpenAI, DeepMind) or alignment-focused organisations (MIRI, MATS, FAR.AI, Redwood Research, ARC). |
Seniority note: Junior alignment researchers (post-PhD, 0-2 years) would still score Green but lower — more execution of established research agendas, less agenda-setting. Goal-Setting drops from 3 to 2. Senior researchers setting alignment research direction score deeper Green.
Protective Principles + AI Growth Correlation
| Principle | Score (0-3) | Rationale |
|---|---|---|
| Embodied Physicality | 0 | Fully digital. All work occurs in compute environments, whiteboards, and mathematical proofs. |
| Deep Interpersonal Connection | 1 | Collaborative research with team members. Some stakeholder communication on alignment findings. Core value is intellectual and mathematical, not relational. |
| Goal-Setting & Moral Judgment | 3 | Defines what "aligned AI" means mathematically. Sets research agendas for problems with no precedent — choosing which reward modelling approaches to pursue, what constitutes adequate value alignment, which interpretability directions reveal genuine model cognition. Every research direction is a judgment call about how to make AI do what humans want. |
| Protective Total | 4/9 | |
| AI Growth Correlation | 2 | Recursive dependency: more powerful AI models require more sophisticated alignment techniques. RLHF, Constitutional AI, and reward modelling exist because AI capability is advancing. You cannot automate the work of aligning AI — that requires genuine mathematical novelty and moral reasoning about human values. |
Quick screen result: Protective 4 + Correlation 2 = Likely Green Zone (Accelerated). Proceed to confirm.
Task Decomposition (Agentic AI Scoring)
| Task | Time % | Score (1-5) | Weighted | Aug/Disp | Rationale |
|---|---|---|---|---|---|
| Novel alignment research (RLHF improvement, Constitutional AI, scalable oversight, debate-based alignment) | 25% | 1 | 0.25 | NOT INVOLVED | Irreducibly human. Inventing new alignment techniques for unprecedented AI capabilities requires genuine mathematical novelty. No training data exists for alignment solutions to problems that haven't been conceived. This is frontier mathematical science. |
| Mechanistic interpretability & value representation research | 20% | 1 | 0.20 | NOT INVOLVED | Irreducibly human. Understanding how neural networks internally represent concepts and values — reverse-engineering representations that the model's creators don't yet understand — requires forming novel hypotheses about systems whose internal structure is poorly characterised. |
| Reward modelling research (reward hacking mitigation, multi-objective rewards, process reward models) | 15% | 1 | 0.15 | NOT INVOLVED | Irreducibly human. Designing reward functions that faithfully capture human values without being exploitable is an open mathematical problem. Reward hacking — where models optimise the proxy reward rather than the true objective — has no algorithmic solution. Each new model capability creates new reward specification challenges. |
| Experimental implementation & evaluation (training runs, ablations, benchmarking alignment quality) | 15% | 2 | 0.30 | AUGMENTATION | AI assists with experiment infrastructure, automated evaluation suites, and scaling interpretability analysis. But designing what experiments to run, interpreting unexpected results, and determining whether an alignment technique actually works requires researcher judgment. |
| Publishing, peer review & conference presentation | 10% | 2 | 0.20 | AUGMENTATION | AI drafts sections, assists with literature reviews, and checks mathematical proofs. The core intellectual contribution — the novel alignment insight, the mathematical formulation, the experimental design — is the researcher's. |
| Cross-team collaboration, mentoring & stakeholder communication | 10% | 1 | 0.10 | NOT INVOLVED | Training the next generation of alignment researchers, collaborating across teams, communicating alignment findings to leadership and policymakers. Human trust and intellectual mentorship in a field where the stakes are existential. |
| Prototype alignment techniques for production systems | 5% | 2 | 0.10 | AUGMENTATION | Translating theoretical alignment research into implementations that can be tested on production models. AI assists with code generation, but the researcher decides what to build and validates whether the implementation matches the theoretical properties. |
| Total | 100% | 1.30 |
Task Resistance Score: 6.00 - 1.30 = 4.70/5.0
Displacement/Augmentation split: 0% displacement, 30% augmentation, 70% not involved.
Reinstatement check (Acemoglu): Strongly positive. AI creates entirely new alignment research tasks: Constitutional AI refinement, GRPO and process reward models, multi-agent alignment for agentic systems, machine unlearning, alignment of recursive self-improvement, formal verification of alignment properties. The task portfolio expands with every capability advance. This role is not merely persisting — it is accelerating.
Evidence Score
| Dimension | Score (-2 to 2) | Evidence |
|---|---|---|
| Job Posting Trends | 2 | ZipRecruiter shows 60 AI alignment postings in San Francisco alone ($111K-$500K). Alignment researcher postings embedded within the ~3,200 AI safety researcher postings (+78% YoY). Every frontier lab actively hiring: Anthropic Alignment Science team, OpenAI Human Alignment team, Google DeepMind ASAT. MATS Summer 2026 expanding to 120 fellows — largest ever. |
| Company Actions | 2 | All frontier labs expanding dedicated alignment teams. Anthropic published recommended alignment research directions (Feb 2025). OpenAI posted dedicated Human Alignment Consumer Devices researcher roles (RLHF, reward modelling, preference learning). DeepMind's AGI Safety & Alignment Team hiring Research Scientists. No evidence of any cuts — the opposite. |
| Wage Trends | 1 | Mid-level total comp $200K-$400K+ at frontier labs. Base salary $160K-$250K. Alignment specialists command 25-45% premiums over general AI positions due to scarcity. ARC ML Researcher salaries $107K-$197K monthly annualised. Growing above inflation but concentrated at frontier labs — not broadly distributed across the economy. |
| AI Tool Maturity | 1 | AI assists with experiment infrastructure and automated evaluation. But inventing new alignment techniques — the mathematical novelty of Constitutional AI, RLHF improvements, reward specification — has no viable AI replacement. Anthropic observed exposure for Computer and Information Research Scientists: 34.0% — moderate, predominantly augmentation not displacement. |
| Expert Consensus | 2 | Universal agreement. WEF ranks AI/ML specialists #1 fastest-growing role through 2030. Frontier lab leadership all publicly state alignment is their top research priority. EU AI Act mandates human oversight. International AI Safety Report 2026 reinforces institutional commitment to alignment research. |
| Total | 8 |
Barrier Assessment
Reframed question: What prevents AI execution even when programmatically possible?
| Barrier | Score (0-2) | Rationale |
|---|---|---|
| Regulatory/Licensing | 1 | No formal licensing, but PhD is de facto requirement. EU AI Act mandates human oversight for high-risk AI. US EO 14110 requires safety research by human researchers. Creates structural demand but not a licensing barrier per se. |
| Physical Presence | 0 | Fully remote capable. Research is conducted computationally and mathematically. |
| Union/Collective Bargaining | 0 | Tech sector, at-will employment. No collective bargaining protections. |
| Liability/Accountability | 1 | If a frontier model causes harm due to inadequate alignment — reward hacking, value misspecification, deceptive alignment — accountability traces to the alignment team. Misaligned AI represents catastrophic risk. Someone must own the decision that "this model is sufficiently aligned to deploy." |
| Cultural/Ethical | 2 | Strong societal resistance to AI aligning itself. The recursive trust problem — "can we trust AI to determine its own values?" — is a core philosophical objection that creates structural demand for human alignment researchers. Misaligned AI is increasingly framed as an existential risk. Society demands that humans, not AI, make the fundamental decisions about what AI systems should value. |
| Total | 4/10 |
AI Growth Correlation Check
Confirmed at +2. This is the strongest possible position — the role has a recursive dependency on AI growth itself.
- Every advance in AI capability creates new alignment problems requiring novel mathematical solutions.
- More powerful models are harder to align — RLHF that worked for GPT-3 is insufficient for GPT-5.
- Agentic AI systems introduce multi-agent alignment challenges that didn't exist two years ago.
- Constitutional AI, process reward models, and debate-based alignment are all emerging techniques that create new research agendas.
- The fundamental question — "how do we formally specify what we want AI to do?" — becomes harder, not easier, as AI grows more capable.
This qualifies as Green Zone (Accelerated): AI Growth Correlation = 2 AND JobZone Score >= 48.
JobZone Composite Score (AIJRI)
| Input | Value |
|---|---|
| Task Resistance Score | 4.70/5.0 |
| Evidence Modifier | 1.0 + (8 x 0.04) = 1.32 |
| Barrier Modifier | 1.0 + (4 x 0.02) = 1.08 |
| Growth Modifier | 1.0 + (2 x 0.05) = 1.10 |
Raw: 4.70 x 1.32 x 1.08 x 1.10 = 7.3704
JobZone Score: (7.3704 - 0.54) / 7.93 x 100 = 86.1/100
Zone: GREEN (Green >= 48, Yellow 25-47, Red <25)
Sub-Label Determination
| Metric | Value |
|---|---|
| % of task time scoring 3+ | 0% |
| AI Growth Correlation | 2 |
| Sub-label | Green (Accelerated) — Growth Correlation = 2 AND JobZone Score >= 48 |
Assessor override: None — formula score accepted. The 86.1 calibrates correctly against AI Safety Researcher (85.2). Alignment research scores marginally higher because it is more theoretical and mathematical — 70% of task time is irreducible (vs 50% for the broader safety researcher) — reflecting the genuine novelty required for reward specification, Constitutional AI design, and formal value alignment. The slightly lower evidence (8 vs 9) reflects a narrower niche market, while the higher barriers (4 vs 3) reflect the catastrophic risk framing of misaligned AI.
Assessor Commentary
Score vs Reality Check
The 86.1 is honest and the highest-scoring role in the project alongside AI Safety Researcher (85.2). The marginal difference is justified: alignment research is the purest theoretical subset of AI safety, with 70% of task time at Score 1 (irreducible). The 4.70 Task Resistance exceeds the Safety Researcher (4.60) because alignment work is more mathematical and theoretical — designing reward functions and value representations is harder to automate than red-teaming or adversarial robustness testing. The barrier score (4/10) slightly exceeds Safety Researcher (3/10) because the cultural barrier around AI self-alignment is stronger than the general safety trust deficit. No borderline concerns — 38 points above the Green threshold.
What the Numbers Don't Capture
- Extreme concentration risk. Perhaps 200-500 alignment researchers globally work at the frontier. The majority sit at 4-5 labs. If frontier AI development consolidates or slows, the job market contracts dramatically. This role is the least diversified by employer of any assessed role.
- Supply shortage confound. Wages and demand reflect a talent pool measured in hundreds, not thousands. If fellowship pipelines (MATS, Anthropic Fellows, SERI) scale successfully, wage premiums may compress even as the role stays Green. The $300K+ total comp reflects extreme scarcity.
- Technique evolution risk. Alignment methods evolve faster than almost any other research field. RLHF dominated 2023; DPO/GRPO emerged 2024-2025; Constitutional AI and process reward models are reshaping the landscape. A researcher who specialises in one technique and doesn't adapt risks obsolescence within a Green Zone role.
- Function-spending vs people-spending. Frontier labs invest heavily in alignment infrastructure (automated evaluation, interpretability tooling) that could reduce the number of researchers needed per alignment insight, even as total alignment investment grows.
Who Should Worry (and Who Shouldn't)
If you're inventing new alignment techniques, designing novel reward functions, or conducting original interpretability research at a frontier lab — you're in the strongest career position in the AI economy. Every capability advance creates more work for you. The mathematical novelty required is irreplaceable.
If you're primarily running established RLHF pipelines, implementing published alignment techniques, or benchmarking models against existing safety evaluations without contributing novel research — you're closer to an RL Engineer (64.7) than an Alignment Researcher (86.1). The protection comes from mathematical creativity, not pipeline execution.
The single biggest factor: originality of research contribution. The $300K+ roles go to researchers who invent new ways to specify rewards, represent values, and verify alignment. Running someone else's Constitutional AI prompts on a new model is engineering, not alignment research.
What This Means
The role in 2028: Alignment researchers in 2028 will tackle alignment for increasingly autonomous multi-agent systems, recursive self-improvement, and models with superhuman capabilities in specific domains. Process reward models will have matured, Constitutional AI will have evolved beyond text, and formal verification of alignment properties will be an active research frontier. Automated tools will handle routine alignment benchmarking, freeing researchers to focus on the hardest open problems: specifying values for systems whose capabilities exceed human understanding.
Survival strategy:
- Maintain frontier mathematical contributions. Novel reward modelling techniques, improved RLHF/DPO/GRPO methods, formal value alignment proofs — original research published at top venues is the primary career currency.
- Build depth across the alignment stack. Specialise in reward modelling, interpretability, or Constitutional AI — but understand the full alignment pipeline. The most valuable researchers can connect theoretical alignment properties to practical training outcomes.
- Develop cross-lab relationships. The alignment community is small and collaborative. Conference presence, cross-lab collaborations, and mentoring build the network that sustains a long career in a field with extreme employer concentration.
Timeline: This role strengthens over the next 10+ years. The driver is AI capability growth itself — more powerful systems require more sophisticated alignment research. The only scenario where demand declines is if AI development slows or if a complete, verified solution to the alignment problem is discovered — currently no indication either will happen.