← All roles
Agentic AI Engineer
The generalist of the agentic stack. Owns the full path from prompt to production: tool design, retry logic, eval harnesses, latency budgets, cost ceilings. Equally fluent in API contracts and prompt design.
Indicative comp$180K – $260K base (US, senior)
Ranges are indicative US base salary at senior level. Actual offers depend on company stage, equity, and candidate strength.
What this role actually owns
- Design and ship multi-step agent loops with tool use, memory, and graceful failure modes.
- Own evals — define golden sets, write graders, run regression checks before every release.
- Tune system prompts against measurable rubrics, not vibes.
- Own the cost and latency profile of agent runs in production.
- Partner with PM and design on what the agent surfaces, hides, and asks for.
What we screen for
- 5+ years software engineering, 1+ year shipping production LLM features.
- Has run an agent in production — not just notebooks. Can talk concretely about a failure mode they fixed.
- Comfortable with streaming, tool use, and structured output APIs.
- Strong opinions about evals; can describe a grader they wrote.
- Bonus: open-source contributions to LangGraph, Inspect, BAML, or similar.
Sample job description
A starting point you can paste into your ATS and adjust. The exact wording matters less than the rubric — the bullets above are what we'll calibrate against during search.
Agentic AI Engineer
Builds production agents end-to-end — tool use, memory, evals, and the unglamorous reliability work in between.
You'll own:
- Design and ship multi-step agent loops with tool use, memory, and graceful failure modes.
- Own evals — define golden sets, write graders, run regression checks before every release.
- Tune system prompts against measurable rubrics, not vibes.
- Own the cost and latency profile of agent runs in production.
We're looking for:
- 5+ years software engineering, 1+ year shipping production LLM features.
- Has run an agent in production — not just notebooks. Can talk concretely about a failure mode they fixed.
- Comfortable with streaming, tool use, and structured output APIs.
- Strong opinions about evals; can describe a grader they wrote.
- Bonus: open-source contributions to LangGraph, Inspect, BAML, or similar.