← All roles

Agentic AI Engineer

The generalist of the agentic stack. Owns the full path from prompt to production: tool design, retry logic, eval harnesses, latency budgets, cost ceilings. Equally fluent in API contracts and prompt design.

Indicative comp$180K – $260K base (US, senior)

Ranges are indicative US base salary at senior level. Actual offers depend on company stage, equity, and candidate strength.

What this role actually owns

  • Design and ship multi-step agent loops with tool use, memory, and graceful failure modes.
  • Own evals — define golden sets, write graders, run regression checks before every release.
  • Tune system prompts against measurable rubrics, not vibes.
  • Own the cost and latency profile of agent runs in production.
  • Partner with PM and design on what the agent surfaces, hides, and asks for.

What we screen for

  • 5+ years software engineering, 1+ year shipping production LLM features.
  • Has run an agent in production — not just notebooks. Can talk concretely about a failure mode they fixed.
  • Comfortable with streaming, tool use, and structured output APIs.
  • Strong opinions about evals; can describe a grader they wrote.
  • Bonus: open-source contributions to LangGraph, Inspect, BAML, or similar.

Sample job description

A starting point you can paste into your ATS and adjust. The exact wording matters less than the rubric — the bullets above are what we'll calibrate against during search.

Agentic AI Engineer

Builds production agents end-to-end — tool use, memory, evals, and the unglamorous reliability work in between.

You'll own:

  • Design and ship multi-step agent loops with tool use, memory, and graceful failure modes.
  • Own evals — define golden sets, write graders, run regression checks before every release.
  • Tune system prompts against measurable rubrics, not vibes.
  • Own the cost and latency profile of agent runs in production.

We're looking for:

  • 5+ years software engineering, 1+ year shipping production LLM features.
  • Has run an agent in production — not just notebooks. Can talk concretely about a failure mode they fixed.
  • Comfortable with streaming, tool use, and structured output APIs.
  • Strong opinions about evals; can describe a grader they wrote.
  • Bonus: open-source contributions to LangGraph, Inspect, BAML, or similar.