AWS Incident Response (
AIR) keeps
AWS working for millions of customers. When major incidents hit,
AIR leads the response, coordinating resolvers across
AWS and driving mitigation. We move fast, but not carelessly, obsessing over observability of the cloud and perpetually improving our detection and response speed and accuracy. We ensure each incident drives improvements that strengthen
AWS. It's a high-visibility, high-impact role with a global view of
AWS health that few teams get to see.
The Role
As a Support Engineer on
AIR's Seattle team, you'll be on the front line of
AWS incident response. You'll lead high-severity calls, triage complex failures across
distributed systems, coordinate resolver teams, and drive incidents to mitigation while millions of customers depend on the outcome. Between incidents, you'll obsess over metrics and detection analysis, building dashboards and mechanisms that surface problems before customers notice. You will drive operational improvements that make the incident management ecosystem faster and more accurate.
This isn't a role where you watch dashboards and robotically follow runbooks. You'll deep-dive the largest, most complex technical environment in the world. You'll develop expertise across
AWS services, networking, and infrastructure. You'll own operational processes end-to-end and use data to find the next leap in how we protect the cloud. If interested, you'll also have the opportunity to grow your development skills by taking on coding projects matched to your ability level.
This role includes participation in an on-call rotation, including some weekends and holidays.
Key job responsibilities
Incident Response
Lead high-severity incident response calls. Triage, coordinate resolvers across
AWS service teams, communicate clearly under pressure, and drive incidents to mitigation. Manage escalations and ensure accurate documentation throughout.
Operational Excellence and Detection
Own and run operational health reviews. Build and maintain dashboards, metrics, and monitoring that surface trends before they become incidents. Obsess over detection accuracy and speed. Detect patterns across events and drive proactive mechanisms to prevent recurrence.
Metrics and Analysis
Deep-dive operational data to identify systemic issues, measure response effectiveness, and prioritize improvements. Use metrics to tell the story of what's working, what's degrading, and where the next risk is hiding.
Process and Tooling Improvement
Identify gaps in operational processes, documentation, and tooling. Build or improve mechanisms that reduce time-to-detection and time-to-mitigation. Use data to prioritize where effort has the highest impact.
Automation and Generative AI
Leverage scripting, generative AI, and automation to accelerate incident response, improve detection, and reduce toil. Identify opportunities where AI can augment human judgment during incidents or surface insights from operational data at scale.
Driving Continuous Improvement
Ensure each incident makes
AWS stronger. Work with service teams to ensure learnings from incidents drive corrective actions and that follow-through happens. Close the loop between what broke and what gets fixed.
- 2+ years of technical support experience
- Direct experience participating in incident response for production systems
- Strong understanding of operating systems (Linux), networking fundamentals, and
distributed systems- Experience with operational monitoring, alerting, and metrics (CloudWatch,
Datadog,
Grafana, or equivalent)
- Demonstrated ability to troubleshoot complex technical problems spanning multiple systems or services
- Experience scripting or programming in at least one modern language (
Python,
Bash,
Go, or similar)
- Ability to clearly break down technical complexity for a wide range of audiences, from engineers to senior leadership, without relying on jargon
- Familiarity with incident management tooling and workflows
- Experience with
AWS services and cloud infrastructure
- Experience using generative AI or automation to solve operational problems or accelerate workflows
- Track record of authoring post-incident analyses (post-mortems) and driving corrective actions to completion
- Experience building operational dashboards, runbooks, or automation that improved team efficiency
- Experience coordinating across globally distributed teams and time zones
Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.
Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit
https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, WA, Seattle - 90,400.00 - 158,200.00 USD annually