Global Production Systems Engineer

at Meta

Location

Menlo Park, CA; New Albany, OH

Compensation

$144k–$204k USD

Type

full time

Posted

2 weeks ago

Save to queue

Market range · company + function + seniority

$176k—$204k—$245k

p25 · target · p75 · n=61

Posted $204k · in the market band

Tailor your résumé to this role in 30 seconds.

Free account · ATS keyword check · per-job bullet rewrite by Claude.

Tailor my résumé Apply on company site

Job description

Meta is seeking a forward-thinking, experienced Production Systems Engineer to join the Data Center Operations team. Our data centers, and the tens of thousands of servers installed in them, are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered. Meta is at the leading edge of the global data center industry, both in terms of how data centers are designed and operated. This role requires prioritizing competing workstreams based on operational impact and adjusting plans as infrastructure needs evolve. The candidate we seek is a forward-thinking IT professional with deep experience in utilizing multiple diverse software tools to identify automation solutions intended to address complex operational issues. This role is deeply cross-functional and considers the technical needs of frontline users to identify and automate diagnostic tooling, which enables quality and efficient delivery of production servers. They should be able to perform deep data analysis to drive decisions on the top priorities for automating repairs on servers in a hyperscale environment. This role requires driving solutions through code and collaborating effectively with globally distributed teams via clear written and verbal communication. Experience managing servers, programming in scripting languages, and administering Linux systems is required.

Responsibilities

Identify and root cause systemic issues in the fleet and drive resolutions. Deliver maximum server fleet uptime and utilization rates, by leveraging data to understand hardware failure conditions and root cause
Write and review code, develop documentation, and debug the hardest problems, live, on some of the largest and most complex systems in the world
Own and develop diagnostic tooling requirements to run the fleet
Own and drive the escalation process for Data Center Operations to identify, root cause, and solve complex tooling and hardware issues affecting the fleet
Execute operational validation and verification activities for the new product integration
Through consistent collaboration with cross-functional tooling teams, helps determine the root cause and provides input into their development process, with an operations-centric view of how open issues are affecting the fleet
Build cross-functional relationships and have the ability to influence policies and procedures to improve global data center operations
Mentor team members to evaluate and identify better ways to resolve issues and define updates to tools and processes
Travel up to 25% to support global data center operations

Minimum Qualifications

Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of experience in production systems engineering, infrastructure engineering, or systems software development for large-scale hardware environments
6+ years of experience with hardware lifecycle management, fleet automation, or data center operations systems spanning compute, storage, or networking infrastructure
Experience developing systems software or tooling in Python, PHP, C, or C++ for Linux-based production environments at scale
Experience in configuration and maintenance of applications such as web servers, load balancers, relational databases, storage systems and messaging systems
Experience communicating technical designs and infrastructure decisions through written documentation and cross-functional stakeholder alignment across engineering and operations teams Experience designing or operating configuration management and infrastructure-as-code systems for large heterogeneous hardware fleets
Experience supporting global, multi-site data center infrastructure deployments including hardware qualification and regional rollout coordination
Familiarity with distributed systems monitoring, alerting, and automated remediation pipelines at hyperscale

More open roles at Meta

Hiring velocity, headcount trend, and every open posting on one page.

View Meta profile →

VisaMeta sponsors H-1B — read the guide

Similar roles

Open postings ranked by description similarity — useful if this role isn't quite right.