Meta is seeking a forward-thinking, experienced Production Systems Engineer to join the Data Center Operations team. Our data centers, and the tens of thousands of servers installed in them, are the foundation upon which our rapidly scaling infrastructure efficiently operates and upon which our innovative services are delivered. Meta is at the leading edge of the global data center industry, both in terms of how data centers are designed and operated. This role requires prioritizing competing workstreams based on operational impact and adjusting plans as infrastructure needs evolve.
The candidate we seek is a forward-thinking IT professional with deep experience in utilizing multiple diverse software tools to identify automation solutions intended to address complex operational issues. This role is deeply cross-functional and considers the technical needs of frontline users to identify and automate diagnostic tooling, which enables quality and efficient delivery of production servers. They should be able to perform deep data analysis to drive decisions on the top priorities for automating repairs on servers in a hyperscale environment. This role requires driving solutions through code and collaborating effectively with globally distributed teams via clear written and verbal communication. Experience managing servers, programming in scripting languages, and administering Linux systems is required.
Responsibilities
- Identify and root cause systemic issues in the fleet and drive resolutions. Deliver maximum server fleet uptime and utilization rates, by leveraging data to understand hardware failure conditions and root cause
- Write and review code, develop documentation, and debug the hardest problems, live, on some of the largest and most complex systems in the world
- Own and develop diagnostic tooling requirements to run the fleet
- Own and drive the escalation process for Data Center Operations to identify, root cause, and solve complex tooling and hardware issues affecting the fleet
- Execute operational validation and verification activities for the new product integration
- Through consistent collaboration with cross-functional tooling teams, helps determine the root cause and provides input into their development process, with an operations-centric view of how open issues are affecting the fleet
- Build cross-functional relationships and have the ability to influence policies and procedures to improve global data center operations
- Mentor team members to evaluate and identify better ways to resolve issues and define updates to tools and processes
- Travel up to 25% to support global data center operations
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 6+ years of experience in production systems engineering, infrastructure engineering, or systems software development for large-scale hardware environments
- 6+ years of experience with hardware lifecycle management, fleet automation, or data center operations systems spanning compute, storage, or networking infrastructure
- Experience developing systems software or tooling in Python, PHP, C, or C++ for Linux-based production environments at scale
- Experience in configuration and maintenance of applications such as web servers, load balancers, relational databases, storage systems and messaging systems
- Experience communicating technical designs and infrastructure decisions through written documentation and cross-functional stakeholder alignment across engineering and operations teams Experience designing or operating configuration management and infrastructure-as-code systems for large heterogeneous hardware fleets
- Experience supporting global, multi-site data center infrastructure deployments including hardware qualification and regional rollout coordination
- Familiarity with distributed systems monitoring, alerting, and automated remediation pipelines at hyperscale