Rack Scale Serviceability & Telemetry Architect

at AMD

Location

Austin, Texas

Compensation

$179k–$306k USD

Type

full time

Posted

Yesterday

Remote

Yes

Save to queue

Market range · company + function + seniority

$217k—$268k—$318k

p25 · target · p75 · n=106

Posted $306k · in the market band

Tailor your résumé to this role in 30 seconds.

Free account · ATS keyword check · per-job bullet rewrite by Claude.

Tailor my résumé Apply on company site

Job description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

Rack Scale Serviceability & Telemetry Architect

THE TEAM

AMD’s Data Center GPU Systems Architecture team defines next-generation AMD Instinct platforms and complete rack-scale solutions for hyperscale AI and HPC deployments. We work across silicon, GPU system firmware, server and board architecture, BMC/platform firmware, management software, security, validation, manufacturing, and ecosystem partners to turn product strategy into deployable, serviceable, production-ready platforms.

THE ROLE

AMD is seeking a Principal Member of Technical Staff (PMTS) to own the architecture for rack-scale serviceability and telemetry across AMD Instinct product lines and complete rack-scale solutions. This is a highly visible technical leadership role responsible for defining the end-to-end manageability, observability, and serviceability architecture spanning node, chassis/tray, rack, and fleet domains. You will drive the strategy, architecture, execution, and delivery of standards-based solutions for inventory, discovery, health monitoring, telemetry, eventing, diagnostics, firmware lifecycle management, and field service workflows across the full AMD rack-scale stack.

In this role, you will independently own a critical cross-product architecture area and drive alignment across GPU/SoC architecture, server/platform architecture, BIOS/UEFI, BMC and embedded software, security, RAS, validation, ODM/OEM partners, and customer-facing teams. The role spans early concept definition through bring-up, validation, deployment, and post-launch improvement.

THE PERSON

The ideal candidate is a deeply technical system architect with strong first-principles thinking and a track record of delivering manageability, telemetry, and serviceability solutions for servers, accelerators, storage, networking, or rack-scale AI/HPC platforms. You are equally comfortable setting long-range technical direction and diving hands-on into protocol definitions, interface design, telemetry models, bring-up, debug, and root-cause analysis. You thrive in ambiguity, influence without authority, raise execution quality across teams, and exemplify AMD’s values through direct, humble, collaborative, and inclusive leadership.

KEY RESPONSIBILITIES

Define and own the end-to-end rack-scale serviceability and telemetry architecture for AMD Instinct-based solutions, spanning node BMC, chassis/rack management, service processors/controllers, management network, and fleet-level observability integration.
Define the standards strategy and interface architecture using DMTF Redfish, PLDM, MCTP, and related specifications, maximizing standards compliance while establishing AMD/OEM extensions only where required.
Drive OpenBMC-based architecture and implementation direction for BMC and rack management controllers, including D-Bus object models, bmcweb/Redfish requirements, sensor and FRU inventory models, logging, eventing, firmware update, and debug workflows.
Architect telemetry frameworks for health, power, thermal, inventory, error, utilization, and service data. Define schemas, metric taxonomies, triggers, event models, aggregation, retention, and reporting strategies required for at-scale observability and automated service operations.
Define platform serviceability flows covering discovery, inventory correlation, fault isolation, diagnostics, crashdump and error capture, remote recovery, FRU replacement, firmware/driver update orchestration, and return-to-service procedures.
Partner with GPU/SoC architects, board and system architects, firmware and software teams, security/RAS, validation, manufacturing, and customer engineering to translate requirements into production-ready architecture and deliverables.
Work closely with ODM/OEMs and ecosystem partners to review designs, close gaps, guide implementation trade-offs, and deliver robust reference solutions and customer platforms on schedule.
Drive validation and conformance strategy for manageability and telemetry, including interoperability, Redfish/PLDM compliance, fault injection, service workflow validation, scale testing, and field debug methodology.
Influence future AMD Instinct platform roadmaps using insights from bring-up, partner integrations, deployment learnings, and telemetry-driven data.
Represent AMD in relevant standards and open-source communities, including DMTF and OpenBMC forums, and guide upstream/downstream strategy where appropriate.
Mentor engineers and architects across the organization and serve as the senior technical point of contact for rack-scale serviceability and telemetry.

PREFERRED EXPERIENCE

Expert level experiences in platform architecture, system management, BMC/embedded firmware, server manageability, or adjacent domains, including significant time in architect or technical leadership roles.
Proven experience defining serviceability/manageability architecture for servers, accelerators, storage, networking, or rack-scale infrastructure in datacenter, cloud, AI, or HPC environments.
Deep knowledge of DMTF Redfish, including schema design, OEM extension strategy, eventing, update service, and telemetry concepts such as MetricReportDefinition/Metric Reports; strong understanding of PLDM/MCTP for platform inventory, monitoring, control, and update workflows.
Strong hands-on experience with OpenBMC, including Yocto/OpenEmbedded, D-Bus, systemd, bmcweb/Redfish, phosphor services, firmware update flows, sensor frameworks, and log/event handling.
Experience with embedded Linux, ARM-based BMC SoCs, U-Boot, Linux kernel/device driver concepts, device tree, and low-level interfaces such as I2C/I3C, SPI, UART, GPIO, SMBus/PMBus, and related platform-management buses.
Strong understanding of server/platform RAS and serviceability features such as health monitoring, error logging, crashdump, diagnostics, inventory/FRU management, and remote recovery.
Experience with secure manageability architectures, including secure boot, root of trust, attestation, firmware signing, SPDM, and protection of out-of-band management paths.
Experience creating architecture specifications, product requirements, conformance plans, validation strategies, and design reviews that drive execution across multiple internal teams and external partners.
Strong programming and scripting background in C/C++, Python, and shell, with the ability to debug across firmware, hardware, and system software boundaries.
Experience with large-scale telemetry or observability pipelines, metrics consumers, or fleet operations tooling is strongly preferred.
Experience with AMD server or GPU platforms, AI/HPC system design, liquid cooling/power/thermal infrastructure, or OCP-aligned rack architectures is a plus.
Strong written and verbal communication skills with proven ability to influence senior engineering leadership, customers, and strategic partners.

ACADEMIC CREDENTIALS

Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field. Advanced degree preferred.

LOCATION

Austin, Texas preferred. Other AMD datacenter engineering locations may be considered based on team alignment and business needs.

This role is not eligible for visa sponsorship.

#LI-BW2

#LI-HYBRID

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Rack Scale Serviceability & Telemetry Architect

THE TEAM

THE ROLE

THE PERSON

KEY RESPONSIBILITIES

Define and own the end-to-end rack-scale serviceability and telemetry architecture for AMD Instinct-based solutions, spanning node BMC, chassis/rack management, service processors/controllers, management network, and fleet-level observability integration.
Define the standards strategy and interface architecture using DMTF Redfish, PLDM, MCTP, and related specifications, maximizing standards compliance while establishing AMD/OEM extensions only where required.
Drive OpenBMC-based architecture and implementation direction for BMC and rack management controllers, including D-Bus object models, bmcweb/Redfish requirements, sensor and FRU inventory models, logging, eventing, firmware update, and debug workflows.
Architect telemetry frameworks for health, power, thermal, inventory, error, utilization, and service data. Define schemas, metric taxonomies, triggers, event models, aggregation, retention, and reporting strategies required for at-scale observability and automated service operations.
Define platform serviceability flows covering discovery, inventory correlation, fault isolation, diagnostics, crashdump and error capture, remote recovery, FRU replacement, firmware/driver update orchestration, and return-to-service procedures.
Partner with GPU/SoC architects, board and system architects, firmware and software teams, security/RAS, validation, manufacturing, and customer engineering to translate requirements into production-ready architecture and deliverables.
Work closely with ODM/OEMs and ecosystem partners to review designs, close gaps, guide implementation trade-offs, and deliver robust reference solutions and customer platforms on schedule.
Drive validation and conformance strategy for manageability and telemetry, including interoperability, Redfish/PLDM compliance, fault injection, service workflow validation, scale testing, and field debug methodology.
Influence future AMD Instinct platform roadmaps using insights from bring-up, partner integrations, deployment learnings, and telemetry-driven data.
Represent AMD in relevant standards and open-source communities, including DMTF and OpenBMC forums, and guide upstream/downstream strategy where appropriate.
Mentor engineers and architects across the organization and serve as the senior technical point of contact for rack-scale serviceability and telemetry.

PREFERRED EXPERIENCE

Expert level experiences in platform architecture, system management, BMC/embedded firmware, server manageability, or adjacent domains, including significant time in architect or technical leadership roles.
Proven experience defining serviceability/manageability architecture for servers, accelerators, storage, networking, or rack-scale infrastructure in datacenter, cloud, AI, or HPC environments.
Deep knowledge of DMTF Redfish, including schema design, OEM extension strategy, eventing, update service, and telemetry concepts such as MetricReportDefinition/Metric Reports; strong understanding of PLDM/MCTP for platform inventory, monitoring, control, and update workflows.
Strong hands-on experience with OpenBMC, including Yocto/OpenEmbedded, D-Bus, systemd, bmcweb/Redfish, phosphor services, firmware update flows, sensor frameworks, and log/event handling.
Experience with embedded Linux, ARM-based BMC SoCs, U-Boot, Linux kernel/device driver concepts, device tree, and low-level interfaces such as I2C/I3C, SPI, UART, GPIO, SMBus/PMBus, and related platform-management buses.
Strong understanding of server/platform RAS and serviceability features such as health monitoring, error logging, crashdump, diagnostics, inventory/FRU management, and remote recovery.
Experience with secure manageability architectures, including secure boot, root of trust, attestation, firmware signing, SPDM, and protection of out-of-band management paths.
Experience creating architecture specifications, product requirements, conformance plans, validation strategies, and design reviews that drive execution across multiple internal teams and external partners.
Strong programming and scripting background in C/C++, Python, and shell, with the ability to debug across firmware, hardware, and system software boundaries.
Experience with large-scale telemetry or observability pipelines, metrics consumers, or fleet operations tooling is strongly preferred.
Experience with AMD server or GPU platforms, AI/HPC system design, liquid cooling/power/thermal infrastructure, or OCP-aligned rack architectures is a plus.
Strong written and verbal communication skills with proven ability to influence senior engineering leadership, customers, and strategic partners.

ACADEMIC CREDENTIALS

Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field. Advanced degree preferred.

LOCATION

Austin, Texas preferred. Other AMD datacenter engineering locations may be considered based on team alignment and business needs.

This role is not eligible for visa sponsorship.

#LI-BW2

#LI-HYBRID

Benefits offered are described: AMD benefits at a glance.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD’s “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

More open roles at AMD

Hiring velocity, headcount trend, and every open posting on one page.

View AMD profile →

VisaAMD sponsors H-1B — read the guide

Similar roles

Open postings ranked by description similarity — useful if this role isn't quite right.