OneMain Financial Jobs

Job Information

Amazon Lead Engineer for Manufacturing and Datacenter Lab, Trainium Manufacturing, Quality and Reliability in Austin, Texas

Description

Within the Trainium Manufacturing Quality & Reliability (TRN MQR) organization, we are establishing a critical new function that bridges manufacturing outcomes with datacenter operational performance. We are seeking a talented and motivated Manufacturing & Datacenter Preparedness Lab Leader to build and lead this strategic capability in Austin, Texas.

This role will report to the leader of Trainium Manufacturing Quality & Reliability and serve as the essential feedback loop between our ODM/JDM/CM manufacturing operations and AWS datacenter fleet performance. You will establish and operate a specialized preparedness lab focused on analyzing datacenter performance of manufactured Trainium systems to identify root causes of field rework and repairs, feeding critical insights back into manufacturing processes, test strategies, and design improvements.

You will participate in the early phase of manufacturing line development for our next generation servers and racks to improve our manufacturing flows informing system design, manufacturing, and fleet operations. You will manage early lifecycle changes, identify initial product quality improvements, and drive to technical root cause in supplier quality activities. The candidate will have experience in design or manufacturing and is capable of making wide-ranging business decisions on behalf of the organization.

You'll join a diverse team working across Manufacturing Engineering, Manufacturing Test Engineering, and Quality & Reliability Engineering. You'll collaborate with people across AWS Data Center Engineering, Hardware Design, ODM/JDM/CM partners, and datacenter operations teams to help us deliver the highest standards for safety and reliability while providing seemingly infinite capacity at the lowest possible cost for our customers. And you'll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion.

Key job responsibilities

  • Own operational production performance of Trainium systems across entire product lifecycle from manufacturing through datacenter deployment and fleet operations

  • Design and build preparedness lab replicating datacenter conditions for assembly, repair and system testing

  • Define and drive assembly and repair recipes in the manufacturing lab as the baseline prior to high volume manufacturing and datacenter deployment.

  • Ensure all manufacturing and datacenter test flows are regressed in the manufacturing lab prior to deployment.

  • Influence hardware design strategy for Design for Manufacturing (DFM), Design for Reliability (DFR), and Design for Test (DFT) based on field failure analysis

  • Establish data-driven analytics frameworks connecting manufacturing test data to datacenter performance, leveraging ML techniques to predict field failures

  • Build and mentor cross-functional team spanning manufacturing, test, quality, and reliability engineering; perform technical promotion assessments as force multiplier

  • Collaborate with AWS datacenter operations teams to understand failure modes, repair patterns, and operational challenges firsthand; translate operator insights and field learnings into actionable manufacturing process improvements and design changes

  • Drive continuous improvement reducing failure rates and lifecycle degradation through rapid feedback loops

  • Develop or adapt manufacturing process at the ODM and CM, including defining fixture requirements, critical assembly requirements, test methodology, signal integrity, power and heat management requirement

About the team

Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro(K2), Graviton, Inferentia, and Trainium families of processors.

Machine Learning Annapurna functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization.

We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability.

This position is in the Manufacturing, Quality and Reliability team.

Basic Qualifications

  • BS or MS degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, Industrial Engineering, or related technical fields

  • 8+ years industry experience in one or more of the following: Manufacturing Engineering, Test Engineering, Quality Engineering, Reliability Engineering, or Datacenter Infrastructure Engineering

  • 7+ years working directly with engineering teams in cross-functional environments

  • Experience with AI/ML acceleration systems, high-performance computing servers, or complex multi-rack systems

  • Demonstrated track record delivering stable, performant hardware solutions meeting cost and quality targets

  • Experience with System Mechanical & Thermal design for air-cooled and liquid-cooled systems

  • Strong problem-solving capabilities to isolate, define, and resolve complex problems spanning manufacturing quality and field reliability

  • Experience with root cause analysis methodologies (8D, 5-Why, Fishbone, FMEA) and implementing corrective/preventive actions

  • Proficiency in data analysis tools, statistical methods, and programming (Python, Bash, Shell script, Linux)

  • Ability to take hardware concept from requirements through fabrication and deployment

  • Ability to collaborate effectively with teams spanning multiple sites and develop detailed specifications for product teams

  • Experience working with ODMs, JDMs, component vendors, and internal design teams on cross-boundary triaging, debugging, and resolving issues

  • Strong communication skills with ability to influence senior leadership and cross-functional stakeholders

  • Experience in Design for Manufacturing (DFM), also known as Design for Manufacturability, a product design approach that focuses on optimizing the ease and cost of manufacturing a product

  • Can be given complex hardware engineering problem to solve and design project strategy that splits work appropriately for parallel development

Preferred Qualifications

  • Experience in management of datacenter operations, facility engineering operations, information technology critical environment facilities, advanced high volume manufacturing, datacenter build-outs and scaling, or similar fields

  • Knowledge of AWS services including compute, storage, networking, security, databases, machine learning, and serverless technologies

  • Experience understanding electrical and mechanical systems involved in critical data center operations including systems such as feeders, transformers, generators, switch gear, UPS systems, ATS units, PDU units, chillers, pumps, or air handling units

  • Masters Degree in Electrical Engineering, Mechanical Engineering, Computer Engineering, Industrial Engineering, or related technical fields

  • Hands-on design experience with enterprise hardware design, sled level design, and rack level designs

  • Track record of implementing data-driven process improvements that measurably reduced field failure rates or improved manufacturing yield

  • Experience with liquid cooling systems, direct-to-chip cooling, coolant distribution units (CDUs), and thermal management for AI/ML workloads

Amazon is an equal opportunity employer and does not discriminate on the basis of protected veteran status, disability, or other legally protected status.

Our inclusive culture empowers Amazonians to deliver the best results for our customers. If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit https://amazon.jobs/content/en/how-we-hire/accommodations for more information. If the country/region you’re applying in isn’t listed, please contact your Recruiting Partner.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits .

USA, TX, Austin - 159,200.00 - 215,300.00 USD annually

DirectEmployers