Job Information
Meta Hardware Systems Engineer, NPI in Menlo Park, California
Summary:
Meta is seeking a highly skilled and experienced Systems/Hardware Engineer to join our Release to Production (RTP) team. The RTP team is responsible for the end-to-end Hardware Lifecycle of all Meta servers, including prototyping, pre-production hands-on system validation, hardware debugging, and stress testing. As a Systems/Hardware Systems Engineer, you will work closely with various teams, including HW/SW co-design teams, hardware designers, networking teams, system manufacturers, component vendors, capacity engineering, production engineering, production services, and data center operations teams to enable new systems that will be deployed in our production data centers.
Required Skills:
Hardware Systems Engineer, NPI Responsibilities:
Interface with external vendors and internal teams to understand system architecture and develop Hardware Fault Management for various server products
Drive new platform enablement, hardware validation, tooling specification, and integration, customer workload testing, and experiment creation to detect and diagnose hardware/firmware/software health issues
Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues
Leverage understanding of RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanisms for better operation quality and cost/efficiency
Develop visibility through data visualization and implement systemic solutions to hardware health issues
Troubleshoot, diagnose, and root cause system failures, isolating components/failure scenarios while working with internal & external stakeholders
Lead bring-up, validation, and deployment of cutting-edge hardware systems in lab and datacenter environments
Design and implement robust system-level test plans, including functional, stress, and performance tests
Enhance hardware reliability by creating data visualizations and implementing systemic solutions to address recurring health issues
Minimum Qualifications:
Minimum Qualifications:
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of work experience in one or more domains such as: ASIC development, compute, AI-ML hardware/software, storage, memory, network, server interconnect technologies, or similar
Knowledge of architecture and components on one of the following products: server/PC/Laptop
Development or debug experience in one or more areas: hardware fault management, error reporting, error handling on hardware products
Experience with Python, C/C++ and/or similar languages, within a Linux environment, for server system management, automation, version control, CI/CD, or similar
Demonstrated problem-solving skills, with track record of resolving to troubleshoot complex technical issues
Demonstrated communication and collaboration skills, with the track record of working effectively with cross-functional teams
Experience working in a matrix organization
Preferred Qualifications:
Preferred Qualifications:
7+ years of experience with a subset of one of the following domains: Compute Systems, Storage Systems, Accelerated Compute Systems/HPC, Kernel/Firmware Development and/or test, Post Silicon Bringup
Experience with x86 or ARM-based CPUs and their subsystems (e.g. memory, inter-chiplet communications, RAS/DFT, performance management, power management)
Working/functional knowledge of common bus protocols such as I2C, SPI, USB, LP/DDR, and/or PCIe
Hands-on experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries. Hands on experience managing/debugging Linux servers
Understanding of the hardware development process and how it pertains to test strategy. Experience authoring test plans for complex chipsets for functional, stress and performance testing
Familiarity with debugging tools for systems-on-chip (SoCs) - eg. JTAG, GDB, DSTREAM, Trace32
Experienced in the integration of lab tools for automated workflows with large scale deployments. Proficiency in continuous integration/continuous delivery tools
2+ years experience scripting automation in Python or equivalent
Public Compensation:
$144,000/year to $204,000/year + bonus + equity + benefits
Industry: Internet
Equal Opportunity:
Meta is proud to be an Equal Employment Opportunity and Affirmative Action employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender, gender identity, gender expression, transgender status, sexual stereotypes, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics. We also consider qualified applicants with criminal histories, consistent with applicable federal, state and local law. Meta participates in the E-Verify program in certain locations, as required by law. Please note that Meta may leverage artificial intelligence and machine learning technologies in connection with applications for employment.
Meta is committed to providing reasonable accommodations for candidates with disabilities in our recruiting process. If you need any assistance or accommodations due to a disability, please let us know at accommodations-ext@meta.com.