OneMain Financial Jobs

Job Information

IBM Site Reliability Engineer in Yorktown Heights, New York

Introduction

IBM is seeking an experienced Site Reliability Engineer (SRE) to play a critical role in ensuring the reliability, availability, and performance of the IBM Quantum platform. In this role, you will collaborate closely with our quantum services development teams to design, build, monitor, and scale systems that power one of the world’s most advanced quantum computing platforms.

As an SRE within IBM Quantum, you are part of the frontline ensuring seamless operations, rapid recovery, and user trust through operational excellence. Every day brings new engineering challenges—ranging from incident response to building automation, enhancing observability, and driving system‑wide reliability improvements. You will support researchers, developers, and enterprise users exploring the future of computing while applying modern SRE principles to a deep‑tech environment.

This role is ideal for true SRE practitioners—engineers who have done SRE as their primary job.

Your role and responsibilities

  • Ensure high availability, resilience, and scalability of IBM Quantum platforms and services.

  • Lead incident response, participate in war room activities, and drive post‑incident reviews and corrective actions.

  • Collaborate with development teams to debug, deploy, and maintain quantum workloads and backend services.

  • Establish, refine, and maintain observability across logs, metrics, traces, and alerting systems.

  • Design and build internal tools, automations, and operational workflows to improve efficiency and reduce toil.

  • Champion operational ownership, ensuring every quantum job runs reliably with full traceability.

  • Drive platform‑wide improvements using operational insights, incident learnings, and reliability patterns.

Required technical and professional expertise

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience.

  • 2-5 years of proven professional experience specifically as a Site Reliability Engineer.

  • Strong systems‑thinking ability to correlate logs, traces, metrics, and code across distributed workloads.

  • Hands‑on experience with incident management, production operations, and on‑call responsibilities.

  • Experience with modern observability tools (Grafana, Sysdig, Jaeger, etc.).

  • Familiarity with Kubernetes, Linux internals, and programming in Python or Go.

  • Ability to work across development, infrastructure, and platform teams.

  • Ability to transform incident learnings into automation, fixes, or architectural improvements.

  • Understanding of SLI/SLO/SLA frameworks and reliability metrics.

Preferred technical and professional experience

  • Master’s degree in Computer Science, Engineering, or related field.

  • Experience with IBM Cloud services.

  • Familiarity with Qiskit or quantum computing concepts.

IBM is committed to creating a diverse environment and is proud to be an equal-opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, gender, gender identity or expression, sexual orientation, national origin, caste, genetics, pregnancy, disability, neurodivergence, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

DirectEmployers