Job Information
Teladoc Health Inc Sr. Site Reliability Engineer in PURCHASE, New York
Join the team leading the next evolution of virtual care. At Teladoc Health, you are empowered to bring your true self to work while helping millions of people live their healthiest lives. Here you will be part of a high-performance culture where colleagues embrace challenges, drive transformative solutions, and create opportunities for growth. Together, we're transforming how better health happens. Summary of PositionWe're looking for a Senior Site Reliability Engineer to own reliability, automation, and infrastructure-as-code for our modern Data & AI platform. In this role, you'll ensure our Azure-based data ecosystem is reliable, scalable, and efficient. You'll build Terraform-first infrastructure, improve developer experience, and support a healthcare environment where uptime and data reliability directly impact patient care. Essential Duties and ResponsibilitiesInfrastructure as Code * Build and maintain Terraform modules for data platform services (Snowflake, Airbyte, Astronomer, dbt, Kafka). * Develop IaC standards, GitOps workflows, and automated CI/CD pipelines using GitHub Actions. * Migrate manual configurations to fully codified infrastructure and enable selfservice provisioning for engineer Platform Reliability & Operations * Implement monitoring, alerting, and SLO/SLIs for data pipelines and platform components. * Lead incident response, root cause analysis, and postmortems. * Create automation, runbooks, and selfhealing capabilities to reduce MTTR. Cross-Cloud Architecture * Design secure connectivity patterns between Azure and AWS vendor systems. * Troubleshoot networking, VPN, private endpoints, DNS, and MFT integrations. Automation & Developer Experience * Build CI/CD pipelines using GitHub Actions for infrastructure changes with comprehensive testing (terraform plan, validate, compliance checks) * Implement policy-as-code using tools like Sentinel, OPA, or Azure Policy integrated into GitHub workflows * Develop testing frameworks for infrastructure code (Terratest, kitchen-terraform) with automated execution in GitHub Actions * Improve abstractions and tooling to streamline development workflows. Performance & Cost Optimization * Optimize Snowflake compute usage and Airflow/dbt performance. * Apply cloud cost management practices and tagging strategies. * Support capacity planning and forecasting. Systems Troubleshooting & Problem Resolution * Lead complex troubleshooting efforts across distributed systems spanning multiple cloud providers * Debug integration issues with Kafka streams, CDC patterns, and real-time data pipelines * Resolve platform-wide incidents involving Snowflake, Astronomer, Airbyte, and downstream BI tools (PowerBI, Tableau, Cube Cloud) * Partner with vendors for escalated support cases and coordinate resolution across multiple teams The time spent on each responsibility reflects an estimate and is subject to change dependent on business needs. Supervisory ResponsibilitiesNo Qualifications Expected for Position * 7+ years in Site Reliability Engineering, DevOps, or Platform Engineering roles. * 5+ years production experience with Terraform at scale. * Strong Azure expertise; AWS experience beneficial. * Experience operating cloud-based data platforms (Snowflake, Airflow, etc.). * Expert GitHub knowledge (pull requests, Actions, branching strategies). * Strong troubleshooting skills across distributed systems, networking, and data pipelines. * Proficient in Python, Bash, PowerShell; able to read SQL and YAML/JSON. * Strong experience with containerization and orchestration (Docker, Kubernetes). Bonus Qualifications * Healthcare data experience (FHIR, HL7, claims data) * Kafka experience, dbt administration, BI tools (PowerBI/Tableau). * Experience with data quality frameworks and synthetic data generation * Policy-as-code t