Job Information
ERCOT Site Reliability Engineer (Java focused) Sr or Lead in Taylor, Texas
At ERCOT, our diverse and dynamic work environment provides a platform on which employees can work together to build the future of the Texas power grid and wholesale market utilizing the latest technologies and resources. We encourage you to join our talented, dedicated workforce to develop world-class solutions for today and tomorrow’s energy challenges while learning new skills and growing your career.
ERCOT is committed to fostering inclusion at all levels of our company. It is the cornerstone of our corporate values of accountability, leadership, innovation, trust, and expertise. We know that individuals with a wide variety of talents, ideas, and experiences propel the innovation that drives our success. An inclusive and diverse workforce strengthens us and allows for a collaborative environment to solve the challenges that face our industry today and in the future.
JOB SUMMARY
ERCOT is seeking a Senior or Lead Site Reliability Engineer (SRE) with strong Java application expertise to ensure the availability, performance, and reliability of mission-critical systems. This role will follow ERCOT specific SRE process and principles which includes managing site failover between 2 datacenters as well as treating Azure as an extended datacenter in the future. You will work deeply with Java codebases while owning production health and operational excellence.
JOB DUTIES INCLUDE:
Core Responsibilities
Own reliability, availability, latency, and scalability of Java-based systems
Define and track SLIs, SLOs, and error budgets
Design and maintain monitoring, alerting, logging, and dashboards
Lead incident response and conduct blameless postmortems
Reduce operational toil through automation and tooling
Review system designs for reliability and failure modes
(Lead level) Establish reliability standards and mentor engineers
Java & Application Responsibilities
Debug and improve Java applications (Spring Boot preferred)
Perform JVM tuning and performance analysis
Diagnose failures across databases, messaging, and APIs
Partner with development teams to improve resilience and recovery
On-Call & Incident Response
Participate in an on-call rotation for supported services
Focus on engineering solutions rather than repetitive manual work
Emphasis on post-incident learning and automation
Toil is tracked and actively reduced
EXPERIENCE:
5+ years (Senior) or 10+ years (Lead) in SRE, DevOps, or Production Engineering
Strong Java experience (Spring-based systems)
Experience with distributed, high-availability systems
Expertise in observability tools (metrics, logs, traces)
CI/CD experience (Git, Maven, Jenkins)
Strong cross-layer debugging skills
-CS or related degree required
PREFERRED
Python
Kubernetes or OpenShift
Microsoft Azure
Kafka or ActiveMQ
Infrastructure automation (Terraform, Azure Resource Manager, Ansible, Liquibase)
Chaos or load testing experience
Observability & Production Tooling
Strong hands-on experience with observability and APM platforms such as Splunk, Dynatrace, DataDog
Expertise in using Metrics, Logs, Traces, and Profiling (MLTP) to troubleshoot complex production incidents
Experience with Grafana LGTM Stack for Observability (Loki - for logs, Grafana - for dashboards and visualization, Tempo - for traces, and Mimir - for metrics)
Experience correlating application performance data with system behavior to identify root causes and prevent recurrence
WORK LOCATION – Taylor, TX:
Employees will be required to be on-site in Taylor, TX at minimum 2 days per week, or more, as needed based on the business needs as determined by management.
On-site schedules are flexible or may be rotated based on business needs as determined by the Manager.
Remote work is required to be performed from your Texas residence.
Employees may opt to work on-site more than required or 100% of the time.
The foregoing description reflects the minimum qualifications and the essential functions of the position that must be performed proficiently with or without reasonable accommodation for individuals with disabilities. It is not an exhaustive list of the duties expected to be performed, and management may, at its discretion, revise or require that other or different tasks be performed as assigned. This job description is not intended to create a contract of employment with ERCOT. Both ERCOT and the employee may exercise their employment-at-will rights at any time. #LI-IV1
ERCOT is firmly committed to equal employment for all qualified persons without regard to race, sex, medical condition, religion, age, creed, national origin, citizenship status, marital status, sexual orientation, physical or mental disability, ancestry, veteran status, genetic information or any other protected category under federal, state or local law.
Expected Salary Range:
$99,230 - $168,715