Job Information
Intercontinental Exchange (ICE) Senior Site Reliability Engineer in Pune, India
Job Description
Job Purpose
Site Reliability Engineer (SRE) headcount to assist with day-to-day activities supporting SRE services related to incidents. Build actionable alerts/automation for preventing incidents, detecting performance bottlenecks, and identifying maintenance activities.
Responsibilities
Employ deep troubleshooting skills to improve the availability, performance, and security of IMT Services
Coding and Automation of Applications on Cloud Platform
Implement automated tests, automated deployments, and operational tools
Collaborate with Product and Support teams to plan and deploy product releases
Work with Cloud Platform and Operations leaders to develop narratives, backlog grooming, epic planning, and overall sprint planning processes
Work with Engineering leadership to build shared services that meet the requirements and need of the platform and application teams
Ensure services are designed with 24/7 availability and operational readiness and rigor
Implementation of proactive monitoring, alerting, trend analysis and self-healing systems
Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
Contribute to product development / engineering as needed to ensure Quality of Service of Highly Available services
Identify, evaluate, and execute preventive measures to minimize/avoid impact to the customers experience Proactive v/s Customer escalated
Resolution of product/service defects or design changes, infrastructure changes, or operational changes
Partner with other SREs and lead by example - contributor more than a delegator
Knowledge and Experience
BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
7+ years of Systems/Applications automation in 24x7 Production support services environments
Fluency with one or more current generation scripting language (Python/Shell/Perl/ PHP/Ruby) AND/OR Java Development and/or .NET
Excellent troubleshooting skills, utilizing a systematic problem-solving approach
Demonstrated experience in designing, analysing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
Experience with elastically scalable, fault tolerance and other cloud architecture patterns
Experience operating on AWS (both PaaS and IaaS offerings)
Experience in both Windows (2k8R2+) and Linux
Experience with Continuous Integration and Continuous Delivery concepts
Hand-on experience in Infrastructure as code tools like Terraform, CloudFormation AND/OR Chef, Salt Stack, Ansible, Puppet
Good to have experience in Containerization concepts like Docker
Proven strength in SaaS services, experience in massive scale web operations