Job Information
iCIMS Sr. Site Reliability Engineer in Dublin 2, Ireland
Job Overview
We are seeking an experienced Sr. Engineer, Site Reliability (SRE) to drive technical excellence within our global Site Reliability Engineering organization. This role is essential to maintaining and improving the reliability, scalability, and performance of our multi-cloud SaaS platform serving thousands of customers worldwide. The successful candidate will provide hands-on technical expertise in incident response, system optimization, and reliability engineering practices across our complex technology stack. Off hours support as needed.
This is a hybrid position based at Windmill Lane, Dublin 2, our strategic hub for AI development in Ireland.
Responsibilities
Technical Leadership:
Provide technical guidance within a team of 5+ SRE engineers across one or more geographic regions (US, Ireland, or India)
Provide technical mentorship and skill development for team members
Contribute to technical decision-making for complex reliability and performance challenges
Conduct architecture reviews and provide guidance on system design for reliability
Facilitate post-incident reviews and support implementation of preventive measures
Incident Management & Response:
Participate in enterprise-wide incident management, ensuring rapid prevention, detection, response, and resolution
Develop and maintain runbooks and emergency response procedures
Conduct root cause analysis and ensure comprehensive documentation
Participate in 24/7 on-call rotation and escalation procedures across global teams
Interface with Engineering teams and Incident Manager during critical incident resolution
Platform Reliability & Performance:
Monitor and optimize multi-cloud infrastructure (AWS primary, Azure, GCP)
Ensure reliability of core services: AWS resources, Auth0/Okta authentication, databases (SQL Server, PostgreSQL, MongoDB), and legacy Java applications
Implement and maintain SLIs, SLOs, and error budgets for assigned services
Drive capacity planning and performance optimization initiatives
Automation & Tooling:
Design automation solutions to reduce manual operational overhead
Develop monitoring strategies using New Relic, Grafana, and Sumo Logic
Create infrastructure-as-code for reliable deployments
Build self-healing systems and automated remediation workflows
Qualifications
Technical Experience:
6+ years in SRE, DevOps, or Infrastructure Engineering roles with 2+ years in senior positions
Deep hands-on experience with multi-cloud environments (AWS required, Azure preferred)
Strong Linux system administration and troubleshooting
Experience with containerization (Docker) and orchestration (Kubernetes, ECS)
Proficiency with monitoring tools (New Relic, Grafana, Prometheus)
Leadership & Communication:
Proven track record mentoring and guiding technical teams
Experience serving as technical expert during critical incidents
Strong communication skills with engineering teams and stakeholders
Cross-functional collaboration in agile environments
SRE & Operations:
Demonstrated success implementing SRE principles in large-scale production environments
Experience with ITIL frameworks and tools
Background in establishing and maintaining SLAs for enterprise SaaS products
Education/Certifications/Licenses:
Bachelor’s degree in computer science, Engineering, Information Systems, or related technical field
Equivalent combination of education and experience will be considere
Preferred
Authentication and identity management systems knowledge
Infrastructure-as-code tools (Terraform, CloudFormation)
Education/Certifications/Licenses:
Cloud certifications (AWS, Azure, or Google Cloud)
Kubernetes certifications
New Relic/Grafana monitoring certifications
Linux certifications (RHCE, LPIC-2