Job Information
The Hartford Staff Engineer, Reliability in Hyderabad, India
IND - Staff Engineer, Reliability - GCC070
We’re determined to make a difference and are proud to be an insurance company that goes well beyond coverages and policies. Working here means having every opportunity to achieve your goals – and to help others accomplish theirs, too. Join our team as we help shape the future.
Cloud Services Team is searching for a Reliability Engineer. Candidate must have hands-on experience operating and engineering services on Google Cloud Platform (GCP), including data, compute, and observability services. The team is accountable for the operations, engineering, and governance of 200+ Cloud Technologies across a multiple cloud environment. Role requires helping mature operational practices for GCP workloads as part of our multi-cloud strategy. This is an excellent opportunity for someone who is interested in a mix of strategy and hands-on work. The ideal candidate should feel comfortable working with teammates at all levels of the organization including leadership.
Key Responsibilities
Assistsin the development,maintenanceand operations of IT services across 200+infraservices across our Cloud transformation landscape.
Develop solutions and driveadoption of enterprise solutions such as Cyber Protection, Disaster Recovery, and Security enhancements, acrossLineof business teams.
Drive improvement, through automation, of software delivered as a service from anefficiencyand simplicity perspective.
Provide clear operational documents and construction/support specifications toITuserbase.
Provide insight into operational Metrics across the entire Cloud Environment.
Consult with customers on any new requirements or design questions or functionality configurations for environments on and off premise
Delivers the tooling and capabilities needed to enable cloud compliance, metrics and reporting and costmanagementroadmap and strategy.
Participate in incident resolution and changeimplementationas necessary. This may occasionally include support duringnon standardhours.
Operate and improve reliability for production workloads running on Google Cloud Platform (GCP), focusing on availability, scalability, and operational readiness rather than application development.
Own day‑to‑day operational concerns for core GCP services including Compute Engine, GKE, Cloud Run,BigQuery, Cloud Storage, and supporting platform services.
Provide operational support forBigQueryplatforms including job performance troubleshooting, capacity planning, quota management, dataset permissions, and cost optimization (slot usage, reservations, and quotas).
Support Vertex AI platforms from an operations and reliability standpoint, including environment readiness, access controls, monitoring, pipeline execution health, and incident response (not model development).
Build andmaintainobservability standards using Cloud Monitoring, Cloud Logging, Error Reporting, and custom SLI/SLO dashboards for GCP workloads.
Implement alerting strategies aligned to error budgets and production reliability goals; reduce alert noise and prevent toil.
Execute incident response, triage, and post‑incident analysis for GCP services, contributing to PIRs and corrective actions.
Develop andmaintainrunbooks, operational playbooks, and escalation workflows for GCP services.
Drive automation-first operations, including self‑healing patterns using Cloud Functions, Cloud Run jobs, Scheduler, and event‑driven remediation.
Enforce and operate GCP security and governance controls, including IAM, service accounts, Org Policies, VPC Service Controls, KMS, Secret Manager, and networking guardrails.
Partner with engineering and data teams to review designs for operability, resiliency, and supportability, ensuring workloads meet production readiness standards beforelaunch.
Required Skills & Experience :
Expertunderstanding ofhow applications should be engineered by following fault tolerate best practices, separation of duties, observability, and being operator friendly.
Expert on beingSelf-motivated and results-oriented with the ability to work in a team environment and independently
Strong hands-on experience withBigQuery, including performance tuning, cost management, and governance.
Experience with Vertex AI, including pipelines, model deployment, model monitoring, and integration withBigQuery.
Deep knowledge of Cloud IAM, service accounts, Workload Identity Federation, and principle-of-least-privilege controls.
Experience with GKE operations (clusters, node pools, autoscaling, workload identity, Istio/Anthos optional).
Understanding ofCloud Storage, Pub/Sub, Dataflow,Dataproc, and Cloud Composer for data/ML workflows.
Experience building CI/CD pipelines targeting GCP using Cloud Build, Artifact Registry, and Terraform.
Ability to troubleshoot GCP networking: VPCs,firewallrules, private service access, interconnects/VPN.
Nice to Have
Intermediateknowledge ofTerraformand Cloud Formationrequired.
Intermediate Microsoft office skills
Hands-on experience with advanced GCP services such as Vertex AI,BigQuery, Dataflow, Pub/Sub, Cloud Run, and GKE.
Experience creating org-level policies, security baselines, and automation patterns for GCP environments
What We Offer
Collaborative work environment with global teams.
Competitive compensation and comprehensive benefits.
Continuous learning and growth opportunities in geospatial and risk analytics technologies.