Job Information
HTC Global Services Inc Platform Engineer – Cloud Infrastructure & SRE in Dearborn, Michigan
Job escription:
Platform Engineering - Cloud Infrastructure & SRE Engineer ## Overview Platform Engineering builds and operates shared infrastructure and paved paths that help product teams deliver securely, reliably, and quickly. This role leans toward cloud infrastructure, DevOps, and Site Reliability Engineering (SRE), with strong software development skills.
What you’ll do -
Design, build, and operate cloud infrastructure and platform capabilities (networking, compute, Kubernetes, CI/CD, secrets, certificates, identity). - Define and improve reliability using service-level indicators (SLIs), service-level objectives (SLOs), and error budgets. - Implement observability (metrics, logs, traces) with actionable alerting focused on user impact.
Create self-service workflows and automation (infrastructure as code, GitOps, build/release pipelines) that reduce toil.
Improve security and compliance through least-privilege access, secure defaults, policy-as-code, and continuous hardening.
Participate in on-call rotation, incident response, and post-incident reviews; drive systemic fixes and runbook quality.
Partner with application teams to improve deployability, resilience, and cost efficiency (capacity planning, autoscaling, graceful degradation).
Required Skills:
Experience operating production cloud platforms and services (e.g., GCP/AWS/Azure) with an SRE mindset.
Strong fundamentals in Linux, networking, distributed systems, and debugging complex production issues.
Proficiency with infrastructure as code and automation (e.g., Terraform, Helm/Kustomize, GitOps tooling).
Experience with containers and orchestration (Docker, Kubernetes) and modern CI/CD.
Programming and scripting ability (e.g., Go, Python, Java, TypeScript) to build tooling and automate workflows.
Clear communication, effective incident leadership, and a customer-focused approach to platform work. ###
Preferred Skills:
Experience defining SLIs/SLOs and implementing SLO-based alerting and dashboards.
Observability platform experience (e.g., Prometheus/Grafana, OpenTelemetry, centralized logging).
Policy-as-code and supply chain security (e.g., OPA/Rego, SLSA concepts, SBOMs, artifact signing).
Experience building golden paths (container images, templates, reference architectures, paved pipelines) adopted by multiple teams. - Cost optimization experience (FinOps practices, capacity forecasting, right-sizing, multi-tenant platform controls).
How we work:
Automate first: eliminate repeatable manual work; measure and reduce toil.
Reliability is a feature: design for failure with timeouts, retries with jitter, idempotency, and graceful degradation. - Small, safe changes: incremental delivery, clear rollback strategies, and continuous improvement. - Engineering excellence: design reviews, blameless postmortems, and strong documentation/runbooks.
What success looks like:
Platform capabilities are easy to adopt, well-documented, and measurably reduce lead time for change.
Reliability improves over time (SLO attainment, reduced incident frequency/severity, faster MTTR).
Security posture improves via secure-by-default patterns and automated controls.
Skills Required: Cloud Infrastructure, Python, GCP, Platform Support, Kubernetes
Cloud Infrastructure > Expectation:
A candidate has provisioned and operated production-grade infrastructure on a major cloud provider. For example, they designed a multi-region GCP network topology using VPCs, subnets, firewall rules, and Cloud NAT, managed with Terraform and deployed via a GitOps pipeline. They understand networking primitives, IAM boundaries, compute options, and can explain tradeoffs between managed services vs. self-hosted.
Python Expectation:
A candidate has written production Python tooling or automation. For example, a script that queries the GCP Asset Inventory API to identify over-provisioned IAM bindings, generates a report, and opens a Jira ticket for remediation. Code is structured, testable (pytest), and handles errors and retries gracefully. Not just glue scripts, but maintainable tools used by a team.
GCP Expectation:
A candidate has hands-on experience operating GCP services in a real platform context. For example, running workloads on Cloud Run, using Workload Identity for pod-level IAM, configuring policies, managing secrets in Secret Manager, and setting up VPC Service Controls. They can reason about GCP-specific reliability and security patterns, not just surface-level console familiarity.
Platform Support Expectation:
A candidate has acted as a platform team member supporting internal developer customers. For example, they owned an on-call rotation, triaged and resolved incidents for shared Kubernetes or CI/CD infrastructure, led a blameless postmortem, and shipped a runbook improvement or systemic fix that prevented recurrence. They approach support as an engineering problem, not just a queue.
Kubernetes Expectation:
A candidate has operated Kubernetes clusters in production. For example, they managed cluster upgrades on GKE, written and debugged Helm charts or Kustomize overlays, configured RBAC and Network Policies, implemented HPA/VPA for autoscaling, and troubleshot pod scheduling failures, OOMKills, or service mesh connectivity issues. They understand the control plane well enough to debug it, not just deploy to it.
Skills Preferred: Go, Cloud Architecture, Reliability Engineering
Go Expectation:
A candidate has written Go for platform tooling or infrastructure automation. For example, a Kubernetes admission webhook (validating or mutating) that enforces security policies on workloads, or a CLI tool that wraps kubectl and Vault APIs to simplify developer secret management. Code should be idiomatic Go with proper error handling, context propagation, and unit tests.
Cloud Architecture Expectation:
A candidate has contributed to or led the design of a multi-team or multi-service platform architecture. For example, they designed a shared services network hub-and-spoke model on GCP, defined the golden path for how product teams onboard to the platform (container image standards, CI/CD templates, service mesh configuration), and documented reference architectures adopted by multiple teams. They can articulate tradeoffs and present designs for review.
Reliability Engineering Expectation:
A candidate has formally implemented SRE practices, not just conceptual familiarity. For example, they defined SLIs (e.g., request success rate, latency p99) and SLOs for a shared platform service, configured SLO-based alerting in Prometheus/Grafana that pages on burn rate rather than raw errors, maintained an error budget, and used that budget to gate or slow feature releases. They can explain how reliability engineering changes team behavior around change management.
Experience Required:
- Engineer 3 Exp: Prac. In 2 coding lang. or adv. Prac. in 1 lang. 6+ years in IT; 4+ years in development
#LI-SK11 #LI-Hybrid
What Makes HTC A Great Place To Build Your Future
HTC Global Services wants you to join our team. Come build new things with us and advance your career. At HTC Global, you’ll collaborate with experts, work alongside clients, and be part of high-performing teams driving success together. You’ll have long-term opportunities to grow your career and develop skills in the latest emerging technologies.
At HTC Global Services, our employees have access to a comprehensive benefits package. Benefits can include Group Health (Medical, Dental, and Vision), Paid Time Off, Paid Holidays, 401(k) matching, Group Life and Disability insurance, Professional Development opportunities, Wellness programs, and a variety of other perks.
Our success as a company is built on inclusion and diversity. HTC Global Services is committed to providing a workplace free from discrimination and harassment, where every employee is treated with dignity and respect. We celebrate differences and believe that diverse cultures, perspectives, and skills drive innovation and success. HTC is an Equal Opportunity Employer and a proud National Minority Supplier. We seek to empower each individual, fostering an environment where everyone feels valued, included, and respected.