Job Information
Insight Global Linux Systems Administrator (HPC) in Colorado Springs, Colorado
Job Description
Insight Global is looking for a Senior level Linux Systems administrator with strong HPC skills. This individual will be responsible for handling all L1/L2 operational support related to HPC (High Performance Computing) operations for Colorado Springs. They will support HPC business users at the OS level and troubleshooting issues related to HPC hardware, monitor SLURM and health of HPC, and monitor HPC cluster health (nodes, storage, interconnect, schedulers). They will handle all operational issues through SNOW, respond to alerts from monitoring tools (Nagios, Prometheus, Grafna), restart failed services and jobs where procedures exist, and coordinate with hardware vendor for hardware related issues.
Day to Day activities:
1. Monitor SNOW tickets and perform basic triage
2. Perform the storage checks (Quota and utilization)
3. Monitor SLURM (queue, Job state) and escalate to the next level where necessary
4. Attend regular standup meetings
5. Analyze performance issues and work with other team members to clear bottlenecks
6. Work closely with the patching team to ensure that the HPC services are available after the critical patching
7. Restart the nodes if they are in the hung state
This will be a permanent position sitting out of the Colorado Springs office. It will be a hybrid role, with a couple days a week in office, with potential for remote. The salary ranges from $75-95k depending on experience.
We are a company committed to creating diverse and inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity/affirmative action employer that believes everyone matters. Qualified candidates will receive consideration for employment regardless of their race, color, ethnicity, religion, sex (including pregnancy), sexual orientation, gender identity and expression, marital status, national origin, ancestry, genetic factors, age, disability, protected veteran status, military or uniformed service member status, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.To learn more about how we collect, keep, and process your private information, please review Insight Global's Workforce Privacy Policy: https://insightglobal.com/workforce-privacy-policy/.
Skills and Requirements
Bachelors or equivalent degree in Computer Science
5+ years of advanced Linux system administration experience
2+ years supporting HPC in a Linux environment
Hands on experience with HPC schedulers (SLURM preferred)
Knowledge of MPI, parallel computing, and job profiling
Experience with high-speed interconnects (InfiniBand, Omni-Path)
Storage expertise (GPFS/Spectrum Scale, Lustre, NFS)
Scripting and automation (Bash, Python)
Understanding of CPU/GPU architectures and NUMA
Networking fundamentals (TCP/IP, RDMA, firewalls)
Experience with xCAT, Bright Cluster Manager, or similar + Experience in Linux automation through shell scripting
Knowledge of containerization and orchestration technologies (Docker, Kubernetes)