HPC Engineer
Job Details
We are seeking a skilled and motivated HPC Engineer to support our High-Performance Computing (HPC) and Digital Platform operations. This role plays a critical part in maintaining and optimizing Linux-based systems in a complex, data-driven environment. You’ll work alongside a global IT team to ensure seamless operation of our HPC and cloud infrastructure, supporting scientific discovery and data-intensive workloads.
This position requires flexibility to work variable shifts, including occasional evenings, weekends, or holiday coverage based on business and system support needs.
Key Responsibilities:
- Install, configure, and maintain Linux systems and related applications across HPC and cloud environments
- Monitor system performance, analyze logs, and proactively identify and address potential issues
- Provide technical support to end users, resolving system, hardware, and software issues
- Manage system backups, software upgrades, and security patches
- Support in-house software, troubleshoot performance issues, and ensure adherence to IT policies
- Utilize ticketing systems to manage and resolve support requests efficiently
- Collaborate with cross-functional teams to support evolving business and research needs
- Contribute to automation efforts using tools like Ansible, GitLab, Puppet, or equivalent
- Support job scheduling and workload management using SLURM (preferred)
- Stay current with evolving technologies and best practices in HPC and cloud computing
Required Skills & Qualifications:
- Bachelor’s degree in Computer Science, IT, Engineering, or a related field – or equivalent work experience
- 1–5 years of hands-on Linux system administration experience, preferably in an HPC environment
- Proficient in shell scripting (Bash, Python, or Perl)
- Experience with Docker and container orchestration
- Familiarity with configuration management tools (Ansible, Chef, Puppet, Salt, etc.)
- Exposure to SQL-based databases such as MySQL or MariaDB
- Strong troubleshooting, problem-solving, and communication skills
- Ability to work variable shifts in a 24/7 environment as needed
Preferred Qualifications:
- Experience with SLURM workload manager
- Exposure to DevOps practices and tools (CI/CD, Kubernetes, OpenStack)
- Understanding of hardware infrastructure, including CPU, GPU, and storage systems
- Cloud administration experience (AWS, Azure, GCP, etc.)
- Certifications such as CompTIA Network+, CCNA, or ITIL Foundation
Location: Houston, Texas
Schedule: Hybrid – 4 days onsite, 1 day work-from-home
Employment Type: Full-Time | Variable Shifts Required