SRNL Principal Systems Engineer, Windows & Linux
We are seeking a highly experienced High Performance Computing (HPC) System Engineer to join our dynamic and growing team. The selected candidate will work in a challenging and team-oriented environment supporting SRNL’s high performance computing clusters. You should have a strong desire to think out of the box and push the envelope to identify new technologies and opportunities and be able to communicate the potential benefits of those choices to others within the team and our research partners.
In this role you will:
• Serve as the Team Lead for the SRNL HPC team and provide high-level system administration support.
• Lead the design and implementation of Linux-based HPC, infrastructure, storage systems, and parallel file system servers and clusters.
• Develop the HPC growth and modernization program to include cloud technologies for HPC environments.
• Team with HPC researchers and vendors to develop and deploy HPC operational technologies such as scheduling, resource management, and monitoring for HPC on-premises and cloud environments.
• Establish and monitor appropriate metrics to direct continual improvements in HPC systems performance and operations.
• Participate in all aspects of the HPC system lifecycle including facility integration, standup, acceptance testing, performance benchmarking, operational support, and reclamation.
• Maintain all system aspects of security, networks, filesystems, system software installation, and user support.
• Analyze and tune performance of complex computer, network, file system and disk sub-systems.
• Develop tools and procedures to monitor and automate system tasks on servers and clusters.
• Monitor and conduct installations of software releases, patches of the operating system, and third-party utilities with emphasis on overall system security.
• Work closely with internal customers and management to ensure the quality of service for end users, working with system engineers and Operations staff.
• Troubleshoot and determine root cause of complex system issues.
• Respond to system problems and user questions in person, via email, and via a trouble ticket system.
• Perform other duties as assigned.