Senior Linux HPC Storage Engineer Apply
Job Description
Job Description
- Must be able to work a hybrid work schedule in Oak Ridge, TN
- Must be eligible for a federal security clearance (US Citizen)
Major Duties/Responsibilities
- Architect, deploy, and manage large-scale HPC storage systems, including parallel file systems such as Lustre, GPFS/Spectrum Scale, BeeGFS and WEKA
- Design, implement, and operate large-scale Ceph storage clusters for HPC and research workloads, delivering reliable, high-performance object, block, and file storage services.
- Ensure the availability, performance, scalability, and security of production storage environments.
- Administer and optimize enterprise storage platforms such as Qumulo and NetApp in support of HPC and research workloads.
- Design, deploy, and maintain archival storage solutions including Spectra Logic BlackPearl and large-scale tape libraries to ensure long-term data preservation and accessibility.
- Integrate high-performance, enterprise, and archival storage layers into cohesive tiered storage architectures that balance cost, scalability, and performance for diverse scientific workflows.
- Leverage automation and monitoring solutions to minimize day-to-day maintenance while identifying opportunities to optimize system performance and management.
- Collaborate with researchers and technical POCs to support large data workflows and optimize I/O performance for scientific workloads.
- Automate storage provisioning, monitoring, and maintenance using scripting and configuration management tools.
- Diagnose and resolve complex storage and I/O-related issues in high-throughput, low-latency HPC environments.
- Evaluate emerging storage technologies (NVMe, object storage, hierarchical storage management, burst buffers) and contribute to strategic planning for future HPC systems.
- Work with 24/7 operations staff to streamline monitoring and troubleshooting, significantly reducing the need for off-hours support.
- Deliver ORNL’s mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote equal opportunity by fostering a respectful workplace.
Basic Qualifications
- A BS degree in computer science, computer engineering, information technology, information systems, science, engineering, or related discipline and 8–12 years of relevant professional experience; or an equivalent combination of education and experience.
- Master’s degree holders: 7–10 years of relevant experience.
- PhD holders: 4–6 years of relevant experience.
- Five (5) or more years managing UNIX/Linux systems.
- Demonstrated experience managing HPC storage and large-scale enterprise storage systems.
- Three (3) or more years working with configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet.
- Proficiency with at least one scripting language (Bash, Python, Perl, etc.).
- Strong Linux administration and advanced troubleshooting experience.
- Experience supporting large data systems and/or HPC scientific workloads.
- Strong desire to innovate and evaluate new technologies for HPC and storage environments.
- Collaborative approach and ability to become a trusted advisor to research teams.
Preferred Qualifications
- Active DOE Q, DoD Top Secret, or TS/SCI clearance is strongly preferred.
- Solid understanding of multiple operating systems and HPC cluster technologies.
- Experience with Rocky/CentOS/RHEL, Ubuntu, VMware.
- Understanding of HPC job schedulers (SLURM) and user support workflows.
- Experience with container technologies in HPC environments.
- Experience with multiple system deployment mechanisms (Warewulf, PXEboot, Cobbler, Bright).
- Experience with GPU clusters (NVIDIA, AMD) for AI/ML and scientific workloads.
- Deep expertise with high-performance parallel file systems (Lustre, GPFS/Spectrum Scale, BeeGFS, WEKA).
- Knowledge of storage networking (Infiniband, NVMe-oF, SAN/NAS architectures).
- Familiarity with RAID, ZFS, and object storage technologies.
- Strong background in performance monitoring, benchmarking, and I/O optimization.
- Experience with monitoring systems such as Grafana, CheckMK, Nagios, Zabbix, Ganglia.
- Previous experience working in a government, scientific, or other highly technical environment.
- Strong documentation skills and ability to prepare web-based documentation.

