image
  • Snapboard
  • Activity
  • Reports
  • Campaign
Welcome ,
loadingbar
Loading, Please wait..!!

Linux Engineer Gpu

  • ... Posted on: Feb 09, 2025
  • ... California Creative Solutions Inc
  • ... Bethesda, Maryland
  • ... Salary: Not Available
  • ... Full-time

Linux Engineer Gpu   

Job Title :

Linux Engineer Gpu

Job Type :

Full-time

Job Location :

Bethesda Maryland United States

Remote :

No

Jobcon Logo Job Description :

Responsibilities:

  • Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
  • Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
    A100 servers within a physical and virtual environment.
  • Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
  • Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
  • Monitor system performance and resource utilization, identifying and resolving bottlenecks.
  • Automate system administration tasks using scripting languages like Bash and Python.

DevOps and Configuration Management:

  • Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
  • Maintain and improve system availability through proactive monitoring and automation.
  • Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.

Resource Management:

  • Monitor resource management system (SLURM) to keep resource allocation efficient and
    aligned with organizational priorities
  • Work directly with users and management to plan and allocate resources effectively.
  • Communicate clearly and proactively regarding resource availability and scheduling.

Incident Response and Support:

  • Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
  • Analyze recurring problems and implement solutions to prevent reoccurrence.
  • Document incident resolution steps and contribute to root cause analysis efforts.
  • Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.

Qualifications:

  • Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
  • 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
  • Hands-on experience troubleshooting server hardware failures.
  • Proficiency with configuration management tools (Ansible, Salt).
  • Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
  • Experience with shell scripting and/or Python for automation.
  • Knowledge of Linux security best practices.
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and interpersonal skills.
  • Ability to work independently and as part of a team.1
  • DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.

Preferred:

  • Experience with container technologies (Docker, Kubernetes).
  • Familiarity with monitoring tools (Prometheus/Grafana).
  • Knowledge of distributed resource scheduling systems (Slurm, LSF).
  • Experience with CUDA and GPU-accelerated computing systems.
  • Basic understanding of deep learning frameworks and algorithms

Jobcon Logo Position Details

Posted:

Feb 09, 2025

Employment:

Full-time

Salary:

Not Available

Snaprecruit ID:

SD-CIE-92ec4a5006a85c935d7d36030a4e775bb8eedebc7d1f895f238084030274f041

City:

Bethesda

Job Origin:

CIEPAL_ORGANIC_FEED

Share this job:

  • linkedin

Jobcon Logo
A job sourcing event
In Dallas Fort Worth
Aug 19, 2017 9am-6pm
All job seekers welcome!

Linux Engineer Gpu    Apply

Click on the below icons to share this job to Linkedin, Twitter!

Responsibilities:

  • Manage and maintain Linux servers (Red Hat/CentOS, Ubuntu) in a multi-enclave enterprise environment.
  • Provide technical support, administration, and monitoring of Linux systems, Nvidia DGX1 and
    A100 servers within a physical and virtual environment.
  • Troubleshoot hardware and software issues, including server failures, network connectivity problems, and application errors.
  • Implement security updates, patches, and configurations to harden systems and protect against vulnerabilities.
  • Monitor system performance and resource utilization, identifying and resolving bottlenecks.
  • Automate system administration tasks using scripting languages like Bash and Python.

DevOps and Configuration Management:

  • Utilize DevOps tools (Ansible, Salt, Gitlab) to automate configuration management, software updates, and system maintenance.
  • Maintain and improve system availability through proactive monitoring and automation.
  • Collaborate with developers and hardware architects to debug issues, define new requirements, and optimize workflows.

Resource Management:

  • Monitor resource management system (SLURM) to keep resource allocation efficient and
    aligned with organizational priorities
  • Work directly with users and management to plan and allocate resources effectively.
  • Communicate clearly and proactively regarding resource availability and scheduling.

Incident Response and Support:

  • Provide technical support to users, troubleshooting issues and resolving incidents in a timely manner.
  • Analyze recurring problems and implement solutions to prevent reoccurrence.
  • Document incident resolution steps and contribute to root cause analysis efforts.
  • Participate in on-call rotation to provide 24/7/365 support during outages and emergencies.

Qualifications:

  • Bachelor's degree in Computer Science or a related field and 6+ years of relevant experience (additional experience may be considered in lieu of a degree).
  • 2+ years of experience administering Linux servers (Red Hat/CentOS, Ubuntu).
  • Hands-on experience troubleshooting server hardware failures.
  • Proficiency with configuration management tools (Ansible, Salt).
  • Strong understanding of networking services (DNS, NFS, LDAP, DHCP).
  • Experience with shell scripting and/or Python for automation.
  • Knowledge of Linux security best practices.
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and interpersonal skills.
  • Ability to work independently and as part of a team.1
  • DoD 8570.11- IAT Level II certification (Security+ CE, CCNA-Security, GSEC, or SSCP) and an appropriate computing environment (CE) certification.

Preferred:

  • Experience with container technologies (Docker, Kubernetes).
  • Familiarity with monitoring tools (Prometheus/Grafana).
  • Knowledge of distributed resource scheduling systems (Slurm, LSF).
  • Experience with CUDA and GPU-accelerated computing systems.
  • Basic understanding of deep learning frameworks and algorithms

Loading
Please wait..!!