• Snapboard
  • Activity
  • Reports
  • Campaign
Welcome ,

Chat with the recruiter

...Minimize

Hey I'm Online! Leave me a message.
Let me know if you have any questions.

Sr. Site Reliability Engineers

In United States

Save this job

Sr. Site Reliability Engineers   

Click on the below icons to share this job to Linkedin, Twitter!

JOB TITLE:

Sr. Site Reliability Engineers

JOB TYPE:

JOB SKILLS:

JOB LOCATION:

HOUSTON United States

JOB DESCRIPTION:

Title: Sr. Site Reliability Engineers

Duration: 18 months

Location: Houston, Tx

 

Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other DSX production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments.

 

As an SRE you will:

- Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.

- Use your on-call shift to prevent incidents from happening.

- Run our infrastructure with Terraform and Kubernetes.

- Use monitoring and alerting to alert on symptoms not outages.

- Document every action so that your findings turn into repeatable actions (playbooks) and then into automation.

- Improve the deployment process--we want to make it as boring as possible.

- Design, build and maintain core infrastructure pieces that allow DSX to scale to support hundreds and then thousands of concurrent users.

- Debug production issues across services and levels of the stack.

- Plan the growth of the DSX infrastructure.

 

You may be a fit for this role if you:

- Think about systems, and particularly edge cases and failure modes.

- Know your way around Linux and the Unix Shell.

- Have strong programming skills--preferably Nodejs, but it could be Python, Go, .NET or even Ruby.

- Have an urge to collaborate and communicate asynchronously.

- Have an urge to document all the things so you don't need to learn the same thing twice.

- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.

- Have an urge for delivering quickly and iterating fast.

- Have experience with Nginx, Docker, Kubernetes, Terraform, or similar technologies.

- Have good experience with GitHub.

 

 

Projects you could work on

- Coding infrastructure automation with GitHub Actions and Terraform.

- Improving our Prometheus Monitoring or building new Metrics.

- Helping to deploy new versions of DSX.

- Helping to plan, prepare for, and execute the migration of DSX from virtual machines running on Azure to cloud-native container-based deployments with Kubernetes using Azure Kubernetes Service.

 

Details Description Technical General knowledge of 4 of the following areas of technical expertise with deep knowledge in 1 area:

- Implement "Infrastructure as Code" using Terraform and GitHub CI/CD for automation.

- Load balancing of the application including Proxies and CDN.

- Kubernetes and containerising our system.

- Administering a high-availability MSSQL cluster.

- Monitoring and Metrics in Prometheus and Grafana, and their integrations with Slack/PagerDuty.

- Logging infrastructure.

- Backend storage management and scaling.

- Disaster Recovery and High Availability strategy.

- Contributing to code for services and automation.

 

Execution

1. Provide emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed.

2. Propose ideas and solutions within the infrastructure team to reduce the workload by automation.

3. Plan, design and execute solutions within the team to reach specific, agreed-upon, goals.

4. Plan and execute configuration change operations both at the application and the infrastructure level.

5. Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation. Collaboration and Communication

Position Details

POSTED:

Oct 18, 2022

EMPLOYMENT:

INDUSTRY:

SNAPRECRUIT ID:

S16577568730617260

LOCATION:

United States

CITY:

HOUSTON

Job Origin:

OORWIN_ORGANIC_FEED

A job sourcing event
In Dallas Fort Worth
Aug 19, 2017 9am-6pm
All job seekers welcome!

Sr. Site Reliability Engineers    Apply

Click on the below icons to share this job to Linkedin, Twitter!

<p><strong>Title: </strong> Sr. Site Reliability Engineers</p> <p><strong>Duration: </strong> 18 months</p> <p><strong>Location: </strong> Houston, Tx</p> <p> </p> <p>Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other DSX production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments.</p> <p> </p> <p>As an SRE you will:</p> <p>- Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.</p> <p>- Use your on-call shift to prevent incidents from happening.</p> <p>- Run our infrastructure with Terraform and Kubernetes.</p> <p>- Use monitoring and alerting to alert on symptoms not outages.</p> <p>- Document every action so that your findings turn into repeatable actions (playbooks) and then into automation.</p> <p>- Improve the deployment process--we want to make it as boring as possible.</p> <p>- Design, build and maintain core infrastructure pieces that allow DSX to scale to support hundreds and then thousands of concurrent users.</p> <p>- Debug production issues across services and levels of the stack.</p> <p>- Plan the growth of the DSX infrastructure.</p> <p> </p> <p>You may be a fit for this role if you:</p> <p>- Think about systems, and particularly edge cases and failure modes.</p> <p>- Know your way around Linux and the Unix Shell.</p> <p>- Have strong programming skills--preferably Nodejs, but it could be Python, Go, .NET or even Ruby.</p> <p>- Have an urge to collaborate and communicate asynchronously.</p> <p>- Have an urge to document all the things so you don't need to learn the same thing twice.</p> <p>- Have an enthusiastic, go-for-it attitude. When you see something broken, you can't help but fix it.</p> <p>- Have an urge for delivering quickly and iterating fast.</p> <p>- Have experience with Nginx, Docker, Kubernetes, Terraform, or similar technologies.</p> <p>- Have good experience with GitHub.</p> <p> </p> <p> </p> <p>Projects you could work on</p> <p>- Coding infrastructure automation with GitHub Actions and Terraform.</p> <p>- Improving our Prometheus Monitoring or building new Metrics.</p> <p>- Helping to deploy new versions of DSX.</p> <p>- Helping to plan, prepare for, and execute the migration of DSX from virtual machines running on Azure to cloud-native container-based deployments with Kubernetes using Azure Kubernetes Service.</p> <p> </p> <p>Details Description Technical General knowledge of 4 of the following areas of technical expertise with deep knowledge in 1 area:</p> <p>- Implement "Infrastructure as Code" using Terraform and GitHub CI/CD for automation.</p> <p>- Load balancing of the application including Proxies and CDN.</p> <p>- Kubernetes and containerising our system.</p> <p>- Administering a high-availability MSSQL cluster.</p> <p>- Monitoring and Metrics in Prometheus and Grafana, and their integrations with Slack/PagerDuty.</p> <p>- Logging infrastructure.</p> <p>- Backend storage management and scaling.</p> <p>- Disaster Recovery and High Availability strategy.</p> <p>- Contributing to code for services and automation.</p> <p> </p> <p>Execution</p> <p>1. Provide emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed.</p> <p>2. Propose ideas and solutions within the infrastructure team to reduce the workload by automation.</p> <p>3. Plan, design and execute solutions within the team to reach specific, agreed-upon, goals.</p> <p>4. Plan and execute configuration change operations both at the application and the infrastructure level.</p> <p>5. Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation. Collaboration and Communication</p>


Please wait..!!