Site Reliability Engineer Splunk Prometheus Grafana

Blue Ribbon Global Technologies

Sunnyvale, California,

Full-time
Salary: 82 per hour

Posted on: Sep 05, 2024

Save this job

Apply Here Save this job

Site Reliability Engineer Splunk Prometheus Grafana

JOB TITLE:

Site Reliability Engineer Splunk Prometheus Grafana

JOB TYPE:

Full-time

JOB LOCATION:

Sunnyvale California United States

REMOTE:

JOB DESCRIPTION:

Note : Please do not add anyone until after this call. I have scheduled this call for Friday 9/6/24 @ 1pm EST. Description:

This is a Site Reliability Engineer Role for Sam's Cash Application team.

Role and Responsibilities include:

Production Tickets handling and Troubleshooting : Requires knowledge of: Strong Analytical and problem solving skills; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Data Reporting and Metrics:

Advanced SQL skills to pull complex data report from multiple sources, familiar with Databricks or GCP Big Query, capable to write advanced "Splunk" queries to join multiple indices to stitch data, using Data-Driven decision-making process to analyze the impact of the production issues and prioritize them.

Additional Information:

What project or initiative will they be working on?

Sam's Cash Reward Project

Will this role be hybrid?

If hybrid, how many days per week will need to be in office?

2-3 times a week

Top 3 Skills Needed or Required

Strong technical analytical and problem solving skills , experiences on triaging and Troubleshooting Production Issues;
Monitoring and Alerting Skills ((Splunk, Prometheus, Grafana)
Data Reporting and Metrics Skills (SQL,Python, Pyspark, Databricks).

What is the makeup of the team?

Team of 8 engineers including Java backend engineers, Site Reliability Engineer and Data Engineers, supporting Sam's Cash Core Application Operations.

Additional Job Details

Location can be Sunnyvale, CA, Bentonville, AR, or Dallas, TX

Required Skills : Grafana
Additional Skills : Cloud Developer

Position Details

POSTED:

Sep 05, 2024

EMPLOYMENT:

Full-time

SALARY:

82 per hour

SNAPRECRUIT ID:

SD-1dbdf0d92e981692096a5ce51e1ba51b308f3659549507d2708344682fdb8c73

CITY:

Sunnyvale

Job Origin:

CIEPAL_ORGANIC_FEED

Similar Jobs

Site Reliability Engineer Splunk Prometheus Grafana Apply

Click on the below icons to share this job to Linkedin, Twitter!

Note : Please do not add anyone until after this call. I have scheduled this call for Friday 9/6/24 @ 1pm EST. Description:

This is a Site Reliability Engineer Role for Sam's Cash Application team.

Role and Responsibilities include:

Production Tickets handling and Troubleshooting : Requires knowledge of: Strong Analytical and problem solving skills; Root cause analysis (RCA); Root cause corrective action (RCCA) To guide team members in RCA and RCCA to identify the origins of and prevent defects/performance gaps. Analyzes complex problems involving multiple parties, networks, hardware, software, and cloud computing technologies.
Assesses immediate restoration versus root cause based on consequences and resource requirements. Analyzes the issues and plans a series of steps to enhance an application's availability and reliability, potentially including reconfiguration, integration, removal, or the addition of application components. Analyzes trends to proactively prevent incidents and provide historical summary reports.
Disaster Recovery Planning: Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To coordinate partial and full tests of contingency and disaster recovery plans. Creates and maintains data center contingency documents and action plans. Defines and documents contingency and disaster recovery procedures. Leads the identification of critical functions for assigned area of responsibility. Creates and tests plans for operating in a remote back-up environment. Coordinates the day-to-day activities of control measures used in recovery plans.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic.
To establish metrics to monitor network, software, or system performance. Establishes SLOs/SLAs to determine availability goals of systems/services. Sets altering priorities by identifying the most important systems based on criticality. Oversees daily system monitoring, including verifying the integrity and availability of all hardware and services, reviews system and application logs, and verifies the completion of scheduled jobs.
Leads end-to-end audits of monitors and alarms based on subsystem knowledge. Provides proactive updates to executive leadership on potential customer-impacting issues. Analyzes systems and makes recommendations to prevent possible incidents using knowledge of complex and company-wide systems.

Data Reporting and Metrics:

Advanced SQL skills to pull complex data report from multiple sources, familiar with Databricks or GCP Big Query, capable to write advanced "Splunk" queries to join multiple indices to stitch data, using Data-Driven decision-making process to analyze the impact of the production issues and prioritize them.

Additional Information:

What project or initiative will they be working on?

Sam's Cash Reward Project

Will this role be hybrid?

If hybrid, how many days per week will need to be in office?

2-3 times a week

Top 3 Skills Needed or Required

Strong technical analytical and problem solving skills , experiences on triaging and Troubleshooting Production Issues;
Monitoring and Alerting Skills ((Splunk, Prometheus, Grafana)
Data Reporting and Metrics Skills (SQL,Python, Pyspark, Databricks).

What is the makeup of the team?

Team of 8 engineers including Java backend engineers, Site Reliability Engineer and Data Engineers, supporting Sam's Cash Core Application Operations.

Additional Job Details

Location can be Sunnyvale, CA, Bentonville, AR, or Dallas, TX

Required Skills : Grafana
Additional Skills : Cloud Developer

Please wait..!!

Find Site Reliability Engineer Splunk Prometheus Grafana Job in Sunnyvale, California | Snaprecruit