Site Reliability Engineer Sre Apply
Site Reliability Engineer (SRE)
RTP, NC
Long Term Contract
Responsibilities:
- Manage AWS/GCP Cloud infrastructure and Kubernetes resources; troubleshoot applications
- in runtime environment.
- Manage and performance tune either databases (Postgres, Redis, Cassandra, Elasticsearch)
- or streaming data pipelines (Kafka, Knowledge of Flink /Storm /Spark /Kubeflow frameworks
- desirable).
- Write and maintain runbooks for knowledge driven automated processes and bots.
- Collaborate with developers and quality engineering teams to automate the monitoring, alerting,
- availability and scalability of our applications and systems.
- Proactive monitoring, diagnosis, on call rotation, and resolution of issues in a 24x7 of multicloud
- environment (AWS / GCP).
- Analyze failures, provide support for software engineers to debug production issues across
- microservices, and distributed platforms.
- Follow SRE best practices and procedures.
Technical Skills
- Experience of maintaining production systems on AWS and/or GCP.
- Experience in Linux and Python, Shell scripting.
- Experience of Kubernetes clusters maintenance, managing and debugging containerized
- applications (Golang, Java, Python).
- Understanding of Kafka, Spark, Storm, Cassandra, ElasticSearch, PostgreSQL, Redis
- (Elasticache), Zookeeper, Nginx, AWS S3/GCP GS.
- Understanding of infrastructure as code software (e.g. Terraform, AWS and Google Cloud
- Deployment, CloudFormation).
- Experience in continuous integration practices & tools (Jenkins, Travis CI, CircleCI, etc. )
- Experience with monitoring solutions such as: CloudWatch, Stackdriver, Prometheus, Thanos,
- Graphite, Grafana, ELK, Alert Logic, Datadog.
- Experience with logging service solutions.

