Sr Principal Engineer It Resiliency Apply
Job Mode : Hybrid
Notice period : 15days or 30days
Roles & Responsibilities:
End-to-End Engineering Leadership: Oversee the design and implementation of resilient engineering across the technology domains.
Cloud and On-Premises Infrastructure Expertise: Design and review resilient solutions in both cloud-based and on-premises environments.
Chaos Engineering Infrastructure Initiatives: Lead chaos engineering efforts to proactively identify and mitigate potential system weaknesses.
Standards for Monitoring and Alerting: Collaborate with Teams to evolve existing standards for system monitoring and alerting to ensure rapid detection and response.
Resiliency Architecture Reviews: Represent the IT Resiliency Office during the Architectural Review Board.
Enterprise-wide Collaboration and stakeholder management: Collaborate with various teams across the organization to align and prioritize resiliency and recovery efforts.
Automation: Expertise with IaC and Tools such as Ansible.
Incident Response and Recovery: Integrate with post mortem process, from a major incident, to identify areas of opportunity for enhancing resiliency.
Development: Evangelize standards and practices among the Technology organization to enrich our resiliency posture.
Reporting and Documentation: Develop standardized regular reporting on resilience activities, risks, and improvements to the Leadership team.
Experience & Qualifications:
Bachelor's degree or equivalent experience.
5-10 years experience with platform engineering with a focus on IaC, DevOps practices, and orchestration tools.
Preferred but not required experience as a Team lead or a hands on Technical Manager role that can engage and deliver projects to completion
A track record of successfully architecting and deploying enterprise-level solutions that prioritize system uptime and data integrity across various operational scenarios.
Demonstrated ability to design and implement systems that ensure high availability, support massive transaction volumes, and facilitate seamless disaster recovery processes.
Infrastructure and service architecture & engineering experience, including functional and technical requirements gathering, and solution development.
Strong dedication to customer needs, with excellent communication and the ability to build lasting relationships, alongside the capability to articulate complex resilience strategies in a clear and impactful manner.
Deep insight into the complexities of multi-AZ and multi-Region cloud platforms, with a keen understanding of how these impact system resilience and disaster recovery planning.
Proven experience in the ongoing management of mission-critical systems that require constant uptime, including out-of-hours support and rapid response to incidents.
Knowledgeable in evaluating and deciding on trade-offs between consistency, availability, and partition tolerance, especially in the context of system failures and recovery strategies.
Well-versed in various cloud service models such as SaaS, PaaS, and IaaS, with hands-on experience in designing resilient services on leading public cloud platforms.
Proficient in Chaos Engineering principles and practices, with experience in designing and conducting experiments to validate the system's capability to withstand turbulent conditions.
Skilled in implementing observability solutions that provide real-time insights into the performance and health of systems, aiding in proactive issue detection and resolution.
Practical experience operating in an Agile development environment.