Lead SRE - AWS, Terraform
Aumni
You will have public cloud expertise and ability to take the lead in defining observability practices to ensure stability, performance and fast recovery while making it easier for our teams to build a better products for our clients. You ensure requirements are accounted for in your products’ design, test service level indicators for effectiveness and customer experience, and drive implementation to production.
Job responsibilities
- Design, code, test, and deliver software to automate manual operational work, including self-healing and resiliency patterns for public cloud services and engineering teams.
- Defining and implementing a telemetry strategy, including rollout of APM (application performance monitoring) and cloud telemetry
- You are a culture carrier and adoption site reliability champion for your team by demonstrating site reliability principles and practices every day and mentoring technologists within the organization
- Troubleshoot priority and escalation incidents, facilitate blameless post-mortems and ensure permanent closure of incidents and subsequent problem tasks.
- Engage and evangelize with development team throughout their SDLC to develop software for reliability and scale, ensuring minimal refactoring or changes
- Identify application patterns and analytics in support of better service level objectives
- Design automated software and product upgrades, change management, and release management solutions.
- Provides comprehensive and ongoing guidance, tools, and solutions to support the firms’ growth
- Works towards becoming an expert on the applications and platforms in your remit by understanding its interdependencies and limitations and driving to evolve and debug the critical components of it
Required qualifications, capabilities and skills
- Bachelor’s degree or equivalent experience in a software engineering discipline
- Demonstrated experience working with a major public cloud provider (Amazon Web Services) and infrastructure as code (Terraform)
- An advanced understanding of site reliability culture and principles and a track record of demonstrating how to implement site reliability within an application or platform and usage of key SRE concepts such as SLOs and Error Budgets
- Advanced knowledge and experience in observability capabilities across applications (metrics, tracing, SLOs), alerting, telemetry collection and ability to design critical and golden signal monitoring and dashboards
- Strong communication skills and a desire to mentor and educate others on site reliability engineering principles and practices
Preferred qualifications, capabilities and skills
- Experience defining non-functional standards and blueprints related to supportability – logging, alerting, resiliency patterns, etc.
- Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks)
- Ability to partner with and influence architecture teams in defining non-functional application supportability standards
- Proven leadership skills with drive for continuous improvement
- AWS Cloud Certification, Linux Foundation CKA/CKAD, Terraform Associate and other relevant certifications are a plus
J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world’s most prominent corporations, governments, wealthy individuals and institutional investors. Our first-class business in a first-class way approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.
J.P. Morgan’s Commercial & Investment Bank is a global leader across banking, markets, securities services and payments. Corporations, governments and institutions throughout the world entrust us with their business in more than 100 countries. The Commercial & Investment Bank provides strategic advice, raises capital, manages risk and extends liquidity in markets around the world.
Assume a critical role ensuring operational stability of the Sales, Research and Data AWS portfolio.