Senior SRE & Technology Support Lead - Applied AI/ML
Aumni
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
Job responsibilities
- Design, implement, and optimize SRE best practices for AI/ML infrastructure, focusing on reliability, scalability, security, and operational efficiency.
- Develop and maintain automated monitoring, alerting, and incident response systems.
- Manage day-to-day support issues, conduct root cause analysis, and drive continuous improvement to reduce repeat errors and enhance system stability.
- Demonstrates and champions site reliability culture and practices and exerts technical influence throughout your team
- Leads initiatives to improve the reliability and stability of your team’s applications and platforms using data-driven analytics to improve service levels
- Collaborates with team members to identify comprehensive service level indicators and stakeholders to establish reasonable service level objectives and error budgets with customers
- Demonstrates a high level of technical expertise within one or more technical domains and proactively identifies and solves technology-related bottlenecks in your areas of expertise
- Acts as the main point of contact during major incidents for your application and demonstrates the skills to identify and solve issues quickly to avoid financial losses
- Documents and shares knowledge within your organization via internal forums and communities of practice
Required qualifications, capabilities, and skills
- 5+ years in site reliability or infrastructure engineering roles.
- Deep expertise in AWS cloud services, infrastructure automation (Terraform, CloudFormation), and monitoring tools (Prometheus, Grafana, CloudWatch).
- Strong problem-solving, communication, and collaboration skills.
- Experience with CI/CD pipelines, operational stability, and risk management.
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field (or equivalent experience).
- Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform
- Fluency in at least one programming language such as (e.g., Python, Java etc.)
- Deep knowledge of software applications and technical processes with emerging depth in one or more technical disciplines
- Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.
- Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)
- Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)
- Experience with troubleshooting common networking technologies and issues
- Drive to self-educate and evaluate new technology
- Advanced English skills
Please submit your resume in English
J.P. Morgan is a global leader in financial services, providing strategic advice and products to the world’s most prominent corporations, governments, wealthy individuals and institutional investors. Our first-class business in a first-class way approach to serving clients drives everything we do. We strive to build trusted, long-term partnerships to help our clients achieve their business objectives.
J.P. Morgan’s Commercial & Investment Bank is a global leader across banking, markets, securities services and payments. Corporations, governments and institutions throughout the world entrust us with their business in more than 100 countries. The Commercial & Investment Bank provides strategic advice, raises capital, manages risk and extends liquidity in markets around the world.
Lead the reliability and support of AI/ML infrastructure on AWS and on-prem, ensuring operational excellence and team development.