You own and drive reliability outcomes at scale for real-time, distributed payment and transaction processing platforms meeting strict SLAs, SLOs, and regulatory requirements.
Responsibilities
- Define reliability architecture and standards across services, platforms, and infrastructure.
- Design and evolve enterprise-grade observability platforms (metrics, logs, traces, SLOs/SLIs).
- Lead incident response for high-severity production issues, driving root-cause analysis and fixes.
- Set strategy and drive adoption of SRE best practices, including error budgets and capacity modeling.
- Architect automation platforms to eliminate toil and enable safe production releases.
Required Skills
- 10+ years of software engineering experience building large-scale, distributed systems.
- Significant experience operating mission-critical systems in Payments, FinTech, or Banking environments.
- Expertise in AWS, Azure, and GCP.
- Proficiency with Prometheus, Grafana, and Datadog for monitoring.
- Strong command of Linux and Python.
- Experience with configuration management using Ansible.
Preferred Skills
- Experience with Splunk, ELK stack, or Oracle RDMS.
- Familiarity with CI/CD platforms and release automation in regulated environments.