Description

You own and drive reliability outcomes at scale for real-time, distributed payment and transaction processing platforms meeting strict SLAs, SLOs, and regulatory requirements.

Responsibilities

  • Define reliability architecture and standards across services, platforms, and infrastructure.
  • Design and evolve enterprise-grade observability platforms (metrics, logs, traces, SLOs/SLIs).
  • Lead incident response for high-severity production issues, driving root-cause analysis and fixes.
  • Set strategy and drive adoption of SRE best practices, including error budgets and capacity modeling.
  • Architect automation platforms to eliminate toil and enable safe production releases.

Required Skills

  • 10+ years of software engineering experience building large-scale, distributed systems.
  • Significant experience operating mission-critical systems in Payments, FinTech, or Banking environments.
  • Expertise in AWS, Azure, and GCP.
  • Proficiency with Prometheus, Grafana, and Datadog for monitoring.
  • Strong command of Linux and Python.
  • Experience with configuration management using Ansible.

Preferred Skills

  • Experience with Splunk, ELK stack, or Oracle RDMS.
  • Familiarity with CI/CD platforms and release automation in regulated environments.

Education

Any Graduate