← Back to jobs
New York, NY, USA
No related jobs found
Job description:
Define, implement, and maintain observability (monitoring, logging, tracing) and actionable ing aligned to service health.
Drive incident management: on-call readiness, triage, incident command support, communications, and post-incident reviews (RCA).
Reduce operational toil through automation (runbooks-to-automation, self-healing, deployment/rollback automation).
Establish reliability standards: SLOs/SLIs, error budgets, production readiness reviews, and release risk controls.
Performance and reliability engineering: capacity planning, load/performance analysis, resilience testing, and failure-mode mitigation.
Partner with engineering teams to improve operational hygiene (deployability, rollback strategy, configuration, secrets, dependency management).
Skills
Required
· Strong grounding in SRE/DevOps practices: incident management, blameless postmortems, SLOs/SLIs, error budgets, production readiness.
· Experience building/operating monitoring and ing, and using logs/metrics to diagnose issues.
· Automation/scripting skills (e.g., Python, PowerShell, Bash) and ability to reduce manual operational work.
· Strong understanding of cloud-based platforms such as Azure DataBricks + Unity Catalog, AWS S3 and RDS.
· Strong experience in ETL / ELT work.
· Understanding of CI/CD concepts, safe deployment patterns, rollback strategies, and change risk controls.
Preferred
· Experience with cloud environments and infrastructure-as-code.
· Experience with large datasets (Multi-million row datasets).
· Experience with container orchestration and modern runtime platforms (where applicable).
Experience building dashboards and reliability reporting for executives and delivery teams
Any Graduate
No related jobs found
← Back to jobs