← Back to jobs

New York, NY, USA

Posted On: 2 days ago

Experience: 8+ years

Availability: Onsite

Openings: 1

Category: Site Reliability Engineer

Tenure: No Preference/Any

No related jobs found

Job description:

Define, implement, and maintain observability (monitoring, logging, tracing) and actionable ing aligned to service health.

Drive incident management: on-call readiness, triage, incident command support, communications, and post-incident reviews (RCA).

Reduce operational toil through automation (runbooks-to-automation, self-healing, deployment/rollback automation).

Establish reliability standards: SLOs/SLIs, error budgets, production readiness reviews, and release risk controls.

Performance and reliability engineering: capacity planning, load/performance analysis, resilience testing, and failure-mode mitigation.

Partner with engineering teams to improve operational hygiene (deployability, rollback strategy, configuration, secrets, dependency management).

Skills
Required

· Strong grounding in SRE/DevOps practices: incident management, blameless postmortems, SLOs/SLIs, error budgets, production readiness.

· Experience building/operating monitoring and ing, and using logs/metrics to diagnose issues.

· Automation/scripting skills (e.g., Python, PowerShell, Bash) and ability to reduce manual operational work.

· Strong understanding of cloud-based platforms such as Azure DataBricks + Unity Catalog, AWS S3 and RDS.

· Strong experience in ETL / ELT work.

· Understanding of CI/CD concepts, safe deployment patterns, rollback strategies, and change risk controls.

Preferred

· Experience with cloud environments and infrastructure-as-code.

· Experience with large datasets (Multi-million row datasets).

· Experience with container orchestration and modern runtime platforms (where applicable).

Experience building dashboards and reliability reporting for executives and delivery teams

Any Graduate

No related jobs found

← Back to jobs