Description
Key Responsibilities
Application Reliability & Issue Resolution
Analyze and troubleshoot application failures, latency issues, and degraded performance across distributed systems
- Perform deep-dive root cause analysis (RCA) to identify underlying application-level issues
- Work with engineering teams to quickly isolate failing components and dependencies
- Reduce MTTR (Mean Time to Resolution) through improved diagnostics and runbooks
Application Observability & Diagnostics
Assess current application landscape and identify gaps in logging, tracing, and monitoring
- Implement and enhance application-level observability (logs, metrics, traces)
- Enable faster issue identification by improving service visibility and dependency mapping
- Define and standardize health checks and alerting strategies for applications
System Understanding & Mapping
Develop a clear understanding of application architecture, data flows, and service dependencies
- Build and maintain application topology and dependency maps
- Identify single points of failure and performance bottlenecks
Performance Engineering
Analyze application performance and recommend improvements for scalability and responsiveness
- Identify issues related to threading, memory, database interactions, and API latency
- Work with developers to optimize code paths, queries, and service interactions
Incident Management & Process Improvement
Lead or support incident triage and war-room calls
- Improve incident response processes and escalation paths
- Create and maintain runbooks, playbooks, and troubleshooting guides
- Identify recurring issues and drive permanent fixes vs temporary patches
Collaboration & Engineering Enablement
Partner with application development teams to embed reliability best practices
- Provide guidance on error handling, resiliency patterns, and fault tolerance
- Enable teams with tools and practices for self-service diagnostics
Required Skills & Experience
5–10 years of experience in application engineering, production support, or SRE roles
- Strong experience in application troubleshooting and debugging (Java/.NET/Node.js preferred)
- Solid understanding of distributed systems and microservices architectures
- Experience with application logs, debugging tools, and performance profiling
- Familiarity with observability tools (Splunk, Dynatrace, AppDynamics, Datadog, etc.)
- Strong understanding of API behavior, database interactions, and system integrations
- Experience working in production support / incident management environments
Preferred Skills
Experience implementing distributed tracing (OpenTelemetry, Jaeger, Zipkin)
- Knowledge of cloud environments (AWS/Azure/GCP)
- Exposure to resiliency patterns (circuit breakers, retries, fallbacks)
Experience with performance tuning and load analysis