← Back to jobs

Site Reliability Engineer

TruEvan Technologies

New Jersey, United States

Posted On: 15+ days ago

Experience: 10+ years

Availability: Hybrid

Openings: 1

Category: Site Reliability Engineer

Tenure: Contract - Corp-to-Corp

Related Jobs

No related jobs found

Description

Key Responsibilities

Application Reliability & Issue Resolution

Analyze and troubleshoot application failures, latency issues, and degraded performance across distributed systems

Perform deep-dive root cause analysis (RCA) to identify underlying application-level issues
Work with engineering teams to quickly isolate failing components and dependencies
Reduce MTTR (Mean Time to Resolution) through improved diagnostics and runbooks

Application Observability & Diagnostics

Assess current application landscape and identify gaps in logging, tracing, and monitoring

Implement and enhance application-level observability (logs, metrics, traces)
Enable faster issue identification by improving service visibility and dependency mapping
Define and standardize health checks and alerting strategies for applications

System Understanding & Mapping

Develop a clear understanding of application architecture, data flows, and service dependencies

Build and maintain application topology and dependency maps
Identify single points of failure and performance bottlenecks

Performance Engineering

Analyze application performance and recommend improvements for scalability and responsiveness

Identify issues related to threading, memory, database interactions, and API latency
Work with developers to optimize code paths, queries, and service interactions

Incident Management & Process Improvement

Lead or support incident triage and war-room calls

Improve incident response processes and escalation paths
Create and maintain runbooks, playbooks, and troubleshooting guides
Identify recurring issues and drive permanent fixes vs temporary patches

Collaboration & Engineering Enablement

Partner with application development teams to embed reliability best practices

Provide guidance on error handling, resiliency patterns, and fault tolerance
Enable teams with tools and practices for self-service diagnostics

Required Skills & Experience

5–10 years of experience in application engineering, production support, or SRE roles

Strong experience in application troubleshooting and debugging (Java/.NET/Node.js preferred)
Solid understanding of distributed systems and microservices architectures
Experience with application logs, debugging tools, and performance profiling
Familiarity with observability tools (Splunk, Dynatrace, AppDynamics, Datadog, etc.)
Strong understanding of API behavior, database interactions, and system integrations
Experience working in production support / incident management environments

Preferred Skills

Experience implementing distributed tracing (OpenTelemetry, Jaeger, Zipkin)

Knowledge of cloud environments (AWS/Azure/GCP)
Exposure to resiliency patterns (circuit breakers, retries, fallbacks)

Experience with performance tuning and load analysis

Key Skills

Ci/cd Sre Splunk Dynatrace Appdynamics Datadog Azure Gcp Java .net

Education

Any Gradute

Related Jobs

No related jobs found

← Back to jobs

Site Reliability Engineer

Related Jobs

Description

Key Skills

Education

Related Jobs

Explore More Jobs