Coordinate production incident response and work with application, infrastructure, database, network, platform, and support teams to resolve issues.
Support critical ecommerce business functions, including checkout, order processing, payments, inventory management, promotions, and fulfillment integrations.
Use Dynatrace and Splunk to analyze logs, metrics, traces, service health indicators, latency, error rates, and system dependencies.
Design and maintain operational dashboards for engineering teams and business stakeholders.
Improve monitoring and alerting by defining effective thresholds and reducing unnecessary notifications.
Conduct root cause analysis, document findings, and implement corrective and preventive actions.
Develop and maintain operational documentation, runbooks, standard operating procedures, and troubleshooting guides.
Identify opportunities for automation to improve operational efficiency and incident response.
Provide technical guidance to engineers and support collaboration across distributed teams.
Communicate system status, business impact, risks, and remediation plans to stakeholders.
Required Qualifications
8+ years of experience in Site Reliability Engineering, Production Support, DevOps, Platform Engineering, or Application Operations.
Strong hands-on experience with Dynatrace and Splunk.
Solid understanding of microservices, APIs, distributed systems, Kubernetes, containers, and cloud platforms.
Experience supporting large-scale ecommerce or enterprise production environments.
Knowledge of incident management, root cause analysis, monitoring, alerting, and service reliability practices.
Experience troubleshooting application performance issues, including latency, throughput, error rates, resource utilization, database performance, and external service dependencies.
Strong written and verbal communication skills.
Experience leading technical teams and coordinating work across multiple locations.
Preferred Qualifications
Experience in retail or ecommerce environments.
Familiarity with checkout, payment, order management, inventory, or fulfillment systems.
Knowledge of Dynatrace DQL, Grail, Smartscape, Davis AI, OpenPipeline, and Service Level Objectives (SLOs).
Experience with ServiceNow, Jira, PagerDuty, Microsoft Teams, or similar operational tools.
Experience with Python, shell scripting, or other automation technologies.
Exposure to automated incident analysis, runbook automation, or self-healing operational workflows