Analyze system monitoring and observability data to identify performance issues, reliability risks, and operational improvement opportunities.
Lead enterprise-level incident triage, troubleshooting, and root-cause analysis activities.
Utilize modern monitoring and observability platforms to improve system reliability and service performance.
Collaborate with application owners, infrastructure teams, DevOps engineers, and system administrators to investigate and resolve complex technical issues.
Evaluate system designs, workflows, and application architectures to identify reliability enhancements.
Partner with development and identity/access management teams on advanced technical investigations.
Monitor enterprise applications and technology stacks, providing actionable recommendations to improve stability and performance.
Support continuous improvement initiatives focused on operational excellence, service availability, and system resilience.
Lead technical problem-solving efforts and mentor teams in resolving complex operational challenges.
Required Qualifications
Strong Site Reliability Engineering (SRE) background with hands-on experience in incident management and enterprise troubleshooting.
Minimum 3 years of deep expertise in at least two of the following enterprise monitoring and observability tools:
Dynatrace
Splunk
SolarWinds
ServiceNow Operator Workspace
Extensive experience in one or more of the following technology domains:
Networking
Windows Systems
Desktop Infrastructure
Unix/Linux
AWS Cloud
Azure Cloud
Middleware Technologies
Java/JavaScript Development
Database Administration and Engineering
8+ years of experience working with IT operational metrics, system reliability indicators, application performance monitoring, and service quality measurements.
8+ years of experience deploying, maintaining, and troubleshooting complex enterprise-scale applications.
Experience working with cross-functional technical teams in large-scale environments.
Experience with cloud technologies, service virtualization, and modern SaaS/PaaS platforms.
Proficiency with Microsoft Office applications, including Word, Excel, and PowerPoint.
Demonstrated experience leading teams through complex technical challenges.
High School Diploma/GED with significant relevant professional experience, or a Master's degree in Computer Science, Engineering, Information Technology, or a related technical discipline with substantial relevant experience.
Preferred Qualifications
Experience with test-driven development (TDD), distributed systems, microservices, and cloud-native architectures.
Familiarity with additional monitoring and performance management tools.
Experience working in rapidly changing, highly dynamic enterprise environments.
Strong written and verbal communication skills with the ability to interact effectively with technical and non-technical stakeholders.
Excellent analytical, critical-thinking, and problem-solving abilities.
Experience managing or collaborating with distributed and virtual teams.
Prior experience supporting regulated or large-scale government, healthcare, or enterprise environments.
Experience with ServiceNow platforms and workflows is highly desirable