Site Reliability Engineer

Intime Infotech Inc
Memphis, TN, USA

Description

ty for critical applications on RHEL, targeting 99.9%+ availability.
Build and maintain CI/CD pipelines and release automation (e.g., Jenkins, GitLab CI/CD), including artifact management, approvals, and rollbacks.
Automate deployment, configuration, and operational tasks using Bash, Python, and configuration management tools (e.g., Ansible).
Lead incident response: triage, root cause analysis, remediation, and post-incident learning; continuously reduce MTTR and change failure rate.
Implement and enhance observability (logs, metrics, traces) using tools such as Splunk/Dynatrace; create actionable dashboards and alerts.
Perform performance engineering: capacity planning, tuning (OS, JVM, MQ, DB connections), and throughput optimization.
Manage security posture: vulnerability remediation, patching, certificate and key management, and adherence to hardening standards.
Partner with business analysts, QA, developers, and release managers to deliver changes through the SDLC; contribute to requirements, design, testing, and documentation.
Review solution designs for scalability, reliability, and conformance with enterprise architecture and risk controls.
Create and maintain runbooks, standard operating procedures, and knowledge articles; mentor junior associates and lead technical design sessions.
Participate in on-call rotation and change/release management processes (e.g., CAB), including after-hours maintenance windows when required.
Support vendor integrations and coordinate with third parties as needed for incident resolution and upgrades.
Contribute to disaster recovery planning and validation (failover testing, recovery procedures, and resilience improvements).

Required qualifications:

High energy, take-charge mindset with a strong sense of ownership and follow-through.
Client-focused approach with strong communication and collaboration skills across distributed teams.
Demonstrated critical thinking with advanced troubleshooting and debugging in complex, high-availability environments.
Proven ability to independently drive solutions while coordinating across multiple teams and stakeholders.
Experience leading technical design discussions and documenting decisions and standards.
Bachelor’s degree in Computer Science, MIS, or related field—or equivalent experience.

Technical must-haves:

8+ years supporting high-availability Unix/Linux platforms in production (RHEL, AIX, or Solaris).
Proficiency with CI/CD tooling (e.g., Jenkins, GitLab CI/CD) and Git-based workflows.
Strong scripting/automation skills (Bash, Python) and hands-on experience with Ansible (or Puppet/Chef).
Experience running and optimizing SQL queries and working with relational databases (e.g., Oracle, PostgreSQL, SQL Server).
Experience with message-oriented middleware (e.g., IBM MQ; Kafka/RabbitMQ a plus).
Solid understanding of networking fundamentals, DNS, load balancing, TLS/PKI, and certificates.
Observability tooling experience (e.g., Splunk/ELK, Prometheus/Grafana, AppDynamics/Dynatrace) with practical alerting/dashboards.
Production support experience, including on-call participation and incident/problem management best practices.

Preferred qualifications (nice to have):

SRE/operational excellence background with experience improving MTTR, change success rate, and automation coverage.
Containerization and orchestration (Docker, Kubernetes or OpenShift) in hybrid/cloud environments.
Experience with secrets management and hardening standards (e.g., CIS benchmarks).
Familiarity with regulated/financial services environments and associated risk and control practices.
Exposure to event streaming (Kafka), API gateways, and microservices architectures

Key Skills

Jenkins Gitlab Ci/cd Automation Splunk Dynatrace Ansible Bash Python Kafka

Education

Bachelor's degree

Apply Now

Back To Jobs

Posted On: Today
Experience: 8+ years of experience
Availability: On Site
Openings: 1
Category: Site Reliability Engineer
Tenure: Contract - Corp-to-Corp Position