- ty for critical applications on RHEL, targeting 99.9%+ availability.
- Build and maintain CI/CD pipelines and release automation (e.g., Jenkins, GitLab CI/CD), including artifact management, approvals, and rollbacks.
- Automate deployment, configuration, and operational tasks using Bash, Python, and configuration management tools (e.g., Ansible).
- Lead incident response: triage, root cause analysis, remediation, and post-incident learning; continuously reduce MTTR and change failure rate.
- Implement and enhance observability (logs, metrics, traces) using tools such as Splunk/Dynatrace; create actionable dashboards and alerts.
- Perform performance engineering: capacity planning, tuning (OS, JVM, MQ, DB connections), and throughput optimization.
- Manage security posture: vulnerability remediation, patching, certificate and key management, and adherence to hardening standards.
- Partner with business analysts, QA, developers, and release managers to deliver changes through the SDLC; contribute to requirements, design, testing, and documentation.
- Review solution designs for scalability, reliability, and conformance with enterprise architecture and risk controls.
- Create and maintain runbooks, standard operating procedures, and knowledge articles; mentor junior associates and lead technical design sessions.
- Participate in on-call rotation and change/release management processes (e.g., CAB), including after-hours maintenance windows when required.
- Support vendor integrations and coordinate with third parties as needed for incident resolution and upgrades.
- Contribute to disaster recovery planning and validation (failover testing, recovery procedures, and resilience improvements).
Required qualifications:
- High energy, take-charge mindset with a strong sense of ownership and follow-through.
- Client-focused approach with strong communication and collaboration skills across distributed teams.
- Demonstrated critical thinking with advanced troubleshooting and debugging in complex, high-availability environments.
- Proven ability to independently drive solutions while coordinating across multiple teams and stakeholders.
- Experience leading technical design discussions and documenting decisions and standards.
- Bachelor’s degree in Computer Science, MIS, or related field—or equivalent experience.
Technical must-haves:
- 8+ years supporting high-availability Unix/Linux platforms in production (RHEL, AIX, or Solaris).
- Proficiency with CI/CD tooling (e.g., Jenkins, GitLab CI/CD) and Git-based workflows.
- Strong scripting/automation skills (Bash, Python) and hands-on experience with Ansible (or Puppet/Chef).
- Experience running and optimizing SQL queries and working with relational databases (e.g., Oracle, PostgreSQL, SQL Server).
- Experience with message-oriented middleware (e.g., IBM MQ; Kafka/RabbitMQ a plus).
- Solid understanding of networking fundamentals, DNS, load balancing, TLS/PKI, and certificates.
- Observability tooling experience (e.g., Splunk/ELK, Prometheus/Grafana, AppDynamics/Dynatrace) with practical alerting/dashboards.
- Production support experience, including on-call participation and incident/problem management best practices.
Preferred qualifications (nice to have):
- SRE/operational excellence background with experience improving MTTR, change success rate, and automation coverage.
- Containerization and orchestration (Docker, Kubernetes or OpenShift) in hybrid/cloud environments.
- Experience with secrets management and hardening standards (e.g., CIS benchmarks).
- Familiarity with regulated/financial services environments and associated risk and control practices.
- Exposure to event streaming (Kafka), API gateways, and microservices architectures