Description

  •  ty for critical applications on RHEL, targeting 99.9%+ availability.
  • Build and maintain CI/CD pipelines and release automation (e.g., Jenkins, GitLab CI/CD), including artifact management, approvals, and rollbacks.
  • Automate deployment, configuration, and operational tasks using Bash, Python, and configuration management tools (e.g., Ansible).
  • Lead incident response: triage, root cause analysis, remediation, and post-incident learning; continuously reduce MTTR and change failure rate.
  • Implement and enhance observability (logs, metrics, traces) using tools such as Splunk/Dynatrace; create actionable dashboards and alerts.
  • Perform performance engineering: capacity planning, tuning (OS, JVM, MQ, DB connections), and throughput optimization.
  • Manage security posture: vulnerability remediation, patching, certificate and key management, and adherence to hardening standards.
  • Partner with business analysts, QA, developers, and release managers to deliver changes through the SDLC; contribute to requirements, design, testing, and documentation.
  • Review solution designs for scalability, reliability, and conformance with enterprise architecture and risk controls.
  • Create and maintain runbooks, standard operating procedures, and knowledge articles; mentor junior associates and lead technical design sessions.
  • Participate in on-call rotation and change/release management processes (e.g., CAB), including after-hours maintenance windows when required.
  • Support vendor integrations and coordinate with third parties as needed for incident resolution and upgrades.
  • Contribute to disaster recovery planning and validation (failover testing, recovery procedures, and resilience improvements).

 

Required qualifications:

  • High energy, take-charge mindset with a strong sense of ownership and follow-through.
  • Client-focused approach with strong communication and collaboration skills across distributed teams.
  • Demonstrated critical thinking with advanced troubleshooting and debugging in complex, high-availability environments.
  • Proven ability to independently drive solutions while coordinating across multiple teams and stakeholders.
  • Experience leading technical design discussions and documenting decisions and standards.
  • Bachelor’s degree in Computer Science, MIS, or related field—or equivalent experience.

 

Technical must-haves:

  • 8+ years supporting high-availability Unix/Linux platforms in production (RHEL, AIX, or Solaris).
  • Proficiency with CI/CD tooling (e.g., Jenkins, GitLab CI/CD) and Git-based workflows.
  • Strong scripting/automation skills (Bash, Python) and hands-on experience with Ansible (or Puppet/Chef).
  • Experience running and optimizing SQL queries and working with relational databases (e.g., Oracle, PostgreSQL, SQL Server).
  • Experience with message-oriented middleware (e.g., IBM MQ; Kafka/RabbitMQ a plus).
  • Solid understanding of networking fundamentals, DNS, load balancing, TLS/PKI, and certificates.
  • Observability tooling experience (e.g., Splunk/ELK, Prometheus/Grafana, AppDynamics/Dynatrace) with practical alerting/dashboards.
  • Production support experience, including on-call participation and incident/problem management best practices.

 

Preferred qualifications (nice to have):

  • SRE/operational excellence background with experience improving MTTR, change success rate, and automation coverage.
  • Containerization and orchestration (Docker, Kubernetes or OpenShift) in hybrid/cloud environments.
  • Experience with secrets management and hardening standards (e.g., CIS benchmarks).
  • Familiarity with regulated/financial services environments and associated risk and control practices.
  • Exposure to event streaming (Kafka), API gateways, and microservices architectures

Education

Bachelor's degree