← Back to jobs
Alpharetta, GA, USA
No related jobs found
>> Expertise in UNIX + LINUX Administration + AWS/ AZURE Cloud monitoring + Terraform/ Ansible + Prometheus/ Grafana observability experience
>> Production experience in SRE / Infrastructure / ops for large-scale systems
>> Strong programming/scripting skills (Python, Go, Java, or equivalent)
>> Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)
>> Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)
>> Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures
>> Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)
>> Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)
>> Solid experience in capacity planning, performance tuning, scaling, and incident response >> Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements
>> Experience in regulated environments (financial services, compliance, audit, security) is a strong plus
>> Excellent communication, documentation, and cross-team collaboration skills
>> Proven track record of reducing operational toil via automation
Requirements:
>> 6+ years of experience as a Site Reliability Engineer or in a similar role
>> Hands-on experience in supporting IaaS platforms with networking and system engineering knowledge
Bachelor's degree
No related jobs found
← Back to jobs