You will support the reliability and scalability of systems and services within the Crypto Services SRE team.
Responsibilities
- Implement and maintain monitoring, observability, alerting, and logging systems.
- Design and deploy automated processes and tooling, including Ansible playbooks and API monitoring tools.
- Monitor key performance metrics to identify opportunities for optimization and efficiency.
- Collaborate with cross-functional teams to troubleshoot incidents, perform root cause analysis, and prevent recurrence.
- Document workflows, procedures, and validate runbooks.
Required Skills
- 5+ years of experience in operations or site reliability engineering.
- Strong Linux/Unix OS system administration and fundamentals.
- Proficiency in shell scripting including Bash or Zsh.
- Experience with interpreted or compiled languages such as Python, Perl, C/C++, Go, or Java.
- Hands-on experience with configuration management and Infrastructure as Code using Ansible, Puppet, Terraform/Terragrunt, or CloudFormation.
- Practical knowledge of containerization with Docker or Podman and orchestration with Kubernetes or Apache Mesos.
- Understanding of network security, TCP/IP, and encryption principles including PKI, OpenSSL, and key exchange protocols.
- Familiarity with SRE principles such as monitoring, alerting, error budgets, and fault analysis.