Description

You will support the reliability and scalability of systems and services within the Crypto Services SRE team.

Responsibilities

  • Implement and maintain monitoring, observability, alerting, and logging systems.
  • Design and deploy automated processes and tooling, including Ansible playbooks and API monitoring tools.
  • Monitor key performance metrics to identify opportunities for optimization and efficiency.
  • Collaborate with cross-functional teams to troubleshoot incidents, perform root cause analysis, and prevent recurrence.
  • Document workflows, procedures, and validate runbooks.

Required Skills

  • 5+ years of experience in operations or site reliability engineering.
  • Strong Linux/Unix OS system administration and fundamentals.
  • Proficiency in shell scripting including Bash or Zsh.
  • Experience with interpreted or compiled languages such as Python, Perl, C/C++, Go, or Java.
  • Hands-on experience with configuration management and Infrastructure as Code using Ansible, Puppet, Terraform/Terragrunt, or CloudFormation.
  • Practical knowledge of containerization with Docker or Podman and orchestration with Kubernetes or Apache Mesos.
  • Understanding of network security, TCP/IP, and encryption principles including PKI, OpenSSL, and key exchange protocols.
  • Familiarity with SRE principles such as monitoring, alerting, error budgets, and fault analysis.

Education

Any Graduate