You will own the reliability, availability, scalability, and performance of mission-critical telecom platforms and services.
Responsibilities
- Ensure high availability and performance of telecom network platforms, OSS/BSS applications, and customer-facing services.
- Develop automation frameworks for deployment, monitoring, incident response, and capacity management.
- Design, implement, and maintain end-to-end observability solutions (logs, metrics, traces, alerts).
- Lead incident response, root cause analysis, and postmortems for outages and performance degradation.
- Implement SRE best practices (error budgets, SLIs, SLOs) aligned with telecom standards.
Required Skills
- 3–7 years in SRE/DevOps/Systems Engineering roles (telecom domain experience preferred).
- Strong knowledge of Linux/Unix system administration and network protocols (TCP/IP, DNS, VoIP, SIP, SS7, Diameter, 5G Core).
- Hands-on experience with cloud platforms (AWS, Azure, GCP, or private telco clouds like OpenStack/VMware).
- Proficiency in automation & configuration tools (Ansible, Terraform, Chef, Puppet).
- Experience with CI/CD tools and pipelines (Jenkins, GitLab CI/CD, ArgoCD).
- Monitoring/observability tools (Prometheus, Grafana, ELK/EFK, Splunk, OpenTelemetry, Nagios, Zabbix).
- Programming/scripting in Python, Go, Shell, or Java.
- Knowledge of Telecom OSS/BSS systems, API integrations, and microservices architecture.
- Bachelor’s or Master’s degree in Computer Science, Telecommunications, Information Technology, or related field.