Description
You will automate operational processes and maintain system reliability across distributed environments.
Responsibilities
- Design and develop software solutions to improve operations and reduce development effort.
- Reduce application Mean Time To Recover (MTTR) via monitoring, alerting, troubleshooting, and automating recovery.
- Productize existing tools, ensuring they are scalable and provide ecosystem insights.
- Engage with internal teams and vendors to understand product roadmaps.
- Automate repetitive tasks identified in operational workflows.
Required Skills
- 5+ years of experience in a technical role.
- Experience working with critical distributed systems ensuring high availability and performance.
- Proficiency in Java or scripting languages such as Python, NodeJS, or Shell Programming.
- Experience with real-time/streaming data using technologies like WebSockets or Apache Kafka.
- Experience building consumable dashboards and monitoring using tools like Grafana or ELK.
- Strong problem-solving skills, comfortable deep-diving into complex issues.
- Experience with cloud platforms, specifically Azure or AWS.
- Familiarity with designing and executing SQL queries.
Preferred Skills
- Experience with APM tools and networking concepts (e.g., load balancing).
- Experience with automated testing and tuning Java applications for performance.