You will build and maintain network automation frameworks for large-scale datacenter environments supporting AI/ML workloads.
Responsibilities
Build Ansible playbooks, roles, and modules to automate device configurations, software upgrades, and compliance checks across multi-vendor environments.
Design and implement Python-based automation scripts and tools to integrate with APIs, orchestration platforms, and monitoring systems.
Automate OS core networking configurations on servers (Linux / Windows / Hypervisor) including bonding, VLANs, routing tables, kernel network parameters, MTU tuning, and NIC performance optimization.
Collaborate with infrastructure, network engineering, and DevOps teams to deliver seamless provisioning and scaling of GPU/TPU clusters.
Ensure network automation solutions meet high-performance computing (HPC) requirements such as low latency, high throughput, and fault tolerance.
Required Skills
Expertise in Ansible for large-scale network automation (playbook development, dynamic inventory, custom modules).
Strong proficiency in Python for scripting, API integrations (REST, NETCONF, gNMI), and device interaction (NAPALM, Paramiko).
Deep understanding of networking concepts including BGP, EVPN-VXLAN, MPLS, QoS, and leaf-spine architectures.
Knowledge of Linux / Windows / Hypervisor OS core networking, including network stack tuning, NIC bonding, SR-IOV, and performance tuning for HPC/AI workloads.
Experience in Private Cloud environments supporting HPC/AI workloads.
Familiarity with CI/CD pipelines (GitLab, Jenkins) for deploying automation at scale.
Knowledge of network observability, telemetry, and streaming protocols (gRPC, sFlow, Prometheus).