Apply Now!

Senior Site Reliability Engineer (SRE) – Production Stability & TechOps

----

Who we are and what we are looking for ?

Alpha Networks, one of the leading providers in hybrid TV back-ends & smart video platforms, offers its customers the right products and services to enhance and implement their content strategy.

Its innovative, reliable, and technically advanced back-end smart TV platform, combined with our Gecko Studio applications, provides a simple and efficient way to distribute pay-TV content via any type of network on any type of device.

Headquartered in Belgium, Alpha Networks is internationally recognized by network and mobile operators as well as content owners and broadcasters all over the world : Orange, Vodafone, Blackpills, TeleCentro, SFR, Bouygues, Canal+, ...

As part of the operational transformation initiative (OPS 3.0), we are looking for a Senior Site Reliability Engineer to reinforce the technical leadership and reliability of our production clusters.

You will be directly responsible for ensuring the stability and resilience of 30 client platforms hosted across 6 dynamic Kubernetes clusters (on-prem and cloud), and for helping structure and mature our TechOps practices.

Why you should join us ?

We are cool people - we like to take it easy and have fun during & after hours.
We are working on innovative solutions with cutting-edge technologies.
We have a very flexible management of work hours & hybrid work: we do rely on objectives more than on "worked hours".
We have offices in Belgium, France, Spain, Morocco, Brazil, ... leaving tons of opportunity to work abroad.
Despite being a 150+ person group, you may have opportunities to evolve quickly in the company and become a new squad lead sooner than you think.
Alpha Networks is the European leader in the OTT industry; we work for small to large customers such as Bouygues Telecom, Orange, Blackpills, Vodafone, ... and this industry sector is in permanent expansion.

Key Responsibilities

Production Operations Leadership
- Lead and secure weekly production updates across platforms.
- Diagnose and resolve incidents across the stack (infra to application).
- Take ownership of the 24/7 stability and continuity of production systems.
TechOps Maturity & Process Ownership
- Design, formalize, and continuously improve operational procedures (incident handling, service recovery, alert handling...).
- Provide recovery procedures for PostgreSQL, Redis, RabbitMQ, and Elastic for the L1/L2 support teams.
- Build strong collaboration with Customer Care and SOC teams (procedures, documentation, escalations).
Automation & Reliability
- Contribute to CI/CD and IaC pipelines.
- Automate observability and alerting hand-in-hand with SREs and developers.
- Drive post-mortems, root cause analysis and preventive actions.
Team Support & Influence
- Act as technical mentor for junior SREs, SOC, and L2 teams.
- Advocate for production-first culture and reliability best practices.
- Take initiative and act as a Doer and Guardian of production.

Our Stack

Infrastructure: Kubernetes, Helm, Docker, GitLab CI, on-prem & cloud (VM & bare metal).
Databases & Middlewares: PostgreSQL, Redis, RabbitMQ, ElasticSearch.
Monitoring & Tools: Prometheus, Grafana, Alertmanager, Kibana
Languages: Bash, Python, YAML, JSON.
Other: Git, S3, OpsGenie, Slack.

Required Skills & Profile

5+ years experience in production environments / operations teams.
Strong expertise in incident management, deployments, and system debugging.
Ability to structure, document, and standardize operational tasks.
Good communicator and team player, fluent in French and English.
Autonomous, accountable, and motivated by ownership and service reliability.
Experience with high-availability environments and mission-critical systems.

Bonus Points

Experience leading MEP or incident bridges with external clients.
Familiarity with operational dashboards and alert lifecycle tuning.
Contribution to scaling or refactoring of multi-tenant platforms.
Experience working in SaaS or PaaS production teams.

Apply Now!