L
Full-time
On-site
London

Responsibilities: Own the day-to-day stability and performance of our Azure data platform (Synapse, Databricks, ADF, Power BI). Act as the primary point of contact for incidents and outages — driving resolution, root cause analysis, and clear stakeholder communication. Define, implement, and enforce SLAs for critical pipelines, datasets, and reporting assets. Run FinOps forums with business stakeholders to improve cost transparency, accountability, and efficiency across the platform. Oversee CI/CD pipelines and deployments, ensuring reliable, safe, and compliant delivery of data platform changes. Champion monitoring, observability, and automation to detect and resolve issues proactively while reducing manual intervention. Develop and maintain operational runbooks, escalation protocols, and incident playbooks to strengthen resilience. Partner with data engineering and analytics teams to align operational strategy with business goals and future platform roadmap. Experience with ITIL or formal service management frameworks, Azure Purview or data governance tooling, FinOps certification, telecoms domain knowledge, or regulatory compliance expertise (e.g., GDPR). Skills: Operational Leadership: Proven track record in leading operations for large-scale data platforms, ensuring stability, performance, and stakeholder trust. Incident & SLA Management: Skilled in incident triage, root cause analysis, escalation handling, and defining/enforcing SLAs with cross-functional teams. Azure Data Stack: Hands-on experience with Azure Synapse, Databricks, ADF, and Power BI, with the ability to guide best practices and optimisations. Automation & CI/CD: Familiar with CI/CD processes and automation to streamline deployments and reduce manual intervention. FinOps Mindset: Experience in cost management, usage reporting, and running forums with business stakeholders to drive accountability and efficiency. Monitoring & Observability: Knowledge of modern monitoring, alerting, and data quality frameworks to ensure proactive platform health management. Communication & Stakeholder Management: Clear, structured communicator who can translate technical issues into business impact and manage expectations at all levels. Problem-Solving: Calm under pressure, able to lead teams through outages, resolve conflicts, and drive continuous improvement.

Apply now
Share this job