Case Study · 2019–2021 · EDF via NeoStair EURL

Building an AI/ML Platform
from Scratch at Enterprise Scale.

100+ Daily training jobs at peak
Faster time-to-production
30% Infrastructure cost reduction
10+ Data science teams served

01 / Context

The Problem Before the Platform

In 2019, EDF's data science division had strong individual talent but no shared infrastructure. Each team had its own way of going from notebook to production — which meant, in practice, that very few things made it to production at all.

The real cost was not operational. It was strategic: EDF was investing heavily in AI capabilities that couldn't be deployed reliably enough to generate business value. A model that takes 6 weeks to go live and has no monitoring is not a production asset — it's a prototype.

The mandate: build the platform that changes that equation. No existing team to inherit. No reference architecture to copy. A regulated industrial environment with strict data residency constraints.

02 / Constraints That Shaped Everything

Data residency

Certain workloads couldn't touch AWS. Required a hybrid on-prem/cloud approach from day one — not as a future option but as a hard architectural constraint.

Legacy integration

Oracle and SAP systems couldn't be replaced. The platform had to consume and produce data in their formats, with their latency characteristics.

Team heterogeneity

10+ data science teams with varying Python maturity, different domain contexts, and strong opinions about tooling. Governance without killing autonomy was the tightest constraint.

No big-bang migration

Existing models in production couldn't be stopped. The platform had to be built alongside live systems, with zero tolerance for disruption.

03 / Architecture

Five Layers, One Contract

Ingestion

Kafka · Change Data Capture · REST connectors

Event-driven from day one. No polling.

Compute

Spark on Kubernetes · PySpark · Airflow DAGs

Containerized workloads, portable across AWS and OpenShift.

Storage

S3 Lakehouse · Delta Lake · DBT transformations

Single source of truth with version history.

ML Layer

MLflow · Feature Store · Model Registry · CI/CD

Training-serving contract enforced at registration.

Serving

FastAPI · Redis · API Gateway · Istio service mesh

Sub-100ms p95 inference latency.

Observability

Prometheus · Grafana · ELK · Evidently drift detection

Every model, every pipeline, every API — observable.

04 / Architecture Decisions — The Reasoning

What was chosen, what was rejected, and why. These are the decisions that determined the platform's long-term operability.

Airflow over Prefect for pipeline orchestration

✓ Airflow · Prefect

Prefect had a better developer experience in 2019, but Airflow had 3 years of production battle-testing and a larger operator community. For a platform serving 10+ teams with varying Python skills, debuggability and community support outweighed ergonomics. Trade: onboarding friction for operational resilience.

MLflow over custom model registry

✓ MLflow (with 3 targeted customizations) · Bespoke registry

Instinct was to build custom from day one. We resisted. MLflow covered ~90% of needs out of the box. We customized exactly three things: upstream metadata schema enforcement, downstream deployment contracts, and rollback trigger logic. Saved an estimated 6 months of engineering and a future maintenance burden.

Feature store: selective adoption

✓ Feature store for ~40% of use cases · Feature store for everything

Feature stores solve training-serving skew but introduce significant operational overhead. Break-even: justified for sub-hour refresh cycles and cross-team feature reuse. For the remaining 60%, versioned feature pipelines with documented contracts were simpler, cheaper, and more debuggable.

OpenShift alongside AWS for regulated workloads

✓ Hybrid AWS + OpenShift · AWS-only

EDF's regulated data couldn't leave on-premise for certain workloads. OpenShift allowed us to run identical container workloads on-prem with the same tooling as AWS. Trade: operational complexity for compliance. The service mesh (Istio) became the critical abstraction layer.

05 / What Almost Failed

Governance killed adoption in month 3. The first architecture review board was too prescriptive. Teams saw it as a bottleneck, not a service. Adoption stalled. We rebuilt the ARB process: reference blueprints instead of mandatory standards, office hours instead of quarterly gates, opt-in tooling with clear upgrade paths. The reversal cost 6 weeks. The lesson: governance without perceived value is friction.

The Kafka consumer lag incident. At 40+ daily jobs, a misconfigured consumer group caused lag to accumulate silently across three pipelines. No alert fired because the metric existed but had no threshold. Two data science teams ran on stale features for 11 hours before detection. After: every Kafka consumer group has a lag SLO. Evidently drift alerts cross-correlated with upstream lag metrics.

06 / Results

Time-to-production

~6 weeks (manual) → 5 days (automated CI/CD + containerized serving)

30%

Infrastructure cost

Baseline multi-M€ AWS bill → −30% via FinOps: spot instances, reserved capacity, idle cleanup

SLA

Platform uptime

Ad-hoc, no SLA → 99.9% across 100+ daily training jobs and inference services

60%

Incident resolution

~4h median → <90 min via runbooks + automated alerting

Stack

PythonFastAPIApache SparkKafkaAirflowDBTMLflowDockerKubernetesOpenShiftAWS S3RedisIstioTerraformGitLab CIPrometheusGrafanaELKEvidently