Case Study · 2019–2021 · EDF via NeoStair EURL
Building an AI/ML Platform
from Scratch at Enterprise Scale.
01 / Context
The Problem Before the Platform
In 2019, EDF's data science division had strong individual talent but no shared infrastructure. Each team had its own way of going from notebook to production — which meant, in practice, that very few things made it to production at all.
The real cost was not operational. It was strategic: EDF was investing heavily in AI capabilities that couldn't be deployed reliably enough to generate business value. A model that takes 6 weeks to go live and has no monitoring is not a production asset — it's a prototype.
The mandate: build the platform that changes that equation. No existing team to inherit. No reference architecture to copy. A regulated industrial environment with strict data residency constraints.
02 / Constraints That Shaped Everything
Data residency
Certain workloads couldn't touch AWS. Required a hybrid on-prem/cloud approach from day one — not as a future option but as a hard architectural constraint.
Legacy integration
Oracle and SAP systems couldn't be replaced. The platform had to consume and produce data in their formats, with their latency characteristics.
Team heterogeneity
10+ data science teams with varying Python maturity, different domain contexts, and strong opinions about tooling. Governance without killing autonomy was the tightest constraint.
No big-bang migration
Existing models in production couldn't be stopped. The platform had to be built alongside live systems, with zero tolerance for disruption.
03 / Architecture
Five Layers, One Contract
Kafka · Change Data Capture · REST connectors
Event-driven from day one. No polling.
Spark on Kubernetes · PySpark · Airflow DAGs
Containerized workloads, portable across AWS and OpenShift.
S3 Lakehouse · Delta Lake · DBT transformations
Single source of truth with version history.
MLflow · Feature Store · Model Registry · CI/CD
Training-serving contract enforced at registration.
FastAPI · Redis · API Gateway · Istio service mesh
Sub-100ms p95 inference latency.
Prometheus · Grafana · ELK · Evidently drift detection
Every model, every pipeline, every API — observable.
04 / Architecture Decisions — The Reasoning
What was chosen, what was rejected, and why. These are the decisions that determined the platform's long-term operability.
Airflow over Prefect for pipeline orchestration
Prefect had a better developer experience in 2019, but Airflow had 3 years of production battle-testing and a larger operator community. For a platform serving 10+ teams with varying Python skills, debuggability and community support outweighed ergonomics. Trade: onboarding friction for operational resilience.
MLflow over custom model registry
Instinct was to build custom from day one. We resisted. MLflow covered ~90% of needs out of the box. We customized exactly three things: upstream metadata schema enforcement, downstream deployment contracts, and rollback trigger logic. Saved an estimated 6 months of engineering and a future maintenance burden.
Feature store: selective adoption
Feature stores solve training-serving skew but introduce significant operational overhead. Break-even: justified for sub-hour refresh cycles and cross-team feature reuse. For the remaining 60%, versioned feature pipelines with documented contracts were simpler, cheaper, and more debuggable.
OpenShift alongside AWS for regulated workloads
EDF's regulated data couldn't leave on-premise for certain workloads. OpenShift allowed us to run identical container workloads on-prem with the same tooling as AWS. Trade: operational complexity for compliance. The service mesh (Istio) became the critical abstraction layer.
05 / What Almost Failed
Governance killed adoption in month 3. The first architecture review board was too prescriptive. Teams saw it as a bottleneck, not a service. Adoption stalled. We rebuilt the ARB process: reference blueprints instead of mandatory standards, office hours instead of quarterly gates, opt-in tooling with clear upgrade paths. The reversal cost 6 weeks. The lesson: governance without perceived value is friction.
The Kafka consumer lag incident. At 40+ daily jobs, a misconfigured consumer group caused lag to accumulate silently across three pipelines. No alert fired because the metric existed but had no threshold. Two data science teams ran on stale features for 11 hours before detection. After: every Kafka consumer group has a lag SLO. Evidently drift alerts cross-correlated with upstream lag metrics.
06 / Results
Time-to-production
~6 weeks (manual) → 5 days (automated CI/CD + containerized serving)
Infrastructure cost
Baseline multi-M€ AWS bill → −30% via FinOps: spot instances, reserved capacity, idle cleanup
Platform uptime
Ad-hoc, no SLA → 99.9% across 100+ daily training jobs and inference services
Incident resolution
~4h median → <90 min via runbooks + automated alerting
Stack