Passer au contenu principal
Publiée 12 juin 2026

Data Engineer - Foundational

Harmattan AI
Paris, Île-de-France 75000, France CDI

About Us

Harmattan AI is a next-generation defense prime building autonomous and scalable defense systems. Following the close of a $200M Series B, valuing the company at $1.4 billion, we are expanding our teams and capabilities to deliver mission-critical systems to allied forces.

Our work is guided by clear values: building technologies with real-world impact, pursuing excellence in everything we do, setting ambitious goals, and taking on the hardest technical challenges. We operate in a demanding environment where rigor, ownership, and execution are expected.

About the Role

As a Data Engineer on the Foundational team, you will serve as the "plumber" for deep learning, building the massive, high-performance data infrastructure required to power our foundational models. Based in Paris, you will manage terabytes-and eventually petabytes-of raw, unstructured, and noisy video data (EO and IR). Your mission is to ensure our ML engineers spend their time designing architectures, not waiting for data loaders or wrangling corrupted files.

Responsibilities
  • Multi-Modal Ingestion Pipeline: Build ETL/ELT pipelines to extract, decode, and store raw Electro-Optical (EO) and Infrared (IR) video from field logs into highly optimised formats like WebDataset, TFRecords, or Parquet.
  • Sensor Synchronisation & Alignment: Develop algorithms to programmatically synchronise EO and IR frames temporally and spatially to provide paired inputs for model training.
  • High-Throughput Data Loading: Architect storage-to-GPU pipelines to ensure multi-node training clusters maintain >90% GPU utilisation without I/O bottlenecks.
  • Distributed Processing: Write and optimise distributed data processing jobs using tools like Apache Spark, Ray, or Apache Beam to process thousands of hours of tactical video logs.
  • Data Quality & Versioning: Implement automated quality checks to filter corrupted or blank frames and maintain 100% reproducible training runs through robust versioning and lineage tracking.
  • Infrastructure Evaluation: Assess and implement advanced storage solutions (e.g., MinIO, S3 tiering) to manage growing datasets while optimising for cost and latency.
Candidate Requirements
  • Educational Background: A BS or MS in Computer Science, Software Engineering, or Distributed Systems is highly preferred. Deep knowledge of operating systems, networking, and parallel computing is essential.
  • Technical Experience: 5-6+ years of experience building and maintaining terabyte-scale pipelines for unstructured data (video, images, or point clouds).
  • Performance Optimisation: Proven track record of maximising multi-node GPU utilisation and optimising data loaders for frameworks like PyTorch or JAX.
  • Tooling Expertise: Strong command of distributed computing tools (Spark, Ray, Beam) and ML data versioning tools (DVC, Apache Iceberg, or Pachyderm).
  • Adaptability & Ownership: A systems-thinker who thrives in a fast-paced startup environment and views messy data as an engineering problem to be solved via automation.
  • Commitment: 100% dedication to Harmattan AI's mission of providing a defensive edge to allied nations through ethical, high-impact technology


We look forward to hearing how you can help shape the future of autonomous defense systems at Harmattan AI.

S’inscrire aux alertes d’offres d’emploi