2H Hands-on Lab120min
Engineering multimodal AI video pipelines at scale: from zero to hero
This session demonstrates how to design scalable multimodal pipelines for large‑scale video analysis. It covers synchronizing audio and video, mitigating noise and hallucinations, and managing cost, latency, and compliance. Attendees learn to transform raw streams into queryable, auditable outputs using ASR, visual embeddings, and higher‑level recognition tasks.
talk.summaryAiDisclaimer
Diana OrtegaOpen Innovation AI
Fabian GutierrezIndependent
talkDetail.whenAndWhere
Friday, April 24, 10:30-12:30
Neuilly 152
Today, a growing number of applications rely on video as a primary data source. Analyzing video at scale requires more than running individual models; it demands well designed multimodal pipelines that combine vision, audio, and text while remaining accurate, cost-efficient, and compliant.
In this session, we build a high-throughput pipeline for video streams. Participants will see how raw feeds are transformed into aligned, queryable components by orchestrating ASR and visual embeddings, and by producing higher-level outputs such as speaker identity, facial recognition, and summaries under noisy conditions.
We focus on key engineering challenges: audio-video synchronization and clock drift, stream fragmentation and context handling, hallucination mitigation, and scaling the system while controlling latency, cost and resilience.
Key takeaways include designing scalable multimodal pipelines, solving alignment and signal-quality issues at scale, and applying practical patterns for building compliant, auditable, and resilient video analysis systems.
Familiarity with data pipelines and basic ML concepts is helpful but not required.
In this session, we build a high-throughput pipeline for video streams. Participants will see how raw feeds are transformed into aligned, queryable components by orchestrating ASR and visual embeddings, and by producing higher-level outputs such as speaker identity, facial recognition, and summaries under noisy conditions.
We focus on key engineering challenges: audio-video synchronization and clock drift, stream fragmentation and context handling, hallucination mitigation, and scaling the system while controlling latency, cost and resilience.
Key takeaways include designing scalable multimodal pipelines, solving alignment and signal-quality issues at scale, and applying practical patterns for building compliant, auditable, and resilient video analysis systems.
Familiarity with data pipelines and basic ML concepts is helpful but not required.
Diana Ortega
Lead Data Engineer at Open Innovation AI, Diana has over 15 years of experience designing and implementing large-scale platforms. Her expertise includes high-throughput data architectures, distributed systems, relational and NoSQL data modeling, and cloud-native solutions. She currently focuses on building AI-enabled data platforms, including RAG pipelines and agentic systems, while mentoring teams on architecture, scalability, and software craftsmanship.
Fabian Gutierrez
Architecte logiciel passionné par l'artisanat logiciel. Je crois que la différence entre du bon code et un grand logiciel tient à une seule chose : l'intentionnalité. Mon travail consiste à concevoir des solutions qui ne se contentent pas de fonctionner, elles passent à l'échelle, elles durent, et elles servent vraiment le métier pour lequel elles ont été conçues.