2H Hands-on Lab120min
Engineering multimodal AI video pipelines at scale: from zero to hero
This session demonstrates how to design scalable multimodal pipelines for large‑scale video analysis. It covers synchronizing audio and video, mitigating noise and hallucinations, and managing cost, latency, and compliance. Attendees learn to transform raw streams into queryable, auditable outputs using ASR, visual embeddings, and higher‑level recognition tasks.
Diana OrtegaOpen Innovation AI
Kaisar BarlybayOpenInnovation AI
talkDetail.whenAndWhere
Friday, April 24, 10:30-12:30
TBA 15
talks.roomOccupancytalks.noOccupancyInfo
Today, a growing number of applications rely on video as a primary data source. Analyzing video at scale requires more than running individual models; it demands well designed multimodal pipelines that combine vision, audio, and text while remaining accurate, cost-efficient, and compliant.
In this session, we build a high-throughput pipeline for video streams. Participants will see how raw feeds are transformed into aligned, queryable components by orchestrating ASR and visual embeddings, and by producing higher-level outputs such as speaker identity, facial recognition, and summaries under noisy conditions.
We focus on key engineering challenges: audio-video synchronization and clock drift, stream fragmentation and context handling, hallucination mitigation, and scaling the system while controlling latency, cost and resilience.
Key takeaways include designing scalable multimodal pipelines, solving alignment and signal-quality issues at scale, and applying practical patterns for building compliant, auditable, and resilient video analysis systems.
Familiarity with data pipelines and basic ML concepts is helpful but not required.
In this session, we build a high-throughput pipeline for video streams. Participants will see how raw feeds are transformed into aligned, queryable components by orchestrating ASR and visual embeddings, and by producing higher-level outputs such as speaker identity, facial recognition, and summaries under noisy conditions.
We focus on key engineering challenges: audio-video synchronization and clock drift, stream fragmentation and context handling, hallucination mitigation, and scaling the system while controlling latency, cost and resilience.
Key takeaways include designing scalable multimodal pipelines, solving alignment and signal-quality issues at scale, and applying practical patterns for building compliant, auditable, and resilient video analysis systems.
Familiarity with data pipelines and basic ML concepts is helpful but not required.
Diana Ortega
Lead Data Engineer at Open Innovation AI, Diana has over 15 years of experience designing and implementing large-scale platforms. Her expertise includes high-throughput data architectures, distributed systems, relational and NoSQL data modeling, and cloud-native solutions. She currently focuses on building AI-enabled data platforms, including RAG pipelines and agentic systems, while mentoring teams on architecture, scalability, and software craftsmanship.
Kaisar Barlybay
Senior Data/Platform Engineer at Open Innovation AI. Working on enterprise data infrastructure — ETL/ELT pipelines, real-time processing, multimodal AI pipelines, observability. Part of a team building the data layer from source acquisition through transformation to serving.
Previously: data warehousing for industrial operations, research infrastructure in aerospace, NLP analytics platforms. Master's in Computer Science, Bachelor's in Mathematics. Stack: Python, Airflow, Kafka, ClickHouse, Kubernetes.
talkDetail.shareFeedback
talkDetail.feedbackNotYetAvailable
talkDetail.feedbackAvailableAfterStart
talkDetail.signInRequired
talkDetail.signInToFeedbackDescription
occupancy.title
occupancy.votingNotYetAvailable
occupancy.votingAvailableBeforeStart
talkDetail.signInRequired
occupancy.signInToVoteDescription
comments.speakerNotEnabledComments