JavaConference40min
Building and Running LLMs on GPUs Directly from Java with TornadoVM and GPULlama3.java
This session presents GPULlama3.java, an open-source framework enabling efficient, GPU-accelerated LLM inference directly from Java. Leveraging TornadoVM, it supports modern models and data types, integrates with LangChain4j and Quarkus, and offers live demos across diverse hardware, highlighting seamless Java-based GPU computing and performance profiling tools for AI workloads.
Michalis PapadimitriouUniversity of Manchester/TornadoVM
talkDetail.whenAndWhere
Thursday, April 23, 10:55-11:35
MC 3
talks.roomOccupancytalks.noOccupancyInfo
Java stands at a critical inflection point in the AI revolution. Running LLM inference on GPUs traditionally forces developers to leave the JVM for Python and CUDA, or accept CPU-bound performance neither feasible for enterprise Java systems. With modern JDK features like the Vector API and frameworks such as llama3.java and JLama, developers can efficiently perform CPU inference for LLMs. Also, GPU computing in Java is also maturing with TornadoVM and Project Babylon that enable seamless offloading of computations to GPUs.
This session introduces GPULlama3.java, an open-source framework built atop llama3.java and TornadoVM that transparently accelerates LLM inference on GPUs. It supports half-precision and quantized data types (FP16, Q8, Q4) mapped directly to GPU native types, GPU-optimized matrix operations, fast Flash Attention, and is compatible with Llama 2/3, Mistral, Qwen2/3, Phi-3, and Gemma3 models. Integration with LangChain4j and Quarkus that allows developers to build GPU-powered inference engines entirely in Java with minimal overhead.
Finally, the session will showcase a live demo running GPULlama3.java on any JDK and hardware varying from Apple Silicon to high-end NVIDIA GPUs showcasing GPU-accelerated LLM inference integrated with Quarkus and agentic LangChain4j workflows. The session will also demonstrate how to leverage TornadoVM’s profiling tools for GPU performance analysis to complement standard JVM tooling.
This session introduces GPULlama3.java, an open-source framework built atop llama3.java and TornadoVM that transparently accelerates LLM inference on GPUs. It supports half-precision and quantized data types (FP16, Q8, Q4) mapped directly to GPU native types, GPU-optimized matrix operations, fast Flash Attention, and is compatible with Llama 2/3, Mistral, Qwen2/3, Phi-3, and Gemma3 models. Integration with LangChain4j and Quarkus that allows developers to build GPU-powered inference engines entirely in Java with minimal overhead.
Finally, the session will showcase a live demo running GPULlama3.java on any JDK and hardware varying from Apple Silicon to high-end NVIDIA GPUs showcasing GPU-accelerated LLM inference integrated with Quarkus and agentic LangChain4j workflows. The session will also demonstrate how to leverage TornadoVM’s profiling tools for GPU performance analysis to complement standard JVM tooling.
Michalis Papadimitriou
Michalis Papadimitriou is a Research Fellow at the University of Manchester and a Staff Software Engineer on the TornadoVM team. His core expertise includes open-source software development, hardware abstractions for high-level programming languages, compiler optimizations for GPU computing, and enabling large language model (LLM) inference on GPUs for the Java Virtual Machine (JVM).
Michalis is focused on advancing GPU acceleration for machine learning workloads on the JVM through the TornadoVM framework and actively maintains the GPULlama3.java project.
Before joining the University of Manchester, he worked on a range of software stacks at Huawei Technologies and contributed to the open-source machine learning compiler Apache TVM, while working for OctoAI (formerly OctoML), which was later acquired by Nvidia.
Michalis is focused on advancing GPU acceleration for machine learning workloads on the JVM through the TornadoVM framework and actively maintains the GPULlama3.java project.
Before joining the University of Manchester, he worked on a range of software stacks at Huawei Technologies and contributed to the open-source machine learning compiler Apache TVM, while working for OctoAI (formerly OctoML), which was later acquired by Nvidia.
talkDetail.shareFeedback
talkDetail.feedbackNotYetAvailable
talkDetail.feedbackAvailableAfterStart
talkDetail.signInRequired
talkDetail.signInToFeedbackDescription
occupancy.title
occupancy.votingNotYetAvailable
occupancy.votingAvailableBeforeStart
talkDetail.signInRequired
occupancy.signInToVoteDescription
comments.speakerNotEnabledComments