Data & (Gen)AIData & (Gen)AI
Conference50min
BEGINNER

Running open large language models in production with Ollama and serverless GPUs

Many companies are interested in running open large language models such as Gemma and Llama because it gives them full control over the deployment options, the timing of model upgrades, and the private data that goes into the model. Ollama is a very popular open-source LLM inference server that works great on localhost and in a container. In this talk, you’ll learn how to deploy an application that uses an open model with Ollama on Cloud Run with scale to zero, serverless GPUs.

Wietse Venema
Wietse VenemaGoogle

talkDetail.whenAndWhere

Thursday, October 10, 13:50-14:40
Room 10
talks.description
Many companies are interested in running open large language models such as Gemma and Llama because it gives them full control over the deployment options, the timing of model upgrades, and the private data that goes into the model. Ollama is a very popular open-source LLM inference server that works great on localhost and in a container. In this talk, you’ll learn how to deploy an application that uses an open model with Ollama on Cloud Run with scale to zero, serverless GPUs.
Ollama
Deployment
Serverless
LLM
talks.speakers
Wietse Venema

Wietse Venema

Google

Netherlands

Wietse Venema is an engineer at Google Cloud. He wrote the O’Reilly book on Cloud Run.
comments.title

comments.speakerNotEnabledComments