Install Oobabooga Text Generation WebUI with Llama 3.2

What Is Oobabooga's Text Generation WebUI?

Oobabooga's Text Generation WebUI (often just called "ooba" or "text-gen-webui") is a free, open-source interface for running large language models locally. Think of it as the Automatic1111 of text AI — a powerful, community-built web app that abstracts away the command-line complexity and gives you a feature-rich UI for loading, chatting with, and configuring language models.

It supports a wide range of backends, model formats, and use cases — from casual chatting to API-compatible serving that other apps can connect to. If you want maximum flexibility and control over how you run local LLMs, text-gen-webui is one of the go-to tools in the community.

🔌

Multiple Backends

Supports Transformers, llama.cpp, ExLlamav2, and more — swap between them per-model.

🔗

OpenAI-Compatible API

Exposes a local API server that other apps (like Open WebUI) can connect to as if it were OpenAI.

💬

Flexible Chat Modes

Chat mode, instruct mode, and notebook mode — each optimized for different interaction styles.

📝

Auto Prompt Formatting

Automatically applies the correct prompt template (ChatML, Llama, Alpaca, etc.) for each model.

🎛️

Fine-Grained Parameters

Temperature, top-p, top-k, repetition penalty — full control over generation behavior.

🧩

LoRA Fine-Tuning

Load and apply LoRA adapters to base models for personalized behavior without full retraining.

But Don't I Need a Powerful GPU?

Locally, yes — running Llama 3.2 well requires a decent GPU. But Google Colab gives you free access to an NVIDIA T4 GPU with 16GB of VRAM, running in the cloud via your browser. You don't install anything on your machine. You don't need a gaming PC. You just need a Google account.

The T4 is powerful enough to run Llama 3.2 in 4-bit quantized form (GGUF or GPTQ), which gives you a fast, high-quality experience. It's an ideal way to test the full Oobabooga feature set before committing to a local GPU setup — or just a reliable free option if you don't have the hardware at all.

💡 Free tier limits Google Colab's free tier gives you T4 access but with session time limits — typically a few hours per session. For extended use, Colab Pro ($10/month) gives more consistent GPU access and longer sessions. The setup process is identical either way.

Choosing Your Backend

Oobabooga supports multiple inference backends. The one you choose affects which model formats you can load and how fast they run:

Backend	Model Formats	Best For
Transformers	HuggingFace safetensors, PyTorch	Broad compatibility, easy model loading from HF Hub
llama.cpp	GGUF Recommended for Colab	Fast, low VRAM usage, best for quantized models
ExLlamav2	EXL2	Highest speed on NVIDIA GPUs, great for 24GB+ VRAM

For Colab with the T4 GPU, llama.cpp with GGUF is the recommended backend — it loads fast, uses memory efficiently, and gives you great performance on the available 16GB VRAM.

Setting Up Oobabooga on Google Colab

Sign Into Your Google Account

Make sure you're logged into the Google account you want to use. Colab sessions are tied to your account and any files you save to Google Drive will persist between sessions.

Open the Oobabooga Colab Notebook

The community maintains Colab notebooks for text-gen-webui. Open the notebook from the video description or search for "oobabooga text generation webui colab" on GitHub. Click the "Open in Colab" badge to load it into your account.

Enable the T4 GPU Runtime

In Colab, go to Runtime → Change runtime type, set Hardware Accelerator to T4 GPU, and click Save. This step is essential — without GPU acceleration, model loading and generation will be extremely slow.

Connect to the Runtime

Click the Connect button (top right of the notebook). Colab will spin up a virtual machine with the T4 GPU attached. Wait for the RAM and Disk indicators to appear — that means you're connected.

Configure and Run the Setup Cell

The notebook has a configuration cell at the top. Set your backend to llama.cpp, select your desired model (Llama 3.2 in GGUF format), and run the cell. The notebook will install Oobabooga, download the model from Hugging Face, and start the WebUI server automatically.

Open the Public URL

Once the setup completes, the notebook outputs a public URL (via Gradio's sharing feature or a tunnel like ngrok). Click that link to open the Oobabooga WebUI in a new browser tab — it's running in the cloud but accessible from anywhere.

Load Your Model and Start Chatting

In the WebUI, go to the Model tab and confirm Llama 3.2 is loaded. Switch to the Chat tab, select your preferred chat mode (Chat or Instruct), and start your conversation. The model runs on the T4 GPU with full generation controls available in the Parameters tab.

What You Can Do with the WebUI

Chat and Instruct Modes

The Chat tab gives you a conversational back-and-forth interface similar to ChatGPT. Instruct mode is designed for task-specific prompts where you want a single, focused response rather than an ongoing dialogue. Both automatically apply the correct prompt format for Llama 3.2.

Notebook Mode

Notebook mode gives you a raw text completion interface — you type a partial sentence or prompt and the model continues it. This is useful for creative writing, exploring how a model handles open-ended generation, or testing prompt engineering without the chat wrapper.

Parameters Tab

This is where Oobabooga really shines over simpler interfaces. You have full control over:

Temperature — how creative/random the responses are (0.1 = focused, 1.5 = wild)
Top-p / Top-k — control the sampling distribution for varied outputs
Repetition penalty — prevents the model from looping or repeating itself
Max new tokens — caps response length
Context length — how much conversation history the model can see

OpenAI-Compatible API

Oobabooga can expose your local model as an OpenAI-compatible API endpoint. Any app that supports OpenAI's API (Open WebUI, Cursor, custom scripts) can connect to it by pointing at localhost:5000 (or the Colab public URL) with any API key string. This lets you use Llama 3.2 as a drop-in replacement for GPT in your tools.

🔄 What's changed since this post was published (Sep 2024) Llama 3.2 introduced genuinely capable small models (1B and 3B) alongside the 11B and 90B multimodal variants. The 3B model runs exceptionally well on Colab's free T4 tier. Oobabooga has also continued adding features — the core setup process remains the same, but check the GitHub repo for the latest Colab notebook link as these are periodically updated.

Going Further: Local Install

Running on Colab is great for getting started without any hardware investment. When you're ready to move to a local install — for longer sessions, more privacy, and the ability to run models 24/7 — Oobabooga installs cleanly on Windows, Linux, and Mac via a one-click installer script from the official GitHub repository.

The interface and features are identical between Colab and local installs. Everything you learn in Colab transfers directly — same tabs, same parameters, same backends. Colab is the perfect low-commitment way to explore before investing in hardware.

📦 Want to skip the setup?

The Local Lab offers pre-configured AI installer packages so you can get running in minutes, not hours.

Browse the Store →

Install Oobabooga Text Generation WebUI With Llama 3.2 Free on Colab — 2024 Tutorial

What Is Oobabooga's Text Generation WebUI?

Multiple Backends

OpenAI-Compatible API

Flexible Chat Modes

Auto Prompt Formatting

Fine-Grained Parameters

LoRA Fine-Tuning

But Don't I Need a Powerful GPU?

Choosing Your Backend

Setting Up Oobabooga on Google Colab

Sign Into Your Google Account

Open the Oobabooga Colab Notebook

Enable the T4 GPU Runtime

Connect to the Runtime

Configure and Run the Setup Cell

Open the Public URL

Load Your Model and Start Chatting

What You Can Do with the WebUI

Chat and Instruct Modes

Notebook Mode

Parameters Tab

OpenAI-Compatible API

Going Further: Local Install

Related Posts