Gemma 4: Google's Most Powerful Open Model — And How to Run It Locally
Google just released Gemma 4 — a family of open-weight models with multimodal input, 256K context, native function calling, and a built-in "thinking mode". Here's everything you need to know, plus a step-by-step guide to running it on your own machine.
On April 2, 2026, Google DeepMind released Gemma 4 — the most capable open-weight model family they've ever shipped. Four sizes, multimodal from day one, 256K token context window, built-in reasoning mode, function calling — and under an Apache 2.0 license that lets you use it commercially with no restrictions.
For developers who want to run powerful AI locally — without sending data to cloud APIs, without per-token costs, and with full control — this is a big deal.
📖 Official Gemma 4 announcement • Google AI for Developers • Hugging Face collection
What Is Gemma 4?
Gemma 4 is a family of open-weight models built by Google DeepMind. "Open-weight" means the model weights are publicly released — you can download them, run them on your own hardware, fine-tune them, and deploy them privately.
The family has four variants:
| Model | Type | Context Window | Best For |
|---|---|---|---|
| Gemma 4 E2B | Effective 2B Dense | 128K tokens | Mobile, edge, on-device |
| Gemma 4 E4B | Effective 4B Dense | 128K tokens | Laptops, consumer hardware |
| Gemma 4 26B A4B | 26B Mixture-of-Experts | 256K tokens | Workstations, powerful GPUs |
| Gemma 4 31B | 31B Dense | 256K tokens | Server-grade, highest accuracy |
The "Effective" in E2B/E4B means the models use architectural tricks (like Mixture-of-Experts routing) to deliver competitive performance at a fraction of the compute cost.
Key Features
1. Multimodal Input — Text, Images, and Audio
All four models accept text and image inputs out of the box. The E2B and E4B models also support native audio input — you can send a recording and have the model transcribe, summarize, or respond to it.
This makes Gemma 4 a genuine multi-modal model, not a text model with image support bolted on as an afterthought.
2. 128K / 256K Token Context Window
The smaller E2B/E4B models support 128K tokens (approximately 90,000 words). The larger 26B and 31B models support 256K tokens — enough to send an entire codebase, a full book, or months of conversation history in a single prompt.
For RAG-based applications or agentic workflows where context accumulates quickly, this is a meaningful upgrade over Gemma 3's limits.
3. Thinking Mode (Configurable Chain-of-Thought)
Every Gemma 4 model includes a built-in "thinking mode" that lets the model reason step-by-step before generating its final answer. This is similar to how OpenAI's o3 models work, but it's configurable — you can turn it on for hard problems and off when you need fast responses.
messages = [
{
"role": "user",
"content": "Solve this step by step: a train leaves at 60 km/h..."
}
]
# Thinking mode is triggered by the model's instruction tuning.
# For models supporting explicit think tokens, include in system prompt:
# "Think carefully through each step before answering."
Benchmarks show that thinking mode dramatically improves performance on math (AIME 2026: 89.2% for 31B) and complex code generation (LiveCodeBench v6: 80.0%).
4. Native Function Calling
Gemma 4 implements function calling using built-in special tokens — not prompt tricks. This makes tool use significantly more reliable than earlier approaches where you had to engineer the model to output structured JSON and parse it yourself.
The model manages the full function-calling lifecycle: decide to call a function → format the call → receive the result → integrate into its response.
5. 140+ Languages
Gemma 4 was natively trained on over 140 languages, including Vietnamese 🇻🇳. If you're building multilingual applications or serving non-English users, this is a significant advantage over models trained primarily on English data.
6. Apache 2.0 License
This matters enormously. Apache 2.0 means:
- ✅ Commercial use permitted
- ✅ Modify and distribute
- ✅ Use in proprietary products
- ✅ No royalties
You own your deployment.
Performance Benchmarks
Here's how Gemma 4's instruction-tuned models compare across standard benchmarks:
| Benchmark | Description | 31B Dense | 26B MoE | E4B | E2B |
|---|---|---|---|---|---|
| Arena AI (text) | Human preference ranking | 1452 | 1441 | — | — |
| MMLU Pro | Knowledge Q&A | 85.2% | 82.6% | 69.4% | 60.0% |
| AIME 2026 | Advanced math | 89.2% | 88.3% | 42.5% | 37.5% |
| LiveCodeBench v6 | Code generation | 80.0% | 77.1% | 52.0% | 44.0% |
| GPQA Diamond | Graduate-level science | 84.3% | 82.3% | 58.6% | 43.4% |
| τ2-bench | Agentic task completion | 86.4% | 85.5% | 57.5% | 29.4% |
Note: The 31B model ranks among the top open-weight models on Arena AI's global leaderboard for text — competitive with models from Anthropic and Meta.
Which Variant Should You Pick?
Have a high-end laptop or M-series Mac? → gemma-4-E4B-it (4B params, fast, great quality)
Have a gaming GPU (16GB+ VRAM)? → gemma-4-26B-A4B-it (MoE, efficient)
Have a workstation GPU (32GB+ VRAM)? → gemma-4-31B-it (best quality)
Building for mobile / edge devices? → gemma-4-E2B-it (lightest, fastest)
Option 1: Run with Ollama (Fastest Way)
Ollama is a CLI tool that makes it trivially easy to download and run any supported model.
Install Ollama
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download/windows
Pull and Run Gemma 4
# Pull the E4B variant (best balance for laptops)
ollama pull gemma4:e4b
# Run in interactive chat mode
ollama run gemma4:e4b
Available tags:
ollama pull gemma4:e2b # ~1.5GB — lightest
ollama pull gemma4:e4b # ~3.2GB — great for laptops
ollama pull gemma4:26b # ~16GB — needs a good GPU
ollama pull gemma4:31b # ~20GB — server-grade
Once running, you'll get an interactive REPL:
>>> What is the CAP theorem in distributed systems?
The CAP theorem states that a distributed data store can only guarantee
two of the following three properties simultaneously...
Use Ollama's REST API
Ollama runs a local server on http://localhost:11434. Every app that supports the OpenAI API format can talk to it:
curl http://localhost:11434/api/generate \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"prompt": "Write a Go function that reverses a string",
"stream": false
}'
Or using the OpenAI-compatible endpoint:
curl http://localhost:11434/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"messages": [
{"role": "user", "content": "Write a Go function that reverses a string"}
]
}'
Option 2: Run with LM Studio (GUI)
LM Studio provides a desktop app with a clean GUI — ideal if you prefer not to use the terminal.
- Download from lmstudio.ai and install
- Open the Search tab (magnifying glass icon)
- Search for "Gemma 4"
- Select the variant that matches your hardware — LM Studio will warn you if the model requires more RAM than your machine has
- Download — it will pull from Hugging Face automatically
- Switch to the AI Chat tab, select the model, and start chatting
LM Studio also exposes a local API server on port 1234, compatible with the OpenAI API format:
# Start the server in LM Studio (Settings > Local Server > Start)
curl http://localhost:1234/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello, Gemma 4!"}],
"temperature": 0.7,
"max_tokens": 200
}'
Option 3: Run with Python + Hugging Face Transformers
For developers who want direct programmatic control, the transformers library gives you full access to the model.
Install Dependencies
pip install -U transformers torch accelerate
# For quantization (reduces VRAM needs dramatically):
pip install bitsandbytes
Login to Hugging Face
huggingface-cli login
# Enter your token from https://huggingface.co/settings/tokens
Basic Text Generation
from transformers import pipeline
# E2B is the lightest — swap with "google/gemma-4-e4b-it" for better quality
model_id = "google/gemma-4-e2b-it"
pipe = pipeline(
"text-generation",
model=model_id,
device_map="auto", # auto-selects GPU, MPS (Mac), or CPU
)
messages = [
{
"role": "system",
"content": "You are a senior backend engineer. Keep responses concise."
},
{
"role": "user",
"content": "Explain the difference between optimistic and pessimistic locking."
}
]
output = pipe(messages, max_new_tokens=512, temperature=0.7)
print(output[0]["generated_text"][-1]["content"])
Run with 4-bit Quantization (Reduced Memory)
If you don't have enough VRAM for the full model, quantization reduces memory by ~4x with minimal quality loss:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "google/gemma-4-26b-a4b-it"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Chat template — use the tokenizer's built-in chat template
messages = [
{"role": "user", "content": "Write a Python decorator that logs function calls."}
]
# Apply the model's chat template
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(
input_ids,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
)
# Decode only the newly generated tokens
response = tokenizer.decode(
outputs[0][input_ids.shape[-1]:],
skip_special_tokens=True
)
print(response)
Multimodal: Image + Text Input
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests
model_id = "google/gemma-4-e4b-it"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
# Load an image (local file or URL)
image = Image.open("architecture_diagram.png")
messages = [
{
"role": "user",
"content": [
{"type": "image"}, # image placeholder
{"type": "text", "text": "Describe this architecture diagram and identify any potential single points of failure."}
]
}
]
# Process with the image
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
images=[image],
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Advanced Config: Modelfile for Ollama
You can create a custom Modelfile to configure Gemma 4's behavior permanently — system prompt, temperature, context size:
# Modelfile — save this to a file called "Modelfile"
FROM gemma4:e4b
# System prompt — sets the model's persona
SYSTEM """
You are a senior backend engineer specializing in Java and Go.
You write clean, production-ready code with proper error handling.
Keep responses concise. Prefer code examples over long explanations.
"""
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 32768 # Context window size (tokens)
PARAMETER num_predict 1024 # Max tokens to generate
Build and run your custom model:
ollama create my-backend-assistant -f Modelfile
ollama run my-backend-assistant
# Now it remembers the persona across all sessions:
>>> Write a Go implementation of a rate limiter using token bucket algorithm
Hardware Requirements Summary
| Model | RAM/VRAM | Inference Speed | Recommended For |
|---|---|---|---|
| E2B | ~4GB RAM | Very fast | Any modern laptop, M1 Mac |
| E4B | ~6GB RAM | Fast | M1/M2 Mac, gaming laptops |
| 26B MoE (Q4) | ~12GB VRAM | Moderate | RTX 3080/4080, Mac Studio |
| 31B Dense (Q4) | ~18GB VRAM | Slower | RTX 4090, Mac Pro, workstation |
Mac users: Ollama and LM Studio both support Apple Silicon's Metal GPU. The E4B model runs at ~30-40 tokens/sec on an M2 MacBook Pro — genuinely usable for daily coding work.
Key Takeaways
-
Gemma 4 is the best open-weight model family available as of April 2026 — multimodal, long context, and Apache 2.0 licensed.
-
Start with Ollama —
ollama pull gemma4:e4b && ollama run gemma4:e4bis the fastest path to a working local AI in under 5 minutes. -
For laptops, E4B is the sweet spot: fast, capable, ~6GB RAM, and now supports images and system prompts natively.
-
For production Python apps, use
transformerswithbitsandbytesquantization on larger models to fit them into consumer GPU VRAM. -
Multimodal support is real — you can send images and audio to the model with no extra setup, not just text.
-
Thinking mode enables complex reasoning — enable it for math, algorithmic problems, and multi-step planning tasks.
Running a model locally isn't just a curiosity anymore — it's a practical choice when you need privacy (sensitive documents), cost control (no per-token billing), or offline capability. Gemma 4 makes that choice easy.
Thanks for reading! If you're curious about how to engineer reliable systems around these models, check out my Prompt Engineering, Context Engineering, and Harness Engineering deep dive. 🚀