Skip to main content

Gemma 4: Google's Most Powerful Open Model — And How to Run It Locally

· 10 min read
Hieu Nguyen
Senior Software Engineer at OCB

Google just released Gemma 4 — a family of open-weight models with multimodal input, 256K context, native function calling, and a built-in "thinking mode". Here's everything you need to know, plus a step-by-step guide to running it on your own machine.

On April 2, 2026, Google DeepMind released Gemma 4 — the most capable open-weight model family they've ever shipped. Four sizes, multimodal from day one, 256K token context window, built-in reasoning mode, function calling — and under an Apache 2.0 license that lets you use it commercially with no restrictions.

For developers who want to run powerful AI locally — without sending data to cloud APIs, without per-token costs, and with full control — this is a big deal.

📖 Official Gemma 4 announcementGoogle AI for DevelopersHugging Face collection

What Is Gemma 4?

Gemma 4 is a family of open-weight models built by Google DeepMind. "Open-weight" means the model weights are publicly released — you can download them, run them on your own hardware, fine-tune them, and deploy them privately.

The family has four variants:

ModelTypeContext WindowBest For
Gemma 4 E2BEffective 2B Dense128K tokensMobile, edge, on-device
Gemma 4 E4BEffective 4B Dense128K tokensLaptops, consumer hardware
Gemma 4 26B A4B26B Mixture-of-Experts256K tokensWorkstations, powerful GPUs
Gemma 4 31B31B Dense256K tokensServer-grade, highest accuracy

The "Effective" in E2B/E4B means the models use architectural tricks (like Mixture-of-Experts routing) to deliver competitive performance at a fraction of the compute cost.


Key Features

1. Multimodal Input — Text, Images, and Audio

All four models accept text and image inputs out of the box. The E2B and E4B models also support native audio input — you can send a recording and have the model transcribe, summarize, or respond to it.

This makes Gemma 4 a genuine multi-modal model, not a text model with image support bolted on as an afterthought.

2. 128K / 256K Token Context Window

The smaller E2B/E4B models support 128K tokens (approximately 90,000 words). The larger 26B and 31B models support 256K tokens — enough to send an entire codebase, a full book, or months of conversation history in a single prompt.

For RAG-based applications or agentic workflows where context accumulates quickly, this is a meaningful upgrade over Gemma 3's limits.

3. Thinking Mode (Configurable Chain-of-Thought)

Every Gemma 4 model includes a built-in "thinking mode" that lets the model reason step-by-step before generating its final answer. This is similar to how OpenAI's o3 models work, but it's configurable — you can turn it on for hard problems and off when you need fast responses.

messages = [
{
"role": "user",
"content": "Solve this step by step: a train leaves at 60 km/h..."
}
]
# Thinking mode is triggered by the model's instruction tuning.
# For models supporting explicit think tokens, include in system prompt:
# "Think carefully through each step before answering."

Benchmarks show that thinking mode dramatically improves performance on math (AIME 2026: 89.2% for 31B) and complex code generation (LiveCodeBench v6: 80.0%).

4. Native Function Calling

Gemma 4 implements function calling using built-in special tokens — not prompt tricks. This makes tool use significantly more reliable than earlier approaches where you had to engineer the model to output structured JSON and parse it yourself.

The model manages the full function-calling lifecycle: decide to call a function → format the call → receive the result → integrate into its response.

5. 140+ Languages

Gemma 4 was natively trained on over 140 languages, including Vietnamese 🇻🇳. If you're building multilingual applications or serving non-English users, this is a significant advantage over models trained primarily on English data.

6. Apache 2.0 License

This matters enormously. Apache 2.0 means:

  • ✅ Commercial use permitted
  • ✅ Modify and distribute
  • ✅ Use in proprietary products
  • ✅ No royalties

You own your deployment.


Performance Benchmarks

Here's how Gemma 4's instruction-tuned models compare across standard benchmarks:

BenchmarkDescription31B Dense26B MoEE4BE2B
Arena AI (text)Human preference ranking14521441
MMLU ProKnowledge Q&A85.2%82.6%69.4%60.0%
AIME 2026Advanced math89.2%88.3%42.5%37.5%
LiveCodeBench v6Code generation80.0%77.1%52.0%44.0%
GPQA DiamondGraduate-level science84.3%82.3%58.6%43.4%
τ2-benchAgentic task completion86.4%85.5%57.5%29.4%

Note: The 31B model ranks among the top open-weight models on Arena AI's global leaderboard for text — competitive with models from Anthropic and Meta.

Which Variant Should You Pick?

Have a high-end laptop or M-series Mac? → gemma-4-E4B-it (4B params, fast, great quality)
Have a gaming GPU (16GB+ VRAM)? → gemma-4-26B-A4B-it (MoE, efficient)
Have a workstation GPU (32GB+ VRAM)? → gemma-4-31B-it (best quality)
Building for mobile / edge devices? → gemma-4-E2B-it (lightest, fastest)

Option 1: Run with Ollama (Fastest Way)

Ollama is a CLI tool that makes it trivially easy to download and run any supported model.

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download/windows

Pull and Run Gemma 4

# Pull the E4B variant (best balance for laptops)
ollama pull gemma4:e4b

# Run in interactive chat mode
ollama run gemma4:e4b

Available tags:

ollama pull gemma4:e2b # ~1.5GB — lightest
ollama pull gemma4:e4b # ~3.2GB — great for laptops
ollama pull gemma4:26b # ~16GB — needs a good GPU
ollama pull gemma4:31b # ~20GB — server-grade

Once running, you'll get an interactive REPL:

>>> What is the CAP theorem in distributed systems?

The CAP theorem states that a distributed data store can only guarantee
two of the following three properties simultaneously...

Use Ollama's REST API

Ollama runs a local server on http://localhost:11434. Every app that supports the OpenAI API format can talk to it:

curl http://localhost:11434/api/generate \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"prompt": "Write a Go function that reverses a string",
"stream": false
}'

Or using the OpenAI-compatible endpoint:

curl http://localhost:11434/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4:e4b",
"messages": [
{"role": "user", "content": "Write a Go function that reverses a string"}
]
}'

Option 2: Run with LM Studio (GUI)

LM Studio provides a desktop app with a clean GUI — ideal if you prefer not to use the terminal.

  1. Download from lmstudio.ai and install
  2. Open the Search tab (magnifying glass icon)
  3. Search for "Gemma 4"
  4. Select the variant that matches your hardware — LM Studio will warn you if the model requires more RAM than your machine has
  5. Download — it will pull from Hugging Face automatically
  6. Switch to the AI Chat tab, select the model, and start chatting

LM Studio also exposes a local API server on port 1234, compatible with the OpenAI API format:

# Start the server in LM Studio (Settings > Local Server > Start)

curl http://localhost:1234/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello, Gemma 4!"}],
"temperature": 0.7,
"max_tokens": 200
}'

Option 3: Run with Python + Hugging Face Transformers

For developers who want direct programmatic control, the transformers library gives you full access to the model.

Install Dependencies

pip install -U transformers torch accelerate
# For quantization (reduces VRAM needs dramatically):
pip install bitsandbytes

Login to Hugging Face

huggingface-cli login
# Enter your token from https://huggingface.co/settings/tokens

Basic Text Generation

from transformers import pipeline

# E2B is the lightest — swap with "google/gemma-4-e4b-it" for better quality
model_id = "google/gemma-4-e2b-it"

pipe = pipeline(
"text-generation",
model=model_id,
device_map="auto", # auto-selects GPU, MPS (Mac), or CPU
)

messages = [
{
"role": "system",
"content": "You are a senior backend engineer. Keep responses concise."
},
{
"role": "user",
"content": "Explain the difference between optimistic and pessimistic locking."
}
]

output = pipe(messages, max_new_tokens=512, temperature=0.7)
print(output[0]["generated_text"][-1]["content"])

Run with 4-bit Quantization (Reduced Memory)

If you don't have enough VRAM for the full model, quantization reduces memory by ~4x with minimal quality loss:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "google/gemma-4-26b-a4b-it"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Chat template — use the tokenizer's built-in chat template
messages = [
{"role": "user", "content": "Write a Python decorator that logs function calls."}
]

# Apply the model's chat template
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)

outputs = model.generate(
input_ids,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.95,
)

# Decode only the newly generated tokens
response = tokenizer.decode(
outputs[0][input_ids.shape[-1]:],
skip_special_tokens=True
)
print(response)

Multimodal: Image + Text Input

from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
import requests

model_id = "google/gemma-4-e4b-it"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

# Load an image (local file or URL)
image = Image.open("architecture_diagram.png")

messages = [
{
"role": "user",
"content": [
{"type": "image"}, # image placeholder
{"type": "text", "text": "Describe this architecture diagram and identify any potential single points of failure."}
]
}
]

# Process with the image
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
images=[image],
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Advanced Config: Modelfile for Ollama

You can create a custom Modelfile to configure Gemma 4's behavior permanently — system prompt, temperature, context size:

# Modelfile — save this to a file called "Modelfile"
FROM gemma4:e4b

# System prompt — sets the model's persona
SYSTEM """
You are a senior backend engineer specializing in Java and Go.
You write clean, production-ready code with proper error handling.
Keep responses concise. Prefer code examples over long explanations.
"""

# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 32768 # Context window size (tokens)
PARAMETER num_predict 1024 # Max tokens to generate

Build and run your custom model:

ollama create my-backend-assistant -f Modelfile
ollama run my-backend-assistant

# Now it remembers the persona across all sessions:
>>> Write a Go implementation of a rate limiter using token bucket algorithm

Hardware Requirements Summary

ModelRAM/VRAMInference SpeedRecommended For
E2B~4GB RAMVery fastAny modern laptop, M1 Mac
E4B~6GB RAMFastM1/M2 Mac, gaming laptops
26B MoE (Q4)~12GB VRAMModerateRTX 3080/4080, Mac Studio
31B Dense (Q4)~18GB VRAMSlowerRTX 4090, Mac Pro, workstation

Mac users: Ollama and LM Studio both support Apple Silicon's Metal GPU. The E4B model runs at ~30-40 tokens/sec on an M2 MacBook Pro — genuinely usable for daily coding work.


Key Takeaways

  1. Gemma 4 is the best open-weight model family available as of April 2026 — multimodal, long context, and Apache 2.0 licensed.

  2. Start with Ollamaollama pull gemma4:e4b && ollama run gemma4:e4b is the fastest path to a working local AI in under 5 minutes.

  3. For laptops, E4B is the sweet spot: fast, capable, ~6GB RAM, and now supports images and system prompts natively.

  4. For production Python apps, use transformers with bitsandbytes quantization on larger models to fit them into consumer GPU VRAM.

  5. Multimodal support is real — you can send images and audio to the model with no extra setup, not just text.

  6. Thinking mode enables complex reasoning — enable it for math, algorithmic problems, and multi-step planning tasks.


Running a model locally isn't just a curiosity anymore — it's a practical choice when you need privacy (sensitive documents), cost control (no per-token billing), or offline capability. Gemma 4 makes that choice easy.

Thanks for reading! If you're curious about how to engineer reliable systems around these models, check out my Prompt Engineering, Context Engineering, and Harness Engineering deep dive. 🚀