Engineering

Local AI in Your Terminal: Ollama + Gemma 4 + ShellGPT

Run Google's Gemma 4 locally with Ollama and wire it into your terminal workflow via ShellGPT — from shell commands to image OCR.

Running AI locally means zero API costs, no rate limits, and your data never leaves your machine. With Google's Gemma 4 model, Ollama, and ShellGPT, you can handle everything from generating shell commands to extracting text from images, all from the terminal.

Here's the full setup and the workflows that actually matter.

Why Run AI Locally?

Cloud APIs like OpenAI and Claude are great, but they come with tradeoffs:

Approach Cost Privacy Speed Offline
Cloud API (OpenAI, Claude) $5-20/mo Data sent to server Depends on network No
Local Ollama + Gemma 4 Free 100% local Fast on Apple Silicon Yes
Hybrid (local + cloud) Varies Local for sensitive tasks Best of both Partial

Local AI works best for quick shell tasks, file processing, and anything involving private data. Cloud APIs stay in the picture for complex reasoning and web-connected tasks.

What is Gemma 4?

Gemma 4 is Google DeepMind's latest open-weights model family, released April 2, 2026. It comes in four sizes:

Model RAM Active params Best for
gemma4:e2b ~3 GB 2B Edge devices, fastest
gemma4:e4b ~5 GB 4B Daily terminal use (default)
gemma4:26b ~10 GB 4B (MoE) Quality close to 13B, speed of 4B
gemma4:31b ~20 GB 31B Flagship, needs beefy RAM

The e4b is the default when you pull gemma4. It runs great on Apple Silicon's unified memory. The 26B MoE variant is the sleeper pick: only 4B parameters activate per token, so you get near-13B quality at 4B speed.

Key upgrades over Gemma 3: configurable thinking mode (chain-of-thought on/off per request), native function calling, 128K-256K context window, and image input on all sizes.

Setup: 3 Steps

Step 1: Install Ollama

Ollama manages and serves local models. On macOS:

brew install ollama

Or grab the latest directly:

curl -fsSL https://ollama.com/install.sh | sh

Start the server:

ollama serve

Step 2: Pull Gemma 4

# Default (e4b, ~9.6 GB download)
ollama pull gemma4

# Or pick a specific size
ollama pull gemma4:e2b    # smallest
ollama pull gemma4:26b    # MoE sweet spot

Verify it's installed:

ollama list
# NAME             SIZE
# gemma4:latest    9.6 GB

Step 3: Wire Up ShellGPT

ShellGPT (sgpt) is a CLI tool that sends your prompts to an LLM. Point it at your local Ollama server by editing ~/.config/shell_gpt/.sgptrc:

DEFAULT_MODEL=gemma4
OPENAI_API_KEY=dummy
OPENAI_BASE_URL=http://localhost:11434/v1
API_BASE_URL=http://localhost:11434/v1

That's it. ShellGPT now talks to Gemma 4 running locally.

Daily Workflows

Here's how this setup fits into real terminal work.

┌──────────────────────────────────────────┐
│              Your Terminal                │
│                                          │
│  Ctrl+L ──▶ sgpt -s "find large files"  │
│              │                           │
│              ▼                           │
│  ┌────────────────────┐                  │
│  │  Ollama (local)    │                  │
│  │  Gemma 4 e4b       │                  │
│  └────────┬───────────┘                  │
│           │                              │
│           ▼                              │
│  Generated command ready to execute      │
│                                          │
│  ollama run gemma4 "prompt" image.png    │
│           │                              │
│           ▼                              │
│  Extracted text / analysis returned      │
└──────────────────────────────────────────┘

Shell Command Generation (Ctrl+L)

The most frequent use case. Instead of Googling shell syntax, ask Gemma 4:

# Generate a command (shell mode)
sgpt -s "find all files larger than 100MB in current directory"
# Output: find . -size +100M -type f

# Generate and execute directly
sgpt -se "compress all PNG files in this folder"

# Complex piped commands you'd never remember
sgpt -s "show top 10 processes by memory usage sorted descending"

Tip: Many terminals (iTerm2, Warp) let you bind Ctrl+L to clear the line and prepend sgpt -s. This turns your terminal into an AI-powered command palette.

Image Processing and OCR

Gemma 4 supports image input on all model sizes. Pass an image path directly to Ollama:

# Extract text from a document image
ollama run gemma4 "Extract all text from this image" /path/to/document.png

# Describe a screenshot
ollama run gemma4 "What does this screenshot show?" ~/Desktop/screenshot.png

# Read a receipt
ollama run gemma4 "List all items and prices" receipt.jpg

We tested this on a battery test report cover page. Gemma 4 correctly read the model number (L135F72), chemistry (LiFePO4), serial number (EU7223092221143), and all specs without any errors.

Chat Mode for Multi-Turn Tasks

When you need back-and-forth conversation:

# Start a named chat session
sgpt --chat debug "I'm getting a segfault in my C program"
# Follow up in the same session
sgpt --chat debug "here's the backtrace: ..."
sgpt --chat debug "how do I fix the null pointer on line 42?"

Code Review from a Diff

# Pipe a git diff for review
git diff | sgpt "review this diff for bugs"

# Explain unfamiliar code
cat script.py | sgpt "explain what this code does"

Controlling Thinking Mode

Gemma 4 has chain-of-thought thinking enabled by default. This improves reasoning quality but adds latency. For quick tasks like command generation, you can turn it off:

# Disable thinking for faster responses
ollama run gemma4 /nothink "list all docker containers"

# Re-enable thinking for complex tasks
ollama run gemma4 /think "debug this error message: ..."

Through the API, pass "think": false in your request body. When using ShellGPT for shell commands (sgpt -s), thinking mode adds unnecessary overhead since you just need a one-liner back. Consider creating a Modelfile with thinking disabled for shell tasks:

# Modelfile.shell
FROM gemma4
PARAMETER think false
ollama create gemma4-shell -f Modelfile.shell
# Then in .sgptrc: DEFAULT_MODEL=gemma4-shell

Best Practices

  • Keep Ollama running as a service so it's always ready. First query loads the model (~5 seconds), subsequent queries are instant.
  • Use shell mode (-s) for commands, plain mode for questions. Shell mode strips explanation and gives you runnable output.
  • Match model size to task. e4b handles 90% of terminal tasks. Pull 26b only if you need better reasoning for code review.
  • Clean up old models to save disk: ollama rm gemma3:4b frees gigabytes.
  • Set request timeout in .sgptrc to 60+ seconds for first-load scenarios: REQUEST_TIMEOUT=60.
  • Image tasks go through ollama run directly, not sgpt. ShellGPT doesn't pass image files, so use ollama run gemma4 "prompt" /path/to/image.png.
  • Disable thinking for speed-sensitive tasks. Use /nothink or create a dedicated Modelfile with thinking off for shell command generation.

Key Takeaways

  • Ollama + Gemma 4 gives you a capable local AI that runs free on Apple Silicon
  • ShellGPT bridges the gap between your terminal and the model
  • sgpt -s for shell commands, ollama run gemma4 for image processing
  • The e4b variant (default) is the sweet spot for daily terminal use
  • Disable thinking mode with /nothink for faster responses on simple tasks
  • Keep cloud APIs for tasks that need web access or deeper reasoning. Local AI handles the rest.
Stay updated

Get new articles in your inbox.

No spam, just technical articles and product updates.