Version: Next

GPU-Accelerated Inference

Project Beacon implements GPU-accelerated inference for LLM workloads through a host-delegation architecture that achieves 25x performance improvements over CPU-only inference.

Architecture Overview

Container → Host GPU Delegation

┌─────────────────┐    HTTP API     ┌──────────────────┐
│  Client         │ ──────────────→ │  Host Ollama     │
│  Container      │                 │  (GPU-enabled)   │
│  (HTTP only)    │                 │                  │
└─────────────────┘                 └──────────────────┘
                                             │
                                             ▼
                                    ┌──────────────────┐
                                    │  GPU Hardware    │
                                    │  (Metal/CUDA)    │
                                    └──────────────────┘

Key Components

Client Containers: Lightweight HTTP-only containers that make API calls
Host Ollama: GPU-accelerated inference server running on host
Model Loading: Pre-pulled models with GPU layer offloading
Networking: Secure localhost binding with container bridge access

Performance Results

Benchmark Comparison

Architecture	Response Time	Success Rate	GPU Utilization
Before (CPU)	30-70s	60% (timeouts)	0%
After (GPU)	1.25-2.83s	100%	90%+

Model Performance (Apple M1 Max)

Model	Size	Avg Response	GPU Layers
Llama 3.2:1b	1.23GB	1.25s	17/17
Qwen 2.5:1.5b	0.92GB	2.83s	29/29
Mistral 7b	3.83GB	2.28s	All

Hardware Tiers

Tier 1: CPU-Only (Fallback)

Target: 4-8GB RAM, no GPU
Models: llama3.2:1b, gemma3:1b
Performance: 10-30s response times

Tier 2: Entry GPU (8-12GB VRAM)

Target: RTX 3060, Apple M1/M2, RX 6600
Models: Small to medium quantized models
Performance: 1-3s response times

Tier 3: Mid-Range GPU (16-24GB VRAM)

Target: RTX 4070 Ti, Apple M1 Max/Ultra
Models: Full-size models, multiple concurrent
Performance: 0.5-2s response times

Tier 4: High-End GPU (24GB+ VRAM)

Target: RTX 4090, A100, H100
Models: Large models, high throughput
Performance: <0.5s response times

Implementation

Container Configuration

Client containers use Dockerfile.client pattern:

FROM python:3.11-slim
RUN pip install requests numpy pandas
# Copy benchmark files only
ENV OLLAMA_BASE_URL=http://host.docker.internal:11434
ENTRYPOINT ["python3", "benchmark.py"]

Host Ollama Setup

# Start with GPU acceleration
OLLAMA_GPU_LAYERS=999 OLLAMA_GPU_MEMORY=40GB ollama serve

# Verify GPU usage
ollama run llama3.2:1b "test" --verbose
# Should show: "load_tensors: offloaded X/X layers to GPU"

Docker Compose Integration

services:
  llama:
    image: beacon/llama-client:latest
    environment:
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      - BENCHMARK_MODEL=llama3.2:1b
    volumes:
      - ./results:/tmp

Security

Localhost Binding: Ollama bound to 127.0.0.1:11434 only
Container Access: Restricted via Docker's host.docker.internal bridge
No External Exposure: GPU inference server not accessible from network

Monitoring

Metrics Collection

# Basic metrics via ollama-metrics.py
{
  "gpu_stats": {"gpu_name": "Apple M1 Max"},
  "models_loaded": 4,
  "inference_test": {
    "status": "success",
    "duration_seconds": 0.152
  }
}

Performance Targets

Metric	Target	Current
Timeout Rate	<2%	0%
Avg Response Time	<3s	1.25s
GPU Utilization	>80%	90%+

Deployment

Production Checklist

GPU drivers installed (NVIDIA/AMD)
Ollama configured with GPU acceleration
Client containers built and tested
Monitoring and alerting configured
Performance benchmarks validated

Scaling Considerations

Model Pre-loading: Keep frequently used models warm
Concurrent Requests: Balance GPU memory vs throughput
Failover: Graceful degradation to CPU when GPU unavailable
Multi-GPU: Load balancing across multiple GPUs

Troubleshooting

Common Issues

GPU Not Detected:

# Check GPU availability
ollama ps
# Should show GPU info in model loading logs

High CPU Usage:

# Verify containers using client architecture
docker ps --format "table {{.Image}}\t{{.Command}}"
# Should show beacon/*-client images, not ollama/ollama

Slow Inference:

# Check GPU layer offloading
ollama show model:name --verbose | grep -i gpu
# Should show layers offloaded to GPU

Next Steps

Deploy on NVIDIA/AMD production hosts
Implement multi-model concurrent serving
Add advanced observability and alerting
Scale testing across multiple regions

Architecture Overview​

Container → Host GPU Delegation​

Key Components​

Performance Results​

Benchmark Comparison​

Model Performance (Apple M1 Max)​

Hardware Tiers​

Tier 1: CPU-Only (Fallback)​

Tier 2: Entry GPU (8-12GB VRAM)​

Tier 3: Mid-Range GPU (16-24GB VRAM)​

Tier 4: High-End GPU (24GB+ VRAM)​

Implementation​

Container Configuration​

Host Ollama Setup​

Docker Compose Integration​

Security​

Monitoring​

Metrics Collection​

Performance Targets​

Deployment​

Production Checklist​

Scaling Considerations​

Troubleshooting​

Common Issues​

Next Steps​