Skip to main content
Version: Next

GPU-Accelerated Inference

Project Beacon implements GPU-accelerated inference for LLM workloads through a host-delegation architecture that achieves 25x performance improvements over CPU-only inference.

Architecture Overview

Container → Host GPU Delegation

┌─────────────────┐    HTTP API     ┌──────────────────┐
│ Client │ ──────────────→ │ Host Ollama │
│ Container │ │ (GPU-enabled) │
│ (HTTP only) │ │ │
└─────────────────┘ └──────────────────┘


┌──────────────────┐
│ GPU Hardware │
│ (Metal/CUDA) │
└──────────────────┘

Key Components

  • Client Containers: Lightweight HTTP-only containers that make API calls
  • Host Ollama: GPU-accelerated inference server running on host
  • Model Loading: Pre-pulled models with GPU layer offloading
  • Networking: Secure localhost binding with container bridge access

Performance Results

Benchmark Comparison

ArchitectureResponse TimeSuccess RateGPU Utilization
Before (CPU)30-70s60% (timeouts)0%
After (GPU)1.25-2.83s100%90%+

Model Performance (Apple M1 Max)

ModelSizeAvg ResponseGPU Layers
Llama 3.2:1b1.23GB1.25s17/17
Qwen 2.5:1.5b0.92GB2.83s29/29
Mistral 7b3.83GB2.28sAll

Hardware Tiers

Tier 1: CPU-Only (Fallback)

  • Target: 4-8GB RAM, no GPU
  • Models: llama3.2:1b, gemma3:1b
  • Performance: 10-30s response times

Tier 2: Entry GPU (8-12GB VRAM)

  • Target: RTX 3060, Apple M1/M2, RX 6600
  • Models: Small to medium quantized models
  • Performance: 1-3s response times

Tier 3: Mid-Range GPU (16-24GB VRAM)

  • Target: RTX 4070 Ti, Apple M1 Max/Ultra
  • Models: Full-size models, multiple concurrent
  • Performance: 0.5-2s response times

Tier 4: High-End GPU (24GB+ VRAM)

  • Target: RTX 4090, A100, H100
  • Models: Large models, high throughput
  • Performance: <0.5s response times

Implementation

Container Configuration

Client containers use Dockerfile.client pattern:

FROM python:3.11-slim
RUN pip install requests numpy pandas
# Copy benchmark files only
ENV OLLAMA_BASE_URL=http://host.docker.internal:11434
ENTRYPOINT ["python3", "benchmark.py"]

Host Ollama Setup

# Start with GPU acceleration
OLLAMA_GPU_LAYERS=999 OLLAMA_GPU_MEMORY=40GB ollama serve

# Verify GPU usage
ollama run llama3.2:1b "test" --verbose
# Should show: "load_tensors: offloaded X/X layers to GPU"

Docker Compose Integration

services:
llama:
image: beacon/llama-client:latest
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
- BENCHMARK_MODEL=llama3.2:1b
volumes:
- ./results:/tmp

Security

  • Localhost Binding: Ollama bound to 127.0.0.1:11434 only
  • Container Access: Restricted via Docker's host.docker.internal bridge
  • No External Exposure: GPU inference server not accessible from network

Monitoring

Metrics Collection

# Basic metrics via ollama-metrics.py
{
"gpu_stats": {"gpu_name": "Apple M1 Max"},
"models_loaded": 4,
"inference_test": {
"status": "success",
"duration_seconds": 0.152
}
}

Performance Targets

MetricTargetCurrent
Timeout Rate<2%0%
Avg Response Time<3s1.25s
GPU Utilization>80%90%+

Deployment

Production Checklist

  • GPU drivers installed (NVIDIA/AMD)
  • Ollama configured with GPU acceleration
  • Client containers built and tested
  • Monitoring and alerting configured
  • Performance benchmarks validated

Scaling Considerations

  • Model Pre-loading: Keep frequently used models warm
  • Concurrent Requests: Balance GPU memory vs throughput
  • Failover: Graceful degradation to CPU when GPU unavailable
  • Multi-GPU: Load balancing across multiple GPUs

Troubleshooting

Common Issues

GPU Not Detected:

# Check GPU availability
ollama ps
# Should show GPU info in model loading logs

High CPU Usage:

# Verify containers using client architecture
docker ps --format "table {{.Image}}\t{{.Command}}"
# Should show beacon/*-client images, not ollama/ollama

Slow Inference:

# Check GPU layer offloading
ollama show model:name --verbose | grep -i gpu
# Should show layers offloaded to GPU

Next Steps

  • Deploy on NVIDIA/AMD production hosts
  • Implement multi-model concurrent serving
  • Add advanced observability and alerting
  • Scale testing across multiple regions