Your Batch Size Shouldn't Be a Constant

The autoscaler found a deal: A100 PCIE, 80GB VRAM, $0.34/hour. Great price. I spun it up, deployed my ML worker, and watched it process images.

160 products per minute. Not bad.

Then I did the math. The A100 was using 24GB of its 80GB. My batch size was 8—tuned for an RTX 4090 with 24GB. I was renting a sports car and driving it in first gear.

The fix took 20 lines of code. Throughput jumped to 400 products per minute. Same GPU. Same hourly rate. Just better utilization.

The Problem with Hardcoded Batch Sizes

My original code looked like this:

BATCH_SIZE = 8  # Works on RTX 4090

async def process_batch(items):
    # Collect 8 items, process together
    ...

This was tuned through trial and error on my development GPU. Too high? OOM crash. Too low? Wasted capacity. I found the sweet spot and hardcoded it.

The problem: my production GPU wasn't my development GPU.

The autoscaler picks whatever's cheapest and available. Sometimes that's an RTX 4090 (24GB). Sometimes it's an A100 (80GB). Sometimes it's an L40 (48GB). The spot market doesn't care about my hardcoded constants.

With a static batch size of 8, I was:

Under-utilizing big GPUs: A100 could handle 32, but I was doing 8
Crashing on small GPUs: If I'd tuned for A100, RTX 4090 would OOM
Leaving money on the table: Paying the same hourly rate for 1/4 the throughput

The Fix: Detect and Adapt

The GPU knows how much VRAM it has. Ask it.

def get_gpu_vram_gb() -> int:
    """Detect GPU VRAM using nvidia-smi."""
    result = subprocess.run(
        ["nvidia-smi", "--query-gpu=memory.total", "--format=csv,noheader,nounits"],
        capture_output=True, text=True
    )
    vram_mb = int(result.stdout.strip())
    return vram_mb // 1024

Now map VRAM to optimal batch size:

def get_optimal_batch_size(vram_gb: int) -> int:
    """Batch size based on available VRAM.

    Model uses ~14GB base. Each image in batch adds ~0.5GB.
    Leave headroom for safety.
    """
    if vram_gb >= 80:   # A100, H100
        return 32
    elif vram_gb >= 48: # A40, L40
        return 16
    elif vram_gb >= 24: # RTX 4090, RTX 3090
        return 8
    else:               # Smaller GPUs
        return 4

# At startup
GPU_VRAM = get_gpu_vram_gb()
BATCH_SIZE = get_optimal_batch_size(GPU_VRAM)
print(f"Detected {GPU_VRAM}GB VRAM, using batch size {BATCH_SIZE}")

Now the same code runs optimally on any GPU. RTX 4090 gets batch size 8. A100 gets batch size 32. No OOM crashes. No wasted capacity.

The Math Behind the Mapping

How did I get those numbers? I profiled the model:

Component	VRAM Usage
Base model (Qwen2.5-VL-7B)	~14GB
Embedding model (FashionSigLIP)	~2GB
CUDA overhead	~2GB
Per-image working memory	~0.5GB

Available for batching = Total VRAM - Base - Embedding - Overhead

For an A100 (80GB):

Available = 80 - 14 - 2 - 2 = 62GB
Max batch = 62 / 0.5 = 124 images

But I don't go to 124. Why?

Memory fragmentation. CUDA doesn't always pack perfectly.
Peak vs average. Some operations spike higher than average.
Safety margin. OOM crashes are expensive. Lost work, restart time.
Diminishing returns. Batch 32 vs 64 isn't 2x throughput—there's overhead.

So I use conservative estimates: 32 for 80GB, 16 for 48GB, 8 for 24GB. Leaves ~50% headroom.

What I Got Wrong Initially

I forgot to set prefetch accordingly. Batch size 32 is useless if the queue only gives you 16 messages at a time.

# Prefetch should match batch processing
await channel.set_qos(prefetch_count=BATCH_SIZE * 2)

The 2x multiplier ensures the next batch is ready while the current one processes.

I didn't log the detection. First production run, I had no idea what batch size was actually being used. Now it's the first thing in the logs:

Detected GPU VRAM: 80GB
Using batch size: 32

If something's wrong, I see it immediately.

I assumed nvidia-smi would always work. On one instance, nvidia-smi wasn't in PATH. The detection failed silently and defaulted to batch size 8. Now I handle the error explicitly:

def get_gpu_vram_gb() -> int:
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=memory.total", "--format=csv,noheader,nounits"],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode == 0:
            return int(result.stdout.strip().split('\n')[0]) // 1024
    except Exception:
        pass
    return 24  # Conservative default

I didn't verify at runtime. The formula said 32 should work, but I'd never actually tested it. First time it ran on an A100, I watched the VRAM graph nervously. It peaked at 58GB. Formula was right, but I didn't know until I saw it.

Beyond Batch Size

The same principle applies to other resources:

Image download concurrency:

# More VRAM = faster processing = need more images ready
IMAGE_CONCURRENCY = BATCH_SIZE * 2

Worker threads:

# Match parallelism to what the GPU can handle
WORKER_THREADS = min(BATCH_SIZE, cpu_count())

Timeout thresholds:

# Bigger batches take longer
BATCH_TIMEOUT = 1.0 + (BATCH_SIZE * 0.1)  # 1s base + 0.1s per item

When one resource adapts, others often need to follow.

The Pattern

# 1. Detect available resources at startup
resources = detect_hardware()

# 2. Calculate optimal settings from resources
settings = calculate_settings(resources)

# 3. Log what you're using
log(f"Hardware: {resources}, Settings: {settings}")

# 4. Apply settings
apply_settings(settings)

# 5. Monitor actual usage
while running:
    actual_usage = measure_usage()
    if actual_usage > threshold:
        log_warning(f"Approaching limit: {actual_usage}")

The key: settings are derived from hardware, not hardcoded. When the hardware changes, settings adapt automatically.

The Checklist

If you implement this:

Detect hardware resources at startup (VRAM, RAM, CPU cores)
Map resources to operational parameters (batch size, concurrency, timeouts)
Log detected resources and derived settings
Handle detection failures with conservative defaults
Set related parameters consistently (prefetch, concurrency, timeouts)
Monitor actual resource usage to validate your formulas
Test on actual hardware variations, not just the formula
Leave safety margins (50% headroom is reasonable)
Document the mapping logic for future maintainers

When NOT to Use This

Homogeneous infrastructure. If every instance is identical, static config is simpler.
Resource detection is unreliable. If you can't trust the detection, don't adapt to it.
Testing is impractical. If you can't test all hardware variations, conservative static limits are safer.
Simplicity matters more. Dynamic adaptation adds complexity. For small workloads, it's not worth it.

The Takeaway

For months, I tuned batch sizes on my laptop and deployed them to production. It worked—but only because I was renting similar GPUs.

The moment I started using spot market GPUs, the mismatch hurt. Small GPUs crashed. Big GPUs wasted capacity. Same code, wildly different results.

Now the code asks the GPU what it can handle, then adapts. An RTX 4090 gets batch size 8. An A100 gets batch size 32. The formula is explicit, the detection is logged, and the code runs optimally wherever it lands.

The GPU market doesn't give you consistent hardware. Your code shouldn't assume it does.

Where This Applies

GPU inference workers (VLM, embedding, classification)
Batch processing systems on heterogeneous infrastructure
Spot instance workloads with variable hardware
Kubernetes pods with different resource limits
Any system where the hardware varies but the code doesn't