Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060

Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060


🎯 TL;DR: Subject-Driven Image Generation on 12GB VRAM

Large AI models like FLUX.1-schnell typically require datacenter GPUs with 48GB+ VRAM. Problem: Most developers and hobbyists only have access to consumer RTX cards which vary from 6 - 12GB VRAM in most cases (with the exception of the expensive 4090/5090 cards which can go up to 32gb).

Solution: Using mmgp (Memory Management for GPU Poor) with Docker containerization enables FLUX.1 OmniControl to run on RTX 3060 12GB through 8-bit quantization, dynamic VRAM/RAM offloading, and selective layer loading. The implementation provides a Gradio web interface generating 512x512 images in ~10 seconds after initial model loading, with models persisting in system RAM to avoid reload overhead.

Technical Approach: Profile 3 configuration quantizes the T5 text encoder (8.8GB → ~4.4GB), pins the FLUX transformer (22.7GB) to reserved system RAM, and dynamically loads only active layers to VRAM during inference. Tested and validated on RTX 3060 12GB with 64GB system RAM running Windows 11 + WSL2 + Docker Desktop.

Complete Implementation: All code, Dockerfile, and setup instructions are available at github.com/Ricky-G/docker-ai-models/omnicontrol


Recently, I wanted to experiment with OmniControl, a subject-driven image generation model that extends FLUX.1-schnell with LoRA adapters for precise control over object placement. The challenge? The model requirements listed 48GB+ VRAM, and I only had an RTX 3060 with 12GB sitting in my workstation.

This is a common frustration in the AI development community. Research papers showcase impressive results on expensive datacenter hardware, but practical implementation on consumer GPUs requires significant engineering effort. Could I actually run this model locally without upgrading to an RTX 4090/5090 or pay for a VM in Azure with A100?

The answer turned out to be yes - with some clever memory management and containerization. This blog post walks through the complete process of dockerizing OmniControl to run efficiently on a 12GB consumer GPU.

What is FLUX.1 OmniControl?

Before diving into the technical implementation, let’s understand what we’re working with. FLUX.1 OmniControl is a subject-driven image generation model that extends the base FLUX.1-schnell diffusion model with LoRA (Low-Rank Adaptation) adapters for precise control over object placement and composition.

Unlike traditional text-to-image models where you only provide a text prompt, OmniControl allows you to:

  • Subject Consistency: Provide a reference image of a specific object (like a toy, person, or product) and have it accurately reproduced in generated images
  • Spatial Control: Specify exactly where in the scene you want objects placed
  • Style Preservation: Maintain the visual characteristics of the reference object across different contexts and environments

Think of it as “Photoshop + AI” - you can place your specific objects into any scene you can describe with text. This makes it incredibly powerful for product visualization, creative content generation, and prototyping visual concepts.

The trade-off? The model is massive - requiring over 30GB of model weights to achieve this level of control and quality. This is where the engineering challenge begins.

The Challenge: Model Size vs Available VRAM

Let’s start with the hard numbers:

FLUX.1-schnell model components:

  • Transformer: 22.7GB (torch.bfloat16)
  • T5 Text Encoder: 8.8GB
  • CLIP Text Encoder: 162MB
  • VAE: ~1GB
  • OminiControl LoRA: 200MB
  • Total: ~32.8GB of model weights

Available hardware:
I am constrained by my existing workstation specs:

  • RTX 3060: 12GB VRAM
  • System RAM: 64GB DDR4
  • Storage: 1TB NVMe SSD + 2TB HDD
  • CPU: Intel i7-11700K
  • OS: Windows 11 + WSL2 (Ubuntu 22.04)
  • Docker Desktop with NVIDIA Container Toolkit

The gap is obvious - we need nearly 3x more VRAM than the GPU provides. Traditional approaches like FP16 precision or model pruning weren’t going to cut it. We needed something more aggressive.

Understanding mmgp: Memory Management for GPU Poor

The key enabler for this project is mmgp (Memory Management for GPU Poor), a Python library specifically designed to run large models on consumer hardware. Here’s how it works:

8-Bit Quantization

mmgp uses quanto to quantize large model components from 16-bit to 8-bit precision:

  • T5 encoder: 8.8GB → ~4.4GB (50% reduction)
  • Quality impact: Minimal for text encoding tasks
  • Speed impact: Slight increase in encoding time (~10-15%)

Dynamic VRAM/RAM Offloading

Instead of keeping all model weights in VRAM, mmgp maintains a “working set”:

  • Critical layers: Loaded to VRAM during active use
  • Inactive layers: Offloaded to pinned system RAM
  • Transfers: Handled automatically during forward passes

RAM Pinning Strategy

Models are loaded once from disk to system RAM (one-time cost), then:

  • Pinned memory allocation: 75% of system RAM reserved (48GB in my case)
  • Fast transfers: Pinned RAM → VRAM takes ~200ms for 1GB
  • Persistent storage: Models stay in RAM across generations

Profile System

mmgp provides 5 preconfigured profiles:

ProfileTarget VRAMStrategyUse Case
116-24GBFull model in VRAMMaximum speed
212-16GBPartial VRAM + RAMBalanced
312GBQuantization + pinningRTX 3060 sweet spot
48-12GBAggressive quantizationLower-end cards
56-8GBMinimal VRAM usageGPU Poor mode

For RTX 3060, Profile 3 provides the best balance between speed and stability.

Prerequisites: What You’ll Need

Before starting the implementation, ensure you have the following components set up:

Hardware Requirements

Minimum Configuration:

  • NVIDIA GPU: 12GB VRAM (RTX 3060, 3060 Ti, or better)
  • System RAM: 64GB DDR4/DDR5 (48GB will be pinned for model storage)
  • Storage: 50GB free space (35GB for models + overhead)
  • CPU: Any modern multi-core processor

Recommended Configuration:

  • GPU: RTX 3060 12GB or RTX 4060 Ti 16GB
  • RAM: 64GB or more
  • Storage: NVMe SSD for faster startup times (HDD works but adds 2-3 min to load times)

Software Requirements

Windows Users:

  • Windows 11 (Windows 10 with WSL2 also works)
  • WSL2 installed and configured
  • Docker Desktop for Windows (latest version)
  • NVIDIA Container Toolkit (installed via Docker Desktop)

Linux Users:

  • Ubuntu 22.04 or similar distribution
  • Docker Engine (latest version)
  • NVIDIA Container Toolkit
  • NVIDIA drivers (version 525+)

Account Requirements

Verification Steps

Before proceeding, verify your setup:

1
2
3
4
5
6
7
8
# Check GPU availability
nvidia-smi

# Verify Docker installation
docker --version

# Test GPU access in Docker
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi

If all commands execute successfully, you’re ready to begin!

Docker Architecture: Why Containerization?

With prerequisites confirmed, let’s talk about why Docker is the right choice for this project. Running large AI models involves complex dependency chains - specific versions of PyTorch, CUDA libraries, Python packages, and system libraries that can conflict with your existing environment.

💡 Want to Skip Ahead?

The complete Docker implementation, including the Dockerfile, all Python code, and deployment scripts, is available in my GitHub repository: docker-ai-models/omnicontrol

You can clone and run it immediately, or continue reading to understand how it works under the hood.

Containerization solves this by:

  • Isolating dependencies from your host system
  • Ensuring reproducibility across different machines
  • Simplifying deployment - one command to run the entire stack
  • Enabling version control of the entire environment

The containerization approach provides several additional benefits:

  • Eliminates dependency conflicts
  • Ensures reproducible builds
  • Simplifies deployment across machines
  • Isolates model storage from application code

Container Structure

1
2
3
4
5
6
7
nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
├── Python 3.10 + CUDA libraries
├── PyTorch 2.0 with CUDA support
├── Diffusers + Transformers
├── mmgp for memory management
├── Gradio for web interface
└── Custom FLUX integration code

Volume Mounts

1
-v D:\_Models\omnicontrol:/app/models    # Persistent model storage (34GB)

Models download once to the host system and persist across container rebuilds. This is critical for development iteration - rebuilding the container doesn’t trigger 30-minute model downloads.

GPU Access

1
--gpus all    # Exposes all NVIDIA GPUs to container

Docker Desktop + NVIDIA Container Toolkit handles GPU passthrough automatically on Windows via WSL2.

Implementation Details: From Code to Running System

Now that we understand the architecture and tools, let’s dive into how everything works in practice. This section covers the actual startup sequence, performance characteristics, and a critical optimization that makes this entire approach viable.

⚡ Key Performance Insight Ahead

One of the biggest challenges in this implementation was preventing VRAM from being cleared after each generation, which would cause 80+ second reload times. The solution? A single line of code change that reduced subsequent generation times from 120s to 10s. We’ll cover this critical fix in detail below.

Startup Sequence

The container initialization follows this sequence:

1. GPU Detection (~1 second)

1
2
nvidia-smi --query-gpu=name,memory.total --format=csv
# Output: NVIDIA GeForce RTX 3060, 12288 MiB

2. Profile Selection (automatic)

1
2
3
vram = get_gpu_memory()
if vram >= 11000:
profile = 3 # 12GB optimized

3. Model Loading (2-3 minutes from HDD)

  • FLUX.1-schnell: Downloads from HuggingFace (~22.7GB)
  • OminiControl LoRA: Downloads adapter weights (~200MB)
  • Loads to CPU first, then applies mmgp profiling

4. mmgp Profiling (1-2 minutes)

  • Quantizes T5 encoder to 8-bit
  • Allocates 48GB pinned RAM (75% of 64GB)
  • Hooks model layers for dynamic offloading

5. Gradio Launch (~5 seconds)

  • Web interface starts on port 7860
  • Ready to accept generation requests

Total first-run time: 5-10 minutes (mostly downloading models)
Subsequent runs: ~3 minutes (loading from disk to RAM)

Generation Performance

First generation after startup:

  • Time: ~110 seconds
  • Breakdown:
    • VRAM loading: 80 seconds (22.7GB from RAM → VRAM)
    • Actual inference: 30 seconds (8 steps @ 512x512)
  • GPU memory: Climbs from 3GB → 10-12GB

Subsequent generations:

  • Time: ~10 seconds (target achieved!)
  • VRAM stays at 10-12GB between generations
  • No reload overhead

Critical Fix: Preventing VRAM Clearing

🚨 This Section Contains The Key Optimization

Initial testing revealed a major performance bottleneck that would have made this entire approach impractical. Understanding and fixing this issue is critical for achieving acceptable performance.

The Problem:

During initial testing, the first image generation took about 110 seconds (expected), but every subsequent generation also took 110+ seconds. Monitoring GPU memory usage revealed the issue:

  • After generation completes: VRAM drops from 10-12GB back to 3GB
  • Next generation starts: 80 seconds spent reloading models from RAM to VRAM
  • Inference runs: 30 seconds of actual generation
  • Total: 110 seconds per image, no matter how many you generate

This made the system unusable for practical work - imagine waiting nearly 2 minutes for every single image!

The Root Cause:

Diagnosis revealed that FLUX’s generation code was calling maybe_free_model_hooks() after every inference pass. This function is designed to free memory for systems running multiple models or tight memory scenarios, but in our case where we want to generate multiple images in sequence, it was counterproductive.

The culprit was in src/flux/generate.py:

1
2
3
4
5
6
7
8
9
10
# BEFORE (problematic)
def generate():
# ... generation code ...
self.maybe_free_model_hooks() # ❌ Unloads everything from VRAM!

# AFTER (fixed)
def generate():
# ... generation code ...
# DISABLED: Keep models in VRAM between generations
# self.maybe_free_model_hooks() # ✅ Models stay loaded

The Impact:

This single line change transformed the performance profile:

MetricBefore FixAfter FixImprovement
First generation110s110s(same)
Second generation110s10s11x faster
Third generation110s10s11x faster
VRAM after gen3GB10-12GB(persistent)

Suddenly, generating 10 images went from 18 minutes to just 2 minutes (110s + 9 × 10s). This made the difference between “technically possible but impractical” and “actually usable for real work.”

📂 See the Implementation:

The complete modified FLUX generation code with this optimization is available in the GitHub repository at src/flux/generate.py. You can see exactly how the model loading and generation pipeline is structured, along with all the mmgp integration code.

Real-World Testing Results

Test Configuration

Hardware: RTX 3060 12GB, Intel i7-11700K, 64GB DDR4, Micron NVMe main drive, 7200RPM HDD secondary

OS: Windows 11 + WSL2 (Ubuntu 22.04)

Docker: Desktop 4.28 with NVIDIA Container Toolkit

Model: FLUX.1-schnell + OminiControl subject_512.safetensors

Settings: 8 inference steps, 512x512 resolution

Generation Tests

Test 1: Cold start

1
2
3
4
5
6
Prompt: "A film photography shot. This item is placed on a wooden desk 
in a cozy study room. Warm afternoon sunlight streams through
the window."
Subject: Toy robot figure
Time: 108 seconds
Quality: Excellent, subject preserved with accurate placement

Test 2: Immediate follow-up

1
2
3
4
5
Prompt: "On Christmas evening, on a crowded sidewalk, this item sits 
covered in snow wearing a Santa hat."
Subject: Same toy robot
Time: 11 seconds
Quality: Excellent, consistent subject representation

Test 3: Third generation

1
2
3
4
5
Prompt: "Underwater photography. This item sits on a coral reef with 
tropical fish swimming around it."
Subject: Same toy robot
Time: 10 seconds
Quality: Good, some water distortion artifacts (expected)

Resource Monitoring

During active generation:

  • GPU Utilization: 95-100%
  • VRAM Usage: 10.2GB / 12GB (85%)
  • System RAM: 52GB / 64GB (model pinning)
  • CPU Usage: 15-20% (mainly data preprocessing)
  • Power Draw: 170W (RTX 3060 TDP)

Storage Impact

HDD vs SSD comparison (estimated):

  • HDD: 2-3 minutes initial load from disk
  • SSD: 30-45 seconds initial load (2.5x faster)
  • During generation: No difference (models in RAM)

Recommendation: SSD for faster startup, but not required for generation performance.

Lessons Learned: What Works and What Doesn’t

After extensive testing and iteration, here are the key insights organized by category. These lessons can save you hours of troubleshooting if you’re implementing something similar.

Memory Management Insights

✅ Profile 3 is the Sweet Spot for 12GB Cards

Tested all five mmgp profiles extensively. Profile 3 provides the perfect balance:

  • Stable VRAM usage at 85% capacity (10.2GB / 12GB)
  • Fast inference times (10s per image)
  • No OOM errors or crashes across 100+ test generations

Profiles 1-2 required more VRAM than available, Profiles 4-5 were unnecessarily slow.

✅ RAM Pinning Eliminates the Disk Bottleneck

The 75% RAM allocation strategy (48GB pinned) was crucial:

  • First load: 2-3 minutes from HDD to RAM (one-time cost)
  • Subsequent loads: <5 seconds from pinned RAM to VRAM
  • Models persist across generations with zero disk I/O

Without pinning, every generation would require disk access - absolutely impractical.

⚠️ WSL2 Memory Limits Are Deceiving

Initial attempts with default WSL2 settings failed. The issue:

  • Host system: 64GB RAM available
  • WSL2 container: Only sees ~31GB (50% default limit)
  • mmgp profile calculation: Incorrectly assumes full RAM available

Solution: Explicitly configure .wslconfig to allocate more memory to WSL2, or force mmgp to use perc_reserved_mem_max=0.75 parameter.

❌ Auto-Offloading Strategies Don’t Work Well

Tried mmgp’s offloadAfterEveryCall feature - it caused frequent crashes:

  • Unpredictable VRAM usage patterns
  • Race conditions between loading/offloading
  • No performance benefit over persistent loading

Lesson: For sequential generation workloads, keep models loaded.

Storage and I/O Optimization

📊 HDD vs SSD Impact Analysis

PhaseHDDSSDImpact
Initial model download5-10 min5-10 minNetwork-bound
First load (disk → RAM)2-3 min30-45 sec4x faster
RAM → VRAM transfer200ms/GB200ms/GBRAM speed
During generation0 disk I/O0 disk I/ONo difference

Key Insight: SSD only matters for startup time. Once models are in RAM, storage speed is irrelevant. If you’re doing many generations in one session, HDD is perfectly acceptable.

💾 Volume Mount Strategy Was Critical

Storing models on the host filesystem (-v D:\_Models:/app/models) provided:

  • Persistence across container rebuilds
  • Ability to share models between different containers
  • Easy backup and version management
  • No re-downloading during development iterations

Without this, every code change would require re-downloading 35GB of models.

Configuration and Deployment

✅ Gradio Provided Zero-Effort UI

Using Gradio for the web interface was brilliant:

  • 20 lines of Python for complete web UI
  • Automatic file upload handling
  • Built-in image preview and download
  • No frontend development required

Alternative approaches (Flask, React frontend) would have taken days vs hours.

✅ Docker Isolated the Complexity

Containerization proved invaluable:

  • No conflicts with host Python environment
  • Reproducible across machines (tested on 3 different PCs)
  • Easy version control of entire stack
  • Simple deployment (docker run and done)

❌ Profile 1 Caused Out-of-Memory Errors

Attempted to use Profile 1 (full model in VRAM) for maximum speed:

  • Required 16GB+ VRAM
  • RTX 3060’s 12GB couldn’t handle it
  • Resulted in CUDA OOM errors mid-generation

Lesson: Always profile your actual available memory, not theoretical specs.

Quality and Performance Trade-offs

✅ 8-Bit Quantization Had Minimal Quality Impact

Side-by-side comparison of T5 encoder outputs:

  • FP16 (original): Baseline quality
  • INT8 (quantized): <5% subjective quality difference
  • Memory savings: 8.8GB → 4.4GB (50% reduction)

Conclusion: For text encoding tasks, 8-bit quantization is essentially free VRAM.

📈 Generation Speed Met Targets

GoalResultStatus
First gen < 2 min110s✅ Achieved
Subsequent gen < 15s10s✅ Exceeded
VRAM stable10-12GB consistent✅ Achieved
Quality acceptableExcellent outputs✅ Achieved

The 10-second generation time makes this practical for real creative work.

Production Considerations

For Personal Use

This setup works great for:

  • Hobbyist AI experimentation
  • Content creation (social media, art projects)
  • Proof-of-concept development
  • Learning FLUX.1 architecture

For Commercial Use

Consider these factors:

  • Generation time: 10s/image × 1000 images = 2.8 hours
  • Scalability: Single GPU, no batch processing
  • Reliability: Consumer GPU thermal throttling under sustained load
  • Support: mmgp is community-maintained, not enterprise-supported

For production workloads, consider:

  • Cloud GPUs (Azure N-Series VMs or Azure Container Apps with GPU nodes) - minimum A40/A100
  • Local GPU upgrade to RTX 4090 or A6000
  • Batch processing optimizations
  • Multiple parallel containers

Docker Hub & Repository

The complete implementation is available:

  • GitHub: docker-ai-models/omnicontrol
  • README: Full setup instructions and troubleshooting
  • Dockerfile: Production-ready container definition
  • Source code: Custom FLUX integration with mmgp

Quick Start

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Clone repository
git clone https://github.com/Ricky-G/docker-ai-models.git
cd docker-ai-models/omnicontrol

# Build container
docker build -t omnicontrol .

# Run with HuggingFace token
docker run -d --gpus all --name omnicontrol \
-p 7860:7860 \
-v D:\_Models\omnicontrol:/app/models \
-e HF_TOKEN=your_token_here \
omnicontrol

# Access web interface
# http://localhost:7860

Conclusion

Running FLUX.1 OmniControl on a 12GB RTX 3060 is not only possible but practical. Through careful memory management with mmgp, strategic quantization, and container optimization, we achieved:

  • ✅ 10-second generation times (after initial load)
  • ✅ Stable VRAM usage across multiple generations
  • ✅ No quality degradation from quantization
  • ✅ Reproducible Docker deployment

The key insight: Memory management is more important than raw VRAM capacity. With the right tools and configuration, consumer GPUs can run models designed for datacenter hardware.

If you have an RTX 3060 (or similar 12GB card) collecting dust because you thought it couldn’t handle modern AI models, give this approach a try. The democratization of AI isn’t just about open-source models - it’s about making them runnable on hardware people actually own.


Hardware tested: RTX 3060 12GB, 64GB RAM, Windows 11 + WSL2

Software stack: Docker Desktop, NVIDIA Container Toolkit, mmgp 3.6.9

Model: FLUX.1-schnell + OminiControl LoRA

Performance: 10s per 512x512 image (8 steps)


References:


Image Credits:

Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060

https://clouddev.blog/AI/LLMs/running-flux-1-omnicontrol-on-a-consumer-gpu-a-docker-implementation-tested-on-rtx-3060/

Author

Ricky Gummadi

Posted on

2025-11-11

Updated on

2026-01-07

Licensed under

Comments