Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060
🎯 TL;DR: Subject-Driven Image Generation on 12GB VRAM
Large AI models like FLUX.1-schnell typically require datacenter GPUs with 48GB+ VRAM. Problem: Most developers and hobbyists only have access to consumer RTX cards which vary from 6 - 12GB VRAM in most cases (with the exception of the expensive 4090/5090 cards which can go up to 32gb).
Solution: Using mmgp (Memory Management for GPU Poor) with Docker containerization enables FLUX.1 OmniControl to run on RTX 3060 12GB through 8-bit quantization, dynamic VRAM/RAM offloading, and selective layer loading. The implementation provides a Gradio web interface generating 512x512 images in ~10 seconds after initial model loading, with models persisting in system RAM to avoid reload overhead.
Technical Approach: Profile 3 configuration quantizes the T5 text encoder (8.8GB → ~4.4GB), pins the FLUX transformer (22.7GB) to reserved system RAM, and dynamically loads only active layers to VRAM during inference. Tested and validated on RTX 3060 12GB with 64GB system RAM running Windows 11 + WSL2 + Docker Desktop.
Complete Implementation: All code, Dockerfile, and setup instructions are available at github.com/Ricky-G/docker-ai-models/omnicontrol
Recently, I wanted to experiment with OmniControl, a subject-driven image generation model that extends FLUX.1-schnell with LoRA adapters for precise control over object placement. The challenge? The model requirements listed 48GB+ VRAM, and I only had an RTX 3060 with 12GB sitting in my workstation.
This is a common frustration in the AI development community. Research papers showcase impressive results on expensive datacenter hardware, but practical implementation on consumer GPUs requires significant engineering effort. Could I actually run this model locally without upgrading to an RTX 4090/5090 or pay for a VM in Azure with A100?
The answer turned out to be yes - with some clever memory management and containerization. This blog post walks through the complete process of dockerizing OmniControl to run efficiently on a 12GB consumer GPU.
What is FLUX.1 OmniControl?
Before diving into the technical implementation, let’s understand what we’re working with. FLUX.1 OmniControl is a subject-driven image generation model that extends the base FLUX.1-schnell diffusion model with LoRA (Low-Rank Adaptation) adapters for precise control over object placement and composition.
Unlike traditional text-to-image models where you only provide a text prompt, OmniControl allows you to:
- Subject Consistency: Provide a reference image of a specific object (like a toy, person, or product) and have it accurately reproduced in generated images
- Spatial Control: Specify exactly where in the scene you want objects placed
- Style Preservation: Maintain the visual characteristics of the reference object across different contexts and environments
Think of it as “Photoshop + AI” - you can place your specific objects into any scene you can describe with text. This makes it incredibly powerful for product visualization, creative content generation, and prototyping visual concepts.
The trade-off? The model is massive - requiring over 30GB of model weights to achieve this level of control and quality. This is where the engineering challenge begins.
The Challenge: Model Size vs Available VRAM
Let’s start with the hard numbers:
FLUX.1-schnell model components:
- Transformer: 22.7GB (torch.bfloat16)
- T5 Text Encoder: 8.8GB
- CLIP Text Encoder: 162MB
- VAE: ~1GB
- OminiControl LoRA: 200MB
- Total: ~32.8GB of model weights
Available hardware:
I am constrained by my existing workstation specs:
- RTX 3060: 12GB VRAM
- System RAM: 64GB DDR4
- Storage: 1TB NVMe SSD + 2TB HDD
- CPU: Intel i7-11700K
- OS: Windows 11 + WSL2 (Ubuntu 22.04)
- Docker Desktop with NVIDIA Container Toolkit
The gap is obvious - we need nearly 3x more VRAM than the GPU provides. Traditional approaches like FP16 precision or model pruning weren’t going to cut it. We needed something more aggressive.
Understanding mmgp: Memory Management for GPU Poor
The key enabler for this project is mmgp (Memory Management for GPU Poor), a Python library specifically designed to run large models on consumer hardware. Here’s how it works:
8-Bit Quantization
mmgp uses quanto to quantize large model components from 16-bit to 8-bit precision:
- T5 encoder: 8.8GB → ~4.4GB (50% reduction)
- Quality impact: Minimal for text encoding tasks
- Speed impact: Slight increase in encoding time (~10-15%)
Dynamic VRAM/RAM Offloading
Instead of keeping all model weights in VRAM, mmgp maintains a “working set”:
- Critical layers: Loaded to VRAM during active use
- Inactive layers: Offloaded to pinned system RAM
- Transfers: Handled automatically during forward passes
RAM Pinning Strategy
Models are loaded once from disk to system RAM (one-time cost), then:
- Pinned memory allocation: 75% of system RAM reserved (48GB in my case)
- Fast transfers: Pinned RAM → VRAM takes ~200ms for 1GB
- Persistent storage: Models stay in RAM across generations
Profile System
mmgp provides 5 preconfigured profiles:
| Profile | Target VRAM | Strategy | Use Case |
|---|---|---|---|
| 1 | 16-24GB | Full model in VRAM | Maximum speed |
| 2 | 12-16GB | Partial VRAM + RAM | Balanced |
| 3 | 12GB | Quantization + pinning | RTX 3060 sweet spot |
| 4 | 8-12GB | Aggressive quantization | Lower-end cards |
| 5 | 6-8GB | Minimal VRAM usage | GPU Poor mode |
For RTX 3060, Profile 3 provides the best balance between speed and stability.
Prerequisites: What You’ll Need
Before starting the implementation, ensure you have the following components set up:
Hardware Requirements
Minimum Configuration:
- NVIDIA GPU: 12GB VRAM (RTX 3060, 3060 Ti, or better)
- System RAM: 64GB DDR4/DDR5 (48GB will be pinned for model storage)
- Storage: 50GB free space (35GB for models + overhead)
- CPU: Any modern multi-core processor
Recommended Configuration:
- GPU: RTX 3060 12GB or RTX 4060 Ti 16GB
- RAM: 64GB or more
- Storage: NVMe SSD for faster startup times (HDD works but adds 2-3 min to load times)
Software Requirements
Windows Users:
- Windows 11 (Windows 10 with WSL2 also works)
- WSL2 installed and configured
- Docker Desktop for Windows (latest version)
- NVIDIA Container Toolkit (installed via Docker Desktop)
Linux Users:
- Ubuntu 22.04 or similar distribution
- Docker Engine (latest version)
- NVIDIA Container Toolkit
- NVIDIA drivers (version 525+)
Account Requirements
- HuggingFace Account: Required to download models
- HuggingFace Token: Generate a read-access token at huggingface.co/settings/tokens
Verification Steps
Before proceeding, verify your setup:
1 | # Check GPU availability |
If all commands execute successfully, you’re ready to begin!
Docker Architecture: Why Containerization?
With prerequisites confirmed, let’s talk about why Docker is the right choice for this project. Running large AI models involves complex dependency chains - specific versions of PyTorch, CUDA libraries, Python packages, and system libraries that can conflict with your existing environment.
💡 Want to Skip Ahead?
The complete Docker implementation, including the Dockerfile, all Python code, and deployment scripts, is available in my GitHub repository: docker-ai-models/omnicontrol
You can clone and run it immediately, or continue reading to understand how it works under the hood.
Containerization solves this by:
- Isolating dependencies from your host system
- Ensuring reproducibility across different machines
- Simplifying deployment - one command to run the entire stack
- Enabling version control of the entire environment
The containerization approach provides several additional benefits:
- Eliminates dependency conflicts
- Ensures reproducible builds
- Simplifies deployment across machines
- Isolates model storage from application code
Container Structure
1 | nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04 |
Volume Mounts
1 | -v D:\_Models\omnicontrol:/app/models # Persistent model storage (34GB) |
Models download once to the host system and persist across container rebuilds. This is critical for development iteration - rebuilding the container doesn’t trigger 30-minute model downloads.
GPU Access
1 | --gpus all # Exposes all NVIDIA GPUs to container |
Docker Desktop + NVIDIA Container Toolkit handles GPU passthrough automatically on Windows via WSL2.
Implementation Details: From Code to Running System
Now that we understand the architecture and tools, let’s dive into how everything works in practice. This section covers the actual startup sequence, performance characteristics, and a critical optimization that makes this entire approach viable.
⚡ Key Performance Insight Ahead
One of the biggest challenges in this implementation was preventing VRAM from being cleared after each generation, which would cause 80+ second reload times. The solution? A single line of code change that reduced subsequent generation times from 120s to 10s. We’ll cover this critical fix in detail below.
Startup Sequence
The container initialization follows this sequence:
1. GPU Detection (~1 second)
1 | nvidia-smi --query-gpu=name,memory.total --format=csv |
2. Profile Selection (automatic)
1 | vram = get_gpu_memory() |
3. Model Loading (2-3 minutes from HDD)
- FLUX.1-schnell: Downloads from HuggingFace (~22.7GB)
- OminiControl LoRA: Downloads adapter weights (~200MB)
- Loads to CPU first, then applies mmgp profiling
4. mmgp Profiling (1-2 minutes)
- Quantizes T5 encoder to 8-bit
- Allocates 48GB pinned RAM (75% of 64GB)
- Hooks model layers for dynamic offloading
5. Gradio Launch (~5 seconds)
- Web interface starts on port 7860
- Ready to accept generation requests
Total first-run time: 5-10 minutes (mostly downloading models)
Subsequent runs: ~3 minutes (loading from disk to RAM)
Generation Performance
First generation after startup:
- Time: ~110 seconds
- Breakdown:
- VRAM loading: 80 seconds (22.7GB from RAM → VRAM)
- Actual inference: 30 seconds (8 steps @ 512x512)
- GPU memory: Climbs from 3GB → 10-12GB
Subsequent generations:
- Time: ~10 seconds (target achieved!)
- VRAM stays at 10-12GB between generations
- No reload overhead
Critical Fix: Preventing VRAM Clearing
🚨 This Section Contains The Key Optimization
Initial testing revealed a major performance bottleneck that would have made this entire approach impractical. Understanding and fixing this issue is critical for achieving acceptable performance.
The Problem:
During initial testing, the first image generation took about 110 seconds (expected), but every subsequent generation also took 110+ seconds. Monitoring GPU memory usage revealed the issue:
- After generation completes: VRAM drops from 10-12GB back to 3GB
- Next generation starts: 80 seconds spent reloading models from RAM to VRAM
- Inference runs: 30 seconds of actual generation
- Total: 110 seconds per image, no matter how many you generate
This made the system unusable for practical work - imagine waiting nearly 2 minutes for every single image!
The Root Cause:
Diagnosis revealed that FLUX’s generation code was calling maybe_free_model_hooks() after every inference pass. This function is designed to free memory for systems running multiple models or tight memory scenarios, but in our case where we want to generate multiple images in sequence, it was counterproductive.
The culprit was in src/flux/generate.py:
1 | # BEFORE (problematic) |
The Impact:
This single line change transformed the performance profile:
| Metric | Before Fix | After Fix | Improvement |
|---|---|---|---|
| First generation | 110s | 110s | (same) |
| Second generation | 110s | 10s | 11x faster |
| Third generation | 110s | 10s | 11x faster |
| VRAM after gen | 3GB | 10-12GB | (persistent) |
Suddenly, generating 10 images went from 18 minutes to just 2 minutes (110s + 9 × 10s). This made the difference between “technically possible but impractical” and “actually usable for real work.”
📂 See the Implementation:
The complete modified FLUX generation code with this optimization is available in the GitHub repository at src/flux/generate.py. You can see exactly how the model loading and generation pipeline is structured, along with all the mmgp integration code.
Real-World Testing Results
Test Configuration
Hardware: RTX 3060 12GB, Intel i7-11700K, 64GB DDR4, Micron NVMe main drive, 7200RPM HDD secondary
OS: Windows 11 + WSL2 (Ubuntu 22.04)
Docker: Desktop 4.28 with NVIDIA Container Toolkit
Model: FLUX.1-schnell + OminiControl subject_512.safetensors
Settings: 8 inference steps, 512x512 resolution
Generation Tests
Test 1: Cold start
1 | Prompt: "A film photography shot. This item is placed on a wooden desk |
Test 2: Immediate follow-up
1 | Prompt: "On Christmas evening, on a crowded sidewalk, this item sits |
Test 3: Third generation
1 | Prompt: "Underwater photography. This item sits on a coral reef with |
Resource Monitoring
During active generation:
- GPU Utilization: 95-100%
- VRAM Usage: 10.2GB / 12GB (85%)
- System RAM: 52GB / 64GB (model pinning)
- CPU Usage: 15-20% (mainly data preprocessing)
- Power Draw: 170W (RTX 3060 TDP)
Storage Impact
HDD vs SSD comparison (estimated):
- HDD: 2-3 minutes initial load from disk
- SSD: 30-45 seconds initial load (2.5x faster)
- During generation: No difference (models in RAM)
Recommendation: SSD for faster startup, but not required for generation performance.
Lessons Learned: What Works and What Doesn’t
After extensive testing and iteration, here are the key insights organized by category. These lessons can save you hours of troubleshooting if you’re implementing something similar.
Memory Management Insights
✅ Profile 3 is the Sweet Spot for 12GB Cards
Tested all five mmgp profiles extensively. Profile 3 provides the perfect balance:
- Stable VRAM usage at 85% capacity (10.2GB / 12GB)
- Fast inference times (10s per image)
- No OOM errors or crashes across 100+ test generations
Profiles 1-2 required more VRAM than available, Profiles 4-5 were unnecessarily slow.
✅ RAM Pinning Eliminates the Disk Bottleneck
The 75% RAM allocation strategy (48GB pinned) was crucial:
- First load: 2-3 minutes from HDD to RAM (one-time cost)
- Subsequent loads: <5 seconds from pinned RAM to VRAM
- Models persist across generations with zero disk I/O
Without pinning, every generation would require disk access - absolutely impractical.
⚠️ WSL2 Memory Limits Are Deceiving
Initial attempts with default WSL2 settings failed. The issue:
- Host system: 64GB RAM available
- WSL2 container: Only sees ~31GB (50% default limit)
- mmgp profile calculation: Incorrectly assumes full RAM available
Solution: Explicitly configure .wslconfig to allocate more memory to WSL2, or force mmgp to use perc_reserved_mem_max=0.75 parameter.
❌ Auto-Offloading Strategies Don’t Work Well
Tried mmgp’s offloadAfterEveryCall feature - it caused frequent crashes:
- Unpredictable VRAM usage patterns
- Race conditions between loading/offloading
- No performance benefit over persistent loading
Lesson: For sequential generation workloads, keep models loaded.
Storage and I/O Optimization
📊 HDD vs SSD Impact Analysis
| Phase | HDD | SSD | Impact |
|---|---|---|---|
| Initial model download | 5-10 min | 5-10 min | Network-bound |
| First load (disk → RAM) | 2-3 min | 30-45 sec | 4x faster |
| RAM → VRAM transfer | 200ms/GB | 200ms/GB | RAM speed |
| During generation | 0 disk I/O | 0 disk I/O | No difference |
Key Insight: SSD only matters for startup time. Once models are in RAM, storage speed is irrelevant. If you’re doing many generations in one session, HDD is perfectly acceptable.
💾 Volume Mount Strategy Was Critical
Storing models on the host filesystem (-v D:\_Models:/app/models) provided:
- Persistence across container rebuilds
- Ability to share models between different containers
- Easy backup and version management
- No re-downloading during development iterations
Without this, every code change would require re-downloading 35GB of models.
Configuration and Deployment
✅ Gradio Provided Zero-Effort UI
Using Gradio for the web interface was brilliant:
- 20 lines of Python for complete web UI
- Automatic file upload handling
- Built-in image preview and download
- No frontend development required
Alternative approaches (Flask, React frontend) would have taken days vs hours.
✅ Docker Isolated the Complexity
Containerization proved invaluable:
- No conflicts with host Python environment
- Reproducible across machines (tested on 3 different PCs)
- Easy version control of entire stack
- Simple deployment (
docker runand done)
❌ Profile 1 Caused Out-of-Memory Errors
Attempted to use Profile 1 (full model in VRAM) for maximum speed:
- Required 16GB+ VRAM
- RTX 3060’s 12GB couldn’t handle it
- Resulted in CUDA OOM errors mid-generation
Lesson: Always profile your actual available memory, not theoretical specs.
Quality and Performance Trade-offs
✅ 8-Bit Quantization Had Minimal Quality Impact
Side-by-side comparison of T5 encoder outputs:
- FP16 (original): Baseline quality
- INT8 (quantized): <5% subjective quality difference
- Memory savings: 8.8GB → 4.4GB (50% reduction)
Conclusion: For text encoding tasks, 8-bit quantization is essentially free VRAM.
📈 Generation Speed Met Targets
| Goal | Result | Status |
|---|---|---|
| First gen < 2 min | 110s | ✅ Achieved |
| Subsequent gen < 15s | 10s | ✅ Exceeded |
| VRAM stable | 10-12GB consistent | ✅ Achieved |
| Quality acceptable | Excellent outputs | ✅ Achieved |
The 10-second generation time makes this practical for real creative work.
Production Considerations
For Personal Use
This setup works great for:
- Hobbyist AI experimentation
- Content creation (social media, art projects)
- Proof-of-concept development
- Learning FLUX.1 architecture
For Commercial Use
Consider these factors:
- Generation time: 10s/image × 1000 images = 2.8 hours
- Scalability: Single GPU, no batch processing
- Reliability: Consumer GPU thermal throttling under sustained load
- Support: mmgp is community-maintained, not enterprise-supported
For production workloads, consider:
- Cloud GPUs (Azure N-Series VMs or Azure Container Apps with GPU nodes) - minimum A40/A100
- Local GPU upgrade to RTX 4090 or A6000
- Batch processing optimizations
- Multiple parallel containers
Docker Hub & Repository
The complete implementation is available:
- GitHub: docker-ai-models/omnicontrol
- README: Full setup instructions and troubleshooting
- Dockerfile: Production-ready container definition
- Source code: Custom FLUX integration with mmgp
Quick Start
1 | # Clone repository |
Conclusion
Running FLUX.1 OmniControl on a 12GB RTX 3060 is not only possible but practical. Through careful memory management with mmgp, strategic quantization, and container optimization, we achieved:
- ✅ 10-second generation times (after initial load)
- ✅ Stable VRAM usage across multiple generations
- ✅ No quality degradation from quantization
- ✅ Reproducible Docker deployment
The key insight: Memory management is more important than raw VRAM capacity. With the right tools and configuration, consumer GPUs can run models designed for datacenter hardware.
If you have an RTX 3060 (or similar 12GB card) collecting dust because you thought it couldn’t handle modern AI models, give this approach a try. The democratization of AI isn’t just about open-source models - it’s about making them runnable on hardware people actually own.
Hardware tested: RTX 3060 12GB, 64GB RAM, Windows 11 + WSL2
Software stack: Docker Desktop, NVIDIA Container Toolkit, mmgp 3.6.9
Model: FLUX.1-schnell + OminiControl LoRA
Performance: 10s per 512x512 image (8 steps)
References:
- Memory optimization via mmgp (Memory Management for GPU Poor)
- FLUX.1-schnell Model
- OminiControl LoRA
- Docker Implementation Repository
Image Credits:
- Main image generated by GPT-Image-1.5
Running FLUX.1 OmniControl on a Consumer GPU: A Docker Implementation tested on RTX 3060




