Building an AI (Home) Lab

Based on a Case Study of KI-Kompetenzzentrum Medien (KI.M)

On-Premise AI Infrastructure

Lightning Talk

Aron Homberg
November 2025

The Journey

Mission: Launch an AI prototyping lab for BLM + KI.M
Timeline: Few months from idea to production
Approach: 100% on-prem deployment
Goal: Fast AI prototyping for media
Scope: GPU Hardware, Inference Engines, Containerization, Prototype Development, Model Evaluation

                            Achievement: One of Bavaria's first fully on-prem AI labs, now live with open-weight models
                        

KI.M's On-Premise AI Architecture

Hardware
H200 GPU

→

Containers
NVIDIA Container Toolkit

→

Inference Engine
vLLM/SGLang/Ollama

→

Models
Open Weight up to 120B

→

Prototypes
Custom Software

Infrastructure Layer

GPU Server + Support Server (Reverse Proxy, Monitoring, etc.)
Container orchestration / Load Balancing
Wireguard + Internet Network Access
Storage systems / Backup System

Containerized Services & Prototypes

Inference engines with specific model configs (vLLM, SGLang, Ollama)
LangFuse, MongoDB, PostgreSQL, Minio, Redis, ClickHouse, etc.
Models in inference:
- gpt-oss-120b
- Qwen3-Embedding-8B
- Qwen3-Omni-30B-A3B-Instruct
- ...

What Hardware Do I Work with?

KI.M Reallabor

Hardware: 1x NVIDIA H200 NVL (141GB), 1x DGX Spark
Software: vLLM, SGLang, Ollama, HF Transformers; Pytorch + Unsloth; ...

                            Why the NVIDIA DGX Spark? 🫩

                            prototypes using sparse/small models only in limited user demo sessions (1-2 concurrent users)
long-running and slow batch/training jobs
testing large models in slow speed (verification)
fint-tuning large models (with highly optimized Unsloth kernels)

Methodologies

Evaluate the best open weight models suitable for KMUs (≤50k invest)
In order to build impactful prototypes, use the smallest models that do the job in orchestration to achieve larger goals
Optimize for peak context window performance leverage, not max token use
Create modular tools/services - each doing one job well, so you can later mix-and-match LLMs, Embedding, Reranking, TTS, STT, etc.
Multi-process GPU sharing: Have 1:1 model/task assignment, so models are loaded in parallel, given VRAM limits (LLM, embedding, ASR, etc.)
Per-model optimization: Specialized inference config per inference engine/model (we are GPU-bound, not CPU/RAM-bound)
Create replicable prototypes, write great documentation, publish benchmarks with diverse hardware

You're GPU-Rich!

But How Do I Run AI at Home?

The Big Dream

Let's run state-of-the-art models like:

gpt-oss-120b
Qwen3-Next-80B-A3B-Instruct
Qwen3-Embedding-8B

locally for ultimate privacy, total control, custom finetuning, prototyping, learning, and decent performance for a single user use-case.

Goal: No external APIs. Mix-and-match LLMs, TTS, STT, ...

The Reality

Reality Check: Inference of large models requires ~ > 80 GB of VRAM w/o offloading to RAM.
Usually, that's a > 10k € investment - if we go for high-end GPUs that are optimited formemory bandwidth.
But... if we don't need to scale (single-user use-case), and if we use sparse/MoE models, we can get away with cheaper, ARM-based APU/NPU systems from AMD, Apple, or even NVIDIA.

Guess: How low can you go in budget? 10k €? 5k €? 2k €?

Option 1: AMD Strix Halo (APU)

The Cheap High-VRAM, High-Risk Option

Hardware: Mini PC with AMD Ryzen™ AI Max+ PRO 395 / Radeon 8060S (gfx1151)
Memory: 128GB LPDDR5x Unified (not really) Memory (max. 96GB assignable as VRAM)
NPU: 50 TOPS XDNA 2 NPU (for ONNX / FastFlowLM)

Where to Buy?

Option A (The Risk): ~€1,600
Brand: Bosgame M5 AI
Risk: Budget brands (like Beelink, GMKtec) have a history of hardware flaws and poor support. Bosgame has a better reputation, but the risk remains. Systems notoriously overheat!
Option B (The "Safe" Bet): ~€2,500
Brand: Framework Desktop (Strix Halo Base)
Benefit: Superior quality, cooling, modularity, reliability, and support. No risk of overheating! A decent option for a modern workstation.

Bosgame M5

Framework Desktop (Strix Halo Base)

Strix Halo: Performance & Reality

🚀 Performance

Benchmark Sparse MoE Model, MXFP4 (gpt-oss-120b): ~30 tokens/sec
Benchmark Q4K-XL (GLM-4.5-Air-UD-Q4K-XL-GGUF): ~15-20 tokens/sec
Benchmark Dense Model (Llama 3.3 70B Q8): ~2,5 tokens/sec
Memory Bandwidth: ~215 GB/s (real-world)
TOPS: 79 GPU TOPS @ INT8 (304 int4 sparse)
- 50 TOPS XDNA 2 NPU
Gaming/Rendering Performance comparable to NVIDIA RTX 4070 Mobile

⚠️ The Catch: The Software (ROCm)

NO CUDA. You must use the ROCm + HIP ecosystem (effectively: Lemonade Server)
Vulkan is Key: The Vulkan backend is critical. HIP is ~40% less efficient
Tooling: Requires specific tools like lemonade-sdk, ONNX Runtime, and FastFlowLM
Clustering: Possible, but not the seamless RDMA experience of NVIDIA

                            Verdict: Amazing performance-per-dollar for "prosumers" if you are willing to navigate the ROCm software stack (a challenge, but doable)
                        

Option 2: NVIDIA DGX Spark

The "Vendor/Dealer Risk-Free" Developer Box for professionals

Hardware: NVIDIA GB10 Grace Blackwell Superchip (ARM CPU + GPU)
Memory: 128GB LPDDR5x TRUE Unified Memory
Software: Full CUDA Stack. DGX OS (Ubuntu), PyTorch, TensorRT, ...
Clustering: "Deluxe" clustering via ConnectX-7 (200 Gbps RDMA) (up to 2 direct interlinked)
Price: from ~€3,500 (1.0 TB)

                            This is a developer machine, NOT an inference machine. It's designed to develop code that scales 1:1 on a large DGX cluster.
                        

DGX Spark: Performance & Reality

🐢 Performance

Benchmark Sparse MoE Model (gpt-oss-120b): 30-50 tokens/sec
Benchmark Dense Model (Llama 3.3 70B Q8, Ollama): ~3-4 tokens/sec
Memory Bandwidth: 273 GB/s (real-world)
TOPS: 1000 (FP4, sparse)

⚠️ The Catch: The Hardware Bottleneck

The Problem: Only 273 GB/s memory bandwidth - LLM inference is memory-bound
(6.5x slower than RTX Pro 6000)
Usage Profile: Excellent for Sparse/MoE models, but struggles with dense models
Software: 100% CUDA stack, seamless dev/prod compatibility
Clustering: "Deluxe" RDMA experience via NVIDIA ConnectX back-2-back (2)

                            Verdict: Buy ONLY if your workloads fit Sparse/MoE models OR you require full CUDA compatibility for enterprise dev, or you need a box for long-running batch/training jobs!
                        

Benchmark Link

Option 3: Apple Mac Studio (M4 Max or even M3 Ultra)

The "It Just Works" Option for whom can afford it

Hardware: Apple M4 Max (16c CPU, 40c GPU, 16c NPU)
Memory: 128GB TRUE Unified Memory (M3 Ultra up to 512GB)
Memory Bandwidth: 546 GB/s (M3 Ultra up to 819 GB/s)
TOPS: ? (36 TOPS for NPU)
Software: macOS with Metal (GPU) and MLX framework, MPS pytorch backend - just use LM Studio if you're a beginner
Price: ~€4,200

The NPU (Neural Engine) is currently under-utilized for most inference workloads. Almost everything runs on the GPU cores (Metal / MPS). There are some exceptions though - check möbius and thier amazing work on ANE and NPU inference. It still utilizes unified memory, but you can parallelize the inference workload across the ANE/NPU and the GPU tensor cores.

Mac Studio: Performance & Reality

🛠️ The Software Ecosystem

Inference Engines:
- MLX / mlx-lm: Apple's native, optimized framework
- llama.cpp (Metal Backend): The community standard, runs everything
- LM Studio: Easy-to-use GUI (uses llama.cpp / MLX)
Model Support: Excellent
- mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit runs well
- Optimized gpt-oss-120b (Unsloth GGUF, also MLX) models are available

📊 Benchmarks

gpt-oss-120b: up to 30 - 40 t/s (Unsloth or MLX) - VCoder variant
Qwen3-Next-80B-A3B: up to 60 t/s (Q8 MLX)
Llama-3.3-70B-Q8: up to ~4-5 t/s

                            Verdict: The most stable, efficient, and silent option. A fantastic "prosumer" choice if you don't need CUDA (can MLX-convert your models and use an mlx-based inference engine).
                        

LM Studio, t/s generated

For Startups: Leveling Up!

Why would a Startup be interested in on-premise hardware? (for training & inference)

1. 💰 Cost

Breakeven: For high-volume, 24/7 workloads, on-prem hardware often pays for itself in < 12 months
Cloud (Token APIs): Easy to start, but scales expensively. You pay for every single token, forever
On-Premise: High initial CAPEX (cost to buy). Buy if you know workload is known

2. 🔒 Data Privacy & Control

On-Premise: Data never leaves your machines
Critical for: GDPR, sensitive IP, client data, legal documents
Legal benefits: No need for complex Data Processing Agreements (AVV) with third-party API providers

3. 🚀 Performance & Latency

On-Premise: You get dedicated, predictable, low-latency performance, but you WONT be able to scale horizontally without engineering effort and investment. But DO YOU scale that fast, realistically?
Cloud API (Token based): Subject to "noisy neighbor" problems (shared resource), and unpredictable latency/issues
Dedicated Rental Cloud GPUs: No "noisy neighbors", but often not economical > 12 months

4. 🎨 Customization

Run any model you want (not just what the API offers), how you want
Use any inference engine (vLLM, sglang, llama.cpp)
Fine-tune and deploy custom models instantly

Critical: While a token-based API has a near-zero talent overhead, an on-premise stack (even when colocated) is complex. It requires significant MLOps and infrastructure engineering talent. What if I don't have access to a datacenter? Where do I actually operate my hardware?
You may host your own GPU server hardware in a datacenter in Germany if you lack datacenter infrastructure, also e.g. via AIME

Open Weight vs. Open Source

Startups often seek their own IP. But how do you build it? You either learn and invent or you adapt. But open models are open, right?

Open Source (e.g., AllenAI Olmo3)

You get the weights of the model
You get the documentation/paper
You get the dataset used to train it
You get the training code, eval code etc.
You can replicate the entire process on-premise (if you could afford it)

                            For building your own IP and learning how it really works, "Open Source" is all you need.
                        

Open Weight (e.g., Qwen3)

You get the model weights (the final product)
You might get some source code
You do not get the training data or the full "recipe"
You can use and finetune the model, but you cannot replicate it from scratch
You need to check the license (some contain restrictions for commercial use)

                            For on-premise inference, "Open Weight" is all you need (if the license allows you to).

                        

Do You Really Need Fine-Tuning?

As a Startup or Student, I might want to finetune a model to build my IP (if the license allows me to). But is it a good idea?

Let's have the model fit our data!

Problem: Your model doesn't know your specific data or follow your specific format
Old Solution: Fine-tune the model on thousands of examples

Current Solution: Fix the prompts and the context first.

Fix Before Fine-Tuning:

Prompt Engineering / Optimizing (Cheap & Fast)
RAG (Retrieval-Augmented Generation) / Long Context (Medium)
Fine-Tuning (Complicated & Risky)

                            Fine-Tuning of larger models is complicated, requires a well-curated dataset, expertise.
                        

Skipping Finetuning 1: Auto-Optimizing Prompts

Let the model write its own prompts.

Tool: DSPy (dspy.ai)
Paper: Link
Concept: You don't write prompts, you write programs
You define the steps (e.g., "Think", "Retrieve", "Answer") and the goal
DSPy's "Optimizer" (like GEPA) will automatically test hundreds of prompts and few-shot examples to find the best possible prompt to achieve your goal
~+10% gains on AIME 2025 with GPT-4.1 Mini

Prompt Engineering transformed into a (brute-force) optimization problem.

Skipping Finetuning 2: Agentic Context Engine (ACE)

Tool: ACE (Agentic Context Engine) (Link)
Paper: Link
Concept: A new RAG technique that creates a "virtual file system" or "context tree" for the AI
Instead of just "dumping" documents into the context, ACE creates a structured hierarchy
The model can then "navigate" this context (e.g., cd /project_docs, ls, cat /summary.txt)
+10.6% on agents and +8.6% on finance

                        Drastically improves RAG performance on complex, multi-document tasks without any fine-tuning.
                    

If You Really Must Fine-Tune...

Technique: QLoRA (Quantized Low-Rank Adaptation)
- Freezes the main model (in 4-bit)
- Trains only a tiny set of "adapter" weights, quantized (QLoRA)
- Result: You can fine-tune gpt-oss-120b on a single GPU with 66GB VRAM Link
Tool: Unsloth (unsloth.ai)
- A drop-in replacement for Hugging Face
- ~1.5x faster training and 70% less VRAM, 10x longer context lengths compared to standard QLoRA
- Uses highly optimized Triton kernels
- QLoRA requirements: gpt-oss-20b = 14GB VRAM • gpt-oss-120b = 65GB VRAM.
  BF16 LoRA requirements: gpt-oss-20b = 44GB VRAM • gpt-oss-120b = 210GB VRAM.

                            If you want cheap inference, you should always choose a sparse/MoE model as your base for fine-tuning/QLoRA.
                        

Which Model Should I Fine-Tune?

⚠️ Dense Models (e.g., Llama 3.3 70B):
- All 70 billion parameters are activated for every single token
- Requires massive memory bandwidth
✅ Sparse (MoE) Models (e.g., gpt-oss-120b, qwen3-next etc.):
- Total Parameters: 120B (looks huge!)
- Active Parameters: Only ~5.1B are activated per token (out of 117B total)
- Result: Runs as fast as a 30B model, but with the knowledge of a 120B model
- Training: Much cheaper than for dense models

                            Stop thinking about parameter count and benchmarks. Start thinking about architecture.
                        

Training non-LLM models

For Startups, there are endless possibilities for business opportunities building small, specific AI models that perform much better on specific tasks, when trained on exclusively access datasets, with an efficient AI model architecture. If you have this opportunity, you might train your own model from scratch (e.g., Computer Vision, recommendation engines, scientific AI) - btw, that's we do at NeuraMancer.ai:

Just use PyTorch to implement it. It's the industry and research standard
The entire ecosystem (data loaders, optimizers, distributed training) is built around it
Use Triton kernels and/or torch.compile() for unparalleled speed!

Train in BF8 (Hopper) or even FP4 natively (Blackwell)
Training is a different workload than LLM inference. It is compute-bound, so raw TFLOPS (Tensor Cores) matter more -> rent a training cluster if you lack the capital

Token-Based APIs (EU)

You want a "serverless" experience, paid per token, but need EU data privacy:

Pro:

Simple to integrate (OpenAI API endpoints), LangChain, LangGraph
No upfront cost
EU: 40-60% cheaper than US Hyperscalers (AWS, GCP, Azure)
The Catch: You must pin the Data Processing Agreement (AVV/DPA) to be EU AI Act compliant

Con:
- If you train your own models.
- If you finetune models.
- STACKIT: Extremely limited model selection
- Telekom (OTC): Absurdly expensive

Provider	Pros	Cons
Scaleway (FR/NL/PL)	Sub-200ms TTFT, good price, 1M free tokens
OVHcloud (FR/DE/PL)	Cheapest provider.
Regolo.ai (IT)	Zero retention policy (max privacy), 100% renewable energy
Nebius (EU/Global)	Best model selection, 99.9% SLA, Zero data retention	Data centers are global, must pin to EU via contract

Our Startup Needs GPU Hardware!

Typical Use Case: A small team needs to train custom models, fine-tune or serve sparse MoE models at relatively small scale, on-premise in the EU. If not sparse, must be a dense Small Language Model (SLM).

Small MVP Setup

Build: 1-2x NVIDIA RTX Pro 6000 Blackwell (192GB Total VRAM)
Price: ~€15,000 - €30,000 all-in
Why: Massive VRAM and bandwidth handles almost any workload. ECC memory for reliability

Why not RTX 5090 32GB? It can be useful for lean inference machines, but only for SLMs/sparse models.

RTX 5090 vs RTX 6000 Pro - Benchmark Link

Other NVIDIA Hardware Options

Best option in 2025 / early 2026 for Startups:

NVIDIA RTX Pro 6000 Blackwell (96GB).
smaller sizes (4000, 4500, 5000, etc.) or older revisions (Ada etc.) are cheaper and viable for smaller models (SLMs) when scaling for inference, but mind the acceleration features.
Blackwell supports FP4 acceleration, while older generations do not.

1x NVIDIA RTX 6000 Pro - Offerings

2x NVIDIA RTX 6000 Pro - Offerings

Trusted Vendors/Dealers: Where to Buy?

Validate the NVIDIA dealer status: NVIDIA Partner Network

My personal favs (from customer experience):

AIME (Germany, Berlin, Berlin)
- Exceptional software stack and service
- Deep expertise in AI workflows
- Offer cloud rental and hardware sales
Amber (Germany, Oberhaching/Munich, Bavaria)
- Highly competent and reliable
- Often more price-competitive in direct negotiation
- Focus purely on hardware solutions

primeLine Solutions (Germany, Bad Oeynhausen, NRW)
- Most transparent communication regarding in-stock status and price changes
- Best online shop product offerings

                            AIME/Amber: You can't go wrong with either. I choose AIME for a premium, software-inclusive experience. I chose Amber for excellent, reliable hardware at a slightly better price point. When both dealers don't offer a product I'm looking for, primeLine Solutions usually has it at a relatively competitive price point.
                        

Containerization for Dummies

Now our Startup has the hardware, but how do we run the model? You either ship your own with your own inference server (your pytorch inference code), or you use one of the popular, highly optimized inference servers.

How do I run this in production?

Don't: Run python my_app.py or an inference server in a screen session etc..
Do: Use Containers
I publish such containers! Here's an example: qwen3-omni-vllm-docker

                            Key Tool: NVIDIA Container Toolkit. This allows your Docker containers to access the GPU accelerated.
                        

Why?

Isolation: Each AI model and its dependencies (CUDA, PyTorch, etc.) live in their own sealed box using a specific inference server (vLLM, sglang, ollama etc.)
Portability: The exact same container runs on your laptop, your on-prem server, or in the cloud
Scalability: Easy to manage, update, and deploy multiple copies - easy to evolve into larger deployments - such as Kubernetes or smaller ones using e.g. Kamal

Many Inference Jobs On One GPU

You have one NVIDIA RTX 6000 Pro, but 5 different apps. How do they share it? Each app might use a different model, hosted and optimized by a different containerized inference server (sglang vs vLLM).

Option 1: Shared Process Model (via Container)

How it works: One containerized inference server (like vLLM) loads the model into VRAM once. It then handles all incoming requests and batches them together. You can also start as many containers (replicas), as you have VRAM if you need more containerized inference engines or special configurations
Pro: Extremely efficient VRAM use. Containerization cost < 0.9%
Con: If your container consume more VRAM than available, Out-of-Memory (OOM) results in an incident. You must plan well (reserved VRAM, managed replica counts)

Option 2: MIG (Multi-Instance GPU)

How it works: Hardware-level partitioning. You physically slice the GPU's VRAM into (e.g.) 2 smaller, fully isolated GPUs. Each slice gets its own guaranteed VRAM and compute
Pro: Total, secure isolation. Feels like multiple GPUs
Con: Inefficient. VRAM is fixed, if one instance is idle, its VRAM is wasted. You can't inference larger models
Best for: Securely serving completely different clients/models on one chip. Usually used by cloud GPU compute providers

Scaling faster than you can buy

Nice to have issue! But what do you do? In this case, you might want to rent GPU hardware for < 12 months.

Here are some of my preferred options in Germany:

AIME GPU Cloud (Germany):
- Link
- Rent massive multi-GPU nodes (H100, etc.) by the hour/day
- Great for burstable training/fine-tuning
IP-Projects (Germany):
- Link
- Rent single-GPU inference servers monthly
- NVIDIA® RTX™ 4000 SFF Ada, NVIDIA® RTX™ 5000 Ada, NVIDIA® RTX™ 6000 Ada
- from 262€ / month up to 737€ / month

Hetzner (Germany):
- Link / GEX44
- Intel® Core™ i5-13500 13th Gen 6 Performance, 8 Eco Cores, 64GB DDR4, 3,84 TB NVMe SSD Storage, 1x 1GB/s Backbone
- NVIDIA® RTX™ 4000 SFF Ada, NVIDIA® RTX™ 6000 Ada
- from 184€ (+setup fee) to 967€ / month; Good value, but often older and more limited hardware

                            Spot-market GPU resources: Often unreliable, often hosted outside of the EU, huge risk regarding data privacy
                        

Smarter Scaling And Cost Savings By Optimization

Let's make inference FASTER and CHEAPER:

Use Quantized Models:
- Use 4-bit, 8-bit, or FP8 models or Dynamic Quantization (looking at Unsloth), where some layers are more quantized than others
- Tools like Unsloth provide dynamically quantized models that are extremely fast
Use a Modern Inference Engine:
- vLLM: Best for high-throughput (many users)
- SGLang: Best for low-latency (TTFT) and highly optimized setups
- llama.cpp/Ollama: Best for CPU and non-NVIDIA (Mac, AMD)
- möbius: On edge devices, facilitating ANE/NPU inference (not yet mainstream)

Optimize Tokens/Second:
- Use Speculative Decoding. A small "draft" model (e.g., Qwen3-0.6B) generates 5-10 tokens, and the large model (e.g., gpt-oss-120b) validates them all in one step
Get More Context:
- Use RoPE Scaling (Linear, NTK, YaRN) to extend a model's context window (e.g., from 8k to 32k)
Optimize Time-To-First-Token (TTFT):
- Use LMCache to cache the prompt processing (prefill). Can be up to 7.7x faster

EU AI Act: What To Do NOW (Nov 2025)

This is what your legal/compliance team needs to be doing.

✅ Immediately (Overdue!)

Unacceptable Risk: (BANNED) e.g., social scoring, real-time biometric surveillance
High-Risk: (STRICT) e.g., HR (hiring), critical infrastructure, medical devices
Limited Risk: (TRANSPARENCY) e.g., Chatbots, deepfakes (must be labeled)
Minimal Risk: (NO RULES) e.g., spam filters

Audit & Remove: Check all systems against Art. 5 (Banned Systems). Stop using them
Ensure AI Literacy: (Art. 4) You are already required to be training your staff

German Law (KI-VO): The draft is out. The Bundesnetzagentur (BNetzA) will be the central market surveillance authority. Regulations will be enforced by October 2026 earliest (expect delays and changes!).

⏰ Urgent (Deadline: August 2026 - 9 months left)

Full AI Inventory: Document every AI system you use or plan to use
Classify Risk: Is it High-Risk? (e.g., sorting resumes? YES.)
Identify Role: Are you a "Provider" or "Deployer"?
Implement Transparency: Label all AI-generated content and chatbots
Start Now (for High-Risk): Begin building your Risk Management System, Quality Management System (QMS), and Technical Documentation
Check Contracts: Audit all 3rd-party vendor contracts (incl. AVV/DPA)

Other Legal Hurdles

It's not just the AI Act.

🇪🇺 GDPR (Datenschutzgrundverordnung)

The AI Act does not replace GDPR
If you process personal data, you still need a legal basis, data processing agreements (AVV), and to conduct data protection impact assessments (DSFA)
On-Premise is your friend here.

🇪🇺 EU Product Liability Directive

If your AI system causes harm (e.g., gives bad medical or financial advice), you (the provider/deployer) can be held liable, especially if it is a Software-as-a-Service (SaaS) product
How to protect yourself: Meticulous documentation, rigorous testing, and clear user disclaimers

Certified (Free) Training & Resources

Looking for structured, certified learning on AI regulation or compliance?
As of now (Nov 2025), there is no regulation of AI training providers/certfications in Germany.
You should be safe with having your employees get certifications from reputable AI training providers.
I do recommend the free offering of KI-Campus (ki-campus.org). They offers a wide range of free, high-quality online courses and micro-certifications in German, covering AI, data ethics, the AI Act, and law.
Include such courses as part of onboarding and annual staff training for compliance (Art. 4).
Do you need further information? Contact me - I can share more information privately and refer you to expert lawyers.

I'm an Enterprise! I Need A CTO Pitch

                    "I am an SME, I have a budget. My CTO asked me for strategy advice. What do I propose?"
                

The Enterprise-Grade Proposal:

Hardware: Start with a certified NVIDIA DGX System (e.g., DGX H100/H200 or DGX Spark clusters)
- Why: It's the industry standard. It comes with full enterprise support, guaranteed performance, and a seamless software stack (DGX OS). It's the "no-one-gets-fired-for-buying-IBM" choice

Software: Deploy on Kubernetes (K8s)
- Why: It's the standard for scalable, resilient container orchestration
Conformance: Adhere to the new CNCF Certified Kubernetes AI Conformance program (k8s-ai-conformance)
- Why: This ensures your K8s cluster is specifically configured for AI workloads (e.t., GPU scheduling, networking), making it standardized and future-proof

My Current Top Model Picks

Of course, you can always have a look at leaderboards like LMArena, MTEB & co., but they lag behind new model releases. Here are my current personal picks for different tasks:

LLM (Chat & Reasoning)

OpenAI/gpt-oss-120b: Best overall open-weight model for reasoning (effort: high) - especially for typical chatbot cases
Qwen/qwen3-next-80b-A3B-instruct: Incredible performance, task-dependent better than gpt-oss-120b

Embedding (RAG & Semantic Search)

Qwen/qwen3-embedding-8b: Best overall performance in realworld use-cases
ibm-granite/granite-embedding-models - Various well-performing embedding and reranking models of different sizes

OCR (Text Recognition)

datalab-to/chandra: SOTA for complex documents
(Previously: AllenAI olmOCR 2)

ASR (Speech-to-Text)

Realtime: nvidia/parakeet-tdt-0.6b-v3 (faster) or nvidia/canary-1b-v2 (higher precision)
Best Whisper-like: nyrahealth/CrisperWhisper or unsloth/nyrahealth/CrisperWhisper - for MLX-based inference, use my model kyr0/crisperwhisper-unsloth-mlx-8b
qwen3-omni-30b-A3B-instruct: Very high multilingual precision even in noisy environments; Audio classification even for non-voice audio (music, fx etc.)
Potential runner-up: FAIR (Meta) omnilingual-asr

VLM (Vision / Multimodal)

Qwen/qwen3-omni: Excellent object detection and grounding for images and videos
Qwen/qwen3-vl: Smaller model that performs very well according to its size
OpenBMB/MiniCPM-V-4_5: Even smaller model that performs very well according to its size

🔮 2025 to 2026: Hybrid Architectures Become Standard

🚀 2025: The Shift From Experimental to Standard

Hybrid SSM + Attention Architectures: Interleaved SSM (State Space Model) + Attention blocks today, with more intra-layer mixing emerging
Why It Matters: ~3× faster throughput on long-context workloads vs. pure original Transformer baselines
Growing Industry Trend: AI21 (Jamba), Google (Griffin), Alibaba (Qwen3/Gated DeltaNet)

🎯 What We Might See in 2026

🌟 Ultra-Sparse MoE: Tens to hundreds of experts, with only a small fraction active per token (e.g. Qwen3-Next)
🌟 Context Windows: 256K–1M native-token contexts becoming standard, not just frontier-only
🌟 Hardware Optimization: Specialized kernels and runtimes for hybrid primitives, dual-mode (NPU/APU/TPU + GPU) inference engines

🌟 Dominantly O(n) Inference: More layers scale roughly linearly with context length, improving long-context throughput vs. quadratic attention
🌟 Memory Efficiency: 500K to 1M-token contexts with more graceful scaling instead of sharp memory cliffs
🌟 Reduced Performance Degradation: Longer contexts remain usable with slower quality drop-off than in traditional transformers

🛠️ Hardware-Model Co-Design Matures

🌟 Specialized Silicon: Continued advances in APUs, NPUs, unified memory, native quantization, etc.
🌟 Modular Inference: Deploy SSM on CPUs, attention on NPUs, MoE experts distributed
🌟 On-Device Inference: 7B–13B models run efficiently on consumer hardware and high-end mobile devices with mixed linear/SSM–attention architectures

🛣️ Potential Outcomes

🌟 Close-to-Linear Attention Models: Sub-quadratic scaling architectures compete directly with hybrids
🌟 Adaptive Routing: Models dynamically choose computation type by activating layers based on the context
🌟 On-Device Ramps Up: Efficient inference enables broad consumer deployment; SMEs increasingly adopt on-premise AI infrastructure in the EU
🌟 EU AI Act Revision: A more realistic timeline and/or a simplified framrwork will be enforced

How Do I Stay Up-to-Date?

It's a full-time job. Here is my strategy:

1. Primary Sources:

LinkedIn: Follow VIP users and orgs
GitHub: Follow VIP users and orgs
HuggingFace: Follow VIP users and orgs
arXiv: Daily check for new papers (ML, CS.AI, CS.CL), linked via https://papers.cool
Social Media platforms: Follow VIP users and orgs

2. Aggregators:

Reddit: r/LocalLLaMA, r/vLLM etc. (for practical/hardware info), r/MachineLearning
Hacker News: For general tech/AI trends

3. Community (The most important):

Active participation in Discord servers (e.g., Unsloth AI, llama.cpp)
Direct discussion with peers, researchers, and other AI engineers
This is what I'm trying to fix!

minloss.club – Join the Waitlist

I'm introducing a new, ML-focused community for engaged AI/ML practitioners and researchers – engineering-first discussions on MLOps and model architecture, without hype.

minloss.club

What: A free, curated, but invitation-only (gated) community for ML/AI researchers, developers, and ML ops – focused entirely on research, engineering, sharing information, discussing ideas.
Why: To share practical knowledge, code, and strategies for training, and scaling real-world, production-grade AI systems from small on-premise systems to large, planet-scale deployments
Who: ML/AI researchers, developers, and ML ops - no pure vibe coders - but professionals. People who prioritizing rigorous engineering over buzzwords and self-promotion.
How: New papers, repos, ideas and models are surfaced early, deduplicated, and filtered for signal over noise, and summarized – so your time is spent on what actually improves outcomes -> min(loss).
Where: A private micro-community (Mattermost), complemented by a public blog and a mailing list.

Join the waitlist via WhatsApp

                    Vision: To support and connect the people who actually design, build, and operate real world AI systems.
                
            

Contact & Resources

Thank you for your attention!

Download Slides

Connect with me on LinkedIn