Building an AI (Home) Lab


Based on a Case Study of KI-Kompetenzzentrum Medien (KI.M)

On-Premise AI Infrastructure

Lightning Talk

Aron Homberg
November 2025

The Journey

  • Mission: Launch an AI prototyping lab for BLM + KI.M
  • Timeline: Few months from idea to production
  • Approach: 100% on-prem deployment
  • Goal: Fast AI prototyping for media
  • Scope: GPU Hardware, Inference Engines, Containerization, Prototype Development, Model Evaluation
Achievement: One of Bavaria's first fully on-prem AI labs, now live with open-weight models
KI.M Reallabor Team

KI.M's On-Premise AI Architecture

Hardware
H200 GPU
Containers
NVIDIA Container Toolkit
Inference Engine
vLLM/SGLang/Ollama
Models
Open Weight up to 120B
Prototypes
Custom Software

Infrastructure Layer

  • GPU Server + Support Server (Reverse Proxy, Monitoring, etc.)
  • Container orchestration / Load Balancing
  • Wireguard + Internet Network Access
  • Storage systems / Backup System
Server Hardware

Containerized Services & Prototypes

  • Inference engines with specific model configs (vLLM, SGLang, Ollama)
  • LangFuse, MongoDB, PostgreSQL, Minio, Redis, ClickHouse, etc.
  • Models in inference:
    • gpt-oss-120b
    • Qwen3-Embedding-8B
    • Qwen3-Omni-30B-A3B-Instruct
    • ...

What Hardware Do I Work with?

KI.M Reallabor

  • Hardware: 1x NVIDIA H200 NVL (141GB), 1x DGX Spark
  • Software: vLLM, SGLang, Ollama, HF Transformers; Pytorch + Unsloth; ...
Why the NVIDIA DGX Spark? 🫩

  • prototypes using sparse/small models only in limited user demo sessions (1-2 concurrent users)
  • long-running and slow batch/training jobs
  • testing large models in slow speed (verification)
  • fint-tuning large models (with highly optimized Unsloth kernels)

Methodologies

  • Evaluate the best open weight models suitable for KMUs (≤50k invest)
  • In order to build impactful prototypes, use the smallest models that do the job in orchestration to achieve larger goals
  • Optimize for peak context window performance leverage, not max token use
  • Create modular tools/services - each doing one job well, so you can later mix-and-match LLMs, Embedding, Reranking, TTS, STT, etc.
  • Multi-process GPU sharing: Have 1:1 model/task assignment, so models are loaded in parallel, given VRAM limits (LLM, embedding, ASR, etc.)
  • Per-model optimization: Specialized inference config per inference engine/model (we are GPU-bound, not CPU/RAM-bound)
  • Create replicable prototypes, write great documentation, publish benchmarks with diverse hardware

You're GPU-Rich!

But How Do I Run AI at Home?

The Big Dream

Let's run state-of-the-art models like:
  • gpt-oss-120b
  • Qwen3-Next-80B-A3B-Instruct
  • Qwen3-Embedding-8B
locally for ultimate privacy, total control, custom finetuning, prototyping, learning, and decent performance for a single user use-case.
Goal: No external APIs. Mix-and-match LLMs, TTS, STT, ...

The Reality

  • Reality Check: Inference of large models requires ~ > 80 GB of VRAM w/o offloading to RAM.
  • Usually, that's a > 10k € investment - if we go for high-end GPUs that are optimited formemory bandwidth.
  • But... if we don't need to scale (single-user use-case), and if we use sparse/MoE models, we can get away with cheaper, ARM-based APU/NPU systems from AMD, Apple, or even NVIDIA.
Guess: How low can you go in budget? 10k €? 5k €? 2k €?

Option 1: AMD Strix Halo (APU)

The Cheap High-VRAM, High-Risk Option

  • Hardware: Mini PC with AMD Ryzen™ AI Max+ PRO 395 / Radeon 8060S (gfx1151)
  • Memory: 128GB LPDDR5x Unified (not really) Memory (max. 96GB assignable as VRAM)
  • NPU: 50 TOPS XDNA 2 NPU (for ONNX / FastFlowLM)

Where to Buy?

  • Option A (The Risk): ~€1,600
    Brand: Bosgame M5 AI
    Risk: Budget brands (like Beelink, GMKtec) have a history of hardware flaws and poor support. Bosgame has a better reputation, but the risk remains. Systems notoriously overheat!
  • Option B (The "Safe" Bet): ~€2,500
    Brand: Framework Desktop (Strix Halo Base)
    Benefit: Superior quality, cooling, modularity, reliability, and support. No risk of overheating! A decent option for a modern workstation.
Bosgame M5
Bosgame M5
Framework Desktop (Strix Halo Base)
Framework Desktop (Strix Halo Base)

Strix Halo: Performance & Reality

🚀 Performance

  • Benchmark Sparse MoE Model, MXFP4 (gpt-oss-120b): ~30 tokens/sec
  • Benchmark Q4K-XL (GLM-4.5-Air-UD-Q4K-XL-GGUF): ~15-20 tokens/sec
  • Benchmark Dense Model (Llama 3.3 70B Q8): ~2,5 tokens/sec
  • Memory Bandwidth: ~215 GB/s (real-world)
  • TOPS: 79 GPU TOPS @ INT8 (304 int4 sparse)
  • - 50 TOPS XDNA 2 NPU
  • Gaming/Rendering Performance comparable to NVIDIA RTX 4070 Mobile

⚠️ The Catch: The Software (ROCm)

  • NO CUDA. You must use the ROCm + HIP ecosystem (effectively: Lemonade Server)
  • Vulkan is Key: The Vulkan backend is critical. HIP is ~40% less efficient
  • Tooling: Requires specific tools like lemonade-sdk, ONNX Runtime, and FastFlowLM
  • Clustering: Possible, but not the seamless RDMA experience of NVIDIA
Strix Halo Benchmark Results
Verdict: Amazing performance-per-dollar for "prosumers" if you are willing to navigate the ROCm software stack (a challenge, but doable)

Option 2: NVIDIA DGX Spark

The "Vendor/Dealer Risk-Free" Developer Box for professionals

  • Hardware: NVIDIA GB10 Grace Blackwell Superchip (ARM CPU + GPU)
  • Memory: 128GB LPDDR5x TRUE Unified Memory
  • Software: Full CUDA Stack. DGX OS (Ubuntu), PyTorch, TensorRT, ...
  • Clustering: "Deluxe" clustering via ConnectX-7 (200 Gbps RDMA) (up to 2 direct interlinked)
  • Price: from ~€3,500 (1.0 TB)
This is a developer machine, NOT an inference machine. It's designed to develop code that scales 1:1 on a large DGX cluster.
NVIDIA DGX Spark

DGX Spark: Performance & Reality

🐢 Performance

  • Benchmark Sparse MoE Model (gpt-oss-120b): 30-50 tokens/sec
  • Benchmark Dense Model (Llama 3.3 70B Q8, Ollama): ~3-4 tokens/sec
  • Memory Bandwidth: 273 GB/s (real-world)
  • TOPS: 1000 (FP4, sparse)

⚠️ The Catch: The Hardware Bottleneck

  • The Problem: Only 273 GB/s memory bandwidth - LLM inference is memory-bound
    (6.5x slower than RTX Pro 6000)
  • Usage Profile: Excellent for Sparse/MoE models, but struggles with dense models
  • Software: 100% CUDA stack, seamless dev/prod compatibility
  • Clustering: "Deluxe" RDMA experience via NVIDIA ConnectX back-2-back (2)
Verdict: Buy ONLY if your workloads fit Sparse/MoE models OR you require full CUDA compatibility for enterprise dev, or you need a box for long-running batch/training jobs!

Option 3: Apple Mac Studio (M4 Max or even M3 Ultra)

The "It Just Works" Option for whom can afford it

  • Hardware: Apple M4 Max (16c CPU, 40c GPU, 16c NPU)
  • Memory: 128GB TRUE Unified Memory (M3 Ultra up to 512GB)
  • Memory Bandwidth: 546 GB/s (M3 Ultra up to 819 GB/s)
  • TOPS: ? (36 TOPS for NPU)
  • Software: macOS with Metal (GPU) and MLX framework, MPS pytorch backend - just use LM Studio if you're a beginner
  • Price: ~€4,200
The NPU (Neural Engine) is currently under-utilized for most inference workloads. Almost everything runs on the GPU cores (Metal / MPS). There are some exceptions though - check möbius and thier amazing work on ANE and NPU inference. It still utilizes unified memory, but you can parallelize the inference workload across the ANE/NPU and the GPU tensor cores.
M4 Max 128GB

Mac Studio: Performance & Reality

🛠️ The Software Ecosystem

  • Inference Engines:
    • MLX / mlx-lm: Apple's native, optimized framework
    • llama.cpp (Metal Backend): The community standard, runs everything
    • LM Studio: Easy-to-use GUI (uses llama.cpp / MLX)
  • Model Support: Excellent
    • mlx-community/Qwen3-Next-80B-A3B-Instruct-8bit runs well
    • Optimized gpt-oss-120b (Unsloth GGUF, also MLX) models are available

📊 Benchmarks

  • gpt-oss-120b: up to 30 - 40 t/s (Unsloth or MLX) - VCoder variant
  • Qwen3-Next-80B-A3B: up to 60 t/s (Q8 MLX)
  • Llama-3.3-70B-Q8: up to ~4-5 t/s
Verdict: The most stable, efficient, and silent option. A fantastic "prosumer" choice if you don't need CUDA (can MLX-convert your models and use an mlx-based inference engine).
M4 Max Benchmark

LM Studio, t/s generated

For Startups: Leveling Up!

Why would a Startup be interested in on-premise hardware? (for training & inference)

1. 💰 Cost

  • Breakeven: For high-volume, 24/7 workloads, on-prem hardware often pays for itself in < 12 months
  • Cloud (Token APIs): Easy to start, but scales expensively. You pay for every single token, forever
  • On-Premise: High initial CAPEX (cost to buy). Buy if you know workload is known

2. 🔒 Data Privacy & Control

  • On-Premise: Data never leaves your machines
  • Critical for: GDPR, sensitive IP, client data, legal documents
  • Legal benefits: No need for complex Data Processing Agreements (AVV) with third-party API providers

3. 🚀 Performance & Latency

  • On-Premise: You get dedicated, predictable, low-latency performance, but you WONT be able to scale horizontally without engineering effort and investment. But DO YOU scale that fast, realistically?
  • Cloud API (Token based): Subject to "noisy neighbor" problems (shared resource), and unpredictable latency/issues
  • Dedicated Rental Cloud GPUs: No "noisy neighbors", but often not economical > 12 months

4. 🎨 Customization

  • Run any model you want (not just what the API offers), how you want
  • Use any inference engine (vLLM, sglang, llama.cpp)
  • Fine-tune and deploy custom models instantly
Critical: While a token-based API has a near-zero talent overhead, an on-premise stack (even when colocated) is complex. It requires significant MLOps and infrastructure engineering talent. What if I don't have access to a datacenter? Where do I actually operate my hardware?
You may host your own GPU server hardware in a datacenter in Germany if you lack datacenter infrastructure, also e.g. via AIME

Open Weight vs. Open Source

Startups often seek their own IP. But how do you build it? You either learn and invent or you adapt. But open models are open, right?

Open Source (e.g., AllenAI Olmo3)

  • You get the weights of the model
  • You get the documentation/paper
  • You get the dataset used to train it
  • You get the training code, eval code etc.
  • You can replicate the entire process on-premise (if you could afford it)
For building your own IP and learning how it really works, "Open Source" is all you need.

Open Weight (e.g., Qwen3)

  • You get the model weights (the final product)
  • You might get some source code
  • You do not get the training data or the full "recipe"
  • You can use and finetune the model, but you cannot replicate it from scratch
  • You need to check the license (some contain restrictions for commercial use)
For on-premise inference, "Open Weight" is all you need (if the license allows you to).

Do You Really Need Fine-Tuning?

As a Startup or Student, I might want to finetune a model to build my IP (if the license allows me to). But is it a good idea?

Let's have the model fit our data!

  • Problem: Your model doesn't know your specific data or follow your specific format
  • Old Solution: Fine-tune the model on thousands of examples
Current Solution: Fix the prompts and the context first.

Fix Before Fine-Tuning:

  • Prompt Engineering / Optimizing (Cheap & Fast)
  • RAG (Retrieval-Augmented Generation) / Long Context (Medium)
  • Fine-Tuning (Complicated & Risky)
Fine-Tuning of larger models is complicated, requires a well-curated dataset, expertise.

Skipping Finetuning 1: Auto-Optimizing Prompts

Let the model write its own prompts.

  • Tool: DSPy (dspy.ai)
  • Paper: Link
  • Concept: You don't write prompts, you write programs
  • You define the steps (e.g., "Think", "Retrieve", "Answer") and the goal
  • DSPy's "Optimizer" (like GEPA) will automatically test hundreds of prompts and few-shot examples to find the best possible prompt to achieve your goal
  • ~+10% gains on AIME 2025 with GPT-4.1 Mini
GEPA auto-prompting illustration
Prompt Engineering transformed into a (brute-force) optimization problem.

Skipping Finetuning 2: Agentic Context Engine (ACE)

  • Tool: ACE (Agentic Context Engine) (Link)
  • Paper: Link
  • Concept: A new RAG technique that creates a "virtual file system" or "context tree" for the AI
  • Instead of just "dumping" documents into the context, ACE creates a structured hierarchy
  • The model can then "navigate" this context (e.g., cd /project_docs, ls, cat /summary.txt)
  • +10.6% on agents and +8.6% on finance
Agentic Context Engine illustration
Drastically improves RAG performance on complex, multi-document tasks without any fine-tuning.

If You Really Must Fine-Tune...

  • Technique: QLoRA (Quantized Low-Rank Adaptation)
    • Freezes the main model (in 4-bit)
    • Trains only a tiny set of "adapter" weights, quantized (QLoRA)
    • Result: You can fine-tune gpt-oss-120b on a single GPU with 66GB VRAM Link
  • Tool: Unsloth (unsloth.ai)
    • A drop-in replacement for Hugging Face
    • ~1.5x faster training and 70% less VRAM, 10x longer context lengths compared to standard QLoRA
    • Uses highly optimized Triton kernels
    • QLoRA requirements: gpt-oss-20b = 14GB VRAM • gpt-oss-120b = 65GB VRAM.
      BF16 LoRA requirements: gpt-oss-20b = 44GB VRAM • gpt-oss-120b = 210GB VRAM.
If you want cheap inference, you should always choose a sparse/MoE model as your base for fine-tuning/QLoRA.

Which Model Should I Fine-Tune?

  • ⚠️ Dense Models (e.g., Llama 3.3 70B):
    • All 70 billion parameters are activated for every single token
    • Requires massive memory bandwidth
  • ✅ Sparse (MoE) Models (e.g., gpt-oss-120b, qwen3-next etc.):
    • Total Parameters: 120B (looks huge!)
    • Active Parameters: Only ~5.1B are activated per token (out of 117B total)
    • Result: Runs as fast as a 30B model, but with the knowledge of a 120B model
    • Training: Much cheaper than for dense models
Stop thinking about parameter count and benchmarks. Start thinking about architecture.

Training non-LLM models

For Startups, there are endless possibilities for business opportunities building small, specific AI models that perform much better on specific tasks, when trained on exclusively access datasets, with an efficient AI model architecture. If you have this opportunity, you might train your own model from scratch (e.g., Computer Vision, recommendation engines, scientific AI) - btw, that's we do at NeuraMancer.ai:

  • Just use PyTorch to implement it. It's the industry and research standard
  • The entire ecosystem (data loaders, optimizers, distributed training) is built around it
  • Use Triton kernels and/or torch.compile() for unparalleled speed!
  • Train in BF8 (Hopper) or even FP4 natively (Blackwell)
  • Training is a different workload than LLM inference. It is compute-bound, so raw TFLOPS (Tensor Cores) matter more -> rent a training cluster if you lack the capital

Token-Based APIs (EU)

You want a "serverless" experience, paid per token, but need EU data privacy:

Pro:
  • Simple to integrate (OpenAI API endpoints), LangChain, LangGraph
  • No upfront cost
  • EU: 40-60% cheaper than US Hyperscalers (AWS, GCP, Azure)
  • The Catch: You must pin the Data Processing Agreement (AVV/DPA) to be EU AI Act compliant
  • Con:
    • If you train your own models.
    • If you finetune models.
    • STACKIT: Extremely limited model selection
    • Telekom (OTC): Absurdly expensive
Provider Pros Cons
Scaleway (FR/NL/PL) Sub-200ms TTFT, good price, 1M free tokens
OVHcloud (FR/DE/PL) Cheapest provider.
Regolo.ai (IT) Zero retention policy (max privacy), 100% renewable energy
Nebius (EU/Global) Best model selection, 99.9% SLA, Zero data retention Data centers are global, must pin to EU via contract

Our Startup Needs GPU Hardware!

Typical Use Case: A small team needs to train custom models, fine-tune or serve sparse MoE models at relatively small scale, on-premise in the EU. If not sparse, must be a dense Small Language Model (SLM).

Small MVP Setup

  • Build: 1-2x NVIDIA RTX Pro 6000 Blackwell (192GB Total VRAM)
  • Price: ~€15,000 - €30,000 all-in
  • Why: Massive VRAM and bandwidth handles almost any workload. ECC memory for reliability

Why not RTX 5090 32GB? It can be useful for lean inference machines, but only for SLMs/sparse models.

RTX 5090 vs RTX 6000 Pro - Benchmark Link

Other NVIDIA Hardware Options

Best option in 2025 / early 2026 for Startups:

  • NVIDIA RTX Pro 6000 Blackwell (96GB).
  • smaller sizes (4000, 4500, 5000, etc.) or older revisions (Ada etc.) are cheaper and viable for smaller models (SLMs) when scaling for inference, but mind the acceleration features.
  • Blackwell supports FP4 acceleration, while older generations do not.

1x NVIDIA RTX 6000 Pro - Offerings

primeLine 1x RTX 6000 Pro AIME 1x RTX 6000 Pro

2x NVIDIA RTX 6000 Pro - Offerings

primeLine 2x RTX 6000 Pro AIME 2x RTX 6000 Pro

Trusted Vendors/Dealers: Where to Buy?

Validate the NVIDIA dealer status: NVIDIA Partner Network

My personal favs (from customer experience):

  • AIME (Germany, Berlin, Berlin)
    • Exceptional software stack and service
    • Deep expertise in AI workflows
    • Offer cloud rental and hardware sales
  • Amber (Germany, Oberhaching/Munich, Bavaria)
    • Highly competent and reliable
    • Often more price-competitive in direct negotiation
    • Focus purely on hardware solutions
  • primeLine Solutions (Germany, Bad Oeynhausen, NRW)
    • Most transparent communication regarding in-stock status and price changes
    • Best online shop product offerings
AIME/Amber: You can't go wrong with either. I choose AIME for a premium, software-inclusive experience. I chose Amber for excellent, reliable hardware at a slightly better price point. When both dealers don't offer a product I'm looking for, primeLine Solutions usually has it at a relatively competitive price point.

Containerization for Dummies

Now our Startup has the hardware, but how do we run the model? You either ship your own with your own inference server (your pytorch inference code), or you use one of the popular, highly optimized inference servers.

How do I run this in production?

  • Don't: Run python my_app.py or an inference server in a screen session etc..
  • Do: Use Containers
  • I publish such containers! Here's an example: qwen3-omni-vllm-docker
Key Tool: NVIDIA Container Toolkit. This allows your Docker containers to access the GPU accelerated.

Why?

  • Isolation: Each AI model and its dependencies (CUDA, PyTorch, etc.) live in their own sealed box using a specific inference server (vLLM, sglang, ollama etc.)
  • Portability: The exact same container runs on your laptop, your on-prem server, or in the cloud
  • Scalability: Easy to manage, update, and deploy multiple copies - easy to evolve into larger deployments - such as Kubernetes or smaller ones using e.g. Kamal

Many Inference Jobs On One GPU

You have one NVIDIA RTX 6000 Pro, but 5 different apps. How do they share it? Each app might use a different model, hosted and optimized by a different containerized inference server (sglang vs vLLM).

Option 1: Shared Process Model (via Container)

  • How it works: One containerized inference server (like vLLM) loads the model into VRAM once. It then handles all incoming requests and batches them together. You can also start as many containers (replicas), as you have VRAM if you need more containerized inference engines or special configurations
  • Pro: Extremely efficient VRAM use. Containerization cost < 0.9%
  • Con: If your container consume more VRAM than available, Out-of-Memory (OOM) results in an incident. You must plan well (reserved VRAM, managed replica counts)

Option 2: MIG (Multi-Instance GPU)

  • How it works: Hardware-level partitioning. You physically slice the GPU's VRAM into (e.g.) 2 smaller, fully isolated GPUs. Each slice gets its own guaranteed VRAM and compute
  • Pro: Total, secure isolation. Feels like multiple GPUs
  • Con: Inefficient. VRAM is fixed, if one instance is idle, its VRAM is wasted. You can't inference larger models
  • Best for: Securely serving completely different clients/models on one chip. Usually used by cloud GPU compute providers

Scaling faster than you can buy

Nice to have issue! But what do you do? In this case, you might want to rent GPU hardware for < 12 months.

Here are some of my preferred options in Germany:

  • AIME GPU Cloud (Germany):
    • Link
    • Rent massive multi-GPU nodes (H100, etc.) by the hour/day
    • Great for burstable training/fine-tuning
  • IP-Projects (Germany):
    • Link
    • Rent single-GPU inference servers monthly
    • NVIDIA® RTX™ 4000 SFF Ada, NVIDIA® RTX™ 5000 Ada, NVIDIA® RTX™ 6000 Ada
    • from 262€ / month up to 737€ / month
  • Hetzner (Germany):
    • Link / GEX44
    • Intel® Core™ i5-13500 13th Gen 6 Performance, 8 Eco Cores, 64GB DDR4, 3,84 TB NVMe SSD Storage, 1x 1GB/s Backbone
    • NVIDIA® RTX™ 4000 SFF Ada, NVIDIA® RTX™ 6000 Ada
    • from 184€ (+setup fee) to 967€ / month; Good value, but often older and more limited hardware
Spot-market GPU resources: Often unreliable, often hosted outside of the EU, huge risk regarding data privacy

Smarter Scaling And Cost Savings By Optimization

Let's make inference FASTER and CHEAPER:

  • Use Quantized Models:
    • Use 4-bit, 8-bit, or FP8 models or Dynamic Quantization (looking at Unsloth), where some layers are more quantized than others
    • Tools like Unsloth provide dynamically quantized models that are extremely fast
  • Use a Modern Inference Engine:
    • vLLM: Best for high-throughput (many users)
    • SGLang: Best for low-latency (TTFT) and highly optimized setups
    • llama.cpp/Ollama: Best for CPU and non-NVIDIA (Mac, AMD)
    • möbius: On edge devices, facilitating ANE/NPU inference (not yet mainstream)
  • Optimize Tokens/Second:
    • Use Speculative Decoding. A small "draft" model (e.g., Qwen3-0.6B) generates 5-10 tokens, and the large model (e.g., gpt-oss-120b) validates them all in one step
  • Get More Context:
    • Use RoPE Scaling (Linear, NTK, YaRN) to extend a model's context window (e.g., from 8k to 32k)
  • Optimize Time-To-First-Token (TTFT):
    • Use LMCache to cache the prompt processing (prefill). Can be up to 7.7x faster

EU AI Act: What To Do NOW (Nov 2025)

This is what your legal/compliance team needs to be doing.

✅ Immediately (Overdue!)

  • Unacceptable Risk: (BANNED) e.g., social scoring, real-time biometric surveillance
  • High-Risk: (STRICT) e.g., HR (hiring), critical infrastructure, medical devices
  • Limited Risk: (TRANSPARENCY) e.g., Chatbots, deepfakes (must be labeled)
  • Minimal Risk: (NO RULES) e.g., spam filters
  • Audit & Remove: Check all systems against Art. 5 (Banned Systems). Stop using them
  • Ensure AI Literacy: (Art. 4) You are already required to be training your staff
German Law (KI-VO): The draft is out. The Bundesnetzagentur (BNetzA) will be the central market surveillance authority. Regulations will be enforced by October 2026 earliest (expect delays and changes!).

⏰ Urgent (Deadline: August 2026 - 9 months left)

  • Full AI Inventory: Document every AI system you use or plan to use
  • Classify Risk: Is it High-Risk? (e.g., sorting resumes? YES.)
  • Identify Role: Are you a "Provider" or "Deployer"?
  • Implement Transparency: Label all AI-generated content and chatbots
  • Start Now (for High-Risk): Begin building your Risk Management System, Quality Management System (QMS), and Technical Documentation
  • Check Contracts: Audit all 3rd-party vendor contracts (incl. AVV/DPA)

Other Legal Hurdles

It's not just the AI Act.

🇪🇺 GDPR (Datenschutzgrundverordnung)

  • The AI Act does not replace GDPR
  • If you process personal data, you still need a legal basis, data processing agreements (AVV), and to conduct data protection impact assessments (DSFA)
  • On-Premise is your friend here.

🇪🇺 EU Product Liability Directive

  • If your AI system causes harm (e.g., gives bad medical or financial advice), you (the provider/deployer) can be held liable, especially if it is a Software-as-a-Service (SaaS) product
  • How to protect yourself: Meticulous documentation, rigorous testing, and clear user disclaimers

Certified (Free) Training & Resources

  • Looking for structured, certified learning on AI regulation or compliance?
  • As of now (Nov 2025), there is no regulation of AI training providers/certfications in Germany.
  • You should be safe with having your employees get certifications from reputable AI training providers.
  • I do recommend the free offering of KI-Campus (ki-campus.org). They offers a wide range of free, high-quality online courses and micro-certifications in German, covering AI, data ethics, the AI Act, and law.
  • Include such courses as part of onboarding and annual staff training for compliance (Art. 4).
  • Do you need further information? Contact me - I can share more information privately and refer you to expert lawyers.

I'm an Enterprise! I Need A CTO Pitch

"I am an SME, I have a budget. My CTO asked me for strategy advice. What do I propose?"

The Enterprise-Grade Proposal:

  • Hardware: Start with a certified NVIDIA DGX System (e.g., DGX H100/H200 or DGX Spark clusters)
    • Why: It's the industry standard. It comes with full enterprise support, guaranteed performance, and a seamless software stack (DGX OS). It's the "no-one-gets-fired-for-buying-IBM" choice

 

  • Software: Deploy on Kubernetes (K8s)
    • Why: It's the standard for scalable, resilient container orchestration
  • Conformance: Adhere to the new CNCF Certified Kubernetes AI Conformance program (k8s-ai-conformance)
    • Why: This ensures your K8s cluster is specifically configured for AI workloads (e.t., GPU scheduling, networking), making it standardized and future-proof

My Current Top Model Picks

Of course, you can always have a look at leaderboards like LMArena, MTEB & co., but they lag behind new model releases. Here are my current personal picks for different tasks:

LLM (Chat & Reasoning)

  • OpenAI/gpt-oss-120b: Best overall open-weight model for reasoning (effort: high) - especially for typical chatbot cases
  • Qwen/qwen3-next-80b-A3B-instruct: Incredible performance, task-dependent better than gpt-oss-120b

Embedding (RAG & Semantic Search)

  • Qwen/qwen3-embedding-8b: Best overall performance in realworld use-cases
  • ibm-granite/granite-embedding-models - Various well-performing embedding and reranking models of different sizes

OCR (Text Recognition)

  • datalab-to/chandra: SOTA for complex documents
  • (Previously: AllenAI olmOCR 2)

ASR (Speech-to-Text)

  • Realtime: nvidia/parakeet-tdt-0.6b-v3 (faster) or nvidia/canary-1b-v2 (higher precision)
  • Best Whisper-like: nyrahealth/CrisperWhisper or unsloth/nyrahealth/CrisperWhisper - for MLX-based inference, use my model kyr0/crisperwhisper-unsloth-mlx-8b
  • qwen3-omni-30b-A3B-instruct: Very high multilingual precision even in noisy environments; Audio classification even for non-voice audio (music, fx etc.)
  • Potential runner-up: FAIR (Meta) omnilingual-asr

VLM (Vision / Multimodal)

  • Qwen/qwen3-omni: Excellent object detection and grounding for images and videos
  • Qwen/qwen3-vl: Smaller model that performs very well according to its size
  • OpenBMB/MiniCPM-V-4_5: Even smaller model that performs very well according to its size

🔮 2025 to 2026: Hybrid Architectures Become Standard

🚀 2025: The Shift From Experimental to Standard

  • Hybrid SSM + Attention Architectures: Interleaved SSM (State Space Model) + Attention blocks today, with more intra-layer mixing emerging
  • Why It Matters: ~3× faster throughput on long-context workloads vs. pure original Transformer baselines
  • Growing Industry Trend: AI21 (Jamba), Google (Griffin), Alibaba (Qwen3/Gated DeltaNet)

🎯 What We Might See in 2026

  • 🌟 Ultra-Sparse MoE: Tens to hundreds of experts, with only a small fraction active per token (e.g. Qwen3-Next)
  • 🌟 Context Windows: 256K–1M native-token contexts becoming standard, not just frontier-only
  • 🌟 Hardware Optimization: Specialized kernels and runtimes for hybrid primitives, dual-mode (NPU/APU/TPU + GPU) inference engines
  • 🌟 Dominantly O(n) Inference: More layers scale roughly linearly with context length, improving long-context throughput vs. quadratic attention
  • 🌟 Memory Efficiency: 500K to 1M-token contexts with more graceful scaling instead of sharp memory cliffs
  • 🌟 Reduced Performance Degradation: Longer contexts remain usable with slower quality drop-off than in traditional transformers

🛠️ Hardware-Model Co-Design Matures

  • 🌟 Specialized Silicon: Continued advances in APUs, NPUs, unified memory, native quantization, etc.
  • 🌟 Modular Inference: Deploy SSM on CPUs, attention on NPUs, MoE experts distributed
  • 🌟 On-Device Inference: 7B–13B models run efficiently on consumer hardware and high-end mobile devices with mixed linear/SSM–attention architectures

🛣️ Potential Outcomes

  • 🌟 Close-to-Linear Attention Models: Sub-quadratic scaling architectures compete directly with hybrids
  • 🌟 Adaptive Routing: Models dynamically choose computation type by activating layers based on the context
  • 🌟 On-Device Ramps Up: Efficient inference enables broad consumer deployment; SMEs increasingly adopt on-premise AI infrastructure in the EU
  • 🌟 EU AI Act Revision: A more realistic timeline and/or a simplified framrwork will be enforced

How Do I Stay Up-to-Date?

It's a full-time job. Here is my strategy:

1. Primary Sources:

  • LinkedIn: Follow VIP users and orgs
  • GitHub: Follow VIP users and orgs
  • HuggingFace: Follow VIP users and orgs
  • arXiv: Daily check for new papers (ML, CS.AI, CS.CL), linked via https://papers.cool
  • Social Media platforms: Follow VIP users and orgs

2. Aggregators:

  • Reddit: r/LocalLLaMA, r/vLLM etc. (for practical/hardware info), r/MachineLearning
  • Hacker News: For general tech/AI trends

3. Community (The most important):

  • Active participation in Discord servers (e.g., Unsloth AI, llama.cpp)
  • Direct discussion with peers, researchers, and other AI engineers
  • This is what I'm trying to fix!

minloss.club – Join the Waitlist

I'm introducing a new, ML-focused community for engaged AI/ML practitioners and researchers – engineering-first discussions on MLOps and model architecture, without hype.

minloss.club

  • What: A free, curated, but invitation-only (gated) community for ML/AI researchers, developers, and ML ops – focused entirely on research, engineering, sharing information, discussing ideas.
  • Why: To share practical knowledge, code, and strategies for training, and scaling real-world, production-grade AI systems from small on-premise systems to large, planet-scale deployments
  • Who: ML/AI researchers, developers, and ML ops - no pure vibe coders - but professionals. People who prioritizing rigorous engineering over buzzwords and self-promotion.
  • How: New papers, repos, ideas and models are surfaced early, deduplicated, and filtered for signal over noise, and summarized – so your time is spent on what actually improves outcomes -> min(loss).
  • Where: A private micro-community (Mattermost), complemented by a public blog and a mailing list.

Join the waitlist via WhatsApp

WhatsApp QR
Vision: To support and connect the people who actually design, build, and operate real world AI systems.

Contact & Resources

Thank you for your attention!

Download Slides

Connect with me on LinkedIn

Aron LinkedIn