← Agents
WIP This article is actively being written. Content may be incomplete or change without notice.
Agents

Optimizing Qwen3.5 27b & 122B for agentic inteligence on Quad RTX 3090

#vllm #qwen3.5 #rtx-3090 #speculative-decoding
Aggregate generation throughput by concurrency across 200W, 250W, and 300W power levels

The stage

I have 4x3090 RTX, 2 of them are nvlinked and one of them is RTX 3090 TI. All are undervolted usually set to 175W as a sweet spot. This system also has one sad Nvidia A4000 that is a dump for all unloved and unwanted models i may need to run concurrently to the main model, like whisperer, maybe some SST or some ComphyUI.

Additonally i have Halo Strix in form of Bosgame M5 with 128gb of 8000MHz RAM. Which is most usefull for hosting smaller finetunned models, uncensored MoE models in MFPV4 or ComphyUI with VRAM hungry models for video, image or 3D models generation.

Finding current Meta

Each time new OpenWeights model drops there is a possiblity for the current meta of local hosting to shift.

For the longest time unquestioned king of quality for that size was glm-4.5-air. Some may say that it was gpt-oss-120b, but they were wrong, unless the goal was to just generate a looot of tokens very fast.

After all not all models are created equal.

Autonomous agentic uscase

My main usecase is a multiagent system that works somewhat similar to OpenClaw, but was built before crab fever consumend everything and is based on OpenCode agents compiled with custom framework. It is basically a bunch of orchestrator agents layered multiple layers deep. Because of that I can use even very weak local models without woring that access to tousands of tools would overwhelm the agent. Currently system is compose out of 150+ subagents that route tasks to each other while using tools via bash. This architecture promotes economical inteligence, I don’t need one slow but extremally large model qontorted via quantisation to fit in 96gb of VRAM. On the other hand decomposing each task into extremally small and simple steps would be also counter productive, i need something that can handle complex tasks, but wont require me to basically hard code each agentic path explicitedly.

The winning config

Currently Qwen3.5-27b is the king of efficiency, by using MTP with speculative decoding and kv-cache quant to fp8 model can achieve up to 4800 tokens/s prefill throughput while maintaining solid generation speeds on 4 GPUs.

name: vllm-thinking

services:
  vllm:
    image: vllm/vllm-openai:v0.19.0
    restart: unless-stopped
    runtime: nvidia
    shm_size: 8gb
    ipc: host
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,2,3,4
      - CUDA_DEVICE_ORDER=PCI_BUS_ID
      - RAY_memory_monitor_refresh_ms=0
      - NCCL_CUMEM_ENABLE=0
      - NCCL_NVLINK_DISABLE=0
      - VLLM_ENABLE_CUDAGRAPH_GC=1
      - VLLM_USE_FLASHINFER_SAMPLER=1
      - PYTORCH_ALLOC_CONF=expandable_segments:True
    volumes:
      - "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
    ports:
      - "8082:8000"
    command: >
      --model cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
      --served-model-name cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
      --quantization compressed-tensors
      --port 8000
      --host 0.0.0.0
      --tensor-parallel-size 4
      -O3
      --max-model-len 262144
      --gpu-memory-utilization 0.9
      --dtype auto
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --reasoning-parser qwen3
      --limit-mm-per-prompt '{"image":10,"video":2}'
      --enable-prefix-caching
      --disable-custom-all-reduce
      --kv-cache-dtype fp8
      --max-num-seqs 12
      --max-num-batched-tokens 8192
      --compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,12]}'
      --trust-remote-code
      --no-use-tqdm-on-load
      --generation-config auto
      --attention-backend FLASHINFER
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
      --override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 300s

Power vs performance

Tested at three power limits: 200W (my usual undervolt), 250W and 300W. The difference is not subtle.

PowerSamplesAvg Gen (t/s)Median GenP95 GenAvg Prefill (t/s)Median Prefill
200W16154.850.6126.2974.1681.2
250W47151.6164.3219.1872.8379.6
300W53108.9114.3213.51161.6722.7

250W nearly triples generation throughput over 200W. Going to 300W actually drops generation back down while prefill climbs a bit, likely because thermal throttling kicks in on the non-TI cards or the power delivery starts bouncing.

Generation throughput by power level and concurrency — grouped bars with min-max whiskers

Prefill throughput by power level and concurrency — grouped bars with min-max whiskers

Generation by concurrency

Data from real agentic workload, capped at 6 concurrent requests (beyond that everything falls apart anyway). All throughput values in tokens/s.

Power limitConcurrent reqsSamplesAvg gen t/sMedian gen t/sPeak gen t/sPer-request t/s
200W190534910753
200W237767115338
200W3656749319
200W411525414913
200W552320535
200W67262994
250W12116116119116
250W2517819421489
250W31917220724557
250W41014317127736
250W5415615516431
250W6214414421124
300W18658312765
300W22111310321857
300W31514816721149
300W41111813223130
300W512736520215
300W622828535

Aggregate generation throughput by concurrency

Aggregate prefill throughput by concurrency

Per-request generation throughput across concurrency levels — the metric that matters for agentic latency

Box plot distributions for generation and prefill throughput by power level

Per-request generation throughput drops fast beyond 2-3 concurrent requests regardless of power level. For agentic workloads where latency matters, 250W at 1-3 concurrent requests is the sweet spot on this rig.