Optimizing Qwen3.5 27b & 122B for agentic inteligence on Quad RTX 3090
The stage
I have 4x3090 RTX, 2 of them are nvlinked and one of them is RTX 3090 TI. All are undervolted usually set to 175W as a sweet spot. This system also has one sad Nvidia A4000 that is a dump for all unloved and unwanted models i may need to run concurrently to the main model, like whisperer, maybe some SST or some ComphyUI.
Additonally i have Halo Strix in form of Bosgame M5 with 128gb of 8000MHz RAM. Which is most usefull for hosting smaller finetunned models, uncensored MoE models in MFPV4 or ComphyUI with VRAM hungry models for video, image or 3D models generation.
Finding current Meta
Each time new OpenWeights model drops there is a possiblity for the current meta of local hosting to shift.
For the longest time unquestioned king of quality for that size was glm-4.5-air. Some may say that it was gpt-oss-120b, but they were wrong, unless the goal was to just generate a looot of tokens very fast.
After all not all models are created equal.
Autonomous agentic uscase
My main usecase is a multiagent system that works somewhat similar to OpenClaw, but was built before crab fever consumend everything and is based on OpenCode agents compiled with custom framework. It is basically a bunch of orchestrator agents layered multiple layers deep. Because of that I can use even very weak local models without woring that access to tousands of tools would overwhelm the agent. Currently system is compose out of 150+ subagents that route tasks to each other while using tools via bash. This architecture promotes economical inteligence, I don’t need one slow but extremally large model qontorted via quantisation to fit in 96gb of VRAM. On the other hand decomposing each task into extremally small and simple steps would be also counter productive, i need something that can handle complex tasks, but wont require me to basically hard code each agentic path explicitedly.
The winning config
Currently Qwen3.5-27b is the king of efficiency, by using MTP with speculative decoding and kv-cache quant to fp8 model can achieve up to 4800 tokens/s prefill throughput while maintaining solid generation speeds on 4 GPUs.
name: vllm-thinking
services:
vllm:
image: vllm/vllm-openai:v0.19.0
restart: unless-stopped
runtime: nvidia
shm_size: 8gb
ipc: host
environment:
- NVIDIA_VISIBLE_DEVICES=0,2,3,4
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- RAY_memory_monitor_refresh_ms=0
- NCCL_CUMEM_ENABLE=0
- NCCL_NVLINK_DISABLE=0
- VLLM_ENABLE_CUDAGRAPH_GC=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- PYTORCH_ALLOC_CONF=expandable_segments:True
volumes:
- "/mnt/ssd-4tb/ai_models/models/hub:/root/.cache/huggingface/hub"
ports:
- "8082:8000"
command: >
--model cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
--served-model-name cyankiwi/Qwen3.5-27B-AWQ-BF16-INT8
--quantization compressed-tensors
--port 8000
--host 0.0.0.0
--tensor-parallel-size 4
-O3
--max-model-len 262144
--gpu-memory-utilization 0.9
--dtype auto
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--limit-mm-per-prompt '{"image":10,"video":2}'
--enable-prefix-caching
--disable-custom-all-reduce
--kv-cache-dtype fp8
--max-num-seqs 12
--max-num-batched-tokens 8192
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,12]}'
--trust-remote-code
--no-use-tqdm-on-load
--generation-config auto
--attention-backend FLASHINFER
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
--override-generation-config '{"temperature":1.0,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 300s
Power vs performance
Tested at three power limits: 200W (my usual undervolt), 250W and 300W. The difference is not subtle.
| Power | Samples | Avg Gen (t/s) | Median Gen | P95 Gen | Avg Prefill (t/s) | Median Prefill |
|---|---|---|---|---|---|---|
| 200W | 161 | 54.8 | 50.6 | 126.2 | 974.1 | 681.2 |
| 250W | 47 | 151.6 | 164.3 | 219.1 | 872.8 | 379.6 |
| 300W | 53 | 108.9 | 114.3 | 213.5 | 1161.6 | 722.7 |
250W nearly triples generation throughput over 200W. Going to 300W actually drops generation back down while prefill climbs a bit, likely because thermal throttling kicks in on the non-TI cards or the power delivery starts bouncing.


Generation by concurrency
Data from real agentic workload, capped at 6 concurrent requests (beyond that everything falls apart anyway). All throughput values in tokens/s.
| Power limit | Concurrent reqs | Samples | Avg gen t/s | Median gen t/s | Peak gen t/s | Per-request t/s |
|---|---|---|---|---|---|---|
| 200W | 1 | 90 | 53 | 49 | 107 | 53 |
| 200W | 2 | 37 | 76 | 71 | 153 | 38 |
| 200W | 3 | 6 | 56 | 74 | 93 | 19 |
| 200W | 4 | 11 | 52 | 54 | 149 | 13 |
| 200W | 5 | 5 | 23 | 20 | 53 | 5 |
| 200W | 6 | 7 | 26 | 2 | 99 | 4 |
| 250W | 1 | 2 | 116 | 116 | 119 | 116 |
| 250W | 2 | 5 | 178 | 194 | 214 | 89 |
| 250W | 3 | 19 | 172 | 207 | 245 | 57 |
| 250W | 4 | 10 | 143 | 171 | 277 | 36 |
| 250W | 5 | 4 | 156 | 155 | 164 | 31 |
| 250W | 6 | 2 | 144 | 144 | 211 | 24 |
| 300W | 1 | 8 | 65 | 83 | 127 | 65 |
| 300W | 2 | 21 | 113 | 103 | 218 | 57 |
| 300W | 3 | 15 | 148 | 167 | 211 | 49 |
| 300W | 4 | 11 | 118 | 132 | 231 | 30 |
| 300W | 5 | 12 | 73 | 65 | 202 | 15 |
| 300W | 6 | 2 | 28 | 28 | 53 | 5 |




Per-request generation throughput drops fast beyond 2-3 concurrent requests regardless of power level. For agentic workloads where latency matters, 250W at 1-3 concurrent requests is the sweet spot on this rig.