``` 🥲 加载模型失败 Failed to load model error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe' ```
🥲 加载模型失败
Failed to load model
error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe'
[ModelLoadingProvider] Estimate to use 20742417866.19 bytes when loaded. (model: 20229396257.23, context: 513021608.96000004) Previous estimation: 27496266757.399998 bytes.
[ModelLoadingProvider] Estimate to use 20742417866.19 bytes when loaded. (model: 20229396257.23, context: 513021608.96000004) Previous estimation: 27496266757.399998 bytes.
[ModelLoadingProvider] Estimate to use 20742417866.19 bytes when loaded. (model: 20229396257.23, context: 513021608.96000004) Previous estimation: 27496266757.399998 bytes.
[ModelLoadingProvider] Requested to load model huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated/ggml-model-Q4_K_M.gguf with opts {
identifier: {
desired: 'huihui-qwen3-vl-30b-a3b-instruct-abliterated',
conflictBehavior: 'bump'
},
excludeUserModelDefaultConfigLayer: true,
instanceLoadTimeConfig: { fields: [] },
ttlMs: undefined,
bypassGuardrails: false
}
[ModelLoadingProvider] Estimate to use 20742417866.19 bytes when loaded. (model: 20229396257.23, context: 513021608.96000004) Previous estimation: 27496266757.399998 bytes.
[ModelLoadingProvider] Started loading model huihui-ai/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated/ggml-model-Q4_K_M.gguf
[ModelProxyObject(id=huihui-qwen3-vl-30b-a3b-instruct-abliterated)] Forking LLMWorker with custom envVars: {"LD_LIBRARY_PATH":"/home/user/.lmstudio/extensions/backends/vendor/linux-llama-cuda-vendor-v1"}
[ProcessForkingProvider][NodeProcessForker] Spawned process 255331
17:17:15.532 › [LMSInternal][Client=LM Studio][Endpoint=loadModel] Error in channel handler: Error: Received load-error
at _0x528116. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:915:141356)
at _0x22deea._0xadffd (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:6583)
at _0x22deea.emit (node:events:518:28)
at _0x22deea.onChildMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:104:209372)
at _0x22deea.onChildMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:4055)
at _0x44c735. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:104:208379)
at _0x44c735.emit (node:events:518:28)
at ChildProcess. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:587:21216)
at ChildProcess.emit (node:events:518:28)
at emit (node:internal/child_process:949:14)
- Caused By: Error: Failed to load model
at _0x4465ad.LLMEngineWrapper.load (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:85:15399)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async _0x3bf452.loadModel (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:113:10517)
at async _0x3bf452.handleMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:113:2096)
[LMSInternal][Client=LM Studio][Endpoint=loadModel] Error in loadModel channel Error: Received load-error
at _0x528116. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:915:141356)
at _0x22deea._0xadffd (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:6583)
at _0x22deea.emit (node:events:518:28)
at _0x22deea.onChildMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:104:209372)
at _0x22deea.onChildMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:4055)
at _0x44c735. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:104:208379)
at _0x44c735.emit (node:events:518:28)
at ChildProcess. (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:587:21216)
at ChildProcess.emit (node:events:518:28)
at emit (node:internal/child_process:949:14) - Caused By: Error: Failed to load model
at _0x4465ad.LLMEngineWrapper.load (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:85:15399)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async _0x3bf452.loadModel (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:113:10517)
at async _0x3bf452.handleMessage (/tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/lib/llmworker.js:113:2096) {
title: 'Failed to load model',
cause: "error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe'"
}
17:17:15.533 › [LMSInternal][Client=LM Studio][Endpoint=loadModel] No instance reference assigned before error
17:17:15.533 › [LMSInternal][Client=LM Studio][Endpoint=countTokens] Error in RPC handler: Error: Model is unloaded.
at /tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:3713
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
17:17:15.533 › [LMSInternal][Client=LM Studio][Endpoint=countTokens] Error in RPC handler: Error: Model is unloaded.
at /tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:3713
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
17:17:15.533 › Unhandled Rejection at: {} reason: Error: Model is unloaded.
at /tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:3713
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
17:17:15.534 › Unhandled Rejection at: {} reason: Error: Model is unloaded.
at /tmp/.mount_LM-Stud6Wda5/resources/app/.webpack/main/index.js:119:3713
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
[ProcessForkingProvider][NodeProcessForker] Exited process 255331
We only used the executable files under tr-qwen3-vl-b6910-ef91761 or the executable files recompiled from the source code for testing; LM Studio has not been tested. tr-qwen3-vl-b6910-ef91761 is a branch of llama.cpp.
it has, and it doesn't load it :)
it's shown in the model list with vision and tools enabled and all that, but the above happens.
(base) user@user:~$ cd ~/llama.cpp
export GGML_NO_CUDA=1
./build/bin/llama-mtmd-cli
-m models/Qwen3-VL-30B-A3B-Q4_K_S.gguf
--mmproj models/mmproj-Qwen3-VL-30B-A3B-F16.gguf
--image /home/user/G2nqQD7W0AAEVA0.jpeg
-p "Find the kitty"
-ngl 0
--threads 32
-b 256
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
build: 6706 (ef4c5b87) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 32 key-value pairs and 579 tensors from models/Qwen3-VL-30B-A3B-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3vlmoe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 VL 30B A3B Thinking
llama_model_loader: - kv 3: general.finetune str = Thinking
llama_model_loader: - kv 4: general.basename str = Qwen3-VL
llama_model_loader: - kv 5: general.size_label str = 30B-A3B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: qwen3vlmoe.block_count u32 = 48
llama_model_loader: - kv 8: qwen3vlmoe.context_length u32 = 262144
llama_model_loader: - kv 9: qwen3vlmoe.embedding_length u32 = 2048
llama_model_loader: - kv 10: qwen3vlmoe.feed_forward_length u32 = 6144
llama_model_loader: - kv 11: qwen3vlmoe.attention.head_count u32 = 32
llama_model_loader: - kv 12: qwen3vlmoe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 13: qwen3vlmoe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 14: qwen3vlmoe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3vlmoe.expert_used_count u32 = 8
llama_model_loader: - kv 16: qwen3vlmoe.attention.key_length u32 = 128
llama_model_loader: - kv 17: qwen3vlmoe.attention.value_length u32 = 128
llama_model_loader: - kv 18: qwen3vlmoe.expert_count u32 = 128
llama_model_loader: - kv 19: qwen3vlmoe.expert_feed_forward_length u32 = 768
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 29: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 30: general.quantization_version u32 = 2
llama_model_loader: - kv 31: general.file_type u32 = 14
llama_model_loader: - type f32: 241 tensors
llama_model_loader: - type q4_K: 327 tensors
llama_model_loader: - type q5_K: 10 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Small
print_info: file size = 16.25 GiB (4.57 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3vlmoe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 2048
print_info: n_layer = 48
print_info: n_head = 32
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 6144
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 30B.A3B
print_info: model params = 30.53 B
print_info: general.name = Qwen3 VL 30B A3B Thinking
print_info: n_ff_exp = 768
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: CPU_REPACK model buffer size = 15387.75 MiB
load_tensors: CPU_Mapped model buffer size = 16533.65 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 256
llama_context: n_ubatch = 256
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache: CPU KV buffer size = 384.00 MiB
llama_kv_cache: size = 384.00 MiB ( 4096 cells, 48 layers, 1/1 seqs), K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CPU compute buffer size = 152.38 MiB
llama_context: graph nodes = 2983
llama_context: graph splits = 106 (with bs=256), 1 (with bs=1)
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name: Qwen3 VL 30B A3B Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 459
clip_model_loader: n_kv: 24
clip_model_loader: has vision encoder
clip_ctx: CLIP using CPU backend
load_hparams: projector: qwen3vl_merger
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 2048
--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 1033.28 MiB
load_hparams: metadata size: 0.16 MiB
alloc_compute_meta: CPU compute buffer size = 2.08 MiB
main: loading model: models/Qwen3-VL-30B-A3B-Q4_K_S.gguf
encoding image slice...
image slice encoded in 14562 ms
decoding image batch 1/4, n_tokens_batch = 256
image decoded (batch 1/4) in 5770 ms
decoding image batch 2/4, n_tokens_batch = 256
image decoded (batch 2/4) in 5989 ms
decoding image batch 3/4, n_tokens_batch = 256
image decoded (batch 3/4) in 6109 ms
decoding image batch 4/4, n_tokens_batch = 256
image decoded (batch 4/4) in 5993 ms
Looking at the faces, most are human, but maybe one has cat features. Let's scan through. The image has many faces, so I need to look for something that's not human. Wait, maybe the "kitty" is a cat hidden among the people. Let's check each face.
Wait, the user mentioned "Find the kitty", so in the crowd of people, there's a cat. Let's look for a cat's face. Maybe the cat is drawn as a face with cat ears or a cat nose. Let's check the details.
Looking at the image, the faces are all human-like, but maybe one is a cat. Let's see. Wait, maybe the cat is in the lower part. Wait, the red circle is pointing to something. Wait, the red circle is on the right side, maybe that's the cat. Wait, no, the red circle is a marker. Wait, the problem is to find the kitty, so the answer is the cat that's hidden.
Wait, maybe the cat is the one with the cat ears. Let's check. In the crowd, there's a face that's a cat. Let's look for features like whiskers, a cat nose, or cat ears. Wait, the image is a bit small, but let's try.
Wait, maybe the cat is the one in the lower right area. Wait, no. Wait, the correct answer is that the cat is the face with the cat's features. Let me think. In some of these puzzles, the cat is hidden as a face with cat-like features. Let's check the faces.
Wait, looking at the faces, there's one that has a cat's face. Let's see. The face with the cat ears. Wait, maybe the face in the lower left? No. Wait, the red circle is pointing to a face. Wait, the red circle is on the right side, maybe that's the cat. Wait, no, the red circle is probably the answer marker.
Wait, maybe the cat is the one with the whiskers. Let's check. In the image, there's a face that has whiskers, which is a cat. Let's look for that.
Wait, I think the cat is the face that has a cat's nose and whiskers. Let's find that. Looking at the crowd, there's a face with whiskers, which is the cat. Let's locate it.
Wait, the correct answer is that the cat is the face with whiskers, so the position is... Let's see. In the image, the cat is the one with the whiskers, so the answer is the specific face. But since it's a visual puzzle, the answer is the cat's position.
Wait, maybe the cat is in the lower right. Wait, no. Let's think again. The problem is to find the kitty, so the answer is the cat hidden among the people. Let's check each face.
Wait, here's a thought: in some of these puzzles, the cat is the one with the cat's face, so the whiskers. Let's look for a face with whiskers. Yes, there's a face with whiskers, which is the cat. Let's find that.
Looking at the image, the cat is the face with whiskers, so the answer is that specific face. Since the user is asking to find it, the answer is the cat's position. But since it's a text-based response, maybe the answer is the description.
Wait, no, the user probably wants the location. But in the image, the cat is the one with whiskers. Let me confirm. Yes, in the crowd, there's a face that's a cat with whiskers. So the answer is that cat.
The "kitty" in the crowd is a cat face hidden among the human faces. It is distinguished by whiskers and a cat-like expression. After carefully scanning the image, the cat is located in the lower-left section of the crowd, where one of the faces features whiskers and a feline appearance.
Answer: The cat is the face with whiskers in the lower-left area of the crowd.
llama_perf_context_print: load time = 18119.73 ms
llama_perf_context_print: prompt eval time = 38913.90 ms / 1035 tokens ( 37.60 ms per token, 26.60 tokens per second)
llama_perf_context_print: eval time = 116456.82 ms / 959 runs ( 121.44 ms per token, 8.23 tokens per second)
llama_perf_context_print: total time = 156988.41 ms / 1994 tokens
llama_perf_context_print: graphs reused = 955
llama is useful
We only used the executable files under tr-qwen3-vl-b6910-ef91761 or the executable files recompiled from the source code for testing; LM Studio has not been tested. tr-qwen3-vl-b6910-ef91761 is a branch of llama.cpp.
Hi thanks a lot for your excellent quantization work. I’m new to using this mdel with LM Studio on Windows with a GPU. I downloaded your model and the bin-win-cuda-12.8-x64.zip version from the Thireus/llama.cpp release page (tag tr-qwen3-vl-b6910-ef91761). After extracting, I copied the files to LMStudio\llama.cpp, overwriting the existing ones with the same names. I selected the default engine as CUDA 12 llama.cpp, but LM Studio still fails to load the model and shows: 🥲 Failed to load the model; error loading model: error loading model architecture: unknown model architecture: 'qwen3vlmoe'.
Did I do something wrong? My LM Studio is the latest version 0.3.31.