Carlo Moro's picture

Carlo Moro

cnmoro

·

AI & ML interests

None yet

Recent Activity

repliedto JonnaMat's post about 2 hours ago

⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` ✨ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. 🧩 Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead 📊 https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks 🔧 Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀

reacted to JonnaMat's post with 🚀 about 2 hours ago

⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` ✨ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. 🧩 Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead 📊 https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks 🔧 Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀

liked a dataset about 2 hours ago

badlogicgames/pi-mono

View all activity

Organizations

New activity in principled-intelligence/gemma-4-E2B-it-text-only 8 days ago

GGUF?

#1 opened 8 days ago by

New activity in cnmoro/Qwen2.5-0.5B-Portuguese-v1 about 2 months ago

Improve language tag

#3 opened 12 months ago by

New activity in NoesisLab/Spartacus-1B-Instruct about 2 months ago

High RAM usage

#1 opened about 2 months ago by

New activity in cnmoro/LFM-Q4-GGUFS 3 months ago

Portuguese GGUFs?

#1 opened 3 months ago by

New activity in mradermacher/model_requests 5 months ago

GGUF Request - LFM2-ColBERT-350M

#1488 opened 5 months ago by

New activity in huggingface/InferenceSupport 5 months ago

LiquidAI/LFM2-ColBERT-350M

#5834 opened 5 months ago by

New activity in LiquidAI/LFM2-ColBERT-350M 5 months ago

Usage via API

#2 opened 5 months ago by

New activity in 12bitmisfit/Qwen3-30B-A3B-Instruct-2507_Pruned_REAP-15B-A3B-GGUF 6 months ago

Multilingual degradation

#1 opened 6 months ago by

New activity in ibm-granite/granite-4.0-h-tiny 6 months ago

Fill-In-the-Middle (FIM) code completions

#4 opened 6 months ago by

New activity in artificialguybr/Surya-OCR 8 months ago

What is surya-ocr version

#3 opened about 1 year ago by

New activity in cnmoro/gliclass-modern-large-v3.0-onnx 9 months ago

Full text construction issue

#1 opened 9 months ago by

New activity in knowledgator/gliclass-base-v3.0 9 months ago

ONNX

#1 opened 9 months ago by

New activity in mradermacher/model_requests 9 months ago

langtech-languagemodeling/IberianLLM-7B-Instruct

#1165 opened 9 months ago by

New activity in VLM2Vec/VLM2Vec-V2.0 9 months ago

Any doc for this model for usage?

#2 opened 9 months ago by

New activity in moondream/moondream-2b-2025-04-14-4bit 10 months ago

Repetition Penalty

#4 opened 10 months ago by

New activity in google/gemma-3-27b-it 11 months ago

Codificação de imagens no prompt

#64 opened 12 months ago by

areumtecnologia

New activity in moondream/moondream-2b-2025-04-14-4bit 11 months ago

moondream_station question

#1 opened 11 months ago by

New activity in cnmoro/Qwen2.5-0.5B-Portuguese-v2 about 1 year ago

Adding the Open Portuguese LLM Leaderboard Evaluation Results

#1 opened about 1 year ago by

leaderboard-pt-pr-bot

New activity in cnmoro/Qwen2.5-0.5B-Portuguese-v1 about 1 year ago

Adding the Open Portuguese LLM Leaderboard Evaluation Results

#2 opened about 1 year ago by

leaderboard-pt-pr-bot

New activity in cnmoro/Qwen2.5-0.5B-Portuguese-Hybrid-Reasoning about 1 year ago

Adding the Open Portuguese LLM Leaderboard Evaluation Results

#1 opened about 1 year ago by

leaderboard-pt-pr-bot