Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
22.1
TFLOPS
53
24
528
Carlo Moro
cnmoro
Follow
Enrell's profile picture
Tonic's profile picture
glaucokiss's profile picture
63 followers
·
92 following
cnmoro
carlo-moro-4a20a7132
AI & ML interests
None yet
Recent Activity
replied
to
JonnaMat
's
post
about 2 hours ago
⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` ✨ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. 🧩 Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead 📊 https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks 🔧 Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀
reacted
to
JonnaMat
's
post
with 🚀
about 2 hours ago
⚡ FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just: ``` pip install flash-head vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16 ``` ✨ The plugin activates automatically via vLLM's `vllm.general_plugins` entry point. No source patches, no custom imports. 🧩 Supported models (full collection): https://huggingface.co/Qwen Qwen3, https://huggingface.co/meta-llama Llama3, https://huggingface.co/google Gemma3, https://huggingface.co/nvidia Cosmos-Reason2 - BF16 and W4A16 variants. https://huggingface.co/collections/embedl/flashhead 📊 https://huggingface.co/spaces/embedl/Edge-Inference-Benchmarks 🔧 Benchmark it yourself: ``` vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 # Baseline comparison FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1 ``` FlashHead shines at low batch sizes; the typical real-time / on-device use case. 🚀
liked
a dataset
about 2 hours ago
badlogicgames/pi-mono
View all activity
Organizations
cnmoro
's activity
All
Models
Datasets
Spaces
Buckets
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
New activity in
principled-intelligence/gemma-4-E2B-it-text-only
8 days ago
GGUF?
#1 opened 8 days ago by
cnmoro
New activity in
cnmoro/Qwen2.5-0.5B-Portuguese-v1
about 2 months ago
Improve language tag
#3 opened 12 months ago by
lbourdois
New activity in
NoesisLab/Spartacus-1B-Instruct
about 2 months ago
High RAM usage
1
#1 opened about 2 months ago by
cnmoro
New activity in
cnmoro/LFM-Q4-GGUFS
3 months ago
Portuguese GGUFs?
2
#1 opened 3 months ago by
xande-p
New activity in
mradermacher/model_requests
5 months ago
GGUF Request - LFM2-ColBERT-350M
1
#1488 opened 5 months ago by
cnmoro
New activity in
huggingface/InferenceSupport
5 months ago
LiquidAI/LFM2-ColBERT-350M
👍
3
#5834 opened 5 months ago by
cnmoro
New activity in
LiquidAI/LFM2-ColBERT-350M
5 months ago
Usage via API
2
#2 opened 5 months ago by
cnmoro
New activity in
12bitmisfit/Qwen3-30B-A3B-Instruct-2507_Pruned_REAP-15B-A3B-GGUF
6 months ago
Multilingual degradation
➕
1
#1 opened 6 months ago by
cnmoro
New activity in
ibm-granite/granite-4.0-h-tiny
6 months ago
Fill-In-the-Middle (FIM) code completions
1
#4 opened 6 months ago by
cnmoro
New activity in
artificialguybr/Surya-OCR
8 months ago
What is surya-ocr version
5
#3 opened about 1 year ago by
Redgalaxy2
New activity in
cnmoro/gliclass-modern-large-v3.0-onnx
9 months ago
Full text construction issue
1
#1 opened 9 months ago by
BioMike
New activity in
knowledgator/gliclass-base-v3.0
9 months ago
ONNX
2
#1 opened 9 months ago by
cnmoro
New activity in
mradermacher/model_requests
9 months ago
langtech-languagemodeling/IberianLLM-7B-Instruct
1
#1165 opened 9 months ago by
cnmoro
New activity in
VLM2Vec/VLM2Vec-V2.0
9 months ago
Any doc for this model for usage?
➕
1
2
#2 opened 9 months ago by
htian01
New activity in
moondream/moondream-2b-2025-04-14-4bit
10 months ago
Repetition Penalty
#4 opened 10 months ago by
cnmoro
New activity in
google/gemma-3-27b-it
11 months ago
Codificação de imagens no prompt
6
#64 opened 12 months ago by
areumtecnologia
New activity in
moondream/moondream-2b-2025-04-14-4bit
11 months ago
moondream_station question
2
#1 opened 11 months ago by
cnmoro
New activity in
cnmoro/Qwen2.5-0.5B-Portuguese-v2
about 1 year ago
Adding the Open Portuguese LLM Leaderboard Evaluation Results
#1 opened about 1 year ago by
leaderboard-pt-pr-bot
New activity in
cnmoro/Qwen2.5-0.5B-Portuguese-v1
about 1 year ago
Adding the Open Portuguese LLM Leaderboard Evaluation Results
#2 opened about 1 year ago by
leaderboard-pt-pr-bot
New activity in
cnmoro/Qwen2.5-0.5B-Portuguese-Hybrid-Reasoning
about 1 year ago
Adding the Open Portuguese LLM Leaderboard Evaluation Results
#1 opened about 1 year ago by
leaderboard-pt-pr-bot
Load more