brightenbb commited on
Commit
66a059e
·
verified ·
1 Parent(s): 4473e5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +343 -3
README.md CHANGED
@@ -1,3 +1,343 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ **HuggingFace:**[https://huggingface.co/MedAIBase/AntAngelMed](https://huggingface.co/MedAIBase/AntAngelMed)
3
+
4
+ **ModelScope:**[https://modelscope.cn/models/MedAIBase/AntAngelMed](https://modelscope.cn/models/MedAIBase/AntAngelMed)
5
+
6
+ **Github:**[https://github.com/MedAIBase/AntAngelMed/tree/main](https://github.com/MedAIBase/AntAngelMed/tree/main)
7
+
8
+
9
+ # Introduction
10
+
11
+ **AntAngelMed is Officially Open Source! 🚀 **
12
+
13
+ **AntAngelMed**, developed by **Ant Group** and the **Health Commission of Zhejiang Province**, is the largest and most powerful open-source medical language model to date.
14
+
15
+ # Core Highlights
16
+
17
+ + 🏆**World-leading performance on authoritative benchmarks**: AntAngelMed surpasses all open-source models and a range of top proprietary models on OpenAI's HealthBench, and ranks first overall on the Chinese authority benchmark MedAIBench.
18
+ + 🧠**Advanced Medical Capabilities**: AntAngelMed achieves its professional medical capabilities through a rigorous three-stage training pipeline: continual pre-training on medical corpora, supervised fine-tuning with high-quality instructions, and GRPO-based reinforcement learning. This process equips the model with deep medical knowledge, sophisticated diagnostic reasoning, and robust adherence to safety and ethics.
19
+ + ⚡**Extremely efficient inference:** Leveraging [Ling-flash-2.0](https://arxiv.org/abs/2507.17702)’s high-efficiency MoE, AntAngelMed matches the performance of ~40B dense models while activating only 6.1B parameters of its 100B parameters. It achieves over 200 tokens/s on H20 hardware and supports 128K context length.
20
+
21
+ # **📊** Benchmark Results
22
+
23
+ ## **HealthBench**
24
+
25
+ [**HealthBench**](https://arxiv.org/abs/2505.08775) is an open-source medical evaluation benchmark released by OpenAI, designed to assess the performance of Large Language Models (LLMs) in real-world medical environments through highly simulated multi-turn dialogues. AntAngelMed achieved outstanding performance on this benchmark, ranking first among all open-source models, with a particularly significant advantage on the challenging HealthBench-Hard subset.
26
+
27
+ ## **MedAIBench**
28
+
29
+ [**MedAIBench**](https://www.medaibench.cn) is an authoritative medical LLM evaluation system developed by the National Artificial Intelligence Medical Industry Pilot Facility. AntAngelMed also **ranks first overall** and demonstrates strong comprehensive professionalism and safety, especially in medical knowledge Q&A and medical ethics/safety.
30
+
31
+ ![](https://intranetproxy.alipay.com/skylark/lark/0/2025/png/135556672/1765855632812-4659290c-8f89-4378-aa40-1df0fcbd6e78.png)
32
+
33
+ **Figure | AntAngelMed ranks first among open-source models on HealthBench and first on MedAIBench**
34
+
35
+ ## **MedBench**
36
+
37
+ [**MedBench**](https://arxiv.org/abs/2511.14439) is a scientific and rigorous benchmark designed to evaluate LLMs in the Chinese healthcare domain. It comprises 36 independently curated evaluation datasets and covers approximately 700,000 samples. AntAngelMed ranks first on the MedBench self-assessment leaderboard and leads across five core dimensions: medical knowledge question answering, medical language understanding, medical language generation, complex medical reasoning, and safety and ethics, highlighting the model's professionalism, safety, and clinical applicability.
38
+
39
+ ![](https://intranetproxy.alipay.com/skylark/lark/0/2025/png/1591/1766130714462-1a4d7350-6255-4bd7-a01b-79fa6f9161ed.png)
40
+
41
+ **Figure | AntAngelMed ranks first on the MedBench self-assessment leaderboard.**
42
+
43
+
44
+ # 🔧 Technical Features
45
+
46
+ ## Professional three-stage training pipeline
47
+
48
+ AntAngelMed employs a carefully designed three-stage training process to deeply integrate general capabilities with medical expertise:
49
+
50
+ + **Continual Pre-Training:** Based on Ling-flash-2.0, AntAngelMed is continually pre-trained with large-scale, high-quality medical corpora (encyclopedias, web text, academic publications), injecting profound domain and world knowledge.
51
+ + **Supervised Fine-Tuning (SFT):** A multi-source and heterogeneous high-quality instruction dataset is constructed at this stage. General data (math, programming, logic) strengthen core chain-of-thought capabilities of AngAngel, while medical scenarios (doctor–patient Q&A, diagnostic reasoning, safety/ethics) provide deep adaptation for improved clinical performance.
52
+ + **Reinforcement Learning (RL):** Using the [**GRPO**](https://arxiv.org/pdf/2402.03300) algorithm and task-specific reward models, RL precisely shapes model behavior—emphasizing empathy, structural clarity, and safety boundaries, and encouraging evidence-based reasoning on complex cases to reduce hallucinations and improve accuracy.
53
+
54
+ ![](https://intranetproxy.alipay.com/skylark/lark/0/2025/jpeg/135556672/1765944098319-b6dc6933-3a6a-4d85-ae97-e9d98c6983c5.jpeg)
55
+
56
+ **Figure | Professional three-stage training pipeline**
57
+
58
+ ## Efficient MoE architecture with high-speed inference
59
+
60
+ AntAngelMed inherits Ling-flash-2.0’s advanced design. Guided by [Ling Scaling Laws](https://arxiv.org/abs/2507.17702), the model uses a **1/32 activation-ratio MoE** and is comprehensively optimized across core components, including expert granularity, shared expert ratio, attention balance, no auxiliary loss + sigmoid routing, MTP layer, QK-Norm, and Partial-RoPE.
61
+
62
+ These refinements enable **small-activation** MoE models to deliver up to **7× efficiency** over similarly sized dense architectures. In other words, with only 6.1B activated parameters, AntAngelMed can match ~40B dense model performance. Because of its small activated parameter count, AntAngelMed offers substantial speed advantages:
63
+
64
+ + On H20 hardware, inference exceeds **200 tokens/s**—about **3× faster** than a 36B dense model.
65
+ + With **YaRN extrapolation**, it supports a **128K context length**; as output length grows, relative speedups can reach 7× or more.
66
+
67
+ ![Figure | Model Architecture Diagram (https://huggingface.co/inclusionAI/Ling-flash-2.0)](https://intranetproxy.alipay.com/skylark/lark/0/2025/png/1591/1764724109582-56e0ca94-e8fd-4f49-a233-f9afe9e12801.png)
68
+
69
+ We have also specifically optimized AntAngelMed for inference acceleration by employing **FP8 quantization combined with EAGLE3 optimization**. Under a concurrency of 32, this approach significantly boosts inference throughput compared to using FP8 alone, with improvements of **71% on HumanEval, 45% on GSM8K**, and **as high as 94% on Math-500**. This achieves a robust balance between inference performance and model stability.
70
+
71
+
72
+ # Quickstart
73
+
74
+ ## 🤗 Hugging Face Transformers
75
+
76
+ Here is a code snippet to show you how to use the chat model with transformers:
77
+
78
+ ```plain
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer
80
+ model_name = "MedAIBase/AntAngelMed" # model_id or your_local_model_path
81
+ model = AutoModelForCausalLM.from_pretrained(
82
+ model_name,
83
+ device_map="auto",
84
+ trust_remote_code=True,
85
+ )
86
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
87
+
88
+ prompt = "What should I do if I have a headache?"
89
+ messages = [
90
+ {"role": "system", "content": "You are AntAngelMed, a helpfull medical assistant."},
91
+ {"role": "user", "content": prompt}
92
+ ]
93
+ text = tokenizer.apply_chat_template(
94
+ messages,
95
+ tokenize=False,
96
+ add_generation_prompt=True
97
+ )
98
+ model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
99
+ generated_ids = model.generate(
100
+ **model_inputs,
101
+ max_new_tokens=16384,
102
+ )
103
+ generated_ids = [
104
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
105
+ ]
106
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
107
+ print(response)
108
+ ```
109
+
110
+ ## 🤖 ModelScope
111
+
112
+ If you're in mainland China, we strongly recommend you to use our model from 🤖 [ModelScope](https://modelscope.cn/organization/MedAIBase).
113
+
114
+ ## Deployment - on Nvidia A100
115
+
116
+ ### vLLM
117
+
118
+ vLLM supports offline batched inference or launching an OpenAI-Compatible API Service for online inference.
119
+
120
+ #### Environment Preparation
121
+
122
+ Please prepare the following environment:
123
+
124
+ ```plain
125
+ pip install vllm==0.11.0
126
+ ```
127
+
128
+ #### Inference
129
+
130
+ ```plain
131
+ from modelscope import AutoTokenizer
132
+ from vllm import LLM, SamplingParams
133
+
134
+ def main():
135
+ model_path = "MedAIBase/AntAngelMed" # model_id or your_local_model_path
136
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
137
+ sampling_params = SamplingParams(
138
+ temperature=0.6,
139
+ top_p=0.95,
140
+ top_k=20,
141
+ repetition_penalty=1.05,
142
+ max_tokens=16384,
143
+ )
144
+ llm = LLM(
145
+ model=model_path,
146
+ trust_remote_code=True,
147
+ dtype="bfloat16",
148
+ tensor_parallel_size=4,
149
+ )
150
+
151
+ prompt = "What should I do if I have a headache?"
152
+ messages = [
153
+ {"role": "system", "content": "You are AntAngelMed, a helpfull medical assistant."},
154
+ {"role": "user", "content": prompt},
155
+ ]
156
+ text = tokenizer.apply_chat_template(
157
+ messages,
158
+ tokenize=False,
159
+ add_generation_prompt=True,
160
+ )
161
+ outputs = llm.generate([text], sampling_params)
162
+ print(outputs[0].outputs[0].text)
163
+
164
+ if __name__ == "__main__":
165
+ main()
166
+ ```
167
+
168
+ ### **SGLang**
169
+
170
+ #### **Environment Preparation**
171
+
172
+ Prepare the following environment:
173
+
174
+ ```plain
175
+ pip install sglang -U
176
+ ```
177
+
178
+ You can use Docker image as well:
179
+
180
+ ```plain
181
+ docker pull lmsysorg/sglang:latest
182
+ ```
183
+
184
+ #### **Run Inference**
185
+
186
+ BF16 and FP8 models are supported by SGLang, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
187
+
188
+ + Start server:
189
+
190
+ ```plain
191
+ SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server \
192
+ --model-path $MODLE_PATH \
193
+ --host 0.0.0.0 --port $PORT \
194
+ --trust-remote-code \
195
+ --attention-backend fa3 \
196
+ --tensor-parallel-size 4 \
197
+ --served-model-name AntAngelMed
198
+ ```
199
+
200
+ + Client:
201
+
202
+ ```plain
203
+ curl -s http://localhost:${PORT}/v1/chat/completions \
204
+ -H "Content-Type: application/json" \
205
+ -d '{"model": "auto", "messages": [{"role": "user", "content": "What should I do if I have a headache?"}]}'
206
+ ```
207
+
208
+ More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html).
209
+
210
+ ## **Deployment - on Ascend 910B**
211
+
212
+ ### **vLLM-Ascend**
213
+
214
+ vLLM-Ascend (vllm-ascend) is a community-maintained hardware backend that enables vLLM to run on Ascend NPUs.
215
+
216
+ #### **Environment Preparation**
217
+
218
+ We recommend using the 64*8GB memory version of the Ascend Atlas 800I A2 server to run this model.
219
+
220
+ We recommend using Docker for deployment. Please prepare the environment by following the steps below:
221
+
222
+ ```plain
223
+ docker pull quay.io/ascend/vllm-ascend:v0.11.0rc3
224
+ ```
225
+
226
+ Next, you can start and join the container by running the following commands, then proceed with subsequent operations inside the container.
227
+
228
+ ```plain
229
+ NAME=your container name
230
+ MODEL_PATH=put your absolute model path here if you already have it locally.
231
+
232
+ docker run -itd --privileged --name=$NAME --net=host \
233
+ --shm-size=1000g \
234
+ --device /dev/davinci0 \
235
+ --device /dev/davinci1 \
236
+ --device /dev/davinci2\
237
+ --device /dev/davinci3 \
238
+ --device /dev/davinci4 \
239
+ --device /dev/davinci5 \
240
+ --device /dev/davinci6 \
241
+ --device /dev/davinci7 \
242
+ --device=/dev/davinci_manager \
243
+ --device=/dev/hisi_hdc \
244
+ --device /dev/devmm_svm \
245
+ -v /usr/local/dcmi:/usr/local/dcmi \
246
+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
247
+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
248
+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
249
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
250
+ -v /usr/local/sbin:/usr/local/sbin \
251
+ -v /etc/hccn.conf:/etc/hccn.conf \
252
+ -v $MODEL_PATH:$MODEL_PATH \
253
+ quay.io/ascend/vllm-ascend:v0.11.0rc2 \
254
+ bash
255
+
256
+ docker exec -u root -it $NAME bash
257
+ ```
258
+
259
+ For both offline and online inference with vLLM, ensure the following environment variables are configured in the terminal before execution:
260
+
261
+ ```plain
262
+ export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
263
+ export HCCL_OP_EXPANSION_MODE="AIV"
264
+ export NPU_MEMORY_FRACTION=0.97
265
+ export TASK_QUEUE_ENABLE=1
266
+ export OMP_NUM_THREADS=100
267
+ export ASCEND_LAUNCH_BLOCKING=0
268
+ export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
269
+ #You can use ModelScope mirror to speed up download:
270
+ export VLLM_USE_MODELSCOPE=true
271
+ ```
272
+
273
+ #### **Offline Inference**
274
+
275
+ ```plain
276
+ from transformers import AutoTokenizer
277
+ from vllm import LLM, SamplingParams
278
+ model_path = "MedAIBase/AntAngelMed" # model_id or your_local_model_path
279
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
280
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
281
+ llm = LLM(model=model_path,
282
+ dtype='float16',
283
+ tensor_parallel_size=4,
284
+ gpu_memory_utilization=0.97,
285
+ enable_prefix_caching=True,
286
+ enable_expert_parallel=True,
287
+ trust_remote_code=True)
288
+ prompt = "What should I do if I have a headache?"
289
+ messages = [
290
+ {"role": "system", "content": "You are AntAngelMed, a helpful medical assistant."},
291
+ {"role": "user", "content": prompt}
292
+ ]
293
+ text = tokenizer.apply_chat_template(
294
+ messages,
295
+ tokenize=False,
296
+ add_generation_prompt=True
297
+ )
298
+ outputs = llm.generate([text], sampling_params)
299
+ ```
300
+
301
+ #### **Online Inference**
302
+
303
+ ```plain
304
+ model_id=MedAIBase/AntAngelMed
305
+ taskset -c 0-23 python3 -m vllm.entrypoints.openai.api_server \
306
+ --model $model_id \
307
+ --max-num-seqs=200 \
308
+ --tensor-parallel-size 4 \
309
+ --data-parallel-size 2 \
310
+ --enable_expert_parallel \
311
+ --gpu_memory_utilization 0.97 \
312
+ --served-model-name AntAngelMed \
313
+ --max-model-len 32768 \
314
+ --port 8080 \
315
+ --enable-prefix-caching \
316
+ --block-size 128 \
317
+ --async-scheduling \
318
+ --trust_remote_code
319
+ ```
320
+
321
+ ```plain
322
+ curl http://0.0.0.0:8080/v1/chat/completions -d '{
323
+ "model": "AntAngelMed",
324
+ "messages": [
325
+ {
326
+ "role": "system",
327
+ "content": "You are a helpful assistant."
328
+ },
329
+ {
330
+ "role": "user",
331
+ "content": "What should I do if I have a headache?"
332
+ }
333
+ ],
334
+ "temperature": 0.6
335
+ }'
336
+
337
+ ```
338
+
339
+ For detailed guidance, please refer to the vLLM-Ascend [here](https://docs.vllm.ai/projects/ascend/zh-cn/latest/quick_start.html).
340
+
341
+ # License
342
+
343
+ This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ling-V2/blob/master/LICENCE).