PyTorch
gpt2
Muhammadidrees commited on
Commit
7bceb01
·
verified ·
1 Parent(s): 3039fde

Upload folder using huggingface_hub

Browse files
Files changed (8) hide show
  1. .gitattributes +0 -1
  2. README.md +149 -0
  3. config.json +38 -0
  4. merges.txt +0 -0
  5. pytorch_model.bin +3 -0
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +1 -0
  8. vocab.json +0 -0
.gitattributes CHANGED
@@ -25,7 +25,6 @@
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
  *.tflite filter=lfs diff=lfs merge=lfs -text
30
  *.tgz filter=lfs diff=lfs merge=lfs -text
31
  *.wasm filter=lfs diff=lfs merge=lfs -text
 
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
  saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
  *.tar.* filter=lfs diff=lfs merge=lfs -text
 
28
  *.tflite filter=lfs diff=lfs merge=lfs -text
29
  *.tgz filter=lfs diff=lfs merge=lfs -text
30
  *.wasm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bigscience-bloom-rail-1.0
3
+ datasets:
4
+ - pubmed
5
+ widget:
6
+ - text: 'Photosynthesis is'
7
+ ---
8
+
9
+ # Model Card for BioMedLM 2.7B
10
+
11
+ Note: This model was previously known as PubMedGPT 2.7B, but we have changed it due to a request from the NIH which holds the trademark for "PubMed".
12
+
13
+ Paper: [BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text](https://arxiv.org/abs/2403.18421)
14
+
15
+ BioMedLM 2.7B is new language model trained exclusively on biomedical abstracts and papers from [The Pile](https://pile.eleuther.ai/). This GPT-style model can achieve strong results on a variety of biomedical NLP tasks, including a new state of the art performance of 50.3% accuracy on the MedQA biomedical question answering task.
16
+
17
+ As an autoregressive language model, BioMedLM 2.7B is also capable of natural language generation. However, we have only begun to explore the generation capabilities and limitations of this model, and we emphasize that this model’s generation capabilities are for research purposes only and not suitable for production. In releasing this model, we hope to advance both the development of biomedical NLP applications and best practices for responsibly training and utilizing domain-specific language models; issues of reliability, truthfulness, and explainability are top of mind for us.
18
+
19
+ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
20
+
21
+ # Table of Contents
22
+
23
+ - [Model Card for BioMedLM 2.7B](#model-card-for--model_id-)
24
+ - [Table of Contents](#table-of-contents)
25
+ - [Model Details](#model-details)
26
+ - [Model Description](#model-description)
27
+ - [Uses](#uses)
28
+ - [Downstream Use](#downstream-use)
29
+ - [Out-of-Scope Use](#out-of-scope-use)
30
+ - [Bias, Risks, and Limitations](#bias-risks-and-limitations)
31
+ - [Recommendations](#recommendations)
32
+ - [Training Details](#training-details)
33
+ - [Training Data](#training-data)
34
+ - [Training Procedure](#training-procedure)
35
+ - [Preprocessing](#preprocessing)
36
+ - [Environmental Impact](#environmental-impact)
37
+ - [Technical Specifications](#technical-specifications)
38
+ - [Model Architecture and Objective](#model-architecture-and-objective)
39
+ - [Compute Infrastructure](#compute-infrastructure)
40
+
41
+ # Model Details
42
+
43
+ ## Model Description
44
+
45
+ <!-- Provide a longer summary of what this model is/does. -->
46
+ BioMedLM 2.7B is new language model trained exclusively on biomedical abstracts and papers from [The Pile](https://pile.eleuther.ai/). This GPT-style model can achieve strong results on a variety of biomedical NLP tasks, including a new state of the art performance of 50.3% accuracy on the MedQA biomedical question answering task.
47
+
48
+ As an autoregressive language model, BioMedLM 2.7B is also capable of natural language generation. However, we have only begun to explore the generation capabilities and limitations of this model, and we emphasize that this model’s generation capabilities are for research purposes only and not suitable for production. In releasing this model, we hope to advance both the development of biomedical NLP applications and best practices for responsibly training and utilizing domain-specific language models; issues of reliability, truthfulness, and explainability are top of mind for us.
49
+
50
+ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.edu/) and [MosaicML](https://www.mosaicml.com/).
51
+
52
+
53
+ - **Developed by:** Stanford CRFM, MosaicML
54
+ - **Shared by:** Stanford CRFM
55
+ - **Model type:** Language model
56
+ - **Language(s) (NLP):** en
57
+ - **License:** [bigscience-bloom-rail-1.0](https://huggingface.co/spaces/bigscience/license)
58
+
59
+ # Uses
60
+
61
+ This model is licensed under the terms of [BigScience Open RAIL-M license](https://huggingface.co/spaces/bigscience/license) used for [BLOOM](https://huggingface.co/bigscience/bloom-1b1). Please note that, among other restrictions, this license forbids use of the model (or derivatives thereof)
62
+ "To provide medical advice and medical results interpretation." If you are concerned that your use case would follow under the "letter" of this restriction, but not the "spirit," you can contact us to discuss.
63
+
64
+ ## Direct Use
65
+
66
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
67
+ <!-- If the user enters content, print that. If not, but they enter a task in the list, use that. If neither, say "more info needed." -->
68
+ It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be directly used for production or work that may directly impact people.
69
+
70
+ ## Downstream Use
71
+
72
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
73
+ The main way we have used this model is finetuning for downstream question answering tasks, and we recommend using this model that way.
74
+
75
+ ## Out-of-Scope Use
76
+
77
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
78
+ We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
79
+
80
+ # Bias, Risks, and Limitations
81
+
82
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
83
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
84
+
85
+ ## Recommendations
86
+
87
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
88
+ While this model is capable of generating natural language text, we have only begun to explore this capability and its limitations. Understanding these limitations is especially important in a domain like medicine. Therefore, **we strongly recommend against using this model in production for natural language generation.**
89
+
90
+ # Training Details
91
+
92
+ ## Training Data
93
+
94
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
95
+
96
+ This model was trained on the Pubmed Abstracts and Full Text from [The Pile](https://pile.eleuther.ai/).
97
+
98
+ ## Training Procedure
99
+
100
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
101
+
102
+ The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a platform designed for large workloads like LLMs. Using the [Composer](https://github.com/mosaicml/composer) training library and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), it was easy to enable multi-node training across 128 A100-40GB GPUs, and the total run was completed in ~6.25 days. The model was trained with batch size=1024 and sequence length=1024 for 300B tokens using Decoupled AdamW with the following settings:
103
+
104
+ | | |
105
+ | --- | ------ |
106
+ | lr | 1.6e-4 |
107
+ | eps | 1e-8 |
108
+ | betas | \[0.9, 0.95\] |
109
+ | weight decay | 1.6e-5 |
110
+
111
+ The training process was very smooth and did not suffer from any divergences.
112
+
113
+ As we were preparing the training run, we were unsure of the benefits of training out to 300B tokens for language model perplexity and downstream task performance. While most models of this scale (e.g. GPT Neo 2.7B) are trained to 300-400B tokens, the datasets those models use are vastly larger than PubMed. For instance, The Pile is 8x the size of its PubMed subcorpora.
114
+
115
+ Fortunately, we did continue to see steady perplexity improvements on the validation and training sets for the entirety of training, and preliminary experiments showed improved downstream task performance as we trained out to the full 300B tokens. Our takeaway from this was that it was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models.
116
+
117
+ ### Preprocessing
118
+
119
+ The model uses a custom tokenizer trained on the PubMed Abstracts. When building domain specific models we have found it important to use a tokenizer trained on in-domain text to maximize performance on downstream tasks. A key benefit is that common biomedical terms are represented as entire tokens.
120
+
121
+ For instance, all of these following terms are tokenized into single tokens by the biomedical tokenizer and multiple tokens by the standard GPT-2 tokenizer:
122
+
123
+ | | |
124
+ | --- | --- |
125
+ | chromatography | chrom/atography |
126
+ | cytotoxicity | cyt/ot/oxicity |
127
+ | Immunohistochemistry | Immun/oh/ist/ochemistry |
128
+ | photosynthesis | photos/ynthesis |
129
+ | probiotic | prob/iotic |
130
+
131
+ This allows the model to encode information about these concepts in their individual token representations rather than spread out across subword tokens like “oh” shared with many other terms.
132
+
133
+ # Technical Specifications
134
+
135
+ ## Model Architecture and Objective
136
+
137
+ BioMedLM 2.7B is a standard GPT-2 implementation (trained with Flash Attention) with the following hyperparameters:
138
+
139
+ | | |
140
+ | ----------- | ----- |
141
+ | hidden size | 2560 |
142
+ | heads | 20 |
143
+ | layers | 32 |
144
+ | vocab size | 28896 |
145
+ | sequence length| 1024 |
146
+
147
+ ## Compute Infrastructure
148
+
149
+ The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a platform designed for large workloads like LLMs. Using the [Composer](https://github.com/mosaicml/composer) training library and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), it was easy to enable multi-node training across 128 A100-40GB GPUs, and the total run was completed in ~6.25 days.
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_function": "gelu_new",
3
+ "architectures": [
4
+ "GPT2LMHeadModel"
5
+ ],
6
+ "attn_pdrop": 0.1,
7
+ "bos_token_id": 28895,
8
+ "embd_pdrop": 0.1,
9
+ "eos_token_id": 28895,
10
+ "initializer_range": 0.02,
11
+ "layer_norm_epsilon": 1e-05,
12
+ "model_type": "gpt2",
13
+ "n_ctx": 1024,
14
+ "n_embd": 2560,
15
+ "n_head": 20,
16
+ "n_inner": null,
17
+ "n_layer": 32,
18
+ "n_positions": 1024,
19
+ "reorder_and_upcast_attn": false,
20
+ "resid_pdrop": 0.1,
21
+ "scale_attn_by_inverse_layer_idx": true,
22
+ "scale_attn_weights": true,
23
+ "summary_activation": null,
24
+ "summary_first_dropout": 0.1,
25
+ "summary_proj_to_labels": true,
26
+ "summary_type": "cls_index",
27
+ "summary_use_proj": true,
28
+ "task_specific_params": {
29
+ "text-generation": {
30
+ "do_sample": true,
31
+ "max_length": 50
32
+ }
33
+ },
34
+ "torch_dtype": "float32",
35
+ "transformers_version": "4.21.3",
36
+ "use_cache": false,
37
+ "vocab_size": 28896
38
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dede638b0b9011559bb1ece1a2d4fa19f13e84658e835f419ac80a7f4bed786c
3
+ size 10706653655
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"add_prefix_space": false, "model_max_length": 1024, "special_tokens_map_file": null, "name_or_path": "stanford-crfm/pubmed_gpt_tokenizer", "tokenizer_class": "GPT2Tokenizer", "unk_token": "<|endoftext|>", "bos_token": "<|endoftext|>", "eos_token": "<|endoftext|>"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff