mazesmazes commited on
Commit
f85864a
·
verified ·
1 Parent(s): c5d4b32

Training in progress - step 500

Browse files
Files changed (5) hide show
  1. README.md +51 -185
  2. asr_config.py +2 -4
  3. asr_modeling.py +5 -4
  4. asr_pipeline.py +28 -0
  5. asr_processing.py +0 -2
README.md CHANGED
@@ -1,207 +1,73 @@
1
  ---
2
- base_model: Qwen/Qwen3-1.7B
3
- library_name: peft
4
- pipeline_tag: text-generation
 
 
 
 
 
 
5
  tags:
6
- - base_model:adapter:Qwen/Qwen3-1.7B
7
- - lora
8
- - transformers
 
 
 
9
  ---
10
 
11
- # Model Card for Model ID
12
 
13
- <!-- Provide a quick summary of what the model is/does. -->
14
 
 
15
 
 
 
 
16
 
17
- ## Model Details
18
-
19
- ### Model Description
20
-
21
- <!-- Provide a longer summary of what this model is. -->
22
-
23
-
24
-
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
-
33
- ### Model Sources [optional]
34
-
35
- <!-- Provide the basic links for the model. -->
36
-
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
-
41
- ## Uses
42
-
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
-
45
- ### Direct Use
46
-
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
-
49
- [More Information Needed]
50
-
51
- ### Downstream Use [optional]
52
-
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
-
55
- [More Information Needed]
56
-
57
- ### Out-of-Scope Use
58
-
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
-
61
- [More Information Needed]
62
-
63
- ## Bias, Risks, and Limitations
64
-
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
-
67
- [More Information Needed]
68
-
69
- ### Recommendations
70
-
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
-
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
74
-
75
- ## How to Get Started with the Model
76
-
77
- Use the code below to get started with the model.
78
-
79
- [More Information Needed]
80
 
81
  ## Training Details
82
 
83
- ### Training Data
84
-
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
-
87
- [More Information Needed]
88
-
89
- ### Training Procedure
90
-
91
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
-
93
- #### Preprocessing [optional]
94
-
95
- [More Information Needed]
96
-
97
-
98
- #### Training Hyperparameters
99
-
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
-
102
- #### Speeds, Sizes, Times [optional]
103
-
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
-
106
- [More Information Needed]
107
-
108
- ## Evaluation
109
-
110
- <!-- This section describes the evaluation protocols and provides the results. -->
111
-
112
- ### Testing Data, Factors & Metrics
113
-
114
- #### Testing Data
115
-
116
- <!-- This should link to a Dataset Card if possible. -->
117
-
118
- [More Information Needed]
119
-
120
- #### Factors
121
-
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
-
124
- [More Information Needed]
125
-
126
- #### Metrics
127
-
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
-
130
- [More Information Needed]
131
-
132
- ### Results
133
-
134
- [More Information Needed]
135
-
136
- #### Summary
137
-
138
-
139
-
140
- ## Model Examination [optional]
141
-
142
- <!-- Relevant interpretability work for the model goes here -->
143
-
144
- [More Information Needed]
145
-
146
- ## Environmental Impact
147
-
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
-
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
-
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
-
158
- ## Technical Specifications [optional]
159
-
160
- ### Model Architecture and Objective
161
-
162
- [More Information Needed]
163
-
164
- ### Compute Infrastructure
165
-
166
- [More Information Needed]
167
-
168
- #### Hardware
169
-
170
- [More Information Needed]
171
-
172
- #### Software
173
-
174
- [More Information Needed]
175
-
176
- ## Citation [optional]
177
-
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
-
180
- **BibTeX:**
181
-
182
- [More Information Needed]
183
-
184
- **APA:**
185
 
186
- [More Information Needed]
187
 
188
- ## Glossary [optional]
189
 
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
 
192
- [More Information Needed]
193
 
194
- ## More Information [optional]
 
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Authors [optional]
 
 
199
 
200
- [More Information Needed]
201
 
202
- ## Model Card Contact
 
 
 
203
 
204
- [More Information Needed]
205
- ### Framework versions
206
 
207
- - PEFT 0.18.0
 
 
 
1
  ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ datasets:
6
+ - speechbrain/LoquaciousSet
7
+ base_model:
8
+ - openai/whisper-large-v3-turbo
9
+ - HuggingFaceTB/SmolLM3-3B
10
+ pipeline_tag: automatic-speech-recognition
11
  tags:
12
+ - asr
13
+ - speech-recognition
14
+ - audio
15
+ - smollm
16
+ - whisper
17
+ - mlp
18
  ---
19
 
20
+ # Tiny Audio
21
 
22
+ A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the [Tiny Audio](https://github.com/alexkroman/tiny-audio) codebase—a minimal, hackable framework for training ASR models.
23
 
24
+ ## Architecture
25
 
26
+ ```
27
+ Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
28
+ ```
29
 
30
+ **MLP Projector:**
31
+ - Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
32
+ - Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
33
+ - Output normalization: RMSNorm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Training Details
36
 
37
+ | | |
38
+ |---|---|
39
+ | **Dataset** | LoquaciousSet (25,000 hours) |
40
+ | **Hardware** | Single NVIDIA A40 40GB |
41
+ | **Training Time** | ~24 hours |
42
+ | **Cost** | ~$12 |
43
+ | **Trainable Parameters** | ~12M (projector only) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
 
45
+ ## Performance
46
 
47
+ **Word Error Rate (WER): 12.14%** on LoquaciousSet test set.
48
 
49
+ See the [community leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard) for comparisons.
50
 
51
+ ## Usage
52
 
53
+ ```python
54
+ from transformers import pipeline
55
 
56
+ pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
57
 
58
+ result = pipe("path/to/audio.wav")
59
+ print(result["text"])
60
+ ```
61
 
62
+ ## Limitations
63
 
64
+ - English only
65
+ - Optimized for 16kHz audio; other sample rates are resampled automatically
66
+ - Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
67
+ - Maximum audio length limited by context window
68
 
69
+ ## Learn More
 
70
 
71
+ - **[Train your own model](https://github.com/alexkroman/tiny-audio)** — The full codebase with training scripts
72
+ - **[Free 3-hour course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md)** — Build your own ASR system from scratch
73
+ - **[Submit to leaderboard](https://github.com/alexkroman/tiny-audio#leaderboard)** — Share your trained model
asr_config.py CHANGED
@@ -14,11 +14,10 @@ class ASRConfig(transformers.PretrainedConfig):
14
  attn_implementation: str = "flash_attention_2",
15
  model_dtype: str = "bfloat16",
16
  num_beams: Optional[int] = None,
17
- system_prompt: str = "/no_think /system_override",
18
  user_prompt: str = "Please transcribe this English audio into text: <audio>",
19
  encoder_dim: Optional[int] = None,
20
  llm_dim: Optional[int] = None,
21
- encoder_stride: int = 2, # Temporal downsampling factor of audio encoder (legacy, use encoder_conv_layers)
22
  # Encoder conv layers: list of (padding, kernel_size, stride) tuples
23
  # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
24
  encoder_conv_layers: Optional[list] = None,
@@ -52,7 +51,7 @@ class ASRConfig(transformers.PretrainedConfig):
52
  # Set default generation parameters (greedy decoding only)
53
  generation_defaults = {
54
  "num_beams": 1,
55
- "max_new_tokens": 96,
56
  "repetition_penalty": 1.0,
57
  "length_penalty": 1.0,
58
  "no_repeat_ngram_size": 0,
@@ -70,7 +69,6 @@ class ASRConfig(transformers.PretrainedConfig):
70
  self.user_prompt = user_prompt
71
  self.encoder_dim = encoder_dim
72
  self.llm_dim = llm_dim
73
- self.encoder_stride = encoder_stride
74
  # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
75
  self.encoder_conv_layers = encoder_conv_layers or [(1, 3, 1), (1, 3, 2)]
76
  self.audio_sample_rate = audio_sample_rate
 
14
  attn_implementation: str = "flash_attention_2",
15
  model_dtype: str = "bfloat16",
16
  num_beams: Optional[int] = None,
17
+ system_prompt: str = "You are a helpful assistant.",
18
  user_prompt: str = "Please transcribe this English audio into text: <audio>",
19
  encoder_dim: Optional[int] = None,
20
  llm_dim: Optional[int] = None,
 
21
  # Encoder conv layers: list of (padding, kernel_size, stride) tuples
22
  # Default is Whisper/GLM-ASR structure: conv1(k=3,s=1,p=1) + conv2(k=3,s=2,p=1)
23
  encoder_conv_layers: Optional[list] = None,
 
51
  # Set default generation parameters (greedy decoding only)
52
  generation_defaults = {
53
  "num_beams": 1,
54
+ "max_new_tokens": 256,
55
  "repetition_penalty": 1.0,
56
  "length_penalty": 1.0,
57
  "no_repeat_ngram_size": 0,
 
69
  self.user_prompt = user_prompt
70
  self.encoder_dim = encoder_dim
71
  self.llm_dim = llm_dim
 
72
  # Default conv layers for Whisper/GLM-ASR: [(pad, kernel, stride), ...]
73
  self.encoder_conv_layers = encoder_conv_layers or [(1, 3, 1), (1, 3, 2)]
74
  self.audio_sample_rate = audio_sample_rate
asr_modeling.py CHANGED
@@ -96,7 +96,6 @@ class ASRModel(PreTrainedModel, GenerationMixin):
96
  super().__init__(config)
97
 
98
  self.system_prompt = config.system_prompt
99
- self.encoder_stride = config.encoder_stride
100
  target_dtype = getattr(torch, config.model_dtype)
101
 
102
  # Audio encoder (frozen)
@@ -121,7 +120,10 @@ class ASRModel(PreTrainedModel, GenerationMixin):
121
  self.generation_config.length_penalty = config.length_penalty
122
  self.generation_config.repetition_penalty = config.repetition_penalty
123
  self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
124
- self.generation_config.eos_token_id = self.tokenizer.convert_tokens_to_ids("<|im_end|>")
 
 
 
125
  self.generation_config.pad_token_id = self.tokenizer.pad_token_id
126
 
127
  # Feature extractor for audio preprocessing
@@ -145,7 +147,7 @@ class ASRModel(PreTrainedModel, GenerationMixin):
145
  encoder_kwargs = {
146
  "attn_implementation": config.attn_implementation,
147
  "low_cpu_mem_usage": True,
148
- "torch_dtype": dtype,
149
  }
150
 
151
  if "whisper" in config.audio_model_id.lower():
@@ -296,7 +298,6 @@ class ASRModel(PreTrainedModel, GenerationMixin):
296
  feature_extractor=self.feature_extractor,
297
  tokenizer=self.tokenizer,
298
  projector=self.projector,
299
- encoder_stride=self.encoder_stride,
300
  encoder_conv_layers=self.config.encoder_conv_layers,
301
  )
302
 
 
96
  super().__init__(config)
97
 
98
  self.system_prompt = config.system_prompt
 
99
  target_dtype = getattr(torch, config.model_dtype)
100
 
101
  # Audio encoder (frozen)
 
120
  self.generation_config.length_penalty = config.length_penalty
121
  self.generation_config.repetition_penalty = config.repetition_penalty
122
  self.generation_config.no_repeat_ngram_size = config.no_repeat_ngram_size
123
+ self.generation_config.eos_token_id = [
124
+ self.tokenizer.convert_tokens_to_ids("<|im_end|>"),
125
+ self.tokenizer.convert_tokens_to_ids("<|endoftext|>"),
126
+ ]
127
  self.generation_config.pad_token_id = self.tokenizer.pad_token_id
128
 
129
  # Feature extractor for audio preprocessing
 
147
  encoder_kwargs = {
148
  "attn_implementation": config.attn_implementation,
149
  "low_cpu_mem_usage": True,
150
+ "dtype": dtype,
151
  }
152
 
153
  if "whisper" in config.audio_model_id.lower():
 
298
  feature_extractor=self.feature_extractor,
299
  tokenizer=self.tokenizer,
300
  projector=self.projector,
 
301
  encoder_conv_layers=self.config.encoder_conv_layers,
302
  )
303
 
asr_pipeline.py CHANGED
@@ -476,4 +476,32 @@ class ASRPipeline(transformers.AutomaticSpeechRecognitionPipeline):
476
  text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
477
  # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
478
  text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
 
 
479
  return {"text": text}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
476
  text = self.tokenizer.decode(tokens, skip_special_tokens=True).strip()
477
  # Strip <think>...</think> tags (Qwen3 doesn't respect /no_think prompt)
478
  text = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
479
+ # Truncate if a word repeats more than 3 times consecutively
480
+ text = self._truncate_repetitions(text, max_repeats=3)
481
  return {"text": text}
482
+
483
+ def _truncate_repetitions(self, text: str, max_repeats: int = 3) -> str:
484
+ """Truncate text when a word repeats more than max_repeats times consecutively.
485
+
486
+ Args:
487
+ text: Input text to check for repetitions
488
+ max_repeats: Maximum allowed consecutive repetitions (default 3)
489
+
490
+ Returns:
491
+ Truncated text if repetition detected, otherwise original text
492
+ """
493
+ words = text.split()
494
+ if len(words) <= max_repeats:
495
+ return text
496
+
497
+ repeat_count = 1
498
+ for i in range(1, len(words)):
499
+ if words[i].lower() == words[i - 1].lower():
500
+ repeat_count += 1
501
+ if repeat_count > max_repeats:
502
+ # Keep up to max_repeats of the repeated word
503
+ return " ".join(words[:i])
504
+ else:
505
+ repeat_count = 1
506
+
507
+ return text
asr_processing.py CHANGED
@@ -26,14 +26,12 @@ class ASRProcessor(ProcessorMixin):
26
  feature_extractor,
27
  tokenizer,
28
  projector=None,
29
- encoder_stride: int = 2,
30
  encoder_conv_layers: Optional[list] = None,
31
  ):
32
  self.feature_extractor = feature_extractor
33
  self.tokenizer = tokenizer
34
  self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
35
  self.projector = projector
36
- self.encoder_stride = encoder_stride # Legacy, kept for compatibility
37
  self.encoder_conv_layers = encoder_conv_layers or self.DEFAULT_ENCODER_CONV_LAYERS
38
 
39
  def _compute_encoder_output_length(self, mel_length: int) -> int:
 
26
  feature_extractor,
27
  tokenizer,
28
  projector=None,
 
29
  encoder_conv_layers: Optional[list] = None,
30
  ):
31
  self.feature_extractor = feature_extractor
32
  self.tokenizer = tokenizer
33
  self.audio_token_id = tokenizer.convert_tokens_to_ids(self.AUDIO_TOKEN)
34
  self.projector = projector
 
35
  self.encoder_conv_layers = encoder_conv_layers or self.DEFAULT_ENCODER_CONV_LAYERS
36
 
37
  def _compute_encoder_output_length(self, mel_length: int) -> int: