Update README.md
Browse files
README.md
CHANGED
|
@@ -2664,6 +2664,12 @@ embeddings = model.encode(['How is the weather today?', 'What is the current wea
|
|
| 2664 |
print(cos_sim(embeddings[0], embeddings[1]))
|
| 2665 |
```
|
| 2666 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2667 |
For long sequences, it's recommended to perform inference using Flash Attention. Using Flash Attention allows you to increase the batch size and throughput for long sequence length.
|
| 2668 |
We include an experimental implementation for Flash Attention, shipped with the model.
|
| 2669 |
Install the following triton version:
|
|
|
|
| 2664 |
print(cos_sim(embeddings[0], embeddings[1]))
|
| 2665 |
```
|
| 2666 |
|
| 2667 |
+
If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
|
| 2668 |
+
|
| 2669 |
+
```python
|
| 2670 |
+
embeddings = model.encode(['Very long ... document'], max_length=2048)
|
| 2671 |
+
```
|
| 2672 |
+
|
| 2673 |
For long sequences, it's recommended to perform inference using Flash Attention. Using Flash Attention allows you to increase the batch size and throughput for long sequence length.
|
| 2674 |
We include an experimental implementation for Flash Attention, shipped with the model.
|
| 2675 |
Install the following triton version:
|