Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,147 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- ko
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
---
|
| 7 |
+
|
| 8 |
+
# formal_classifier
|
| 9 |
+
formal classifier or honorific classifier
|
| 10 |
+
|
| 11 |
+
## νκ΅μ΄ μ‘΄λλ§ λ°λ§ λΆλ₯κΈ°
|
| 12 |
+
|
| 13 |
+
μ€λμ μ μ‘΄λλ§ , λ°λ§μ νκ΅μ΄ ννμ λΆμκΈ°λ‘ λΆλ₯νλ κ°λ¨ν λ°©λ²μ μκ°νλ€.<br>
|
| 14 |
+
νμ§λ§ μ΄ λ°©λ²μ μ€μ λ‘ μ μ©νλ € νλλ, λ§μ λΆλΆμμ μ€λ₯κ° λ°μνμλ€.
|
| 15 |
+
|
| 16 |
+
μλ₯Ό λ€λ©΄)
|
| 17 |
+
```bash
|
| 18 |
+
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?
|
| 19 |
+
```
|
| 20 |
+
λΌλ 문ꡬλ₯Ό "κ»μ"λΌλ μ‘΄μΉλλ¬Έμ μ 체문μ₯μ μ‘΄λλ§λ‘ νλ¨νλ μ€λ₯κ° λ§μ΄ λ°μνλ€. <br>
|
| 21 |
+
κ·Έλμ μ΄λ²μ λ₯λ¬λ λͺ¨λΈμ λ§λ€κ³ κ·Έ κ³Όμ μ 곡μ ν΄λ³΄κ³ μνλ€.
|
| 22 |
+
|
| 23 |
+
#### λΉ λ₯΄κ² κ°μ Έλ€ μ°μ€ λΆλ€μ μλ μ½λλ‘ λ°λ‘ μ¬μ©νμ€ μ μμ΅λλ€.
|
| 24 |
+
```python
|
| 25 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
| 26 |
+
|
| 27 |
+
model = AutoModelForSequenceClassification.from_pretrained("j5ng/kcbert-formal-classifier")
|
| 28 |
+
tokenizer = AutoTokenizer.from_pretrained('j5ng/kcbert-formal-classifier')
|
| 29 |
+
|
| 30 |
+
formal_classifier = pipeline(task="text-classification", model=model, tokenizer=tokenizer)
|
| 31 |
+
print(formal_classifier("μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ?"))
|
| 32 |
+
# [{'label': 'LABEL_0', 'score': 0.9999139308929443}]
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
***
|
| 36 |
+
|
| 37 |
+
### λ°μ΄ν° μ
μΆμ²
|
| 38 |
+
|
| 39 |
+
#### μ€λ§μΌκ²μ΄νΈ λ§ν¬ λ°μ΄ν° μ
(korean SmileStyle Dataset)
|
| 40 |
+
: https://github.com/smilegate-ai/korean_smile_style_dataset
|
| 41 |
+
|
| 42 |
+
#### AI νλΈ κ°μ± λν λ§λμΉ
|
| 43 |
+
: https://www.aihub.or.kr/
|
| 44 |
+
|
| 45 |
+
#### λ°μ΄ν°μ
λ€μ΄λ‘λ(AIνλΈλ μ§μ λ€μ΄λ‘λλ§ κ°λ₯)
|
| 46 |
+
```bash
|
| 47 |
+
wget https://raw.githubusercontent.com/smilegate-ai/korean_smile_style_dataset/main/smilestyle_dataset.tsv
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### κ°λ° νκ²½
|
| 51 |
+
```bash
|
| 52 |
+
Python3.9
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
torch==1.13.1
|
| 57 |
+
transformers==4.26.0
|
| 58 |
+
pandas==1.5.3
|
| 59 |
+
emoji==2.2.0
|
| 60 |
+
soynlp==0.0.493
|
| 61 |
+
datasets==2.10.1
|
| 62 |
+
pandas==1.5.3
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
#### μ¬μ© λͺ¨λΈ
|
| 67 |
+
beomi/kcbert-base
|
| 68 |
+
- GitHub : https://github.com/Beomi/KcBERT
|
| 69 |
+
- HuggingFace : https://huggingface.co/beomi/kcbert-base
|
| 70 |
+
***
|
| 71 |
+
|
| 72 |
+
## λ°μ΄ν°
|
| 73 |
+
```bash
|
| 74 |
+
get_train_data.py
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### μμ
|
| 78 |
+
|sentence|label|
|
| 79 |
+
|------|---|
|
| 80 |
+
|곡λΆλ₯Ό μ΄μ¬ν ν΄λ μ΄μ¬ν ν λ§νΌ μ±μ μ΄ μ λμ€μ§ μμ|0|
|
| 81 |
+
|μλ€μκ² λ³΄λ΄λ λ¬Έμλ₯Ό ν΅ν΄ κ΄κ³κ° ν볡λκΈΈ λ°λκ²μ|1|
|
| 82 |
+
|μ°Έ μ΄μ¬ν μ¬μ 보λμ΄ μμΌμλ€μ|1|
|
| 83 |
+
|λλ μ€μ μ’μν¨ μ΄λ² λ¬λΆν° μκ΅ κ° λ―|0|
|
| 84 |
+
|λ³ΈλΆμ₯λμ΄ λ΄κ° ν μ μλ μ
무λ₯Ό κ³μ μ£Όμ
μ νλ€μ΄|0|
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
### λΆν¬
|
| 88 |
+
|label|train|test|
|
| 89 |
+
|------|---|---|
|
| 90 |
+
|0|133,430|34,908|
|
| 91 |
+
|1|112,828|29,839|
|
| 92 |
+
|
| 93 |
+
***
|
| 94 |
+
|
| 95 |
+
## νμ΅(train)
|
| 96 |
+
```bash
|
| 97 |
+
python3 modeling/train.py
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
***
|
| 101 |
+
|
| 102 |
+
## μμΈ‘(inference)
|
| 103 |
+
```bash
|
| 104 |
+
python3 inference.py
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
def formal_percentage(self, text):
|
| 109 |
+
return round(float(self.predict(text)[0][1]), 2)
|
| 110 |
+
|
| 111 |
+
def print_message(self, text):
|
| 112 |
+
result = self.formal_persentage(text)
|
| 113 |
+
if result > 0.5:
|
| 114 |
+
print(f'{text} : μ‘΄λλ§μ
λλ€. ( νλ₯ {result*100}% )')
|
| 115 |
+
if result < 0.5:
|
| 116 |
+
print(f'{text} : λ°λ§μ
λλ€. ( νλ₯ {((1 - result)*100)}% )')
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
κ²°κ³Ό
|
| 120 |
+
```
|
| 121 |
+
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνμ
¨λλ° κΈ°μ΅λμΈμ? : μ‘΄λλ§μ
λλ€. ( νλ₯ 99.19% )
|
| 122 |
+
μ λ²μ κ΅μλκ»μ μλ£ κ°μ Έμ€λΌνλλ° κΈ°μ΅λ? : λ°λ§μ
λλ€. ( νλ₯ 92.86% )
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
***
|
| 128 |
+
|
| 129 |
+
## μΈμ©
|
| 130 |
+
```bash
|
| 131 |
+
@misc{SmilegateAI2022KoreanSmileStyleDataset,
|
| 132 |
+
title = {SmileStyle: Parallel Style-variant Corpus for Korean Multi-turn Chat Text Dataset},
|
| 133 |
+
author = {Seonghyun Kim},
|
| 134 |
+
year = {2022},
|
| 135 |
+
howpublished = {\url{https://github.com/smilegate-ai/korean_smile_style_dataset}},
|
| 136 |
+
}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
@inproceedings{lee2020kcbert,
|
| 141 |
+
title={KcBERT: Korean Comments BERT},
|
| 142 |
+
author={Lee, Junbum},
|
| 143 |
+
booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
|
| 144 |
+
pages={437--440},
|
| 145 |
+
year={2020}
|
| 146 |
+
}
|
| 147 |
+
```
|