Counterfactual Augmentation for Robust Authorship Representation Learning

SIGIR License

ERLAS is official hub for the paper "Counterfactual Augmentation for Robust Authorship Representation Learning". In this framework we introduce generating style-counterfactual examples by retrieving the most similar content texts by different authors on the same topics/domains.

Installation:

git clone https://github.com/hieum98/Counterfactual-Augmentation-for-Robust-Authorship-Representation-Learning.git
cd Counterfactual-Augmentation-for-Robust-Authorship-Representation-Learning
pip install -r requirements.txt
pip install -e .

Usage:

from ERLAS.model.erlas import ERLAS
from transformers import AutoTokenizer

model = ERLAS.from_pretrained('Hieuman/erlas')
tokenizer = AutoTokenizer.from_pretrained('Hieuman/erlas')

batch_size = 3
episode_length = 16
text = [
    ["Foo"] * episode_length,
    ["Bar"] * episode_length,
    ["Zoo"] * episode_length,
]
text = [j for i in text for j in i]
tokenized_text = tokenizer(
    text, 
    max_length=32,
    padding="max_length", 
    truncation=True,
    return_tensors="pt"
)
# inputs size: (batch_size, episode_length, max_token_length)
tokenized_text["input_ids"] = tokenized_text["input_ids"].reshape(batch_size, 1, episode_length, -1)
tokenized_text["attention_mask"] = tokenized_text["attention_mask"].reshape(batch_size, 1, episode_length, -1)

author_reps, _ = model(tokenized_text['input_ids'], tokenized_text['attention_mask'])

author_reps = author_reps.squeeze(1) # [bs, hidden_size]

Citation

@inproceedings{10.1145/3626772.3657956,
author = {Man, Hieu and Huu Nguyen, Thien},
title = {Counterfactual Augmentation for Robust Authorship Representation Learning},
year = {2024},
isbn = {9798400704314},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3626772.3657956},
doi = {10.1145/3626772.3657956},
pages = {2347–2351},
numpages = {5},
keywords = {authorship attribution, counterfactual learning, domain generalization},
location = {Washington DC, USA},
series = {SIGIR '24}
}

This model has been pushed to the Hub using the PytorchModelHubMixin integration:

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support