# Python Multilingual BibTeX Generator This Python script generates `multilingual_papers.bib` by filtering the original `anthology+abstracts.bib` file for multilingual NLP research papers. ## Features - **Identical Logic**: Uses the same filtering logic as the JavaScript web application - **Comprehensive Detection**: Detects multilingual papers using keywords and language names - **LaTeX Cleaning**: Properly handles LaTeX commands and formatting - **Statistics**: Provides detailed statistics about the filtering process - **Safe Operation**: Checks for existing files and asks for confirmation before overwriting ## Requirements - Python 3.6 or higher - No external dependencies (uses only standard library) ## Usage 1. **Place your files**: Ensure `anthology+abstracts.bib` is in the same directory as the script 2. **Run the script**: ```bash python generate_multilingual_bib.py ``` 3. **Follow prompts**: The script will ask for confirmation if `multilingual_papers.bib` already exists ## Output The script will: - Generate `multilingual_papers.bib` containing only multilingual papers - Display statistics about the filtering process - Show the top 10 most common keywords found ## Example Output ``` Reading anthology+abstracts.bib... Parsing BibTeX entries... Found 50000 total papers Found 2500 multilingual papers Generating BibTeX content... Writing to multilingual_papers.bib... Successfully generated multilingual_papers.bib with 2500 papers! Statistics: Total papers processed: 50000 Multilingual papers found: 2500 Percentage multilingual: 5.0% Top 10 keywords found: multilingual: 1200 papers chinese: 800 papers crosslingual: 600 papers hindi: 400 papers low-resource: 350 papers korean: 300 papers arabic: 250 papers japanese: 200 papers spanish: 180 papers french: 150 papers ``` ## Filtering Criteria The script uses the same criteria as the web application: ### Multilingual Keywords - multilingual, crosslingual, multi-lingual, cross-lingual - low-resource language, low resource language - low-resource, low resource ### Language Names - 100+ language names including: Hindi, Chinese, Korean, Arabic, Spanish, French, German, Japanese, etc. - Regional language variations and dialects ## Customization You can modify the filtering criteria by editing the constants at the top of the script: ```python MULTILINGUAL_KEYWORDS = [ 'multilingual', 'crosslingual', 'multi lingual', # Add your custom keywords here ] LANGUAGE_NAMES = [ 'afrikaans', 'albanian', 'amharic', 'arabic', # Add more language names here ] ``` ## Error Handling The script includes robust error handling: - Checks for input file existence - Handles malformed BibTeX entries gracefully - Provides clear error messages - Asks for confirmation before overwriting existing files ## Performance - Efficient regex-based parsing - Memory-efficient processing for large files - Fast keyword matching using set operations ## Troubleshooting ### File Not Found ``` Error: anthology+abstracts.bib not found in current directory. ``` **Solution**: Ensure the input file is in the same directory as the script. ### No Papers Found ``` No multilingual papers found. Check your keywords and language lists. ``` **Solution**: Verify your BibTeX file contains papers with multilingual content, or adjust the keyword lists. ### Encoding Issues If you encounter encoding errors, the script uses UTF-8 encoding. Ensure your BibTeX file is properly encoded. ## Comparison with JavaScript Version This Python script produces identical results to the JavaScript web application: - Same filtering logic - Same LaTeX cleaning - Same BibTeX output format - Same keyword detection The main advantage is that it can be run independently without a web browser and provides detailed statistics about the filtering process.