--- license: apache-2.0 language: - ca - es - en base_model: BSC-LT/salamandra-7b-instruct-tools library_name: transformers pipeline_tag: text-generation tags: - query-parsing - semantic-search - structured-output - json-generation - multilingual - catalan - spanish - LoRA - fine-tuned - AINA - R&D datasets: - SIRIS-Lab/impuls-query-parsing metrics: - accuracy model-index: - name: IMPULS-Salamandra-7B-Query-Parser results: - task: type: text-generation name: Query Parsing metrics: - name: JSON Validity type: accuracy value: 1.0 - name: Strict Accuracy type: accuracy value: 0.51 - name: Relaxed Accuracy type: accuracy value: 0.65 - name: Language Match type: accuracy value: 0.87 --- # IMPULS-Salamandra-7B-Query-Parser A fine-tuned version of [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) for converting natural language queries into structured JSON for R&D project semantic search. ## Model Description This model was developed as part of the **IMPULS project** (AINA Challenge 2024), a collaboration between [SIRIS Academic](https://sirisacademic.com/) and [Generalitat de Catalunya](https://web.gencat.cat/) to build a multilingual semantic search system for Catalonia's R&D ecosystem (RIS3-MCAT platform). The model converts natural language queries in **Catalan, Spanish, and English** into structured JSON containing: - **Semantic query**: Core thematic content for vector search - **Filters**: Structured metadata (funding programme, year range, location, organization type) - **Query rewrite**: Human-readable interpretation of the query - **Metadata**: Language detection and processing notes ### Example **Input (Catalan):** ``` projectes d'IA en salut finançats per H2020 des de 2020 ``` **Output:** ```json { "doc_type": "projects", "filters": { "programme": "Horizon 2020", "year": ">=2020" }, "organisations": [], "semantic_query": "intel·ligència artificial salut", "query_rewrite": "Projectes sobre IA en salut del programa H2020 des de 2020", "meta": { "lang": "CA" } } ``` ## Training Details ### Base Model - **Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) - **Architecture**: LlamaForCausalLM (7B parameters) ### Fine-tuning Method - **Technique**: LoRA (Low-Rank Adaptation) - **Trainable parameters**: ~1% of total (~50MB adapter) ### LoRA Configuration | Parameter | Value | |-----------|-------| | Rank (r) | 16 | | Alpha | 32 | | Dropout | 0.05 | | Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | ### Training Hyperparameters | Parameter | Value | |-----------|-------| | Epochs | 3 | | Batch size | 16 (effective) | | Learning rate | 2e-4 | | Sequence length | 2048 | | Precision | FP16 (mixed) | | Optimizer | AdamW | | LR scheduler | Cosine | | Warmup ratio | 0.1 | ### Training Data - **Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing) - **Training split**: 682 multilingual queries (synthetic, template-generated) - **Language distribution**: ~33% Catalan, ~33% Spanish, ~33% English - **Query types**: Discover (88%), Quantify (12%) ### Evaluation Data - **Test split**: 100 real queries from domain experts (SIRIS Academic) - **Annotation**: Manual gold-standard JSON for each query ## Evaluation Results ### Overall Performance | Metric | Base Model | Fine-tuned | |--------|------------|------------| | JSON Validity | 100% | **100%** | | Strict Accuracy | 15% | **51%** | | Relaxed Accuracy | 29% | **65%** | | Language Match | 53% | **87%** | | Semantic Query Accuracy | 44% | **86%** | ### Component-level Accuracy | Component | Accuracy | |-----------|----------| | Programme (H2020, FEDER, etc.) | 96% | | Year extraction | 98% | | Location | 91% | | Organizations | 77% | | Semantic Query | 86% | ### Performance by Language | Language | Relaxed Accuracy | |----------|------------------| | English | 72% | | Catalan | 64% | | Spanish | 52% | ### Comparison with Other Models | Model | Strict Accuracy | Relaxed Accuracy | JSON Valid | |-------|-----------------|------------------|------------| | **Salamandra-7B (ours)** | **51%** | **65%** | 100% | | Qwen 2.5-7B | 47% | 65% | 100% | | Mistral-7B | 24% | 55% | 100% | ## Usage ### Basic Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "SIRIS-Lab/impuls-salamandra-7b-query-parser" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="auto" ) # System prompt (simplified version) system_prompt = """Convert natural language queries into structured JSON for R&D project search. Output only valid JSON with the required schema.""" query = "projectes d'hidrogen finançats per H2020 des de 2020" messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": query} ] input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(input_text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.1, do_sample=True ) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) ``` ### With 4-bit Quantization (Recommended for limited VRAM) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( "SIRIS-Lab/impuls-salamandra-7b-query-parser", quantization_config=quantization_config, device_map="auto" ) # Reduces memory from ~14GB to ~3.5GB ``` ## Output Schema ```json { "doc_type": "projects", "filters": { "programme": "string | null", "funding_level": "string | null", "year": "string | null", "location": "string | null", "location_level": "region | province | country | null" }, "organisations": [ { "type": "university | research_center | hospital | company | null", "name": "string | null", "location": "string | null", "location_level": "string | null" } ], "semantic_query": "string | null", "query_rewrite": "string", "meta": { "lang": "CA | ES | EN", "notes": "string | null" } } ``` ## Hardware Requirements | Configuration | VRAM Required | |---------------|---------------| | FP16 (full precision) | ~14 GB | | 4-bit quantization | ~3.5 GB | **Recommended**: GPU with 24GB+ VRAM (A100) or 4-bit quantization on consumer GPUs. ## Limitations - **Domain-specific**: Optimized for R&D project search queries; may not generalize well to other domains - **Schema-bound**: Outputs follow a fixed JSON schema; cannot handle arbitrary structured formats - **Language coverage**: Best performance on Catalan and English; Spanish accuracy is lower - **Complex queries**: Struggles with queries requiring numerical aggregation or ranking operations ## Intended Use This model is designed for: - R&D project discovery platforms (RIS3CAT, Horizon Europe portals) - Scientific literature search systems - Multilingual semantic search applications - Query understanding in Catalan, Spanish, and English ## Ethical Considerations - The model was trained on synthetic queries generated from templates and real queries from domain experts - No personal or sensitive data was used in training - The model is intended for search query parsing and does not generate harmful content ## Citation If you use this model, please cite: ```bibtex @misc{impuls-salamandra-2024, author = {SIRIS Academic}, title = {IMPULS-Salamandra-7B-Query-Parser: Multilingual Query Parsing for R&D Semantic Search}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/SIRIS-Lab/impuls-salamandra-7b-query-parser}} } ``` ## Acknowledgments - **[Barcelona Supercomputing Center (BSC)](https://www.bsc.es/)** - For the Salamandra base model and AINA infrastructure - **[Generalitat de Catalunya](https://web.gencat.cat/)** - For funding and the RIS3-MCAT platform - **[AINA Project](https://projecteaina.cat/)** - For the AINA Challenge 2024 framework ## License This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0), consistent with the base Salamandra model. ## Links - **Training Dataset**: [SIRIS-Lab/impuls-query-parsing](https://huggingface.co/datasets/SIRIS-Lab/impuls-query-parsing) - **Project Repository**: [github.com/sirisacademic/aina-impulse](https://github.com/sirisacademic/aina-impulse) - **Base Model**: [BSC-LT/salamandra-7b-instruct-tools](https://huggingface.co/BSC-LT/salamandra-7b-instruct-tools) - **AINA Project**: [projecteaina.cat](https://projecteaina.cat/) - **SIRIS Academic**: [sirisacademic.com](https://sirisacademic.com/)