Spaces:

mtyrrell
/

prefilter_app

Running

App Files Files Community

mtyrrell commited on Jan 12

Commit

bc92a1b

0 Parent(s):

Fresh start with LFS for images

Browse files

Files changed (19) hide show

.dockerignore +50 -0
.gitattributes +1 -0
.gitignore +8 -0
CLAUDE.md +121 -0
Dockerfile +40 -0
README.md +11 -0
__pycache__/app.cpython-311.pyc +0 -0
app.py +243 -0
config.cfg +2 -0
images/pipeline.png +3 -0
modules/auth.py +13 -0
modules/llm.py +90 -0
modules/models.py +8 -0
modules/org_count.py +232 -0
modules/pipeline.py +339 -0
modules/prompts.py +21 -0
modules/semantic_similarity.py +70 -0
modules/utils.py +111 -0
requirements.txt +12 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,50 @@

+# Git files
+.git
+.gitignore
+# Python cache
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.so
+*.egg
+*.egg-info
+dist
+build
+# Environment files
+.env
+*.env
+# IDE files
+.vscode
+.idea
+*.swp
+*.swo
+*~
+# OS files
+.DS_Store
+Thumbs.db
+# Testing and sandbox
+testing/
+sandbox/
+# Logs (will be created at runtime)
+logs/
+*.log
+# CSV files (test data)
+*.csv
+# Large data files
+org_renamed.csv
+rename.csv
+test.csv
+# Documentation
+README.md
+CLAUDE.md

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+.env
+.DS_Store
+*.csv
+*.xlsx
+/testing/
+/modules/__pycache__/
+app.log
+/sandbox/

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,121 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Overview
+This is a **Streamlit-based web application** for pre-filtering and analyzing grant applications using machine learning models. The app processes Excel files containing application data, runs them through multiple fine-tuned LLMs for classification, and generates scored outputs with recommended filtering actions.
+The application is deployed as a **Hugging Face Space** (ref. README.md metadata).
+## Running the Application
+### Local Development
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Run the Streamlit app
+streamlit run app.py
+```
+### Environment Variables
+Required environment variables (stored in `.env` for local development):
+- `HF_TOKEN` - Hugging Face token for authentication and model access
+- `<USERNAME>_HASH` - Bcrypt password hashes for user authentication (e.g., `USER1_HASH`)
+### CUDA/GPU Support
+The application checks for CUDA availability on startup (app.py:1-13). It will automatically use GPU if available, otherwise falls back to CPU or MPS (for Apple Silicon).
+## Architecture
+### Application Flow
+1. **Authentication** (modules/auth.py) - Users log in with credentials validated against bcrypt hashes from environment variables
+2. **File Upload** - Users upload an Excel file matching the template structure
+3. **Data Processing Pipeline** (modules/pipeline.py) - Core processing logic:
+   - Validates required columns
+   - Standardizes organization names and counts concepts per organization (modules/org_count.py)
+   - Cleans text fields (modules/utils.py)
+   - Runs inference through 10 different ML models sequentially
+   - Calculates scores and generates filtering recommendations
+4. **Output Generation** - Produces downloadable Excel file with analysis results
+### ML Model Pipeline
+The app uses **10 classification models** loaded from Hugging Face:
+- **SetFit Models (6)**: scope_lab1, scope_lab2, tech_lab1, tech_lab3, fin_lab2, bar_lab2
+- **Transformer Pipelines (4)**: ADAPMIT_SCOPE, ADAPMIT_TECH, SECTOR (multilabel), LANG (language detection)
+Models are loaded from different HF profiles:
+- `mtyrrell/classifier_SF_*` - SetFit models
+- `GIZ/ADAPMIT-multilabel-bge_f` - Adaptation vs. Mitigation classification
+- `GIZ/SECTOR-multilabel-bge_f` - Sector classification
+- `qanastek/51-languages-classifier` - Language detection
+### Required Excel Template Columns
+The input Excel file must contain these columns (ref. pipeline.py:113-124):
+- `id` - Application identifier
+- `organization` - Organization name
+- `scope` - Scope description text
+- `technology` - Technology description text
+- `financial` - Financial information text
+- `barrier` - Barrier description text
+- `maf_funding_requested` - Requested funding amount
+- `contributions_public_sector` - Public sector contributions
+- `contributions_private_sector` - Private sector contributions
+- `contributions_other` - Other contributions
+- `mitigation potential` - Mitigation potential text
+### Scoring and Filtering Logic
+The application calculates a predicted score (0-10) based on:
+- Individual model predictions (6 binary classifiers)
+- Leverage calculations (lev_gt_0, lev_maf_scale)
+- Formula: `(fin_lab2*2 + scope_lab1*2 + scope_lab2*2 + tech_lab1 + tech_lab3 + bar_lab2 + lev_gt_0 + lev_maf_scale) / 11 * 10`
+Applications are labeled with one of four actions (pipeline.py:269-278):
+- **INELIGIBLE** - Non-English, >6 concepts from same org, Adaptation (not Mitigation), wrong sector, or insufficient text length
+- **REJECT** - pred_score ≤ sensitivity level
+- **PRE-ASSESSMENT** - pred_score in [sensitivity+1, sensitivity+2]
+- **FULL-ASSESSMENT** - pred_score > sensitivity+2
+Sensitivity levels (app.py:81-98):
+- Low: 4 (fewer false negatives ~6%)
+- Medium: 5
+- High: 6 (more false negatives ~13%)
+### Key Modules
+**modules/pipeline.py**
+- `process_data(uploaded_file, sens_level)` - Main processing function
+- `predict_category(df, model_name, progress_bar, repo, profile, multilabel)` - Model inference wrapper
+- Handles both SetFit and Transformer pipeline models with different text truncation strategies
+**modules/utils.py**
+- `create_excel()` - Generates template Excel file for download
+- `clean_text(input_text)` - Removes special characters and normalizes whitespace
+- `extract_predicted_labels(output, ordinal_selection, threshold)` - Extracts top-2 labels from multilabel output
+**modules/org_count.py**
+- `standardize_organization_names(df)` - Normalizes org names using exact matching, abbreviations, and fuzzy matching (threshold=90)
+- Adds `org_renamed` and `concept_count` columns to track multiple concepts from same organization
+**modules/auth.py**
+- `validate_login(username, password)` - Validates against bcrypt hashes stored in environment variables
+**modules/logging_config.py**
+- `setup_logging()` - Configures rotating file handler (logs/app.log, 1MB max, 5 backups)
+### Session State Management
+The app uses Streamlit session state to manage:
+- `authenticated` - Login status
+- `data_processed` - Whether processing is complete
+- `df` - Processed dataframe
+- `show_button` - Controls "Start Analysis" button visibility
+- `processing` - Tracks processing state
+## Important Notes
+- **Processing Time**: Approximately 5 minutes for 1000 applications (ref. app.py:66)
+- **Model Truncation**: All text is truncated to 512 tokens max (pipeline.py:76-89)
+- **Device Selection**: Automatically selects CUDA > MPS > CPU (pipeline.py:51)
+- **Column Validation**: The app will error if required columns are missing from uploaded Excel file
+- **Organization Duplicates**: The system tracks concepts per organization and flags applications if >6 concepts from same org (marked INELIGIBLE)

Dockerfile ADDED Viewed

	@@ -0,0 +1,40 @@

+# NVIDIA CUDA base image for GPU support
+FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
+# Environment variables to prevent interactive prompts
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y \
+    python3.10 \
+    python3-pip \
+    python3.10-dev \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH
+WORKDIR $HOME/app
+RUN pip install --no-cache-dir --upgrade pip
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY --chown=user . .
+# Create logs directory with proper permissions
+USER root
+RUN mkdir -p logs && chown -R user:user logs
+USER user
+# Expose Streamlit default port
+EXPOSE 8501
+# Health check
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
+# Run app
+CMD ["streamlit", "run", "app.py","--server.port=8501", "--server.address=0.0.0.0", "--server.headless=true", "--server.fileWatcherType=none", "--server.enableXsrfProtection=false", "--server.enableCORS=false"]

README.md ADDED Viewed

	@@ -0,0 +1,11 @@

+---
+title: Prefilter App
+emoji: 🦀
+colorFrom: yellow
+colorTo: red
+sdk: docker
+app_port: 8501
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

__pycache__/app.cpython-311.pyc ADDED Viewed

Binary file (13.5 kB). View file

app.py ADDED Viewed

	@@ -0,0 +1,243 @@

+import torch
+try:
+    print(f"Is CUDA available: {torch.cuda.is_available()}")
+    if torch.cuda.is_available():
+        try:
+            print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
+        except Exception as e:
+            print(f"Error getting CUDA device name: {str(e)}")
+    else:
+        print("No CUDA device available - using CPU")
+except Exception as e:
+    print(f"Error checking CUDA availability: {str(e)}")
+    print("Continuing with CPU...")
+import streamlit as st
+import os
+from huggingface_hub import login
+from datetime import datetime
+from openai import OpenAI
+from modules.auth import validate_login
+from modules.utils import create_excel, setup_logging, getconfig
+from modules.pipeline import process_data
+setup_logging()
+import logging
+from io import BytesIO
+logger = logging.getLogger(__name__)
+# Local
+# from dotenv import load_dotenv
+# load_dotenv()
+config = getconfig("config.cfg")
+@st.cache_resource
+def get_azure_openai_client():
+    """Initialize and cache Azure OpenAI client for the session"""
+    try:
+        AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
+        AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")
+        AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
+        if not all([AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_API_KEY]):
+            raise ValueError("Missing required Azure OpenAI environment variables. Please check your .env file.")
+        client = OpenAI(api_key=AZURE_OPENAI_API_KEY, base_url=AZURE_OPENAI_ENDPOINT)
+        logger.info("Azure OpenAI client initialized successfully")
+        return client
+    except Exception as e:
+        logger.error(f"Failed to initialize Azure OpenAI client: {str(e)}")
+        raise
+def get_azure_deployment():
+    """Get Azure OpenAI deployment name from config file"""
+    try:
+        config = getconfig("config.cfg")
+        deployment = config.get("deployments", "DEPLOYMENT")
+        logger.info(f"Using Azure OpenAI deployment: {deployment}")
+        return deployment
+    except Exception as e:
+        logger.error(f"Failed to read deployment from config: {str(e)}. Using default deployment.")
+        deployment = "gpt-4o-mini"
+        return deployment
+# Main app logic
+def main():
+    # Temporarily set authentication to True for testing
+    if 'authenticated' not in st.session_state:
+        st.session_state['authenticated'] = False
+    if st.session_state['authenticated']:
+        # Remove login success message for testing
+        hf_token = os.environ["HF_TOKEN"]
+        login(token=hf_token, add_to_git_credential=True)
+        # Initialize session state variables
+        if 'data_processed' not in st.session_state:
+            st.session_state['data_processed'] = False
+            st.session_state['df'] = None
+        # Main Streamlit app
+        st.title('Application Pre-Filtering Tool')
+        # Sidebar (filters)
+        with st.sidebar:
+            with st.expander("ℹ️ - Instructions", expanded=False):
+                st.markdown(
+                    """
+                    1. **Download the Excel Template file (below)**
+                    2. **[OPTIONAL]: Select the desired filtering sensitivity level (below)**
+                    3. **Copy/paste the requisite application data in the template file. Best practice is to 'paste as values'**
+                    4. **Upload the template file in the area to the right (or click browse files)**
+                    5. **Click 'Start Analysis'**
+                    The tool will start processing the uploaded application data. This can take some time
+                    depending on the number of applications and the length of text in each. For example, a file with 1000 applications
+                    could be expected to take approximately 5 minutes.
+                    ***NOTE** -  you can also simply rename the column headers in your own file. The headers must match the column names in the template for the tool to run properly.*
+                    """
+                )
+            # Excel file download
+            st.download_button(
+                label="Download Excel Template",
+                data=create_excel(),
+                file_name="upload_template.xlsx",
+                mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
+            )
+            # get sensitivity level for use in review / reject (ref. process_data function)
+            sens_options = {
+                "Low": 4,
+                "Medium": 5,
+                "High": 6,
+            }
+            sens_input = st.sidebar.radio(label = 'Select the Sensitivity Level [OPTIONAL]',
+                                    help = 'Decreasing the level of sensitivity results in less \
+                                    applications filtered out. This also \
+                                    reduces the probability of false negatives (FNs). The rate of \
+                                    FNs at the lowest setting is approximately 6 percent, and \
+                                    approaches 13 percent at the highest setting. \
+                                    NOTE: changing this setting does not affect the raw data in the CSV output file (only the labels)',
+                                    options = list(sens_options.keys()),
+                                    index = list(sens_options.keys()).index("High"),
+                                    horizontal = False)
+            sens_level = sens_options[sens_input]
+        with st.expander("ℹ️ - About this app", expanded=False):
+            st.write(
+                """
+                This tool provides an interface for running an automated preliminary assessment of applications for a call for applications.
+                The tool functions by running selected text fields from the application through a series of LLMs fine-tuned for text classification (ref. diagram below).
+                The resulting output classifications are used to compute a score and a suggested pre-filtering action. The tool has been tested against
+                human assessors and exhibits an extremely low false negative rate (<6%) at a Sensitivity Level of 'Low' (i.e. rejection threshold for predicted score < 4).
+                """)
+            st.image('images/pipeline.png')
+        uploaded_file = st.file_uploader("Select a file containing application pre-filtering data (see instructions in the sidebar)")
+        # Add session state variables if they don't exist
+        if 'show_button' not in st.session_state:
+            st.session_state['show_button'] = True
+        if 'processing' not in st.session_state:
+            st.session_state['processing'] = False
+        if 'data_processed' not in st.session_state:
+            st.session_state['data_processed'] = False
+        # Only show the button if show_button is True and file is uploaded and not processing
+        if uploaded_file is not None and st.session_state['show_button'] and not st.session_state['processing']:
+            if st.button("Start Analysis", key="start_analysis"):
+                st.session_state['show_button'] = False
+                st.session_state['processing'] = True
+                st.rerun()
+        # If we're processing, show the processing logic
+        if st.session_state['processing']:
+            try:
+                logger.info(f"File uploaded: {uploaded_file.name}")
+                if not st.session_state['data_processed']:
+                    logger.info("Starting data processing")
+                    try:
+                        # Initialize Azure OpenAI client and get deployment name
+                        azure_client = get_azure_openai_client()
+                        azure_deployment = get_azure_deployment()
+                        st.session_state['df'] = process_data(
+                            uploaded_file,
+                            sens_level,
+                            azure_client,
+                            azure_deployment
+                        )
+                        logger.info("Data processing completed successfully")
+                        st.session_state['data_processed'] = True
+                    except ValueError as e:
+                        # Handle specific validation errors
+                        logger.error(f"Validation error: {str(e)}")
+                        st.error(str(e))
+                        st.session_state['show_button'] = True
+                        st.session_state['processing'] = False
+                        st.rerun()
+                    except Exception as e:
+                        # Handle other unexpected errors
+                        logger.error(f"Error in process_data: {str(e)}")
+                        st.error("An unexpected error occurred. Please check your input file and try again.")
+                        st.session_state['show_button'] = True
+                        st.session_state['processing'] = False
+                        st.rerun()
+                df = st.session_state['df']
+                def reset_button_state():
+                    st.session_state['show_button'] = True
+                    st.session_state['processing'] = False
+                    st.session_state['data_processed'] = False
+                # Create Excel buffer
+                excel_buffer = BytesIO()
+                df.to_excel(excel_buffer, index=False, engine='openpyxl')
+                excel_buffer.seek(0)
+                current_datetime = datetime.now().strftime('%d-%m-%Y_%H-%M-%S')
+                output_filename = f'processed_applications_{current_datetime}.xlsx'
+                st.download_button(
+                    label="Download Analysis Data File",
+                    data=excel_buffer,
+                    file_name=output_filename,
+                    mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
+                    on_click=reset_button_state
+                )
+            except Exception as e:
+                logger.error(f"Error processing file: {str(e)}")
+                st.error("Failed to process the file. Please ensure your column names match the template file.")
+                st.session_state['show_button'] = True
+                st.session_state['processing'] = False
+                st.rerun()
+    # Comment out for testing
+    else:
+        username = st.text_input("Username")
+        password = st.text_input("Password", type="password")
+        if st.button("Login"):
+            if validate_login(username, password):
+                st.session_state['authenticated'] = True
+                st.rerun()
+            else:
+                st.error("Incorrect username or password")
+main()

config.cfg ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [deployments]
2	+ DEPLOYMENT=gpt-5-mini

images/pipeline.png ADDED Viewed

Git LFS Details

SHA256: c6966ec792e3b749e773a185e441a558bd5abbde42bb61f826fb1321af910928
Pointer size: 131 Bytes
Size of remote file: 679 kB

modules/auth.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import os
+import bcrypt
+# Helper functions
+def check_password(provided_password, stored_hash):
+    return bcrypt.checkpw(provided_password.encode(), stored_hash)
+def validate_login(username, password):
+    # Retrieve user's hashed password from environment variables
+    user_hash = os.getenv(username.upper() + '_HASH')  # Assumes an env var like 'USER1_HASH'
+    if user_hash:
+        return check_password(password, user_hash.encode())
+    return False

modules/llm.py ADDED Viewed

	@@ -0,0 +1,90 @@

+# Helper functions for pipeline
+from datetime import datetime, timedelta
+from collections import defaultdict, namedtuple, Counter
+from typing import List, Dict, Any
+import torch
+import logging
+from transformers import pipeline
+from modules.utils import setup_logging
+from modules.prompts import prompt_concept
+from modules.models import ConceptClassify
+from openai import OpenAI
+logger = setup_logging()
+def call_structured(client: OpenAI, deployment: str, system_prompt: str, user_prompt: str,
+                   response_model: None,
+                   logger: logging.Logger) -> Dict[str, Any]:
+    """Call Azure OpenAI with structured output"""
+    system_prompt = "You are assessing grant applications for an open funding call."
+    try:
+        if deployment in ['o4-mini','o3',"gpt-5","gpt-5-mini","gpt-5-nano"]:
+            response = client.responses.parse(
+                model=deployment,
+                reasoning={"effort": "low"},
+                input=[
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ],
+                text_format=response_model)
+        else:
+            response = client.responses.parse(
+                model=deployment,
+                input=[
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=0,
+                text_format=response_model)
+        result = response.output_parsed
+        # Return data + cost info
+        return result.model_dump()
+    except Exception as e:
+        logger.error(f"Error calling Azure OpenAI for {response_model.__name__}: {e}")
+        # Return default error response for ConceptClassify model
+        return None
+# Not used - results sucked
+# def check_duplicate_concepts(client, deployment, concept_id: str, organization: str, concept_profile: str, df) -> bool:
+#     """
+#     Check for duplicate concepts within the same organization using Azure OpenAI
+#     Args:
+#         client: AzureOpenAI client instance
+#         deployment: Azure OpenAI deployment name
+#         concept_id: ID of the current concept being checked
+#         organization: Organization name
+#         concept_profile: Text description of the concept to check
+#         df: DataFrame containing all application data
+#     Returns:
+#         Boolean classification result
+#     """
+#     # Remove current concept from the dataframe
+#     df_check = df[df['id'] != concept_id].copy()
+#     # Get other concepts from the same organization
+#     org_concepts = df_check[df_check['org_renamed'] == organization]
+#     other_concepts = org_concepts['scope_txt'].tolist()
+#     # If no other concepts from this organization, return False
+#     if len(other_concepts) == 0:
+#         return False
+#     logger.info(f"Checking duplicates for concept ID {concept_id} from organization {organization} against {len(other_concepts)} other concept(s).")
+#     logger.info(f"Scope text {concept_profile}")
+#     # Construct prompt
+#     prompt = prompt_concept(concept_profile, other_concepts)
+#     response = call_structured(client, deployment, prompt, concept_profile, ConceptClassify, logger)
+#     check = response['classification']
+#     logger.info(f"Duplicate check response for concept ID {concept_id}: {check}")
+#     if check == "YES":
+#         return True
+#     return False

modules/models.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from typing import Dict, Any, List, Optional, Literal
+from pydantic import BaseModel, Field
+#===================== Duplicate concepts =====================
+class ConceptClassify(BaseModel):
+    classification: Literal["YES","NO","UNCERTAIN"] = Field(description="Is the concept duplicated in other applications? (yes/no/uncertain)")

modules/org_count.py ADDED Viewed

	@@ -0,0 +1,232 @@

+import pandas as pd
+from thefuzz import fuzz
+import logging
+import re
+logger = logging.getLogger(__name__)
+def standardize_organization_names(df):
+    """
+    Standardizes organization names in a DataFrame using exact matches, abbreviations, and fuzzy matching.
+    Args:
+        df (pd.DataFrame): DataFrame containing an 'organization' column
+    Returns:
+        pd.DataFrame: DataFrame with added 'org_renamed' and 'concept_count' columns
+    """
+    # Make a copy to avoid modifying the original DataFrame
+    df = df.copy()
+    # Sort DataFrame by 'id' column in ascending order
+    df = df.sort_values('id', ascending=True)
+    # Return DataFrame as-is if 'organization' column is not present
+    if 'organization' not in df.columns:
+        logger.warning("No 'organization' column found in DataFrame. Returning DataFrame as-is.")
+    else:
+        logger.info("Checking org names")
+        # Dictionary of organization variations and their standardized names
+        # Cleaned up to remove leading/trailing spaces for consistency
+        org_variations = {
+            'Adventist Development Relief Agency': ['adventist development'],
+            'Asian Development Bank': ['asian development bank'],
+            'Association of the Regional Mechanism for Emissions Reductions of Boyacá, Colombia (MRRE)': ['regional mechanism for emissions reductions of boyacá'],
+            'BioCarbon Partners (BCP)': ['biocarbon partners'],
+            'Biothermica Technologies Inc': ['biothermica tech'],
+            'Brazilian Tourist Board': ['brazilian tourist board'],
+            'Caribbean Community Climate Change Centre': ['caribbean community climate'],
+            'Caritas': ['caritas'],
+            'Chemical Industries Holding Company': ['chemical industries holding company'],
+            'Climate Advocacy International (CAI)': ['climate advocacy int'],
+            'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['deutsche gesellschaft für internationale'],
+            'Deutsche Sparkassenstiftung (DSIK)': ['deutsche sparkassenstiftung'],
+            'Development Initiative for Community Impact (DICI)': ['development initiative for community impact'],
+            'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['east african centre of excellence for renewable'],
+            'Eco-Ideal': ['eco-ideal'],
+            'Electricité de France (EDF)': ['electricité de france', 'edf international networks'],
+            'The Energy and Resources Institute (TERI)': ['energy and resources institute'],
+            'Environmental Defense Fund (EDF)': ['environmental defense fund'],
+            'Food and Agriculture Organization (FAO)': ['food and agriculture organization'],
+            'Global Green Growth Institute (GGGI)': ['global green growth'],
+            'International Finance Corporation (IFC)': ['international finance corporation'],
+            "International Organization for Migration (IOM)": ['international organization for migration'],
+            'Inter-American Development Bank (IDB)': ['american development bank'],
+            'Iskandar Regional Development Authority (IRDA)': ['iskandar regional'],
+            'Islamic Development Bank': ['islamic development bank'],
+            'Malaysian Industry Government Group for High Technology (MIGHT)': ['government group for high technology'],
+            'Metallurgical Industries Holding Company': ['metallurgical industries holding company'],
+            'MicroSave Consulting (MSC)': ['microsave consulting'],
+            'Osh Technological University': ['osh technological university','ошский технологический университет'],
+            'Oxford Policy Management (OPM)': ['oxford policy management'],
+            'Pacific Rim Investment Management': ['pacific rim investment'],
+            'Palestinian Energy and Natural Resources Authority (PENRA)': ['palestinian energy and natural'],
+            'Rwanda Energy Group (REG) Ltd': ['rwanda energy group'],
+            'Rocky Mountain Institute (RMI)': ['rocky mountain institute'],
+            'Secretariat of the Pacific Regional Environment Programme (SPREP)': ['secretariat of the pacific regional environment programme (sprep)'],
+            'Serviço Nacional de Aprendizagem Industrial (SENAI)': ['serviço nacional de aprendizagem'],
+            'Sumy City Council': ['sumy city council'],
+            # 'Tajik Technical University': ['tajik technical university'],
+            'Uganda Development Bank Limited (UDBL)': ['uganda development bank'],
+            'United Nations Human Settlement Programme (UN-Habitat)': ['united nations human settlement','un-habitat'],
+            'United Nations Children\'s Fund (UNICEF)': ['united nations children'],
+            'United Nations Conference on Trade and Development (UNCTAD)': ['united nations conference on trade'],
+            'United Nations Development Programme (UNDP)': ['united nations development program'],
+            'United Nations Economic and Social Commission (ECOSOC)': ['united nations economic and social'],
+            'United Nations Environment Programme (UNEP)': ['united nations environment'],
+            'United Nations High Commissioner for Refugees (UNHCR)': ['high commissioner for refugees'],
+            'United Nations Industrial Development Organization (UNIDO)': ['united nations industrial'],
+            'United Nations Office for Project Services (UNOPS)': ['united nations office for project'],
+            'World Food Programme (WFP)': ['world food program'],
+            'World Health Organization (WHO)': ['world health organization'],
+            'World Resources Institute (WRI)': ['world resources institute'],
+            'World Wide Fund for Nature (WWF)': ['world wildlife','world wide fund for nature'],
+        }
+        # Dictionary of organization abbreviations
+        org_abreviations = {
+            'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['GIZ'],
+            'Deutsche Sparkassenstiftung (DSIK)': ['DSIK'],
+            'Development Initiative for Community Impact (DICI)': ['DICI'],
+            'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['EACREEE'],
+            'Food and Agriculture Organization (FAO)': ['FAO'],
+            'Global Green Growth Institute (GGGI)': ['GGGI'],
+            'International Finance Corporation (IFC)': ['IFC'],
+            'International Organization for Migration (IOM)': ['IOM'],
+            'Inter-American Development Bank (IDB)': ['IDB'],
+            'United Nations Children\'s Fund (UNICEF)': ['UNICEF'],
+            'United Nations Conference on Trade and Development (UNCTAD)': ['UNCTAD'],
+            'United Nations Development Programme (UNDP)': ['UNDP'],
+            'United Nations Economic and Social Commission (ECOSOC)': ['ECOSOC'],
+            'United Nations Environment Programme (UNEP)': ['UNEP'],
+            'United Nations Industrial Development Organization (UNIDO)': ['UNIDO'],
+            'United Nations High Commissioner for Refugees (UNHCR)': ['UNHCR'],
+            'United Nations Office for Project Services (UNOPS)': ['UNOPS'],
+            'World Food Programme (WFP)': ['WFP'],
+            'World Health Organization (WHO)': ['WHO'],
+            'World Resources Institute (WRI)': ['WRI'],
+            'World Wide Fund for Nature (WWF)': ['WWF']
+        }
+        # Initialize result column
+        df['org_renamed'] = None
+        # Step 1: Process abbreviations first (highest priority for exact acronyms)
+        # Use case-insensitive matching to catch "giz", "GIZ", "Giz", etc.
+        logger.info("Processing abbreviation matches")
+        for standard_name, abreviations in org_abreviations.items():
+            for abreviation in abreviations:
+                # Case-insensitive matching with word boundaries to avoid partial matches
+                pattern = r'\b' + re.escape(abreviation) + r'\b'
+                mask = df['organization'].str.contains(pattern, case=False, regex=True, na=False) & df['org_renamed'].isna()
+                df.loc[mask, 'org_renamed'] = standard_name
+        # Step 2: Process substring variations (e.g., "adventist development" in full org name)
+        # Use improved substring matching to reduce false positives
+        logger.info("Processing variation matches")
+        for standard_name, variations in org_variations.items():
+            for var in variations:
+                # Check if already matched by abbreviation
+                mask = df['org_renamed'].isna()
+                if mask.sum() == 0:
+                    break
+                # Use simple substring matching (case-insensitive)
+                # Note: Not using word boundaries here as variations are often partial phrases
+                org_lower = df.loc[mask, 'organization'].str.lower()
+                submask = org_lower.str.contains(re.escape(var), regex=True, na=False)
+                df.loc[mask & submask, 'org_renamed'] = standard_name
+        # Step 3: Process fuzzy matches against dictionary for remaining unmatched organizations
+        unmatched_mask = df['org_renamed'].isna()
+        if unmatched_mask.sum() > 0:
+            logger.info(f"Processing fuzzy matches against dictionary for {unmatched_mask.sum()} unmatched organizations")
+            # Get unique unmatched organization names to avoid duplicate processing
+            unique_unmatched = df.loc[unmatched_mask, 'organization'].unique()
+            threshold = 85  # token_set_ratio handles extra tokens (e.g., country names) better
+            # Create mapping dictionary for unique unmatched orgs
+            org_mapping = {}
+            for org in unique_unmatched:
+                org_lower = str(org).lower()
+                best_match = None
+                highest_ratio = 0
+                # Check against all variations and standard names
+                for standard_name, variations in org_variations.items():
+                    all_forms = [standard_name.lower()] + variations
+                    for variant in all_forms:
+                        # Use token_set_ratio for better matching with extra tokens/geographic qualifiers
+                        ratio = fuzz.token_set_ratio(org_lower, variant)
+                        if ratio > threshold and ratio > highest_ratio:
+                            highest_ratio = ratio
+                            best_match = standard_name
+                if best_match:
+                    org_mapping[org] = best_match
+                    logger.debug(f"Fuzzy matched '{org}' to '{best_match}' (score: {highest_ratio})")
+            # Apply the mapping to all matching rows
+            if org_mapping:
+                for original_org, standard_org in org_mapping.items():
+                    mask = (df['organization'] == original_org) & df['org_renamed'].isna()
+                    df.loc[mask, 'org_renamed'] = standard_org
+        # Step 4: Dynamic fuzzy matching of remaining unmatched organizations
+        # Compare new unmatched orgs against previously stored unmatched orgs
+        unmatched_mask = df['org_renamed'].isna()
+        if unmatched_mask.sum() > 0:
+            logger.info(f"Processing dynamic fuzzy matching for {unmatched_mask.sum()} remaining unmatched organizations")
+            # Get unmatched organizations in order of first appearance
+            unmatched_df = df[unmatched_mask].copy()
+            unmatched_df = unmatched_df.sort_values('id')
+            unique_unmatched_ordered = unmatched_df['organization'].drop_duplicates().tolist()
+            threshold = 95  # Higher threshold for dynamic matching to avoid false positives
+            stored_orgs = []  # List of previously seen unmatched orgs (becomes canonical)
+            org_mapping = {}  # Maps original org name to stored canonical org
+            for org in unique_unmatched_ordered:
+                org_lower = str(org).lower()
+                best_match = None
+                highest_ratio = 0
+                # Compare against all previously stored unmatched organizations
+                for stored_org in stored_orgs:
+                    stored_org_lower = str(stored_org).lower()
+                    ratio = fuzz.token_set_ratio(org_lower, stored_org_lower)
+                    if ratio > threshold and ratio > highest_ratio:
+                        highest_ratio = ratio
+                        best_match = stored_org
+                if best_match:
+                    # Match found - map to the stored org
+                    org_mapping[org] = best_match
+                    logger.debug(f"Dynamically matched '{org}' to '{best_match}' (score: {highest_ratio})")
+                else:
+                    # No match - store this org for future comparisons
+                    stored_orgs.append(org)
+                    org_mapping[org] = org
+            # Apply the mapping to all matching rows
+            for original_org, canonical_org in org_mapping.items():
+                mask = (df['organization'] == original_org) & df['org_renamed'].isna()
+                df.loc[mask, 'org_renamed'] = canonical_org
+        # Fill remaining empty values with original organization names
+        df.loc[df['org_renamed'].isna(), 'org_renamed'] = df.loc[df['org_renamed'].isna(), 'organization']
+        # Add concept count
+        df['concept_count'] = df.groupby('org_renamed').cumcount() + 1
+        # Reorder columns with id, organization, org_renamed, concept_count first, followed by all others
+        cols = ['id', 'organization', 'org_renamed', 'concept_count']
+        other_cols = [col for col in df.columns if col not in cols]
+        df = df[cols + other_cols]
+    return df

modules/pipeline.py ADDED Viewed

	@@ -0,0 +1,339 @@

+import re
+import time
+import pandas as pd
+from io import BytesIO
+import streamlit as st
+import torch
+from setfit import SetFitModel
+from transformers import pipeline
+from openpyxl import Workbook
+from openpyxl.styles import Font, NamedStyle, PatternFill
+from openpyxl.styles.differential import DifferentialStyle
+from modules.org_count import standardize_organization_names
+from modules.utils import clean_text, extract_predicted_labels
+# from modules.llm import check_duplicate_concepts
+from modules.semantic_similarity import check_duplicate_concepts_semantic
+from sentence_transformers import SentenceTransformer
+import logging
+logger = logging.getLogger(__name__)
+# # Function for extracting classifications for each SECTOR label
+def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
+    # verify output is a list of dictionaries
+    if isinstance(output, list) and all(isinstance(item, dict) for item in output):
+        # filter items with scores above the threshold
+        filtered_items = [item for item in output if item.get('score', 0) > threshold]
+        # sort the filtered items by score in descending order
+        sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
+        # extract the highest and second-highest labels
+        if len(sorted_items) >= 2:
+            highest_label = sorted_items[0].get('label')
+            second_highest_label = sorted_items[1].get('label')
+        elif len(sorted_items) == 1:
+            highest_label = sorted_items[0].get('label')
+            second_highest_label = None
+        else:
+            print("Warning: Less than two items above the threshold in the current list.")
+            highest_label = None
+            second_highest_label = None
+    else:
+        print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
+        highest_label = None
+        second_highest_label = None
+    # Output dictionary of highest and second-highest labels to the all_predicted_labels list
+    predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
+    return predicted_labels
+# Function to call model and run inference for varying classification tasks/models
+def predict_category(df, model_name, progress_bar, repo, profile, multilabel=False):
+    device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
+    model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2', 'bar_lab2']
+    # Model configuration mapping
+    model_config = {
+        'ADAPMIT_TECH': {'col_name': 'tech_txt', 'top_k': 1},
+        'ADAPMIT_SCOPE': {'col_name': 'scope_txt', 'top_k': 1},
+        'LANG': {'col_name': 'scope_txt', 'top_k': 1},
+        'default': {'col_name': 'scope_txt', 'top_k': None}
+    }
+    if model_name in model_names_sf:
+        col_name = re.sub(r'_(.*)', r'_txt', model_name)
+        model = SetFitModel.from_pretrained(profile+"/"+repo)
+        model.to(device)
+        # Get tokenizer from the model
+        tokenizer = model.model_body.tokenizer
+    else:
+        # Get configuration for the model, falling back to default if not specified
+        config = model_config.get(model_name, model_config['default'])
+        col_name = config['col_name']
+        model = pipeline("text-classification",
+                        model=profile+"/"+repo,
+                        device=device,
+                        top_k=config['top_k'],
+                        truncation=True,
+                        max_length=512)
+    predictions = []
+    # probabilities = []
+    total = len(df)
+    for i, text in enumerate(df[col_name]):
+        try:
+            if model_name in model_names_sf:
+                # Truncate text for SetFit models
+                    encoded = tokenizer(text, truncation=True, max_length=512)
+                    truncated_text = tokenizer.decode(encoded['input_ids'])
+                    prediction = model(truncated_text)
+                    predictions.append(0 if prediction == 'NEGATIVE' else 1)
+            else:
+                prediction = model(text)
+                if model_name == 'ADAPMIT_SCOPE' or model_name == 'ADAPMIT_TECH':
+                    predictions.append(re.sub('Label$', '', prediction[0][0]['label']))
+                elif model_name == 'SECTOR':
+                    predictions.append(extract_predicted_labels(prediction[0], threshold=0.5))
+                elif model_name == 'LANG':
+                    predictions.append(prediction[0][0]['label'])
+        except Exception as e:
+            logger.error(f"Error processing sample {df['id'][i]}: {str(e)}")
+            st.error("Application Error. Please contact support.")
+        # Update progress bar with each iteration
+        progress = (i + 1) / total
+        progress_bar.progress(progress)
+    return predictions
+# Main function to process data
+def process_data(uploaded_file, sens_level, azure_client, azure_deployment):
+    """
+    Process uploaded application data through ML pipeline
+    Args:
+        uploaded_file: Excel file containing application data
+        sens_level: Sensitivity level for filtering (4=Low, 5=Medium, 6=High)
+        azure_client: AzureOpenAI client instance for LLM calls
+        azure_deployment: Azure OpenAI deployment name
+    Returns:
+        Processed DataFrame with predictions and scores
+    """
+    # Define required columns and their mappings
+    required_columns = {
+        'id': 'id',
+        'scope': 'scope_txt',
+        'technology': 'tech_txt',
+        'financial': 'fin_txt',
+        'barrier': 'bar_txt',
+        'maf_funding_requested': 'maf_funding',
+        'contributions_public_sector': 'cont_public',
+        'contributions_private_sector': 'cont_private',
+        'contributions_other': 'cont_other',
+        'mitigation_potential': 'mitigation_potential'
+    }
+    # Read the Excel file
+    try:
+        df = pd.read_excel(uploaded_file)
+        logger.info("Data import successful")
+        # Clean up organization names
+        df = standardize_organization_names(df)
+    except Exception as e:
+        error_msg = f"Failed to read Excel file: {str(e)}"
+        logger.error(error_msg)
+        st.error("Failed to read the uploaded file. Please ensure it's a valid Excel file.")
+        raise ValueError(error_msg)
+    # Validate required columns
+    missing_columns = [col for col in required_columns.keys() if col not in df.columns]
+    if missing_columns:
+        error_msg = f"Missing required columns: {', '.join(missing_columns)}"
+        logger.error(error_msg)
+        st.error(error_msg)
+        raise ValueError(error_msg)
+    # Rename required columns while preserving all others
+    df = df.rename(columns={k: v for k, v in required_columns.items() if k in df.columns})
+    # Clean and process text fields
+    df.fillna('', inplace=True)
+    df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']] = df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']].applymap(clean_text)
+    # Define models and predictions
+    model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2','bar_lab2']
+    model_names = model_names_sf + ['ADAPMIT_SCOPE','ADAPMIT_TECH','SECTOR','LANG','DUPLICATE_CHECK']
+    # model_names_sf = []
+    # model_names = ['ADAPMIT_SCOPE','ADAPMIT_TECH']
+    total_predictions = len(model_names) * len(df)
+    progress_count = 0
+    # UI setup for progress tracking
+    st.subheader("Overall Progress:")
+    patience_text = st.empty()
+    patience_text.markdown("*You may want to grab a coffee, this can take a while...*")
+    overall_progress = st.progress(0)
+    overall_start_time = time.time()
+    estimated_time_remaining_text = st.empty()
+    # Model processing
+    step_count = 0
+    total_steps = len(model_names)
+    for model_name in model_names:
+        logger.info(f"Loading: {model_name}")
+        step_count += 1
+        model_processing_text = st.empty()
+        model_processing_text.markdown(f'**Current Task: Processing with model "{model_name}"**')
+        model_progress = st.empty()
+        progress_bar = model_progress.progress(0)
+        # Load the model and run inference
+        if model_name in model_names_sf:
+            df[model_name] = predict_category(df, model_name, progress_bar, repo='classifier_SF_' + model_name, profile='mtyrrell')
+        elif model_name == 'ADAPMIT_SCOPE':
+            df[model_name] = predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
+        elif model_name == 'ADAPMIT_TECH':
+            df[model_name]= predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
+        elif model_name == 'SECTOR':
+            sectors_dict = predict_category(df, model_name, progress_bar, repo='SECTOR-multilabel-bge_f', profile='GIZ', multilabel=True)
+            df['SECTOR1'] = [item['SECTOR1'] for item in sectors_dict]
+            df['SECTOR2'] = [item['SECTOR2'] for item in sectors_dict]
+        elif model_name == 'LANG':
+            df[model_name] = predict_category(df, model_name, progress_bar, repo='51-languages-classifier', profile='qanastek')
+            # df[model_name] = predict_category(df, model_name, progress_bar, repo='xlm-roberta-base-language-detection', profile='papluca')
+        elif model_name == 'DUPLICATE_CHECK':
+            # Load semantic similarity model for duplicate detection
+            device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
+            logger.info(f"Loading semantic similarity model on device: {device}")
+            semantic_model = SentenceTransformer('BAAI/bge-m3', device=device)
+            # Process duplicate check with progress tracking
+            duplicate_results = []
+            total = len(df)
+            for i, row in df.iterrows():
+                result = check_duplicate_concepts_semantic(
+                    semantic_model,
+                    row['id'],
+                    row['org_renamed'],
+                    row['scope_txt'],
+                    df
+                )
+                duplicate_results.append(result)
+                # Update progress bar with each iteration
+                progress = (i + 1) / total
+                progress_bar.progress(progress)
+            df['duplicate_check'] = duplicate_results
+        logger.info(f"Completed: {model_name}")
+        model_progress.empty()
+        progress_count += len(df)
+        overall_progress_value = progress_count / total_predictions
+        overall_progress.progress(overall_progress_value)
+        # Calculate and display estimated time remaining
+        elapsed_time = time.time() - overall_start_time
+        steps_remaining = total_steps - step_count
+        if step_count > 1:
+            estimated_time_remaining = (elapsed_time / step_count) * steps_remaining
+            estimated_time_remaining_text.markdown(
+                f"Elapsed time: {elapsed_time:.1f}s. "
+                f"Estimated time remaining: {estimated_time_remaining:.1f}s"
+                f" (step {step_count+1} of {len(model_names)})"
+            )
+        else:
+            estimated_time_remaining_text.write(f'Calculating time remaining... (step {step_count+1} of {len(model_names)})')
+        model_processing_text.empty()
+    patience_text.empty()
+    estimated_time_remaining_text.empty()
+    st.write(f'Processing complete. Total time: {elapsed_time:.1f} seconds')
+    # df['ADAPMIT_SCOPE_SCORE'] = df['ADAPMIT_SCOPE'].apply(
+    #     lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
+    # )
+    # df['ADAPMIT_TECH_SCORE'] = df['ADAPMIT_TECH'].apply(
+    #     lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
+    # )
+    # # Calculate average mitigation score
+    # df['ADAPMIT_SCORE'] = (df['ADAPMIT_SCOPE_SCORE'] + df['ADAPMIT_TECH_SCORE']) / 2
+    df['ADAPMIT'] = df.apply(lambda x: 'Adaptation' if x['ADAPMIT_SCOPE'] == 'Adaptation' and x['ADAPMIT_TECH'] == 'Adaptation' else 'Mitigation', axis=1)
+    # Convert funding columns to numeric, replacing any non-numeric values with NaN
+    df['maf_funding'] = pd.to_numeric(df['maf_funding'], errors='coerce')
+    df['cont_public'] = pd.to_numeric(df['cont_public'], errors='coerce')
+    df['cont_private'] = pd.to_numeric(df['cont_private'], errors='coerce')
+    df['cont_other'] = pd.to_numeric(df['cont_other'], errors='coerce')
+    # same for mitigation potential
+    df['mitigation_potential'] = abs(pd.to_numeric(df['mitigation_potential'], errors='coerce'))
+    # Fill any NaN values with 0
+    df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']] = df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']].fillna(0)
+    # Get total of all leverage
+    df['lev_total'] = df.apply(lambda x: x['cont_public'] + x['cont_private'] + x['cont_other'], axis=1)
+    # Leverage > MAF request
+    # df['lev_gt_maf'] = df.apply(lambda x: 'True' if x['lev_total'] > x['maf_funding'] else 'False', axis=1) # not used
+    # Leverage > 0 ?
+    df['lev_gt_0'] = (df['lev_total'] > 0).astype(int)
+    # Calculate leverage as percentage of MAF funding
+    df['lev_maf_%'] = df.apply(lambda x: round(x['lev_total']/x['maf_funding']*100,2) if x['maf_funding'] != 0 else 0, axis=1)
+    # Create normalized leverage scale (0-1) where 300% leverage = 1
+    df['lev_maf_scale'] = df['lev_maf_%'].apply(lambda x: min(x/300, 1) if x > 0 else 0)
+    # EUR / tCO2e mitigation potential
+    df['cost_effectivness'] = df.apply(lambda x: round(x['maf_funding']/x['mitigation_potential'],2) if x['mitigation_potential'] > 0 else None, axis=1)
+    # Normalize cost_effectivness to 0-1 scale (lower cost = higher score, capped at 1000 EUR/tCO2e)
+    df['cost_effectivness_norm'] = df['cost_effectivness'].apply(lambda x: max(0, 1 - (x / 1000)) if x is not None else None)
+    # Test if all text fields don't have minimum required words
+    df['word_length_check'] = df.apply(lambda x:
+        True if len(x['scope_txt'].split()) < 10 and
+            len(x['fin_txt'].split()) < 10 and
+            len(x['tech_txt'].split()) < 10
+            else False, axis=1)
+    # Predict score
+    sector_classes = ['Energy','Transport','Industries']
+    df['pred_score'] = df.apply(lambda x: round((x['fin_lab2']*2 + x['scope_lab1']*2 + x['scope_lab2']*2 + x['tech_lab1'] + x['tech_lab3'] + x['bar_lab2'] + x['lev_gt_0']+x['lev_maf_scale'])/11*10,0), axis=1)
+    # labelling logic
+    df['pred_action'] = df.apply(lambda x:
+        'REJECT' if (('concept_count' in df.columns and x['concept_count'] > 6) or
+                        x['LANG'][0:2] != 'en' or
+                        x['ADAPMIT'] == 'Adaptation' or
+                        not any(sector in [x['SECTOR1'], x['SECTOR2']] for sector in sector_classes) or
+                        x['word_length_check'] == True or
+                        x['duplicate_check'] == True or
+                        x['pred_score'] <= sens_level )
+        else 'PRE-ASSESSMENT' if sens_level+1 <= x['pred_score'] <= sens_level+2
+        else 'FULL-ASSESSMENT' if x['pred_score'] > sens_level+2
+        else 'ERROR', axis=1)
+    # Reorder columns in final dataframe
+    column_order = ['id', 'organization', 'org_renamed', 'concept_count', 'duplicate_check', 'scope_txt', 'tech_txt', 'fin_txt', 'maf_funding', 'cont_public',
+                    'cont_private', 'cont_other', 'scope_lab1', 'scope_lab2', 'tech_lab1',
+                    'tech_lab3', 'fin_lab2', 'bar_lab2', 'ADAPMIT_SCOPE', 'ADAPMIT_TECH', 'ADAPMIT', 'SECTOR1',
+                    'SECTOR2', 'LANG', 'lev_total', 'lev_gt_0', 'lev_maf_%', 'lev_maf_scale','mitigation_potential', 'cost_effectivness', 'cost_effectivness_norm',
+                    'word_length_check', 'pred_score', 'pred_action']
+    # Only include columns that exist in the DataFrame
+    final_columns = [col for col in column_order if col in df.columns]
+    df = df[final_columns]
+    return df

modules/prompts.py ADDED Viewed

	@@ -0,0 +1,21 @@

+# Prompts library
+from typing import List
+def prompt_concept(concept: str, other_concepts: List[str]) -> str:
+    """Generate prompt for classifying concepts by similarity"""
+    prompt = f"""
+    Each organization is allowed to submit up to 6 concepts per year via a web portal. However, in some cases organizations submit the same concept multiple times.
+    This can happen for various reasons. For example, an organization may erroneously submit the same application twice because they lost access to the previous web session somehow.
+    In all such cases, it is not usually the case that the duplicate concepts are verbatim identical. It is more usually the case that there is simply high semantic alignment - i.e. it is the same concept, but there are minor superficial differences between each application.
+    Your task is to review the concept profiles submitted by a particular organization and identify the amount of similarity so that we can in turn identify cases of duplicate concepts.
+    Here is the concept profile for review:
+    {concept}
+    Please review this against the following concepts and assess for duplication:
+    {other_concepts}
+    Please conduct your review carefully - however, ensure that you tag all duplicates correctly. Please return your response according to the following structure:
+    """
+    return prompt

modules/semantic_similarity.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# Semantic similarity-based duplicate detection
+import pandas as pd
+import logging
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics.pairwise import cosine_similarity
+from modules.utils import setup_logging
+logger = setup_logging()
+def check_duplicate_concepts_semantic(
+    model: SentenceTransformer,
+    concept_id: str,
+    organization: str,
+    concept_profile: str,
+    df: pd.DataFrame,
+    similarity_threshold: float = 0.85
+) -> bool:
+    """
+    Check for duplicate concepts within the same organization using semantic similarity
+    Args:
+        model: SentenceTransformer model for computing embeddings
+        concept_id: ID of the current concept being checked
+        organization: Organization name
+        concept_profile: Text description of the concept to check
+        df: DataFrame containing all application data
+        similarity_threshold: Threshold for considering concepts duplicates (0-1)
+                            Recommended values: 0.80 (lenient) to 0.95 (strict)
+    Returns:
+        Boolean classification result
+    """
+    # Remove current concept from the dataframe
+    df_check = df[df['id'] != concept_id].copy()
+    # Get other concepts from the same organization
+    org_concepts = df_check[df_check['org_renamed'] == organization]
+    other_concepts = org_concepts['scope_txt'].tolist()
+    # If no other concepts from this organization, return False
+    if len(other_concepts) == 0:
+        return False
+    # Compute embedding for current concept
+    current_embedding = model.encode(
+        concept_profile if concept_profile else "",
+        convert_to_numpy=True
+    )
+    # Compute embeddings for other concepts
+    other_embeddings = model.encode(
+        [text if text else "" for text in other_concepts],
+        convert_to_numpy=True
+    )
+    # Compute cosine similarities
+    similarities = cosine_similarity(
+        current_embedding.reshape(1, -1),
+        other_embeddings
+    )[0]
+    max_similarity = similarities.max() if len(similarities) > 0 else 0.0
+    logger.info(f"Duplicate check response for concept ID {concept_id}: max_similarity={max_similarity:.3f}")
+    if max_similarity >= similarity_threshold:
+        return True
+    return False

modules/utils.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import re
+from io import BytesIO
+from openpyxl import Workbook
+from openpyxl.styles import Font, NamedStyle, PatternFill
+from openpyxl.styles.differential import DifferentialStyle
+import logging
+from logging.handlers import RotatingFileHandler
+import os
+import configparser
+def setup_logging():
+    # Set up logging
+    log_dir = 'logs'
+    os.makedirs(log_dir, exist_ok=True)
+    log_file = os.path.join(log_dir, 'app.log')
+    # Create a RotatingFileHandler
+    file_handler = RotatingFileHandler(log_file, maxBytes=1024 * 1024, backupCount=5)
+    file_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
+    # Configure the root logger
+    logging.basicConfig(level=logging.INFO,
+                        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+                        handlers=[file_handler, logging.StreamHandler()])
+    # Return a logger instance
+    return logging.getLogger(__name__)
+def getconfig(configfile_path: str):
+    """
+    Read the config file
+    Params
+    ----------------
+    configfile_path: file path of .cfg file
+    """
+    config = configparser.ConfigParser()
+    try:
+        config.read_file(open(configfile_path))
+        return config
+    except:
+        logging.warning("config file not found")
+# Function for creating Upload template file
+def create_excel():
+    wb = Workbook()
+    sheet = wb.active
+    sheet.title = "template"
+    columns = ['id',
+               'organization',
+               'scope',
+               'technology',
+               'financial',
+               'barrier',
+               'maf_funding_requested',
+               'contributions_public_sector',
+               'contributions_private_sector',
+               'contributions_other',
+               'mitigation_potential']
+    sheet.append(columns)  # Appending columns to the first row
+    # formatting
+    for c in sheet['A1:K4'][0]:
+        c.fill = PatternFill('solid', fgColor = 'bad8e1')
+        c.font = Font(bold=True)
+    # Save to a BytesIO object
+    output = BytesIO()
+    wb.save(output)
+    return output.getvalue()
+# Function to clean text
+def clean_text(input_text):
+    cleaned_text = re.sub(r"[^a-zA-Z0-9\s.,:;!?()\-\n]", "", input_text)
+    cleaned_text = re.sub(r"x000D", "", cleaned_text)
+    cleaned_text = re.sub(r"\s+", " ", cleaned_text)
+    cleaned_text = re.sub(r"\n+", "\n", cleaned_text)
+    return cleaned_text
+# # Function for extracting classifications for each SECTOR label
+def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
+    # verify output is a list of dictionaries
+    if isinstance(output, list) and all(isinstance(item, dict) for item in output):
+        # filter items with scores above the threshold
+        filtered_items = [item for item in output if item.get('score', 0) > threshold]
+        # sort the filtered items by score in descending order
+        sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
+        # extract the highest and second-highest labels
+        if len(sorted_items) >= 2:
+            highest_label = sorted_items[0].get('label')
+            second_highest_label = sorted_items[1].get('label')
+        elif len(sorted_items) == 1:
+            highest_label = sorted_items[0].get('label')
+            second_highest_label = None
+        else:
+            print("Warning: Less than two items above the threshold in the current list.")
+            highest_label = None
+            second_highest_label = None
+    else:
+        print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
+        highest_label = None
+        second_highest_label = None
+    # Output dictionary of highest and second-highest labels to the all_predicted_labels list
+    predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
+    return predicted_labels

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+streamlit
+pandas
+openpyxl
+setfit
+bcrypt
+--extra-index-url https://download.pytorch.org/whl/cu113
+torch
+thefuzz
+openai==2.9.0
+python-dotenv
+sentence-transformers
+scikit-learn