Spaces:
Running
Running
Commit
·
bc92a1b
0
Parent(s):
Fresh start with LFS for images
Browse files- .dockerignore +50 -0
- .gitattributes +1 -0
- .gitignore +8 -0
- CLAUDE.md +121 -0
- Dockerfile +40 -0
- README.md +11 -0
- __pycache__/app.cpython-311.pyc +0 -0
- app.py +243 -0
- config.cfg +2 -0
- images/pipeline.png +3 -0
- modules/auth.py +13 -0
- modules/llm.py +90 -0
- modules/models.py +8 -0
- modules/org_count.py +232 -0
- modules/pipeline.py +339 -0
- modules/prompts.py +21 -0
- modules/semantic_similarity.py +70 -0
- modules/utils.py +111 -0
- requirements.txt +12 -0
.dockerignore
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Git files
|
| 2 |
+
.git
|
| 3 |
+
.gitignore
|
| 4 |
+
|
| 5 |
+
# Python cache
|
| 6 |
+
__pycache__
|
| 7 |
+
*.pyc
|
| 8 |
+
*.pyo
|
| 9 |
+
*.pyd
|
| 10 |
+
.Python
|
| 11 |
+
*.so
|
| 12 |
+
*.egg
|
| 13 |
+
*.egg-info
|
| 14 |
+
dist
|
| 15 |
+
build
|
| 16 |
+
|
| 17 |
+
# Environment files
|
| 18 |
+
.env
|
| 19 |
+
*.env
|
| 20 |
+
|
| 21 |
+
# IDE files
|
| 22 |
+
.vscode
|
| 23 |
+
.idea
|
| 24 |
+
*.swp
|
| 25 |
+
*.swo
|
| 26 |
+
*~
|
| 27 |
+
|
| 28 |
+
# OS files
|
| 29 |
+
.DS_Store
|
| 30 |
+
Thumbs.db
|
| 31 |
+
|
| 32 |
+
# Testing and sandbox
|
| 33 |
+
testing/
|
| 34 |
+
sandbox/
|
| 35 |
+
|
| 36 |
+
# Logs (will be created at runtime)
|
| 37 |
+
logs/
|
| 38 |
+
*.log
|
| 39 |
+
|
| 40 |
+
# CSV files (test data)
|
| 41 |
+
*.csv
|
| 42 |
+
|
| 43 |
+
# Large data files
|
| 44 |
+
org_renamed.csv
|
| 45 |
+
rename.csv
|
| 46 |
+
test.csv
|
| 47 |
+
|
| 48 |
+
# Documentation
|
| 49 |
+
README.md
|
| 50 |
+
CLAUDE.md
|
.gitattributes
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
.gitignore
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.env
|
| 2 |
+
.DS_Store
|
| 3 |
+
*.csv
|
| 4 |
+
*.xlsx
|
| 5 |
+
/testing/
|
| 6 |
+
/modules/__pycache__/
|
| 7 |
+
app.log
|
| 8 |
+
/sandbox/
|
CLAUDE.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Overview
|
| 6 |
+
|
| 7 |
+
This is a **Streamlit-based web application** for pre-filtering and analyzing grant applications using machine learning models. The app processes Excel files containing application data, runs them through multiple fine-tuned LLMs for classification, and generates scored outputs with recommended filtering actions.
|
| 8 |
+
|
| 9 |
+
The application is deployed as a **Hugging Face Space** (ref. README.md metadata).
|
| 10 |
+
|
| 11 |
+
## Running the Application
|
| 12 |
+
|
| 13 |
+
### Local Development
|
| 14 |
+
```bash
|
| 15 |
+
# Install dependencies
|
| 16 |
+
pip install -r requirements.txt
|
| 17 |
+
|
| 18 |
+
# Run the Streamlit app
|
| 19 |
+
streamlit run app.py
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
### Environment Variables
|
| 23 |
+
Required environment variables (stored in `.env` for local development):
|
| 24 |
+
- `HF_TOKEN` - Hugging Face token for authentication and model access
|
| 25 |
+
- `<USERNAME>_HASH` - Bcrypt password hashes for user authentication (e.g., `USER1_HASH`)
|
| 26 |
+
|
| 27 |
+
### CUDA/GPU Support
|
| 28 |
+
The application checks for CUDA availability on startup (app.py:1-13). It will automatically use GPU if available, otherwise falls back to CPU or MPS (for Apple Silicon).
|
| 29 |
+
|
| 30 |
+
## Architecture
|
| 31 |
+
|
| 32 |
+
### Application Flow
|
| 33 |
+
1. **Authentication** (modules/auth.py) - Users log in with credentials validated against bcrypt hashes from environment variables
|
| 34 |
+
2. **File Upload** - Users upload an Excel file matching the template structure
|
| 35 |
+
3. **Data Processing Pipeline** (modules/pipeline.py) - Core processing logic:
|
| 36 |
+
- Validates required columns
|
| 37 |
+
- Standardizes organization names and counts concepts per organization (modules/org_count.py)
|
| 38 |
+
- Cleans text fields (modules/utils.py)
|
| 39 |
+
- Runs inference through 10 different ML models sequentially
|
| 40 |
+
- Calculates scores and generates filtering recommendations
|
| 41 |
+
4. **Output Generation** - Produces downloadable Excel file with analysis results
|
| 42 |
+
|
| 43 |
+
### ML Model Pipeline
|
| 44 |
+
The app uses **10 classification models** loaded from Hugging Face:
|
| 45 |
+
- **SetFit Models (6)**: scope_lab1, scope_lab2, tech_lab1, tech_lab3, fin_lab2, bar_lab2
|
| 46 |
+
- **Transformer Pipelines (4)**: ADAPMIT_SCOPE, ADAPMIT_TECH, SECTOR (multilabel), LANG (language detection)
|
| 47 |
+
|
| 48 |
+
Models are loaded from different HF profiles:
|
| 49 |
+
- `mtyrrell/classifier_SF_*` - SetFit models
|
| 50 |
+
- `GIZ/ADAPMIT-multilabel-bge_f` - Adaptation vs. Mitigation classification
|
| 51 |
+
- `GIZ/SECTOR-multilabel-bge_f` - Sector classification
|
| 52 |
+
- `qanastek/51-languages-classifier` - Language detection
|
| 53 |
+
|
| 54 |
+
### Required Excel Template Columns
|
| 55 |
+
The input Excel file must contain these columns (ref. pipeline.py:113-124):
|
| 56 |
+
- `id` - Application identifier
|
| 57 |
+
- `organization` - Organization name
|
| 58 |
+
- `scope` - Scope description text
|
| 59 |
+
- `technology` - Technology description text
|
| 60 |
+
- `financial` - Financial information text
|
| 61 |
+
- `barrier` - Barrier description text
|
| 62 |
+
- `maf_funding_requested` - Requested funding amount
|
| 63 |
+
- `contributions_public_sector` - Public sector contributions
|
| 64 |
+
- `contributions_private_sector` - Private sector contributions
|
| 65 |
+
- `contributions_other` - Other contributions
|
| 66 |
+
- `mitigation potential` - Mitigation potential text
|
| 67 |
+
|
| 68 |
+
### Scoring and Filtering Logic
|
| 69 |
+
The application calculates a predicted score (0-10) based on:
|
| 70 |
+
- Individual model predictions (6 binary classifiers)
|
| 71 |
+
- Leverage calculations (lev_gt_0, lev_maf_scale)
|
| 72 |
+
- Formula: `(fin_lab2*2 + scope_lab1*2 + scope_lab2*2 + tech_lab1 + tech_lab3 + bar_lab2 + lev_gt_0 + lev_maf_scale) / 11 * 10`
|
| 73 |
+
|
| 74 |
+
Applications are labeled with one of four actions (pipeline.py:269-278):
|
| 75 |
+
- **INELIGIBLE** - Non-English, >6 concepts from same org, Adaptation (not Mitigation), wrong sector, or insufficient text length
|
| 76 |
+
- **REJECT** - pred_score ≤ sensitivity level
|
| 77 |
+
- **PRE-ASSESSMENT** - pred_score in [sensitivity+1, sensitivity+2]
|
| 78 |
+
- **FULL-ASSESSMENT** - pred_score > sensitivity+2
|
| 79 |
+
|
| 80 |
+
Sensitivity levels (app.py:81-98):
|
| 81 |
+
- Low: 4 (fewer false negatives ~6%)
|
| 82 |
+
- Medium: 5
|
| 83 |
+
- High: 6 (more false negatives ~13%)
|
| 84 |
+
|
| 85 |
+
### Key Modules
|
| 86 |
+
|
| 87 |
+
**modules/pipeline.py**
|
| 88 |
+
- `process_data(uploaded_file, sens_level)` - Main processing function
|
| 89 |
+
- `predict_category(df, model_name, progress_bar, repo, profile, multilabel)` - Model inference wrapper
|
| 90 |
+
- Handles both SetFit and Transformer pipeline models with different text truncation strategies
|
| 91 |
+
|
| 92 |
+
**modules/utils.py**
|
| 93 |
+
- `create_excel()` - Generates template Excel file for download
|
| 94 |
+
- `clean_text(input_text)` - Removes special characters and normalizes whitespace
|
| 95 |
+
- `extract_predicted_labels(output, ordinal_selection, threshold)` - Extracts top-2 labels from multilabel output
|
| 96 |
+
|
| 97 |
+
**modules/org_count.py**
|
| 98 |
+
- `standardize_organization_names(df)` - Normalizes org names using exact matching, abbreviations, and fuzzy matching (threshold=90)
|
| 99 |
+
- Adds `org_renamed` and `concept_count` columns to track multiple concepts from same organization
|
| 100 |
+
|
| 101 |
+
**modules/auth.py**
|
| 102 |
+
- `validate_login(username, password)` - Validates against bcrypt hashes stored in environment variables
|
| 103 |
+
|
| 104 |
+
**modules/logging_config.py**
|
| 105 |
+
- `setup_logging()` - Configures rotating file handler (logs/app.log, 1MB max, 5 backups)
|
| 106 |
+
|
| 107 |
+
### Session State Management
|
| 108 |
+
The app uses Streamlit session state to manage:
|
| 109 |
+
- `authenticated` - Login status
|
| 110 |
+
- `data_processed` - Whether processing is complete
|
| 111 |
+
- `df` - Processed dataframe
|
| 112 |
+
- `show_button` - Controls "Start Analysis" button visibility
|
| 113 |
+
- `processing` - Tracks processing state
|
| 114 |
+
|
| 115 |
+
## Important Notes
|
| 116 |
+
|
| 117 |
+
- **Processing Time**: Approximately 5 minutes for 1000 applications (ref. app.py:66)
|
| 118 |
+
- **Model Truncation**: All text is truncated to 512 tokens max (pipeline.py:76-89)
|
| 119 |
+
- **Device Selection**: Automatically selects CUDA > MPS > CPU (pipeline.py:51)
|
| 120 |
+
- **Column Validation**: The app will error if required columns are missing from uploaded Excel file
|
| 121 |
+
- **Organization Duplicates**: The system tracks concepts per organization and flags applications if >6 concepts from same org (marked INELIGIBLE)
|
Dockerfile
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# NVIDIA CUDA base image for GPU support
|
| 2 |
+
FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
|
| 3 |
+
|
| 4 |
+
# Environment variables to prevent interactive prompts
|
| 5 |
+
ENV DEBIAN_FRONTEND=noninteractive
|
| 6 |
+
|
| 7 |
+
RUN apt-get update && apt-get install -y \
|
| 8 |
+
python3.10 \
|
| 9 |
+
python3-pip \
|
| 10 |
+
python3.10-dev \
|
| 11 |
+
git \
|
| 12 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 13 |
+
|
| 14 |
+
RUN useradd -m -u 1000 user
|
| 15 |
+
USER user
|
| 16 |
+
ENV HOME=/home/user \
|
| 17 |
+
PATH=/home/user/.local/bin:$PATH
|
| 18 |
+
WORKDIR $HOME/app
|
| 19 |
+
|
| 20 |
+
RUN pip install --no-cache-dir --upgrade pip
|
| 21 |
+
|
| 22 |
+
COPY --chown=user requirements.txt .
|
| 23 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 24 |
+
|
| 25 |
+
COPY --chown=user . .
|
| 26 |
+
|
| 27 |
+
# Create logs directory with proper permissions
|
| 28 |
+
USER root
|
| 29 |
+
RUN mkdir -p logs && chown -R user:user logs
|
| 30 |
+
USER user
|
| 31 |
+
|
| 32 |
+
# Expose Streamlit default port
|
| 33 |
+
EXPOSE 8501
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# Health check
|
| 37 |
+
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
|
| 38 |
+
|
| 39 |
+
# Run app
|
| 40 |
+
CMD ["streamlit", "run", "app.py","--server.port=8501", "--server.address=0.0.0.0", "--server.headless=true", "--server.fileWatcherType=none", "--server.enableXsrfProtection=false", "--server.enableCORS=false"]
|
README.md
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Prefilter App
|
| 3 |
+
emoji: 🦀
|
| 4 |
+
colorFrom: yellow
|
| 5 |
+
colorTo: red
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 8501
|
| 8 |
+
pinned: false
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
__pycache__/app.cpython-311.pyc
ADDED
|
Binary file (13.5 kB). View file
|
|
|
app.py
ADDED
|
@@ -0,0 +1,243 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import torch
|
| 2 |
+
try:
|
| 3 |
+
print(f"Is CUDA available: {torch.cuda.is_available()}")
|
| 4 |
+
if torch.cuda.is_available():
|
| 5 |
+
try:
|
| 6 |
+
print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
|
| 7 |
+
except Exception as e:
|
| 8 |
+
print(f"Error getting CUDA device name: {str(e)}")
|
| 9 |
+
else:
|
| 10 |
+
print("No CUDA device available - using CPU")
|
| 11 |
+
except Exception as e:
|
| 12 |
+
print(f"Error checking CUDA availability: {str(e)}")
|
| 13 |
+
print("Continuing with CPU...")
|
| 14 |
+
|
| 15 |
+
import streamlit as st
|
| 16 |
+
import os
|
| 17 |
+
from huggingface_hub import login
|
| 18 |
+
from datetime import datetime
|
| 19 |
+
from openai import OpenAI
|
| 20 |
+
from modules.auth import validate_login
|
| 21 |
+
from modules.utils import create_excel, setup_logging, getconfig
|
| 22 |
+
from modules.pipeline import process_data
|
| 23 |
+
|
| 24 |
+
setup_logging()
|
| 25 |
+
import logging
|
| 26 |
+
from io import BytesIO
|
| 27 |
+
|
| 28 |
+
logger = logging.getLogger(__name__)
|
| 29 |
+
|
| 30 |
+
# Local
|
| 31 |
+
# from dotenv import load_dotenv
|
| 32 |
+
# load_dotenv()
|
| 33 |
+
|
| 34 |
+
config = getconfig("config.cfg")
|
| 35 |
+
|
| 36 |
+
@st.cache_resource
|
| 37 |
+
def get_azure_openai_client():
|
| 38 |
+
"""Initialize and cache Azure OpenAI client for the session"""
|
| 39 |
+
try:
|
| 40 |
+
AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
|
| 41 |
+
AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")
|
| 42 |
+
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
|
| 43 |
+
|
| 44 |
+
if not all([AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_API_KEY]):
|
| 45 |
+
raise ValueError("Missing required Azure OpenAI environment variables. Please check your .env file.")
|
| 46 |
+
|
| 47 |
+
client = OpenAI(api_key=AZURE_OPENAI_API_KEY, base_url=AZURE_OPENAI_ENDPOINT)
|
| 48 |
+
logger.info("Azure OpenAI client initialized successfully")
|
| 49 |
+
return client
|
| 50 |
+
except Exception as e:
|
| 51 |
+
logger.error(f"Failed to initialize Azure OpenAI client: {str(e)}")
|
| 52 |
+
raise
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def get_azure_deployment():
|
| 56 |
+
"""Get Azure OpenAI deployment name from config file"""
|
| 57 |
+
try:
|
| 58 |
+
config = getconfig("config.cfg")
|
| 59 |
+
deployment = config.get("deployments", "DEPLOYMENT")
|
| 60 |
+
logger.info(f"Using Azure OpenAI deployment: {deployment}")
|
| 61 |
+
return deployment
|
| 62 |
+
except Exception as e:
|
| 63 |
+
logger.error(f"Failed to read deployment from config: {str(e)}. Using default deployment.")
|
| 64 |
+
deployment = "gpt-4o-mini"
|
| 65 |
+
return deployment
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
# Main app logic
|
| 69 |
+
def main():
|
| 70 |
+
# Temporarily set authentication to True for testing
|
| 71 |
+
if 'authenticated' not in st.session_state:
|
| 72 |
+
st.session_state['authenticated'] = False
|
| 73 |
+
|
| 74 |
+
if st.session_state['authenticated']:
|
| 75 |
+
# Remove login success message for testing
|
| 76 |
+
hf_token = os.environ["HF_TOKEN"]
|
| 77 |
+
login(token=hf_token, add_to_git_credential=True)
|
| 78 |
+
|
| 79 |
+
# Initialize session state variables
|
| 80 |
+
if 'data_processed' not in st.session_state:
|
| 81 |
+
st.session_state['data_processed'] = False
|
| 82 |
+
st.session_state['df'] = None
|
| 83 |
+
|
| 84 |
+
# Main Streamlit app
|
| 85 |
+
st.title('Application Pre-Filtering Tool')
|
| 86 |
+
|
| 87 |
+
# Sidebar (filters)
|
| 88 |
+
with st.sidebar:
|
| 89 |
+
with st.expander("ℹ️ - Instructions", expanded=False):
|
| 90 |
+
st.markdown(
|
| 91 |
+
"""
|
| 92 |
+
1. **Download the Excel Template file (below)**
|
| 93 |
+
2. **[OPTIONAL]: Select the desired filtering sensitivity level (below)**
|
| 94 |
+
3. **Copy/paste the requisite application data in the template file. Best practice is to 'paste as values'**
|
| 95 |
+
4. **Upload the template file in the area to the right (or click browse files)**
|
| 96 |
+
5. **Click 'Start Analysis'**
|
| 97 |
+
|
| 98 |
+
The tool will start processing the uploaded application data. This can take some time
|
| 99 |
+
depending on the number of applications and the length of text in each. For example, a file with 1000 applications
|
| 100 |
+
could be expected to take approximately 5 minutes.
|
| 101 |
+
|
| 102 |
+
***NOTE** - you can also simply rename the column headers in your own file. The headers must match the column names in the template for the tool to run properly.*
|
| 103 |
+
|
| 104 |
+
"""
|
| 105 |
+
)
|
| 106 |
+
# Excel file download
|
| 107 |
+
st.download_button(
|
| 108 |
+
label="Download Excel Template",
|
| 109 |
+
data=create_excel(),
|
| 110 |
+
file_name="upload_template.xlsx",
|
| 111 |
+
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
# get sensitivity level for use in review / reject (ref. process_data function)
|
| 115 |
+
sens_options = {
|
| 116 |
+
"Low": 4,
|
| 117 |
+
"Medium": 5,
|
| 118 |
+
"High": 6,
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
sens_input = st.sidebar.radio(label = 'Select the Sensitivity Level [OPTIONAL]',
|
| 122 |
+
help = 'Decreasing the level of sensitivity results in less \
|
| 123 |
+
applications filtered out. This also \
|
| 124 |
+
reduces the probability of false negatives (FNs). The rate of \
|
| 125 |
+
FNs at the lowest setting is approximately 6 percent, and \
|
| 126 |
+
approaches 13 percent at the highest setting. \
|
| 127 |
+
NOTE: changing this setting does not affect the raw data in the CSV output file (only the labels)',
|
| 128 |
+
options = list(sens_options.keys()),
|
| 129 |
+
index = list(sens_options.keys()).index("High"),
|
| 130 |
+
horizontal = False)
|
| 131 |
+
|
| 132 |
+
sens_level = sens_options[sens_input]
|
| 133 |
+
|
| 134 |
+
with st.expander("ℹ️ - About this app", expanded=False):
|
| 135 |
+
st.write(
|
| 136 |
+
"""
|
| 137 |
+
This tool provides an interface for running an automated preliminary assessment of applications for a call for applications.
|
| 138 |
+
|
| 139 |
+
The tool functions by running selected text fields from the application through a series of LLMs fine-tuned for text classification (ref. diagram below).
|
| 140 |
+
The resulting output classifications are used to compute a score and a suggested pre-filtering action. The tool has been tested against
|
| 141 |
+
human assessors and exhibits an extremely low false negative rate (<6%) at a Sensitivity Level of 'Low' (i.e. rejection threshold for predicted score < 4).
|
| 142 |
+
|
| 143 |
+
""")
|
| 144 |
+
st.image('images/pipeline.png')
|
| 145 |
+
|
| 146 |
+
uploaded_file = st.file_uploader("Select a file containing application pre-filtering data (see instructions in the sidebar)")
|
| 147 |
+
|
| 148 |
+
# Add session state variables if they don't exist
|
| 149 |
+
if 'show_button' not in st.session_state:
|
| 150 |
+
st.session_state['show_button'] = True
|
| 151 |
+
if 'processing' not in st.session_state:
|
| 152 |
+
st.session_state['processing'] = False
|
| 153 |
+
if 'data_processed' not in st.session_state:
|
| 154 |
+
st.session_state['data_processed'] = False
|
| 155 |
+
|
| 156 |
+
# Only show the button if show_button is True and file is uploaded and not processing
|
| 157 |
+
if uploaded_file is not None and st.session_state['show_button'] and not st.session_state['processing']:
|
| 158 |
+
if st.button("Start Analysis", key="start_analysis"):
|
| 159 |
+
st.session_state['show_button'] = False
|
| 160 |
+
st.session_state['processing'] = True
|
| 161 |
+
st.rerun()
|
| 162 |
+
|
| 163 |
+
# If we're processing, show the processing logic
|
| 164 |
+
if st.session_state['processing']:
|
| 165 |
+
try:
|
| 166 |
+
logger.info(f"File uploaded: {uploaded_file.name}")
|
| 167 |
+
|
| 168 |
+
if not st.session_state['data_processed']:
|
| 169 |
+
logger.info("Starting data processing")
|
| 170 |
+
try:
|
| 171 |
+
# Initialize Azure OpenAI client and get deployment name
|
| 172 |
+
azure_client = get_azure_openai_client()
|
| 173 |
+
azure_deployment = get_azure_deployment()
|
| 174 |
+
|
| 175 |
+
st.session_state['df'] = process_data(
|
| 176 |
+
uploaded_file,
|
| 177 |
+
sens_level,
|
| 178 |
+
azure_client,
|
| 179 |
+
azure_deployment
|
| 180 |
+
)
|
| 181 |
+
logger.info("Data processing completed successfully")
|
| 182 |
+
st.session_state['data_processed'] = True
|
| 183 |
+
except ValueError as e:
|
| 184 |
+
# Handle specific validation errors
|
| 185 |
+
logger.error(f"Validation error: {str(e)}")
|
| 186 |
+
st.error(str(e))
|
| 187 |
+
st.session_state['show_button'] = True
|
| 188 |
+
st.session_state['processing'] = False
|
| 189 |
+
st.rerun()
|
| 190 |
+
except Exception as e:
|
| 191 |
+
# Handle other unexpected errors
|
| 192 |
+
logger.error(f"Error in process_data: {str(e)}")
|
| 193 |
+
st.error("An unexpected error occurred. Please check your input file and try again.")
|
| 194 |
+
st.session_state['show_button'] = True
|
| 195 |
+
st.session_state['processing'] = False
|
| 196 |
+
st.rerun()
|
| 197 |
+
|
| 198 |
+
df = st.session_state['df']
|
| 199 |
+
|
| 200 |
+
def reset_button_state():
|
| 201 |
+
st.session_state['show_button'] = True
|
| 202 |
+
st.session_state['processing'] = False
|
| 203 |
+
st.session_state['data_processed'] = False
|
| 204 |
+
|
| 205 |
+
# Create Excel buffer
|
| 206 |
+
excel_buffer = BytesIO()
|
| 207 |
+
df.to_excel(excel_buffer, index=False, engine='openpyxl')
|
| 208 |
+
excel_buffer.seek(0)
|
| 209 |
+
|
| 210 |
+
current_datetime = datetime.now().strftime('%d-%m-%Y_%H-%M-%S')
|
| 211 |
+
output_filename = f'processed_applications_{current_datetime}.xlsx'
|
| 212 |
+
|
| 213 |
+
st.download_button(
|
| 214 |
+
label="Download Analysis Data File",
|
| 215 |
+
data=excel_buffer,
|
| 216 |
+
file_name=output_filename,
|
| 217 |
+
mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
|
| 218 |
+
on_click=reset_button_state
|
| 219 |
+
)
|
| 220 |
+
|
| 221 |
+
except Exception as e:
|
| 222 |
+
logger.error(f"Error processing file: {str(e)}")
|
| 223 |
+
st.error("Failed to process the file. Please ensure your column names match the template file.")
|
| 224 |
+
st.session_state['show_button'] = True
|
| 225 |
+
st.session_state['processing'] = False
|
| 226 |
+
st.rerun()
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
# Comment out for testing
|
| 230 |
+
else:
|
| 231 |
+
username = st.text_input("Username")
|
| 232 |
+
password = st.text_input("Password", type="password")
|
| 233 |
+
if st.button("Login"):
|
| 234 |
+
if validate_login(username, password):
|
| 235 |
+
st.session_state['authenticated'] = True
|
| 236 |
+
st.rerun()
|
| 237 |
+
else:
|
| 238 |
+
st.error("Incorrect username or password")
|
| 239 |
+
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
main()
|
| 243 |
+
|
config.cfg
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[deployments]
|
| 2 |
+
DEPLOYMENT=gpt-5-mini
|
images/pipeline.png
ADDED
|
Git LFS Details
|
modules/auth.py
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import bcrypt
|
| 3 |
+
|
| 4 |
+
# Helper functions
|
| 5 |
+
def check_password(provided_password, stored_hash):
|
| 6 |
+
return bcrypt.checkpw(provided_password.encode(), stored_hash)
|
| 7 |
+
|
| 8 |
+
def validate_login(username, password):
|
| 9 |
+
# Retrieve user's hashed password from environment variables
|
| 10 |
+
user_hash = os.getenv(username.upper() + '_HASH') # Assumes an env var like 'USER1_HASH'
|
| 11 |
+
if user_hash:
|
| 12 |
+
return check_password(password, user_hash.encode())
|
| 13 |
+
return False
|
modules/llm.py
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Helper functions for pipeline
|
| 2 |
+
from datetime import datetime, timedelta
|
| 3 |
+
from collections import defaultdict, namedtuple, Counter
|
| 4 |
+
from typing import List, Dict, Any
|
| 5 |
+
import torch
|
| 6 |
+
import logging
|
| 7 |
+
from transformers import pipeline
|
| 8 |
+
from modules.utils import setup_logging
|
| 9 |
+
from modules.prompts import prompt_concept
|
| 10 |
+
from modules.models import ConceptClassify
|
| 11 |
+
from openai import OpenAI
|
| 12 |
+
|
| 13 |
+
logger = setup_logging()
|
| 14 |
+
|
| 15 |
+
def call_structured(client: OpenAI, deployment: str, system_prompt: str, user_prompt: str,
|
| 16 |
+
response_model: None,
|
| 17 |
+
logger: logging.Logger) -> Dict[str, Any]:
|
| 18 |
+
"""Call Azure OpenAI with structured output"""
|
| 19 |
+
system_prompt = "You are assessing grant applications for an open funding call."
|
| 20 |
+
try:
|
| 21 |
+
if deployment in ['o4-mini','o3',"gpt-5","gpt-5-mini","gpt-5-nano"]:
|
| 22 |
+
response = client.responses.parse(
|
| 23 |
+
model=deployment,
|
| 24 |
+
reasoning={"effort": "low"},
|
| 25 |
+
input=[
|
| 26 |
+
{"role": "system", "content": system_prompt},
|
| 27 |
+
{"role": "user", "content": user_prompt},
|
| 28 |
+
],
|
| 29 |
+
text_format=response_model)
|
| 30 |
+
else:
|
| 31 |
+
response = client.responses.parse(
|
| 32 |
+
model=deployment,
|
| 33 |
+
input=[
|
| 34 |
+
{"role": "system", "content": system_prompt},
|
| 35 |
+
{"role": "user", "content": user_prompt},
|
| 36 |
+
],
|
| 37 |
+
temperature=0,
|
| 38 |
+
text_format=response_model)
|
| 39 |
+
|
| 40 |
+
result = response.output_parsed
|
| 41 |
+
|
| 42 |
+
# Return data + cost info
|
| 43 |
+
return result.model_dump()
|
| 44 |
+
|
| 45 |
+
except Exception as e:
|
| 46 |
+
logger.error(f"Error calling Azure OpenAI for {response_model.__name__}: {e}")
|
| 47 |
+
# Return default error response for ConceptClassify model
|
| 48 |
+
return None
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
# Not used - results sucked
|
| 52 |
+
# def check_duplicate_concepts(client, deployment, concept_id: str, organization: str, concept_profile: str, df) -> bool:
|
| 53 |
+
# """
|
| 54 |
+
# Check for duplicate concepts within the same organization using Azure OpenAI
|
| 55 |
+
|
| 56 |
+
# Args:
|
| 57 |
+
# client: AzureOpenAI client instance
|
| 58 |
+
# deployment: Azure OpenAI deployment name
|
| 59 |
+
# concept_id: ID of the current concept being checked
|
| 60 |
+
# organization: Organization name
|
| 61 |
+
# concept_profile: Text description of the concept to check
|
| 62 |
+
# df: DataFrame containing all application data
|
| 63 |
+
|
| 64 |
+
# Returns:
|
| 65 |
+
# Boolean classification result
|
| 66 |
+
# """
|
| 67 |
+
|
| 68 |
+
# # Remove current concept from the dataframe
|
| 69 |
+
# df_check = df[df['id'] != concept_id].copy()
|
| 70 |
+
|
| 71 |
+
# # Get other concepts from the same organization
|
| 72 |
+
# org_concepts = df_check[df_check['org_renamed'] == organization]
|
| 73 |
+
# other_concepts = org_concepts['scope_txt'].tolist()
|
| 74 |
+
|
| 75 |
+
# # If no other concepts from this organization, return False
|
| 76 |
+
# if len(other_concepts) == 0:
|
| 77 |
+
# return False
|
| 78 |
+
|
| 79 |
+
# logger.info(f"Checking duplicates for concept ID {concept_id} from organization {organization} against {len(other_concepts)} other concept(s).")
|
| 80 |
+
# logger.info(f"Scope text {concept_profile}")
|
| 81 |
+
# # Construct prompt
|
| 82 |
+
# prompt = prompt_concept(concept_profile, other_concepts)
|
| 83 |
+
|
| 84 |
+
# response = call_structured(client, deployment, prompt, concept_profile, ConceptClassify, logger)
|
| 85 |
+
|
| 86 |
+
# check = response['classification']
|
| 87 |
+
# logger.info(f"Duplicate check response for concept ID {concept_id}: {check}")
|
| 88 |
+
# if check == "YES":
|
| 89 |
+
# return True
|
| 90 |
+
# return False
|
modules/models.py
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from typing import Dict, Any, List, Optional, Literal
|
| 2 |
+
from pydantic import BaseModel, Field
|
| 3 |
+
|
| 4 |
+
#===================== Duplicate concepts =====================
|
| 5 |
+
|
| 6 |
+
class ConceptClassify(BaseModel):
|
| 7 |
+
classification: Literal["YES","NO","UNCERTAIN"] = Field(description="Is the concept duplicated in other applications? (yes/no/uncertain)")
|
| 8 |
+
|
modules/org_count.py
ADDED
|
@@ -0,0 +1,232 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
from thefuzz import fuzz
|
| 3 |
+
import logging
|
| 4 |
+
import re
|
| 5 |
+
|
| 6 |
+
logger = logging.getLogger(__name__)
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def standardize_organization_names(df):
|
| 10 |
+
"""
|
| 11 |
+
Standardizes organization names in a DataFrame using exact matches, abbreviations, and fuzzy matching.
|
| 12 |
+
|
| 13 |
+
Args:
|
| 14 |
+
df (pd.DataFrame): DataFrame containing an 'organization' column
|
| 15 |
+
|
| 16 |
+
Returns:
|
| 17 |
+
pd.DataFrame: DataFrame with added 'org_renamed' and 'concept_count' columns
|
| 18 |
+
"""
|
| 19 |
+
# Make a copy to avoid modifying the original DataFrame
|
| 20 |
+
df = df.copy()
|
| 21 |
+
|
| 22 |
+
# Sort DataFrame by 'id' column in ascending order
|
| 23 |
+
df = df.sort_values('id', ascending=True)
|
| 24 |
+
|
| 25 |
+
# Return DataFrame as-is if 'organization' column is not present
|
| 26 |
+
if 'organization' not in df.columns:
|
| 27 |
+
logger.warning("No 'organization' column found in DataFrame. Returning DataFrame as-is.")
|
| 28 |
+
|
| 29 |
+
else:
|
| 30 |
+
logger.info("Checking org names")
|
| 31 |
+
# Dictionary of organization variations and their standardized names
|
| 32 |
+
# Cleaned up to remove leading/trailing spaces for consistency
|
| 33 |
+
org_variations = {
|
| 34 |
+
'Adventist Development Relief Agency': ['adventist development'],
|
| 35 |
+
'Asian Development Bank': ['asian development bank'],
|
| 36 |
+
'Association of the Regional Mechanism for Emissions Reductions of Boyacá, Colombia (MRRE)': ['regional mechanism for emissions reductions of boyacá'],
|
| 37 |
+
'BioCarbon Partners (BCP)': ['biocarbon partners'],
|
| 38 |
+
'Biothermica Technologies Inc': ['biothermica tech'],
|
| 39 |
+
'Brazilian Tourist Board': ['brazilian tourist board'],
|
| 40 |
+
'Caribbean Community Climate Change Centre': ['caribbean community climate'],
|
| 41 |
+
'Caritas': ['caritas'],
|
| 42 |
+
'Chemical Industries Holding Company': ['chemical industries holding company'],
|
| 43 |
+
'Climate Advocacy International (CAI)': ['climate advocacy int'],
|
| 44 |
+
'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['deutsche gesellschaft für internationale'],
|
| 45 |
+
'Deutsche Sparkassenstiftung (DSIK)': ['deutsche sparkassenstiftung'],
|
| 46 |
+
'Development Initiative for Community Impact (DICI)': ['development initiative for community impact'],
|
| 47 |
+
'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['east african centre of excellence for renewable'],
|
| 48 |
+
'Eco-Ideal': ['eco-ideal'],
|
| 49 |
+
'Electricité de France (EDF)': ['electricité de france', 'edf international networks'],
|
| 50 |
+
'The Energy and Resources Institute (TERI)': ['energy and resources institute'],
|
| 51 |
+
'Environmental Defense Fund (EDF)': ['environmental defense fund'],
|
| 52 |
+
'Food and Agriculture Organization (FAO)': ['food and agriculture organization'],
|
| 53 |
+
'Global Green Growth Institute (GGGI)': ['global green growth'],
|
| 54 |
+
'International Finance Corporation (IFC)': ['international finance corporation'],
|
| 55 |
+
"International Organization for Migration (IOM)": ['international organization for migration'],
|
| 56 |
+
'Inter-American Development Bank (IDB)': ['american development bank'],
|
| 57 |
+
'Iskandar Regional Development Authority (IRDA)': ['iskandar regional'],
|
| 58 |
+
'Islamic Development Bank': ['islamic development bank'],
|
| 59 |
+
'Malaysian Industry Government Group for High Technology (MIGHT)': ['government group for high technology'],
|
| 60 |
+
'Metallurgical Industries Holding Company': ['metallurgical industries holding company'],
|
| 61 |
+
'MicroSave Consulting (MSC)': ['microsave consulting'],
|
| 62 |
+
'Osh Technological University': ['osh technological university','ошский технологический университет'],
|
| 63 |
+
'Oxford Policy Management (OPM)': ['oxford policy management'],
|
| 64 |
+
'Pacific Rim Investment Management': ['pacific rim investment'],
|
| 65 |
+
'Palestinian Energy and Natural Resources Authority (PENRA)': ['palestinian energy and natural'],
|
| 66 |
+
'Rwanda Energy Group (REG) Ltd': ['rwanda energy group'],
|
| 67 |
+
'Rocky Mountain Institute (RMI)': ['rocky mountain institute'],
|
| 68 |
+
'Secretariat of the Pacific Regional Environment Programme (SPREP)': ['secretariat of the pacific regional environment programme (sprep)'],
|
| 69 |
+
'Serviço Nacional de Aprendizagem Industrial (SENAI)': ['serviço nacional de aprendizagem'],
|
| 70 |
+
'Sumy City Council': ['sumy city council'],
|
| 71 |
+
# 'Tajik Technical University': ['tajik technical university'],
|
| 72 |
+
'Uganda Development Bank Limited (UDBL)': ['uganda development bank'],
|
| 73 |
+
'United Nations Human Settlement Programme (UN-Habitat)': ['united nations human settlement','un-habitat'],
|
| 74 |
+
'United Nations Children\'s Fund (UNICEF)': ['united nations children'],
|
| 75 |
+
'United Nations Conference on Trade and Development (UNCTAD)': ['united nations conference on trade'],
|
| 76 |
+
'United Nations Development Programme (UNDP)': ['united nations development program'],
|
| 77 |
+
'United Nations Economic and Social Commission (ECOSOC)': ['united nations economic and social'],
|
| 78 |
+
'United Nations Environment Programme (UNEP)': ['united nations environment'],
|
| 79 |
+
'United Nations High Commissioner for Refugees (UNHCR)': ['high commissioner for refugees'],
|
| 80 |
+
'United Nations Industrial Development Organization (UNIDO)': ['united nations industrial'],
|
| 81 |
+
'United Nations Office for Project Services (UNOPS)': ['united nations office for project'],
|
| 82 |
+
'World Food Programme (WFP)': ['world food program'],
|
| 83 |
+
'World Health Organization (WHO)': ['world health organization'],
|
| 84 |
+
'World Resources Institute (WRI)': ['world resources institute'],
|
| 85 |
+
'World Wide Fund for Nature (WWF)': ['world wildlife','world wide fund for nature'],
|
| 86 |
+
}
|
| 87 |
+
|
| 88 |
+
# Dictionary of organization abbreviations
|
| 89 |
+
org_abreviations = {
|
| 90 |
+
'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['GIZ'],
|
| 91 |
+
'Deutsche Sparkassenstiftung (DSIK)': ['DSIK'],
|
| 92 |
+
'Development Initiative for Community Impact (DICI)': ['DICI'],
|
| 93 |
+
'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['EACREEE'],
|
| 94 |
+
'Food and Agriculture Organization (FAO)': ['FAO'],
|
| 95 |
+
'Global Green Growth Institute (GGGI)': ['GGGI'],
|
| 96 |
+
'International Finance Corporation (IFC)': ['IFC'],
|
| 97 |
+
'International Organization for Migration (IOM)': ['IOM'],
|
| 98 |
+
'Inter-American Development Bank (IDB)': ['IDB'],
|
| 99 |
+
'United Nations Children\'s Fund (UNICEF)': ['UNICEF'],
|
| 100 |
+
'United Nations Conference on Trade and Development (UNCTAD)': ['UNCTAD'],
|
| 101 |
+
'United Nations Development Programme (UNDP)': ['UNDP'],
|
| 102 |
+
'United Nations Economic and Social Commission (ECOSOC)': ['ECOSOC'],
|
| 103 |
+
'United Nations Environment Programme (UNEP)': ['UNEP'],
|
| 104 |
+
'United Nations Industrial Development Organization (UNIDO)': ['UNIDO'],
|
| 105 |
+
'United Nations High Commissioner for Refugees (UNHCR)': ['UNHCR'],
|
| 106 |
+
'United Nations Office for Project Services (UNOPS)': ['UNOPS'],
|
| 107 |
+
'World Food Programme (WFP)': ['WFP'],
|
| 108 |
+
'World Health Organization (WHO)': ['WHO'],
|
| 109 |
+
'World Resources Institute (WRI)': ['WRI'],
|
| 110 |
+
'World Wide Fund for Nature (WWF)': ['WWF']
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
# Initialize result column
|
| 114 |
+
df['org_renamed'] = None
|
| 115 |
+
|
| 116 |
+
# Step 1: Process abbreviations first (highest priority for exact acronyms)
|
| 117 |
+
# Use case-insensitive matching to catch "giz", "GIZ", "Giz", etc.
|
| 118 |
+
logger.info("Processing abbreviation matches")
|
| 119 |
+
for standard_name, abreviations in org_abreviations.items():
|
| 120 |
+
for abreviation in abreviations:
|
| 121 |
+
# Case-insensitive matching with word boundaries to avoid partial matches
|
| 122 |
+
pattern = r'\b' + re.escape(abreviation) + r'\b'
|
| 123 |
+
mask = df['organization'].str.contains(pattern, case=False, regex=True, na=False) & df['org_renamed'].isna()
|
| 124 |
+
df.loc[mask, 'org_renamed'] = standard_name
|
| 125 |
+
|
| 126 |
+
# Step 2: Process substring variations (e.g., "adventist development" in full org name)
|
| 127 |
+
# Use improved substring matching to reduce false positives
|
| 128 |
+
logger.info("Processing variation matches")
|
| 129 |
+
for standard_name, variations in org_variations.items():
|
| 130 |
+
for var in variations:
|
| 131 |
+
# Check if already matched by abbreviation
|
| 132 |
+
mask = df['org_renamed'].isna()
|
| 133 |
+
if mask.sum() == 0:
|
| 134 |
+
break
|
| 135 |
+
# Use simple substring matching (case-insensitive)
|
| 136 |
+
# Note: Not using word boundaries here as variations are often partial phrases
|
| 137 |
+
org_lower = df.loc[mask, 'organization'].str.lower()
|
| 138 |
+
submask = org_lower.str.contains(re.escape(var), regex=True, na=False)
|
| 139 |
+
df.loc[mask & submask, 'org_renamed'] = standard_name
|
| 140 |
+
|
| 141 |
+
# Step 3: Process fuzzy matches against dictionary for remaining unmatched organizations
|
| 142 |
+
unmatched_mask = df['org_renamed'].isna()
|
| 143 |
+
if unmatched_mask.sum() > 0:
|
| 144 |
+
logger.info(f"Processing fuzzy matches against dictionary for {unmatched_mask.sum()} unmatched organizations")
|
| 145 |
+
|
| 146 |
+
# Get unique unmatched organization names to avoid duplicate processing
|
| 147 |
+
unique_unmatched = df.loc[unmatched_mask, 'organization'].unique()
|
| 148 |
+
threshold = 85 # token_set_ratio handles extra tokens (e.g., country names) better
|
| 149 |
+
|
| 150 |
+
# Create mapping dictionary for unique unmatched orgs
|
| 151 |
+
org_mapping = {}
|
| 152 |
+
for org in unique_unmatched:
|
| 153 |
+
org_lower = str(org).lower()
|
| 154 |
+
best_match = None
|
| 155 |
+
highest_ratio = 0
|
| 156 |
+
|
| 157 |
+
# Check against all variations and standard names
|
| 158 |
+
for standard_name, variations in org_variations.items():
|
| 159 |
+
all_forms = [standard_name.lower()] + variations
|
| 160 |
+
for variant in all_forms:
|
| 161 |
+
# Use token_set_ratio for better matching with extra tokens/geographic qualifiers
|
| 162 |
+
ratio = fuzz.token_set_ratio(org_lower, variant)
|
| 163 |
+
if ratio > threshold and ratio > highest_ratio:
|
| 164 |
+
highest_ratio = ratio
|
| 165 |
+
best_match = standard_name
|
| 166 |
+
|
| 167 |
+
if best_match:
|
| 168 |
+
org_mapping[org] = best_match
|
| 169 |
+
logger.debug(f"Fuzzy matched '{org}' to '{best_match}' (score: {highest_ratio})")
|
| 170 |
+
|
| 171 |
+
# Apply the mapping to all matching rows
|
| 172 |
+
if org_mapping:
|
| 173 |
+
for original_org, standard_org in org_mapping.items():
|
| 174 |
+
mask = (df['organization'] == original_org) & df['org_renamed'].isna()
|
| 175 |
+
df.loc[mask, 'org_renamed'] = standard_org
|
| 176 |
+
|
| 177 |
+
# Step 4: Dynamic fuzzy matching of remaining unmatched organizations
|
| 178 |
+
# Compare new unmatched orgs against previously stored unmatched orgs
|
| 179 |
+
unmatched_mask = df['org_renamed'].isna()
|
| 180 |
+
if unmatched_mask.sum() > 0:
|
| 181 |
+
logger.info(f"Processing dynamic fuzzy matching for {unmatched_mask.sum()} remaining unmatched organizations")
|
| 182 |
+
|
| 183 |
+
# Get unmatched organizations in order of first appearance
|
| 184 |
+
unmatched_df = df[unmatched_mask].copy()
|
| 185 |
+
unmatched_df = unmatched_df.sort_values('id')
|
| 186 |
+
unique_unmatched_ordered = unmatched_df['organization'].drop_duplicates().tolist()
|
| 187 |
+
|
| 188 |
+
threshold = 95 # Higher threshold for dynamic matching to avoid false positives
|
| 189 |
+
stored_orgs = [] # List of previously seen unmatched orgs (becomes canonical)
|
| 190 |
+
org_mapping = {} # Maps original org name to stored canonical org
|
| 191 |
+
|
| 192 |
+
for org in unique_unmatched_ordered:
|
| 193 |
+
org_lower = str(org).lower()
|
| 194 |
+
best_match = None
|
| 195 |
+
highest_ratio = 0
|
| 196 |
+
|
| 197 |
+
# Compare against all previously stored unmatched organizations
|
| 198 |
+
for stored_org in stored_orgs:
|
| 199 |
+
stored_org_lower = str(stored_org).lower()
|
| 200 |
+
ratio = fuzz.token_set_ratio(org_lower, stored_org_lower)
|
| 201 |
+
|
| 202 |
+
if ratio > threshold and ratio > highest_ratio:
|
| 203 |
+
highest_ratio = ratio
|
| 204 |
+
best_match = stored_org
|
| 205 |
+
|
| 206 |
+
if best_match:
|
| 207 |
+
# Match found - map to the stored org
|
| 208 |
+
org_mapping[org] = best_match
|
| 209 |
+
logger.debug(f"Dynamically matched '{org}' to '{best_match}' (score: {highest_ratio})")
|
| 210 |
+
else:
|
| 211 |
+
# No match - store this org for future comparisons
|
| 212 |
+
stored_orgs.append(org)
|
| 213 |
+
org_mapping[org] = org
|
| 214 |
+
|
| 215 |
+
# Apply the mapping to all matching rows
|
| 216 |
+
for original_org, canonical_org in org_mapping.items():
|
| 217 |
+
mask = (df['organization'] == original_org) & df['org_renamed'].isna()
|
| 218 |
+
df.loc[mask, 'org_renamed'] = canonical_org
|
| 219 |
+
|
| 220 |
+
# Fill remaining empty values with original organization names
|
| 221 |
+
df.loc[df['org_renamed'].isna(), 'org_renamed'] = df.loc[df['org_renamed'].isna(), 'organization']
|
| 222 |
+
|
| 223 |
+
# Add concept count
|
| 224 |
+
df['concept_count'] = df.groupby('org_renamed').cumcount() + 1
|
| 225 |
+
|
| 226 |
+
# Reorder columns with id, organization, org_renamed, concept_count first, followed by all others
|
| 227 |
+
cols = ['id', 'organization', 'org_renamed', 'concept_count']
|
| 228 |
+
other_cols = [col for col in df.columns if col not in cols]
|
| 229 |
+
df = df[cols + other_cols]
|
| 230 |
+
|
| 231 |
+
return df
|
| 232 |
+
|
modules/pipeline.py
ADDED
|
@@ -0,0 +1,339 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
import time
|
| 3 |
+
import pandas as pd
|
| 4 |
+
from io import BytesIO
|
| 5 |
+
import streamlit as st
|
| 6 |
+
import torch
|
| 7 |
+
from setfit import SetFitModel
|
| 8 |
+
from transformers import pipeline
|
| 9 |
+
from openpyxl import Workbook
|
| 10 |
+
from openpyxl.styles import Font, NamedStyle, PatternFill
|
| 11 |
+
from openpyxl.styles.differential import DifferentialStyle
|
| 12 |
+
from modules.org_count import standardize_organization_names
|
| 13 |
+
from modules.utils import clean_text, extract_predicted_labels
|
| 14 |
+
# from modules.llm import check_duplicate_concepts
|
| 15 |
+
from modules.semantic_similarity import check_duplicate_concepts_semantic
|
| 16 |
+
from sentence_transformers import SentenceTransformer
|
| 17 |
+
import logging
|
| 18 |
+
|
| 19 |
+
logger = logging.getLogger(__name__)
|
| 20 |
+
|
| 21 |
+
# # Function for extracting classifications for each SECTOR label
|
| 22 |
+
def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
|
| 23 |
+
|
| 24 |
+
# verify output is a list of dictionaries
|
| 25 |
+
if isinstance(output, list) and all(isinstance(item, dict) for item in output):
|
| 26 |
+
# filter items with scores above the threshold
|
| 27 |
+
filtered_items = [item for item in output if item.get('score', 0) > threshold]
|
| 28 |
+
|
| 29 |
+
# sort the filtered items by score in descending order
|
| 30 |
+
sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
|
| 31 |
+
|
| 32 |
+
# extract the highest and second-highest labels
|
| 33 |
+
if len(sorted_items) >= 2:
|
| 34 |
+
highest_label = sorted_items[0].get('label')
|
| 35 |
+
second_highest_label = sorted_items[1].get('label')
|
| 36 |
+
elif len(sorted_items) == 1:
|
| 37 |
+
highest_label = sorted_items[0].get('label')
|
| 38 |
+
second_highest_label = None
|
| 39 |
+
else:
|
| 40 |
+
print("Warning: Less than two items above the threshold in the current list.")
|
| 41 |
+
highest_label = None
|
| 42 |
+
second_highest_label = None
|
| 43 |
+
else:
|
| 44 |
+
print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
|
| 45 |
+
highest_label = None
|
| 46 |
+
second_highest_label = None
|
| 47 |
+
|
| 48 |
+
# Output dictionary of highest and second-highest labels to the all_predicted_labels list
|
| 49 |
+
predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
|
| 50 |
+
return predicted_labels
|
| 51 |
+
|
| 52 |
+
# Function to call model and run inference for varying classification tasks/models
|
| 53 |
+
def predict_category(df, model_name, progress_bar, repo, profile, multilabel=False):
|
| 54 |
+
device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
|
| 55 |
+
model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2', 'bar_lab2']
|
| 56 |
+
|
| 57 |
+
# Model configuration mapping
|
| 58 |
+
model_config = {
|
| 59 |
+
'ADAPMIT_TECH': {'col_name': 'tech_txt', 'top_k': 1},
|
| 60 |
+
'ADAPMIT_SCOPE': {'col_name': 'scope_txt', 'top_k': 1},
|
| 61 |
+
'LANG': {'col_name': 'scope_txt', 'top_k': 1},
|
| 62 |
+
'default': {'col_name': 'scope_txt', 'top_k': None}
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
if model_name in model_names_sf:
|
| 66 |
+
col_name = re.sub(r'_(.*)', r'_txt', model_name)
|
| 67 |
+
model = SetFitModel.from_pretrained(profile+"/"+repo)
|
| 68 |
+
model.to(device)
|
| 69 |
+
# Get tokenizer from the model
|
| 70 |
+
tokenizer = model.model_body.tokenizer
|
| 71 |
+
else:
|
| 72 |
+
# Get configuration for the model, falling back to default if not specified
|
| 73 |
+
config = model_config.get(model_name, model_config['default'])
|
| 74 |
+
col_name = config['col_name']
|
| 75 |
+
model = pipeline("text-classification",
|
| 76 |
+
model=profile+"/"+repo,
|
| 77 |
+
device=device,
|
| 78 |
+
top_k=config['top_k'],
|
| 79 |
+
truncation=True,
|
| 80 |
+
max_length=512)
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
predictions = []
|
| 85 |
+
# probabilities = []
|
| 86 |
+
total = len(df)
|
| 87 |
+
for i, text in enumerate(df[col_name]):
|
| 88 |
+
try:
|
| 89 |
+
if model_name in model_names_sf:
|
| 90 |
+
# Truncate text for SetFit models
|
| 91 |
+
encoded = tokenizer(text, truncation=True, max_length=512)
|
| 92 |
+
truncated_text = tokenizer.decode(encoded['input_ids'])
|
| 93 |
+
prediction = model(truncated_text)
|
| 94 |
+
predictions.append(0 if prediction == 'NEGATIVE' else 1)
|
| 95 |
+
else:
|
| 96 |
+
prediction = model(text)
|
| 97 |
+
if model_name == 'ADAPMIT_SCOPE' or model_name == 'ADAPMIT_TECH':
|
| 98 |
+
predictions.append(re.sub('Label$', '', prediction[0][0]['label']))
|
| 99 |
+
elif model_name == 'SECTOR':
|
| 100 |
+
predictions.append(extract_predicted_labels(prediction[0], threshold=0.5))
|
| 101 |
+
elif model_name == 'LANG':
|
| 102 |
+
predictions.append(prediction[0][0]['label'])
|
| 103 |
+
except Exception as e:
|
| 104 |
+
logger.error(f"Error processing sample {df['id'][i]}: {str(e)}")
|
| 105 |
+
st.error("Application Error. Please contact support.")
|
| 106 |
+
# Update progress bar with each iteration
|
| 107 |
+
progress = (i + 1) / total
|
| 108 |
+
progress_bar.progress(progress)
|
| 109 |
+
|
| 110 |
+
return predictions
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
# Main function to process data
|
| 114 |
+
def process_data(uploaded_file, sens_level, azure_client, azure_deployment):
|
| 115 |
+
"""
|
| 116 |
+
Process uploaded application data through ML pipeline
|
| 117 |
+
|
| 118 |
+
Args:
|
| 119 |
+
uploaded_file: Excel file containing application data
|
| 120 |
+
sens_level: Sensitivity level for filtering (4=Low, 5=Medium, 6=High)
|
| 121 |
+
azure_client: AzureOpenAI client instance for LLM calls
|
| 122 |
+
azure_deployment: Azure OpenAI deployment name
|
| 123 |
+
|
| 124 |
+
Returns:
|
| 125 |
+
Processed DataFrame with predictions and scores
|
| 126 |
+
"""
|
| 127 |
+
# Define required columns and their mappings
|
| 128 |
+
required_columns = {
|
| 129 |
+
'id': 'id',
|
| 130 |
+
'scope': 'scope_txt',
|
| 131 |
+
'technology': 'tech_txt',
|
| 132 |
+
'financial': 'fin_txt',
|
| 133 |
+
'barrier': 'bar_txt',
|
| 134 |
+
'maf_funding_requested': 'maf_funding',
|
| 135 |
+
'contributions_public_sector': 'cont_public',
|
| 136 |
+
'contributions_private_sector': 'cont_private',
|
| 137 |
+
'contributions_other': 'cont_other',
|
| 138 |
+
'mitigation_potential': 'mitigation_potential'
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
# Read the Excel file
|
| 142 |
+
try:
|
| 143 |
+
df = pd.read_excel(uploaded_file)
|
| 144 |
+
logger.info("Data import successful")
|
| 145 |
+
# Clean up organization names
|
| 146 |
+
df = standardize_organization_names(df)
|
| 147 |
+
|
| 148 |
+
|
| 149 |
+
except Exception as e:
|
| 150 |
+
error_msg = f"Failed to read Excel file: {str(e)}"
|
| 151 |
+
logger.error(error_msg)
|
| 152 |
+
st.error("Failed to read the uploaded file. Please ensure it's a valid Excel file.")
|
| 153 |
+
raise ValueError(error_msg)
|
| 154 |
+
|
| 155 |
+
# Validate required columns
|
| 156 |
+
missing_columns = [col for col in required_columns.keys() if col not in df.columns]
|
| 157 |
+
if missing_columns:
|
| 158 |
+
error_msg = f"Missing required columns: {', '.join(missing_columns)}"
|
| 159 |
+
logger.error(error_msg)
|
| 160 |
+
st.error(error_msg)
|
| 161 |
+
raise ValueError(error_msg)
|
| 162 |
+
|
| 163 |
+
# Rename required columns while preserving all others
|
| 164 |
+
df = df.rename(columns={k: v for k, v in required_columns.items() if k in df.columns})
|
| 165 |
+
|
| 166 |
+
# Clean and process text fields
|
| 167 |
+
df.fillna('', inplace=True)
|
| 168 |
+
df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']] = df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']].applymap(clean_text)
|
| 169 |
+
|
| 170 |
+
# Define models and predictions
|
| 171 |
+
model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2','bar_lab2']
|
| 172 |
+
model_names = model_names_sf + ['ADAPMIT_SCOPE','ADAPMIT_TECH','SECTOR','LANG','DUPLICATE_CHECK']
|
| 173 |
+
# model_names_sf = []
|
| 174 |
+
# model_names = ['ADAPMIT_SCOPE','ADAPMIT_TECH']
|
| 175 |
+
total_predictions = len(model_names) * len(df)
|
| 176 |
+
progress_count = 0
|
| 177 |
+
|
| 178 |
+
# UI setup for progress tracking
|
| 179 |
+
st.subheader("Overall Progress:")
|
| 180 |
+
patience_text = st.empty()
|
| 181 |
+
patience_text.markdown("*You may want to grab a coffee, this can take a while...*")
|
| 182 |
+
overall_progress = st.progress(0)
|
| 183 |
+
overall_start_time = time.time()
|
| 184 |
+
estimated_time_remaining_text = st.empty()
|
| 185 |
+
|
| 186 |
+
# Model processing
|
| 187 |
+
step_count = 0
|
| 188 |
+
total_steps = len(model_names)
|
| 189 |
+
for model_name in model_names:
|
| 190 |
+
logger.info(f"Loading: {model_name}")
|
| 191 |
+
step_count += 1
|
| 192 |
+
model_processing_text = st.empty()
|
| 193 |
+
model_processing_text.markdown(f'**Current Task: Processing with model "{model_name}"**')
|
| 194 |
+
model_progress = st.empty()
|
| 195 |
+
progress_bar = model_progress.progress(0)
|
| 196 |
+
|
| 197 |
+
# Load the model and run inference
|
| 198 |
+
if model_name in model_names_sf:
|
| 199 |
+
df[model_name] = predict_category(df, model_name, progress_bar, repo='classifier_SF_' + model_name, profile='mtyrrell')
|
| 200 |
+
elif model_name == 'ADAPMIT_SCOPE':
|
| 201 |
+
df[model_name] = predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
|
| 202 |
+
elif model_name == 'ADAPMIT_TECH':
|
| 203 |
+
df[model_name]= predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
|
| 204 |
+
elif model_name == 'SECTOR':
|
| 205 |
+
sectors_dict = predict_category(df, model_name, progress_bar, repo='SECTOR-multilabel-bge_f', profile='GIZ', multilabel=True)
|
| 206 |
+
df['SECTOR1'] = [item['SECTOR1'] for item in sectors_dict]
|
| 207 |
+
df['SECTOR2'] = [item['SECTOR2'] for item in sectors_dict]
|
| 208 |
+
elif model_name == 'LANG':
|
| 209 |
+
df[model_name] = predict_category(df, model_name, progress_bar, repo='51-languages-classifier', profile='qanastek')
|
| 210 |
+
# df[model_name] = predict_category(df, model_name, progress_bar, repo='xlm-roberta-base-language-detection', profile='papluca')
|
| 211 |
+
elif model_name == 'DUPLICATE_CHECK':
|
| 212 |
+
# Load semantic similarity model for duplicate detection
|
| 213 |
+
device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
|
| 214 |
+
logger.info(f"Loading semantic similarity model on device: {device}")
|
| 215 |
+
semantic_model = SentenceTransformer('BAAI/bge-m3', device=device)
|
| 216 |
+
|
| 217 |
+
# Process duplicate check with progress tracking
|
| 218 |
+
duplicate_results = []
|
| 219 |
+
total = len(df)
|
| 220 |
+
for i, row in df.iterrows():
|
| 221 |
+
result = check_duplicate_concepts_semantic(
|
| 222 |
+
semantic_model,
|
| 223 |
+
row['id'],
|
| 224 |
+
row['org_renamed'],
|
| 225 |
+
row['scope_txt'],
|
| 226 |
+
df
|
| 227 |
+
)
|
| 228 |
+
duplicate_results.append(result)
|
| 229 |
+
# Update progress bar with each iteration
|
| 230 |
+
progress = (i + 1) / total
|
| 231 |
+
progress_bar.progress(progress)
|
| 232 |
+
df['duplicate_check'] = duplicate_results
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
logger.info(f"Completed: {model_name}")
|
| 236 |
+
model_progress.empty()
|
| 237 |
+
|
| 238 |
+
progress_count += len(df)
|
| 239 |
+
overall_progress_value = progress_count / total_predictions
|
| 240 |
+
overall_progress.progress(overall_progress_value)
|
| 241 |
+
|
| 242 |
+
# Calculate and display estimated time remaining
|
| 243 |
+
elapsed_time = time.time() - overall_start_time
|
| 244 |
+
steps_remaining = total_steps - step_count
|
| 245 |
+
if step_count > 1:
|
| 246 |
+
estimated_time_remaining = (elapsed_time / step_count) * steps_remaining
|
| 247 |
+
estimated_time_remaining_text.markdown(
|
| 248 |
+
f"Elapsed time: {elapsed_time:.1f}s. "
|
| 249 |
+
f"Estimated time remaining: {estimated_time_remaining:.1f}s"
|
| 250 |
+
f" (step {step_count+1} of {len(model_names)})"
|
| 251 |
+
)
|
| 252 |
+
else:
|
| 253 |
+
estimated_time_remaining_text.write(f'Calculating time remaining... (step {step_count+1} of {len(model_names)})')
|
| 254 |
+
|
| 255 |
+
model_processing_text.empty()
|
| 256 |
+
|
| 257 |
+
patience_text.empty()
|
| 258 |
+
estimated_time_remaining_text.empty()
|
| 259 |
+
|
| 260 |
+
st.write(f'Processing complete. Total time: {elapsed_time:.1f} seconds')
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
# df['ADAPMIT_SCOPE_SCORE'] = df['ADAPMIT_SCOPE'].apply(
|
| 264 |
+
# lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
|
| 265 |
+
# )
|
| 266 |
+
# df['ADAPMIT_TECH_SCORE'] = df['ADAPMIT_TECH'].apply(
|
| 267 |
+
# lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
|
| 268 |
+
# )
|
| 269 |
+
|
| 270 |
+
# # Calculate average mitigation score
|
| 271 |
+
# df['ADAPMIT_SCORE'] = (df['ADAPMIT_SCOPE_SCORE'] + df['ADAPMIT_TECH_SCORE']) / 2
|
| 272 |
+
|
| 273 |
+
df['ADAPMIT'] = df.apply(lambda x: 'Adaptation' if x['ADAPMIT_SCOPE'] == 'Adaptation' and x['ADAPMIT_TECH'] == 'Adaptation' else 'Mitigation', axis=1)
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
# Convert funding columns to numeric, replacing any non-numeric values with NaN
|
| 278 |
+
df['maf_funding'] = pd.to_numeric(df['maf_funding'], errors='coerce')
|
| 279 |
+
df['cont_public'] = pd.to_numeric(df['cont_public'], errors='coerce')
|
| 280 |
+
df['cont_private'] = pd.to_numeric(df['cont_private'], errors='coerce')
|
| 281 |
+
df['cont_other'] = pd.to_numeric(df['cont_other'], errors='coerce')
|
| 282 |
+
# same for mitigation potential
|
| 283 |
+
df['mitigation_potential'] = abs(pd.to_numeric(df['mitigation_potential'], errors='coerce'))
|
| 284 |
+
|
| 285 |
+
# Fill any NaN values with 0
|
| 286 |
+
df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']] = df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']].fillna(0)
|
| 287 |
+
|
| 288 |
+
# Get total of all leverage
|
| 289 |
+
df['lev_total'] = df.apply(lambda x: x['cont_public'] + x['cont_private'] + x['cont_other'], axis=1)
|
| 290 |
+
# Leverage > MAF request
|
| 291 |
+
# df['lev_gt_maf'] = df.apply(lambda x: 'True' if x['lev_total'] > x['maf_funding'] else 'False', axis=1) # not used
|
| 292 |
+
# Leverage > 0 ?
|
| 293 |
+
df['lev_gt_0'] = (df['lev_total'] > 0).astype(int)
|
| 294 |
+
# Calculate leverage as percentage of MAF funding
|
| 295 |
+
df['lev_maf_%'] = df.apply(lambda x: round(x['lev_total']/x['maf_funding']*100,2) if x['maf_funding'] != 0 else 0, axis=1)
|
| 296 |
+
# Create normalized leverage scale (0-1) where 300% leverage = 1
|
| 297 |
+
df['lev_maf_scale'] = df['lev_maf_%'].apply(lambda x: min(x/300, 1) if x > 0 else 0)
|
| 298 |
+
# EUR / tCO2e mitigation potential
|
| 299 |
+
df['cost_effectivness'] = df.apply(lambda x: round(x['maf_funding']/x['mitigation_potential'],2) if x['mitigation_potential'] > 0 else None, axis=1)
|
| 300 |
+
# Normalize cost_effectivness to 0-1 scale (lower cost = higher score, capped at 1000 EUR/tCO2e)
|
| 301 |
+
df['cost_effectivness_norm'] = df['cost_effectivness'].apply(lambda x: max(0, 1 - (x / 1000)) if x is not None else None)
|
| 302 |
+
|
| 303 |
+
# Test if all text fields don't have minimum required words
|
| 304 |
+
df['word_length_check'] = df.apply(lambda x:
|
| 305 |
+
True if len(x['scope_txt'].split()) < 10 and
|
| 306 |
+
len(x['fin_txt'].split()) < 10 and
|
| 307 |
+
len(x['tech_txt'].split()) < 10
|
| 308 |
+
else False, axis=1)
|
| 309 |
+
|
| 310 |
+
# Predict score
|
| 311 |
+
sector_classes = ['Energy','Transport','Industries']
|
| 312 |
+
df['pred_score'] = df.apply(lambda x: round((x['fin_lab2']*2 + x['scope_lab1']*2 + x['scope_lab2']*2 + x['tech_lab1'] + x['tech_lab3'] + x['bar_lab2'] + x['lev_gt_0']+x['lev_maf_scale'])/11*10,0), axis=1)
|
| 313 |
+
# labelling logic
|
| 314 |
+
df['pred_action'] = df.apply(lambda x:
|
| 315 |
+
'REJECT' if (('concept_count' in df.columns and x['concept_count'] > 6) or
|
| 316 |
+
x['LANG'][0:2] != 'en' or
|
| 317 |
+
x['ADAPMIT'] == 'Adaptation' or
|
| 318 |
+
not any(sector in [x['SECTOR1'], x['SECTOR2']] for sector in sector_classes) or
|
| 319 |
+
x['word_length_check'] == True or
|
| 320 |
+
x['duplicate_check'] == True or
|
| 321 |
+
x['pred_score'] <= sens_level )
|
| 322 |
+
else 'PRE-ASSESSMENT' if sens_level+1 <= x['pred_score'] <= sens_level+2
|
| 323 |
+
else 'FULL-ASSESSMENT' if x['pred_score'] > sens_level+2
|
| 324 |
+
else 'ERROR', axis=1)
|
| 325 |
+
|
| 326 |
+
# Reorder columns in final dataframe
|
| 327 |
+
column_order = ['id', 'organization', 'org_renamed', 'concept_count', 'duplicate_check', 'scope_txt', 'tech_txt', 'fin_txt', 'maf_funding', 'cont_public',
|
| 328 |
+
'cont_private', 'cont_other', 'scope_lab1', 'scope_lab2', 'tech_lab1',
|
| 329 |
+
'tech_lab3', 'fin_lab2', 'bar_lab2', 'ADAPMIT_SCOPE', 'ADAPMIT_TECH', 'ADAPMIT', 'SECTOR1',
|
| 330 |
+
'SECTOR2', 'LANG', 'lev_total', 'lev_gt_0', 'lev_maf_%', 'lev_maf_scale','mitigation_potential', 'cost_effectivness', 'cost_effectivness_norm',
|
| 331 |
+
'word_length_check', 'pred_score', 'pred_action']
|
| 332 |
+
|
| 333 |
+
# Only include columns that exist in the DataFrame
|
| 334 |
+
final_columns = [col for col in column_order if col in df.columns]
|
| 335 |
+
df = df[final_columns]
|
| 336 |
+
|
| 337 |
+
return df
|
| 338 |
+
|
| 339 |
+
|
modules/prompts.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Prompts library
|
| 2 |
+
from typing import List
|
| 3 |
+
|
| 4 |
+
def prompt_concept(concept: str, other_concepts: List[str]) -> str:
|
| 5 |
+
"""Generate prompt for classifying concepts by similarity"""
|
| 6 |
+
prompt = f"""
|
| 7 |
+
|
| 8 |
+
Each organization is allowed to submit up to 6 concepts per year via a web portal. However, in some cases organizations submit the same concept multiple times.
|
| 9 |
+
This can happen for various reasons. For example, an organization may erroneously submit the same application twice because they lost access to the previous web session somehow.
|
| 10 |
+
In all such cases, it is not usually the case that the duplicate concepts are verbatim identical. It is more usually the case that there is simply high semantic alignment - i.e. it is the same concept, but there are minor superficial differences between each application.
|
| 11 |
+
Your task is to review the concept profiles submitted by a particular organization and identify the amount of similarity so that we can in turn identify cases of duplicate concepts.
|
| 12 |
+
|
| 13 |
+
Here is the concept profile for review:
|
| 14 |
+
{concept}
|
| 15 |
+
|
| 16 |
+
Please review this against the following concepts and assess for duplication:
|
| 17 |
+
{other_concepts}
|
| 18 |
+
|
| 19 |
+
Please conduct your review carefully - however, ensure that you tag all duplicates correctly. Please return your response according to the following structure:
|
| 20 |
+
"""
|
| 21 |
+
return prompt
|
modules/semantic_similarity.py
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Semantic similarity-based duplicate detection
|
| 2 |
+
import pandas as pd
|
| 3 |
+
import logging
|
| 4 |
+
from sentence_transformers import SentenceTransformer
|
| 5 |
+
from sklearn.metrics.pairwise import cosine_similarity
|
| 6 |
+
from modules.utils import setup_logging
|
| 7 |
+
|
| 8 |
+
logger = setup_logging()
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def check_duplicate_concepts_semantic(
|
| 12 |
+
model: SentenceTransformer,
|
| 13 |
+
concept_id: str,
|
| 14 |
+
organization: str,
|
| 15 |
+
concept_profile: str,
|
| 16 |
+
df: pd.DataFrame,
|
| 17 |
+
similarity_threshold: float = 0.85
|
| 18 |
+
) -> bool:
|
| 19 |
+
"""
|
| 20 |
+
Check for duplicate concepts within the same organization using semantic similarity
|
| 21 |
+
|
| 22 |
+
Args:
|
| 23 |
+
model: SentenceTransformer model for computing embeddings
|
| 24 |
+
concept_id: ID of the current concept being checked
|
| 25 |
+
organization: Organization name
|
| 26 |
+
concept_profile: Text description of the concept to check
|
| 27 |
+
df: DataFrame containing all application data
|
| 28 |
+
similarity_threshold: Threshold for considering concepts duplicates (0-1)
|
| 29 |
+
Recommended values: 0.80 (lenient) to 0.95 (strict)
|
| 30 |
+
|
| 31 |
+
Returns:
|
| 32 |
+
Boolean classification result
|
| 33 |
+
"""
|
| 34 |
+
|
| 35 |
+
# Remove current concept from the dataframe
|
| 36 |
+
df_check = df[df['id'] != concept_id].copy()
|
| 37 |
+
|
| 38 |
+
# Get other concepts from the same organization
|
| 39 |
+
org_concepts = df_check[df_check['org_renamed'] == organization]
|
| 40 |
+
other_concepts = org_concepts['scope_txt'].tolist()
|
| 41 |
+
|
| 42 |
+
# If no other concepts from this organization, return False
|
| 43 |
+
if len(other_concepts) == 0:
|
| 44 |
+
return False
|
| 45 |
+
|
| 46 |
+
# Compute embedding for current concept
|
| 47 |
+
current_embedding = model.encode(
|
| 48 |
+
concept_profile if concept_profile else "",
|
| 49 |
+
convert_to_numpy=True
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
# Compute embeddings for other concepts
|
| 53 |
+
other_embeddings = model.encode(
|
| 54 |
+
[text if text else "" for text in other_concepts],
|
| 55 |
+
convert_to_numpy=True
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
# Compute cosine similarities
|
| 59 |
+
similarities = cosine_similarity(
|
| 60 |
+
current_embedding.reshape(1, -1),
|
| 61 |
+
other_embeddings
|
| 62 |
+
)[0]
|
| 63 |
+
|
| 64 |
+
max_similarity = similarities.max() if len(similarities) > 0 else 0.0
|
| 65 |
+
|
| 66 |
+
logger.info(f"Duplicate check response for concept ID {concept_id}: max_similarity={max_similarity:.3f}")
|
| 67 |
+
|
| 68 |
+
if max_similarity >= similarity_threshold:
|
| 69 |
+
return True
|
| 70 |
+
return False
|
modules/utils.py
ADDED
|
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from io import BytesIO
|
| 3 |
+
from openpyxl import Workbook
|
| 4 |
+
from openpyxl.styles import Font, NamedStyle, PatternFill
|
| 5 |
+
from openpyxl.styles.differential import DifferentialStyle
|
| 6 |
+
import logging
|
| 7 |
+
from logging.handlers import RotatingFileHandler
|
| 8 |
+
import os
|
| 9 |
+
import configparser
|
| 10 |
+
|
| 11 |
+
def setup_logging():
|
| 12 |
+
# Set up logging
|
| 13 |
+
log_dir = 'logs'
|
| 14 |
+
os.makedirs(log_dir, exist_ok=True)
|
| 15 |
+
log_file = os.path.join(log_dir, 'app.log')
|
| 16 |
+
|
| 17 |
+
# Create a RotatingFileHandler
|
| 18 |
+
file_handler = RotatingFileHandler(log_file, maxBytes=1024 * 1024, backupCount=5)
|
| 19 |
+
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
|
| 20 |
+
|
| 21 |
+
# Configure the root logger
|
| 22 |
+
logging.basicConfig(level=logging.INFO,
|
| 23 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
| 24 |
+
handlers=[file_handler, logging.StreamHandler()])
|
| 25 |
+
|
| 26 |
+
# Return a logger instance
|
| 27 |
+
return logging.getLogger(__name__)
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def getconfig(configfile_path: str):
|
| 31 |
+
"""
|
| 32 |
+
Read the config file
|
| 33 |
+
Params
|
| 34 |
+
----------------
|
| 35 |
+
configfile_path: file path of .cfg file
|
| 36 |
+
"""
|
| 37 |
+
config = configparser.ConfigParser()
|
| 38 |
+
try:
|
| 39 |
+
config.read_file(open(configfile_path))
|
| 40 |
+
return config
|
| 41 |
+
except:
|
| 42 |
+
logging.warning("config file not found")
|
| 43 |
+
|
| 44 |
+
# Function for creating Upload template file
|
| 45 |
+
def create_excel():
|
| 46 |
+
wb = Workbook()
|
| 47 |
+
sheet = wb.active
|
| 48 |
+
sheet.title = "template"
|
| 49 |
+
columns = ['id',
|
| 50 |
+
'organization',
|
| 51 |
+
'scope',
|
| 52 |
+
'technology',
|
| 53 |
+
'financial',
|
| 54 |
+
'barrier',
|
| 55 |
+
'maf_funding_requested',
|
| 56 |
+
'contributions_public_sector',
|
| 57 |
+
'contributions_private_sector',
|
| 58 |
+
'contributions_other',
|
| 59 |
+
'mitigation_potential']
|
| 60 |
+
sheet.append(columns) # Appending columns to the first row
|
| 61 |
+
|
| 62 |
+
# formatting
|
| 63 |
+
for c in sheet['A1:K4'][0]:
|
| 64 |
+
c.fill = PatternFill('solid', fgColor = 'bad8e1')
|
| 65 |
+
c.font = Font(bold=True)
|
| 66 |
+
|
| 67 |
+
# Save to a BytesIO object
|
| 68 |
+
output = BytesIO()
|
| 69 |
+
wb.save(output)
|
| 70 |
+
return output.getvalue()
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
# Function to clean text
|
| 74 |
+
def clean_text(input_text):
|
| 75 |
+
cleaned_text = re.sub(r"[^a-zA-Z0-9\s.,:;!?()\-\n]", "", input_text)
|
| 76 |
+
cleaned_text = re.sub(r"x000D", "", cleaned_text)
|
| 77 |
+
cleaned_text = re.sub(r"\s+", " ", cleaned_text)
|
| 78 |
+
cleaned_text = re.sub(r"\n+", "\n", cleaned_text)
|
| 79 |
+
return cleaned_text
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
# # Function for extracting classifications for each SECTOR label
|
| 83 |
+
def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
|
| 84 |
+
|
| 85 |
+
# verify output is a list of dictionaries
|
| 86 |
+
if isinstance(output, list) and all(isinstance(item, dict) for item in output):
|
| 87 |
+
# filter items with scores above the threshold
|
| 88 |
+
filtered_items = [item for item in output if item.get('score', 0) > threshold]
|
| 89 |
+
|
| 90 |
+
# sort the filtered items by score in descending order
|
| 91 |
+
sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
|
| 92 |
+
|
| 93 |
+
# extract the highest and second-highest labels
|
| 94 |
+
if len(sorted_items) >= 2:
|
| 95 |
+
highest_label = sorted_items[0].get('label')
|
| 96 |
+
second_highest_label = sorted_items[1].get('label')
|
| 97 |
+
elif len(sorted_items) == 1:
|
| 98 |
+
highest_label = sorted_items[0].get('label')
|
| 99 |
+
second_highest_label = None
|
| 100 |
+
else:
|
| 101 |
+
print("Warning: Less than two items above the threshold in the current list.")
|
| 102 |
+
highest_label = None
|
| 103 |
+
second_highest_label = None
|
| 104 |
+
else:
|
| 105 |
+
print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
|
| 106 |
+
highest_label = None
|
| 107 |
+
second_highest_label = None
|
| 108 |
+
|
| 109 |
+
# Output dictionary of highest and second-highest labels to the all_predicted_labels list
|
| 110 |
+
predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
|
| 111 |
+
return predicted_labels
|
requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit
|
| 2 |
+
pandas
|
| 3 |
+
openpyxl
|
| 4 |
+
setfit
|
| 5 |
+
bcrypt
|
| 6 |
+
--extra-index-url https://download.pytorch.org/whl/cu113
|
| 7 |
+
torch
|
| 8 |
+
thefuzz
|
| 9 |
+
openai==2.9.0
|
| 10 |
+
python-dotenv
|
| 11 |
+
sentence-transformers
|
| 12 |
+
scikit-learn
|