mtyrrell commited on
Commit
bc92a1b
·
0 Parent(s):

Fresh start with LFS for images

Browse files
.dockerignore ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Git files
2
+ .git
3
+ .gitignore
4
+
5
+ # Python cache
6
+ __pycache__
7
+ *.pyc
8
+ *.pyo
9
+ *.pyd
10
+ .Python
11
+ *.so
12
+ *.egg
13
+ *.egg-info
14
+ dist
15
+ build
16
+
17
+ # Environment files
18
+ .env
19
+ *.env
20
+
21
+ # IDE files
22
+ .vscode
23
+ .idea
24
+ *.swp
25
+ *.swo
26
+ *~
27
+
28
+ # OS files
29
+ .DS_Store
30
+ Thumbs.db
31
+
32
+ # Testing and sandbox
33
+ testing/
34
+ sandbox/
35
+
36
+ # Logs (will be created at runtime)
37
+ logs/
38
+ *.log
39
+
40
+ # CSV files (test data)
41
+ *.csv
42
+
43
+ # Large data files
44
+ org_renamed.csv
45
+ rename.csv
46
+ test.csv
47
+
48
+ # Documentation
49
+ README.md
50
+ CLAUDE.md
.gitattributes ADDED
@@ -0,0 +1 @@
 
 
1
+ *.png filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ .env
2
+ .DS_Store
3
+ *.csv
4
+ *.xlsx
5
+ /testing/
6
+ /modules/__pycache__/
7
+ app.log
8
+ /sandbox/
CLAUDE.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Overview
6
+
7
+ This is a **Streamlit-based web application** for pre-filtering and analyzing grant applications using machine learning models. The app processes Excel files containing application data, runs them through multiple fine-tuned LLMs for classification, and generates scored outputs with recommended filtering actions.
8
+
9
+ The application is deployed as a **Hugging Face Space** (ref. README.md metadata).
10
+
11
+ ## Running the Application
12
+
13
+ ### Local Development
14
+ ```bash
15
+ # Install dependencies
16
+ pip install -r requirements.txt
17
+
18
+ # Run the Streamlit app
19
+ streamlit run app.py
20
+ ```
21
+
22
+ ### Environment Variables
23
+ Required environment variables (stored in `.env` for local development):
24
+ - `HF_TOKEN` - Hugging Face token for authentication and model access
25
+ - `<USERNAME>_HASH` - Bcrypt password hashes for user authentication (e.g., `USER1_HASH`)
26
+
27
+ ### CUDA/GPU Support
28
+ The application checks for CUDA availability on startup (app.py:1-13). It will automatically use GPU if available, otherwise falls back to CPU or MPS (for Apple Silicon).
29
+
30
+ ## Architecture
31
+
32
+ ### Application Flow
33
+ 1. **Authentication** (modules/auth.py) - Users log in with credentials validated against bcrypt hashes from environment variables
34
+ 2. **File Upload** - Users upload an Excel file matching the template structure
35
+ 3. **Data Processing Pipeline** (modules/pipeline.py) - Core processing logic:
36
+ - Validates required columns
37
+ - Standardizes organization names and counts concepts per organization (modules/org_count.py)
38
+ - Cleans text fields (modules/utils.py)
39
+ - Runs inference through 10 different ML models sequentially
40
+ - Calculates scores and generates filtering recommendations
41
+ 4. **Output Generation** - Produces downloadable Excel file with analysis results
42
+
43
+ ### ML Model Pipeline
44
+ The app uses **10 classification models** loaded from Hugging Face:
45
+ - **SetFit Models (6)**: scope_lab1, scope_lab2, tech_lab1, tech_lab3, fin_lab2, bar_lab2
46
+ - **Transformer Pipelines (4)**: ADAPMIT_SCOPE, ADAPMIT_TECH, SECTOR (multilabel), LANG (language detection)
47
+
48
+ Models are loaded from different HF profiles:
49
+ - `mtyrrell/classifier_SF_*` - SetFit models
50
+ - `GIZ/ADAPMIT-multilabel-bge_f` - Adaptation vs. Mitigation classification
51
+ - `GIZ/SECTOR-multilabel-bge_f` - Sector classification
52
+ - `qanastek/51-languages-classifier` - Language detection
53
+
54
+ ### Required Excel Template Columns
55
+ The input Excel file must contain these columns (ref. pipeline.py:113-124):
56
+ - `id` - Application identifier
57
+ - `organization` - Organization name
58
+ - `scope` - Scope description text
59
+ - `technology` - Technology description text
60
+ - `financial` - Financial information text
61
+ - `barrier` - Barrier description text
62
+ - `maf_funding_requested` - Requested funding amount
63
+ - `contributions_public_sector` - Public sector contributions
64
+ - `contributions_private_sector` - Private sector contributions
65
+ - `contributions_other` - Other contributions
66
+ - `mitigation potential` - Mitigation potential text
67
+
68
+ ### Scoring and Filtering Logic
69
+ The application calculates a predicted score (0-10) based on:
70
+ - Individual model predictions (6 binary classifiers)
71
+ - Leverage calculations (lev_gt_0, lev_maf_scale)
72
+ - Formula: `(fin_lab2*2 + scope_lab1*2 + scope_lab2*2 + tech_lab1 + tech_lab3 + bar_lab2 + lev_gt_0 + lev_maf_scale) / 11 * 10`
73
+
74
+ Applications are labeled with one of four actions (pipeline.py:269-278):
75
+ - **INELIGIBLE** - Non-English, >6 concepts from same org, Adaptation (not Mitigation), wrong sector, or insufficient text length
76
+ - **REJECT** - pred_score ≤ sensitivity level
77
+ - **PRE-ASSESSMENT** - pred_score in [sensitivity+1, sensitivity+2]
78
+ - **FULL-ASSESSMENT** - pred_score > sensitivity+2
79
+
80
+ Sensitivity levels (app.py:81-98):
81
+ - Low: 4 (fewer false negatives ~6%)
82
+ - Medium: 5
83
+ - High: 6 (more false negatives ~13%)
84
+
85
+ ### Key Modules
86
+
87
+ **modules/pipeline.py**
88
+ - `process_data(uploaded_file, sens_level)` - Main processing function
89
+ - `predict_category(df, model_name, progress_bar, repo, profile, multilabel)` - Model inference wrapper
90
+ - Handles both SetFit and Transformer pipeline models with different text truncation strategies
91
+
92
+ **modules/utils.py**
93
+ - `create_excel()` - Generates template Excel file for download
94
+ - `clean_text(input_text)` - Removes special characters and normalizes whitespace
95
+ - `extract_predicted_labels(output, ordinal_selection, threshold)` - Extracts top-2 labels from multilabel output
96
+
97
+ **modules/org_count.py**
98
+ - `standardize_organization_names(df)` - Normalizes org names using exact matching, abbreviations, and fuzzy matching (threshold=90)
99
+ - Adds `org_renamed` and `concept_count` columns to track multiple concepts from same organization
100
+
101
+ **modules/auth.py**
102
+ - `validate_login(username, password)` - Validates against bcrypt hashes stored in environment variables
103
+
104
+ **modules/logging_config.py**
105
+ - `setup_logging()` - Configures rotating file handler (logs/app.log, 1MB max, 5 backups)
106
+
107
+ ### Session State Management
108
+ The app uses Streamlit session state to manage:
109
+ - `authenticated` - Login status
110
+ - `data_processed` - Whether processing is complete
111
+ - `df` - Processed dataframe
112
+ - `show_button` - Controls "Start Analysis" button visibility
113
+ - `processing` - Tracks processing state
114
+
115
+ ## Important Notes
116
+
117
+ - **Processing Time**: Approximately 5 minutes for 1000 applications (ref. app.py:66)
118
+ - **Model Truncation**: All text is truncated to 512 tokens max (pipeline.py:76-89)
119
+ - **Device Selection**: Automatically selects CUDA > MPS > CPU (pipeline.py:51)
120
+ - **Column Validation**: The app will error if required columns are missing from uploaded Excel file
121
+ - **Organization Duplicates**: The system tracks concepts per organization and flags applications if >6 concepts from same org (marked INELIGIBLE)
Dockerfile ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NVIDIA CUDA base image for GPU support
2
+ FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04
3
+
4
+ # Environment variables to prevent interactive prompts
5
+ ENV DEBIAN_FRONTEND=noninteractive
6
+
7
+ RUN apt-get update && apt-get install -y \
8
+ python3.10 \
9
+ python3-pip \
10
+ python3.10-dev \
11
+ git \
12
+ && rm -rf /var/lib/apt/lists/*
13
+
14
+ RUN useradd -m -u 1000 user
15
+ USER user
16
+ ENV HOME=/home/user \
17
+ PATH=/home/user/.local/bin:$PATH
18
+ WORKDIR $HOME/app
19
+
20
+ RUN pip install --no-cache-dir --upgrade pip
21
+
22
+ COPY --chown=user requirements.txt .
23
+ RUN pip install --no-cache-dir -r requirements.txt
24
+
25
+ COPY --chown=user . .
26
+
27
+ # Create logs directory with proper permissions
28
+ USER root
29
+ RUN mkdir -p logs && chown -R user:user logs
30
+ USER user
31
+
32
+ # Expose Streamlit default port
33
+ EXPOSE 8501
34
+
35
+
36
+ # Health check
37
+ HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health || exit 1
38
+
39
+ # Run app
40
+ CMD ["streamlit", "run", "app.py","--server.port=8501", "--server.address=0.0.0.0", "--server.headless=true", "--server.fileWatcherType=none", "--server.enableXsrfProtection=false", "--server.enableCORS=false"]
README.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Prefilter App
3
+ emoji: 🦀
4
+ colorFrom: yellow
5
+ colorTo: red
6
+ sdk: docker
7
+ app_port: 8501
8
+ pinned: false
9
+ ---
10
+
11
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
__pycache__/app.cpython-311.pyc ADDED
Binary file (13.5 kB). View file
 
app.py ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ try:
3
+ print(f"Is CUDA available: {torch.cuda.is_available()}")
4
+ if torch.cuda.is_available():
5
+ try:
6
+ print(f"CUDA device: {torch.cuda.get_device_name(torch.cuda.current_device())}")
7
+ except Exception as e:
8
+ print(f"Error getting CUDA device name: {str(e)}")
9
+ else:
10
+ print("No CUDA device available - using CPU")
11
+ except Exception as e:
12
+ print(f"Error checking CUDA availability: {str(e)}")
13
+ print("Continuing with CPU...")
14
+
15
+ import streamlit as st
16
+ import os
17
+ from huggingface_hub import login
18
+ from datetime import datetime
19
+ from openai import OpenAI
20
+ from modules.auth import validate_login
21
+ from modules.utils import create_excel, setup_logging, getconfig
22
+ from modules.pipeline import process_data
23
+
24
+ setup_logging()
25
+ import logging
26
+ from io import BytesIO
27
+
28
+ logger = logging.getLogger(__name__)
29
+
30
+ # Local
31
+ # from dotenv import load_dotenv
32
+ # load_dotenv()
33
+
34
+ config = getconfig("config.cfg")
35
+
36
+ @st.cache_resource
37
+ def get_azure_openai_client():
38
+ """Initialize and cache Azure OpenAI client for the session"""
39
+ try:
40
+ AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
41
+ AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")
42
+ AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
43
+
44
+ if not all([AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_VERSION, AZURE_OPENAI_API_KEY]):
45
+ raise ValueError("Missing required Azure OpenAI environment variables. Please check your .env file.")
46
+
47
+ client = OpenAI(api_key=AZURE_OPENAI_API_KEY, base_url=AZURE_OPENAI_ENDPOINT)
48
+ logger.info("Azure OpenAI client initialized successfully")
49
+ return client
50
+ except Exception as e:
51
+ logger.error(f"Failed to initialize Azure OpenAI client: {str(e)}")
52
+ raise
53
+
54
+
55
+ def get_azure_deployment():
56
+ """Get Azure OpenAI deployment name from config file"""
57
+ try:
58
+ config = getconfig("config.cfg")
59
+ deployment = config.get("deployments", "DEPLOYMENT")
60
+ logger.info(f"Using Azure OpenAI deployment: {deployment}")
61
+ return deployment
62
+ except Exception as e:
63
+ logger.error(f"Failed to read deployment from config: {str(e)}. Using default deployment.")
64
+ deployment = "gpt-4o-mini"
65
+ return deployment
66
+
67
+
68
+ # Main app logic
69
+ def main():
70
+ # Temporarily set authentication to True for testing
71
+ if 'authenticated' not in st.session_state:
72
+ st.session_state['authenticated'] = False
73
+
74
+ if st.session_state['authenticated']:
75
+ # Remove login success message for testing
76
+ hf_token = os.environ["HF_TOKEN"]
77
+ login(token=hf_token, add_to_git_credential=True)
78
+
79
+ # Initialize session state variables
80
+ if 'data_processed' not in st.session_state:
81
+ st.session_state['data_processed'] = False
82
+ st.session_state['df'] = None
83
+
84
+ # Main Streamlit app
85
+ st.title('Application Pre-Filtering Tool')
86
+
87
+ # Sidebar (filters)
88
+ with st.sidebar:
89
+ with st.expander("ℹ️ - Instructions", expanded=False):
90
+ st.markdown(
91
+ """
92
+ 1. **Download the Excel Template file (below)**
93
+ 2. **[OPTIONAL]: Select the desired filtering sensitivity level (below)**
94
+ 3. **Copy/paste the requisite application data in the template file. Best practice is to 'paste as values'**
95
+ 4. **Upload the template file in the area to the right (or click browse files)**
96
+ 5. **Click 'Start Analysis'**
97
+
98
+ The tool will start processing the uploaded application data. This can take some time
99
+ depending on the number of applications and the length of text in each. For example, a file with 1000 applications
100
+ could be expected to take approximately 5 minutes.
101
+
102
+ ***NOTE** - you can also simply rename the column headers in your own file. The headers must match the column names in the template for the tool to run properly.*
103
+
104
+ """
105
+ )
106
+ # Excel file download
107
+ st.download_button(
108
+ label="Download Excel Template",
109
+ data=create_excel(),
110
+ file_name="upload_template.xlsx",
111
+ mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
112
+ )
113
+
114
+ # get sensitivity level for use in review / reject (ref. process_data function)
115
+ sens_options = {
116
+ "Low": 4,
117
+ "Medium": 5,
118
+ "High": 6,
119
+ }
120
+
121
+ sens_input = st.sidebar.radio(label = 'Select the Sensitivity Level [OPTIONAL]',
122
+ help = 'Decreasing the level of sensitivity results in less \
123
+ applications filtered out. This also \
124
+ reduces the probability of false negatives (FNs). The rate of \
125
+ FNs at the lowest setting is approximately 6 percent, and \
126
+ approaches 13 percent at the highest setting. \
127
+ NOTE: changing this setting does not affect the raw data in the CSV output file (only the labels)',
128
+ options = list(sens_options.keys()),
129
+ index = list(sens_options.keys()).index("High"),
130
+ horizontal = False)
131
+
132
+ sens_level = sens_options[sens_input]
133
+
134
+ with st.expander("ℹ️ - About this app", expanded=False):
135
+ st.write(
136
+ """
137
+ This tool provides an interface for running an automated preliminary assessment of applications for a call for applications.
138
+
139
+ The tool functions by running selected text fields from the application through a series of LLMs fine-tuned for text classification (ref. diagram below).
140
+ The resulting output classifications are used to compute a score and a suggested pre-filtering action. The tool has been tested against
141
+ human assessors and exhibits an extremely low false negative rate (<6%) at a Sensitivity Level of 'Low' (i.e. rejection threshold for predicted score < 4).
142
+
143
+ """)
144
+ st.image('images/pipeline.png')
145
+
146
+ uploaded_file = st.file_uploader("Select a file containing application pre-filtering data (see instructions in the sidebar)")
147
+
148
+ # Add session state variables if they don't exist
149
+ if 'show_button' not in st.session_state:
150
+ st.session_state['show_button'] = True
151
+ if 'processing' not in st.session_state:
152
+ st.session_state['processing'] = False
153
+ if 'data_processed' not in st.session_state:
154
+ st.session_state['data_processed'] = False
155
+
156
+ # Only show the button if show_button is True and file is uploaded and not processing
157
+ if uploaded_file is not None and st.session_state['show_button'] and not st.session_state['processing']:
158
+ if st.button("Start Analysis", key="start_analysis"):
159
+ st.session_state['show_button'] = False
160
+ st.session_state['processing'] = True
161
+ st.rerun()
162
+
163
+ # If we're processing, show the processing logic
164
+ if st.session_state['processing']:
165
+ try:
166
+ logger.info(f"File uploaded: {uploaded_file.name}")
167
+
168
+ if not st.session_state['data_processed']:
169
+ logger.info("Starting data processing")
170
+ try:
171
+ # Initialize Azure OpenAI client and get deployment name
172
+ azure_client = get_azure_openai_client()
173
+ azure_deployment = get_azure_deployment()
174
+
175
+ st.session_state['df'] = process_data(
176
+ uploaded_file,
177
+ sens_level,
178
+ azure_client,
179
+ azure_deployment
180
+ )
181
+ logger.info("Data processing completed successfully")
182
+ st.session_state['data_processed'] = True
183
+ except ValueError as e:
184
+ # Handle specific validation errors
185
+ logger.error(f"Validation error: {str(e)}")
186
+ st.error(str(e))
187
+ st.session_state['show_button'] = True
188
+ st.session_state['processing'] = False
189
+ st.rerun()
190
+ except Exception as e:
191
+ # Handle other unexpected errors
192
+ logger.error(f"Error in process_data: {str(e)}")
193
+ st.error("An unexpected error occurred. Please check your input file and try again.")
194
+ st.session_state['show_button'] = True
195
+ st.session_state['processing'] = False
196
+ st.rerun()
197
+
198
+ df = st.session_state['df']
199
+
200
+ def reset_button_state():
201
+ st.session_state['show_button'] = True
202
+ st.session_state['processing'] = False
203
+ st.session_state['data_processed'] = False
204
+
205
+ # Create Excel buffer
206
+ excel_buffer = BytesIO()
207
+ df.to_excel(excel_buffer, index=False, engine='openpyxl')
208
+ excel_buffer.seek(0)
209
+
210
+ current_datetime = datetime.now().strftime('%d-%m-%Y_%H-%M-%S')
211
+ output_filename = f'processed_applications_{current_datetime}.xlsx'
212
+
213
+ st.download_button(
214
+ label="Download Analysis Data File",
215
+ data=excel_buffer,
216
+ file_name=output_filename,
217
+ mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
218
+ on_click=reset_button_state
219
+ )
220
+
221
+ except Exception as e:
222
+ logger.error(f"Error processing file: {str(e)}")
223
+ st.error("Failed to process the file. Please ensure your column names match the template file.")
224
+ st.session_state['show_button'] = True
225
+ st.session_state['processing'] = False
226
+ st.rerun()
227
+
228
+
229
+ # Comment out for testing
230
+ else:
231
+ username = st.text_input("Username")
232
+ password = st.text_input("Password", type="password")
233
+ if st.button("Login"):
234
+ if validate_login(username, password):
235
+ st.session_state['authenticated'] = True
236
+ st.rerun()
237
+ else:
238
+ st.error("Incorrect username or password")
239
+
240
+
241
+
242
+ main()
243
+
config.cfg ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [deployments]
2
+ DEPLOYMENT=gpt-5-mini
images/pipeline.png ADDED

Git LFS Details

  • SHA256: c6966ec792e3b749e773a185e441a558bd5abbde42bb61f826fb1321af910928
  • Pointer size: 131 Bytes
  • Size of remote file: 679 kB
modules/auth.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import bcrypt
3
+
4
+ # Helper functions
5
+ def check_password(provided_password, stored_hash):
6
+ return bcrypt.checkpw(provided_password.encode(), stored_hash)
7
+
8
+ def validate_login(username, password):
9
+ # Retrieve user's hashed password from environment variables
10
+ user_hash = os.getenv(username.upper() + '_HASH') # Assumes an env var like 'USER1_HASH'
11
+ if user_hash:
12
+ return check_password(password, user_hash.encode())
13
+ return False
modules/llm.py ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Helper functions for pipeline
2
+ from datetime import datetime, timedelta
3
+ from collections import defaultdict, namedtuple, Counter
4
+ from typing import List, Dict, Any
5
+ import torch
6
+ import logging
7
+ from transformers import pipeline
8
+ from modules.utils import setup_logging
9
+ from modules.prompts import prompt_concept
10
+ from modules.models import ConceptClassify
11
+ from openai import OpenAI
12
+
13
+ logger = setup_logging()
14
+
15
+ def call_structured(client: OpenAI, deployment: str, system_prompt: str, user_prompt: str,
16
+ response_model: None,
17
+ logger: logging.Logger) -> Dict[str, Any]:
18
+ """Call Azure OpenAI with structured output"""
19
+ system_prompt = "You are assessing grant applications for an open funding call."
20
+ try:
21
+ if deployment in ['o4-mini','o3',"gpt-5","gpt-5-mini","gpt-5-nano"]:
22
+ response = client.responses.parse(
23
+ model=deployment,
24
+ reasoning={"effort": "low"},
25
+ input=[
26
+ {"role": "system", "content": system_prompt},
27
+ {"role": "user", "content": user_prompt},
28
+ ],
29
+ text_format=response_model)
30
+ else:
31
+ response = client.responses.parse(
32
+ model=deployment,
33
+ input=[
34
+ {"role": "system", "content": system_prompt},
35
+ {"role": "user", "content": user_prompt},
36
+ ],
37
+ temperature=0,
38
+ text_format=response_model)
39
+
40
+ result = response.output_parsed
41
+
42
+ # Return data + cost info
43
+ return result.model_dump()
44
+
45
+ except Exception as e:
46
+ logger.error(f"Error calling Azure OpenAI for {response_model.__name__}: {e}")
47
+ # Return default error response for ConceptClassify model
48
+ return None
49
+
50
+
51
+ # Not used - results sucked
52
+ # def check_duplicate_concepts(client, deployment, concept_id: str, organization: str, concept_profile: str, df) -> bool:
53
+ # """
54
+ # Check for duplicate concepts within the same organization using Azure OpenAI
55
+
56
+ # Args:
57
+ # client: AzureOpenAI client instance
58
+ # deployment: Azure OpenAI deployment name
59
+ # concept_id: ID of the current concept being checked
60
+ # organization: Organization name
61
+ # concept_profile: Text description of the concept to check
62
+ # df: DataFrame containing all application data
63
+
64
+ # Returns:
65
+ # Boolean classification result
66
+ # """
67
+
68
+ # # Remove current concept from the dataframe
69
+ # df_check = df[df['id'] != concept_id].copy()
70
+
71
+ # # Get other concepts from the same organization
72
+ # org_concepts = df_check[df_check['org_renamed'] == organization]
73
+ # other_concepts = org_concepts['scope_txt'].tolist()
74
+
75
+ # # If no other concepts from this organization, return False
76
+ # if len(other_concepts) == 0:
77
+ # return False
78
+
79
+ # logger.info(f"Checking duplicates for concept ID {concept_id} from organization {organization} against {len(other_concepts)} other concept(s).")
80
+ # logger.info(f"Scope text {concept_profile}")
81
+ # # Construct prompt
82
+ # prompt = prompt_concept(concept_profile, other_concepts)
83
+
84
+ # response = call_structured(client, deployment, prompt, concept_profile, ConceptClassify, logger)
85
+
86
+ # check = response['classification']
87
+ # logger.info(f"Duplicate check response for concept ID {concept_id}: {check}")
88
+ # if check == "YES":
89
+ # return True
90
+ # return False
modules/models.py ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, Any, List, Optional, Literal
2
+ from pydantic import BaseModel, Field
3
+
4
+ #===================== Duplicate concepts =====================
5
+
6
+ class ConceptClassify(BaseModel):
7
+ classification: Literal["YES","NO","UNCERTAIN"] = Field(description="Is the concept duplicated in other applications? (yes/no/uncertain)")
8
+
modules/org_count.py ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ from thefuzz import fuzz
3
+ import logging
4
+ import re
5
+
6
+ logger = logging.getLogger(__name__)
7
+
8
+
9
+ def standardize_organization_names(df):
10
+ """
11
+ Standardizes organization names in a DataFrame using exact matches, abbreviations, and fuzzy matching.
12
+
13
+ Args:
14
+ df (pd.DataFrame): DataFrame containing an 'organization' column
15
+
16
+ Returns:
17
+ pd.DataFrame: DataFrame with added 'org_renamed' and 'concept_count' columns
18
+ """
19
+ # Make a copy to avoid modifying the original DataFrame
20
+ df = df.copy()
21
+
22
+ # Sort DataFrame by 'id' column in ascending order
23
+ df = df.sort_values('id', ascending=True)
24
+
25
+ # Return DataFrame as-is if 'organization' column is not present
26
+ if 'organization' not in df.columns:
27
+ logger.warning("No 'organization' column found in DataFrame. Returning DataFrame as-is.")
28
+
29
+ else:
30
+ logger.info("Checking org names")
31
+ # Dictionary of organization variations and their standardized names
32
+ # Cleaned up to remove leading/trailing spaces for consistency
33
+ org_variations = {
34
+ 'Adventist Development Relief Agency': ['adventist development'],
35
+ 'Asian Development Bank': ['asian development bank'],
36
+ 'Association of the Regional Mechanism for Emissions Reductions of Boyacá, Colombia (MRRE)': ['regional mechanism for emissions reductions of boyacá'],
37
+ 'BioCarbon Partners (BCP)': ['biocarbon partners'],
38
+ 'Biothermica Technologies Inc': ['biothermica tech'],
39
+ 'Brazilian Tourist Board': ['brazilian tourist board'],
40
+ 'Caribbean Community Climate Change Centre': ['caribbean community climate'],
41
+ 'Caritas': ['caritas'],
42
+ 'Chemical Industries Holding Company': ['chemical industries holding company'],
43
+ 'Climate Advocacy International (CAI)': ['climate advocacy int'],
44
+ 'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['deutsche gesellschaft für internationale'],
45
+ 'Deutsche Sparkassenstiftung (DSIK)': ['deutsche sparkassenstiftung'],
46
+ 'Development Initiative for Community Impact (DICI)': ['development initiative for community impact'],
47
+ 'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['east african centre of excellence for renewable'],
48
+ 'Eco-Ideal': ['eco-ideal'],
49
+ 'Electricité de France (EDF)': ['electricité de france', 'edf international networks'],
50
+ 'The Energy and Resources Institute (TERI)': ['energy and resources institute'],
51
+ 'Environmental Defense Fund (EDF)': ['environmental defense fund'],
52
+ 'Food and Agriculture Organization (FAO)': ['food and agriculture organization'],
53
+ 'Global Green Growth Institute (GGGI)': ['global green growth'],
54
+ 'International Finance Corporation (IFC)': ['international finance corporation'],
55
+ "International Organization for Migration (IOM)": ['international organization for migration'],
56
+ 'Inter-American Development Bank (IDB)': ['american development bank'],
57
+ 'Iskandar Regional Development Authority (IRDA)': ['iskandar regional'],
58
+ 'Islamic Development Bank': ['islamic development bank'],
59
+ 'Malaysian Industry Government Group for High Technology (MIGHT)': ['government group for high technology'],
60
+ 'Metallurgical Industries Holding Company': ['metallurgical industries holding company'],
61
+ 'MicroSave Consulting (MSC)': ['microsave consulting'],
62
+ 'Osh Technological University': ['osh technological university','ошский технологический университет'],
63
+ 'Oxford Policy Management (OPM)': ['oxford policy management'],
64
+ 'Pacific Rim Investment Management': ['pacific rim investment'],
65
+ 'Palestinian Energy and Natural Resources Authority (PENRA)': ['palestinian energy and natural'],
66
+ 'Rwanda Energy Group (REG) Ltd': ['rwanda energy group'],
67
+ 'Rocky Mountain Institute (RMI)': ['rocky mountain institute'],
68
+ 'Secretariat of the Pacific Regional Environment Programme (SPREP)': ['secretariat of the pacific regional environment programme (sprep)'],
69
+ 'Serviço Nacional de Aprendizagem Industrial (SENAI)': ['serviço nacional de aprendizagem'],
70
+ 'Sumy City Council': ['sumy city council'],
71
+ # 'Tajik Technical University': ['tajik technical university'],
72
+ 'Uganda Development Bank Limited (UDBL)': ['uganda development bank'],
73
+ 'United Nations Human Settlement Programme (UN-Habitat)': ['united nations human settlement','un-habitat'],
74
+ 'United Nations Children\'s Fund (UNICEF)': ['united nations children'],
75
+ 'United Nations Conference on Trade and Development (UNCTAD)': ['united nations conference on trade'],
76
+ 'United Nations Development Programme (UNDP)': ['united nations development program'],
77
+ 'United Nations Economic and Social Commission (ECOSOC)': ['united nations economic and social'],
78
+ 'United Nations Environment Programme (UNEP)': ['united nations environment'],
79
+ 'United Nations High Commissioner for Refugees (UNHCR)': ['high commissioner for refugees'],
80
+ 'United Nations Industrial Development Organization (UNIDO)': ['united nations industrial'],
81
+ 'United Nations Office for Project Services (UNOPS)': ['united nations office for project'],
82
+ 'World Food Programme (WFP)': ['world food program'],
83
+ 'World Health Organization (WHO)': ['world health organization'],
84
+ 'World Resources Institute (WRI)': ['world resources institute'],
85
+ 'World Wide Fund for Nature (WWF)': ['world wildlife','world wide fund for nature'],
86
+ }
87
+
88
+ # Dictionary of organization abbreviations
89
+ org_abreviations = {
90
+ 'Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ)': ['GIZ'],
91
+ 'Deutsche Sparkassenstiftung (DSIK)': ['DSIK'],
92
+ 'Development Initiative for Community Impact (DICI)': ['DICI'],
93
+ 'East African Centre of Excellence for Renewable Energy and Efficiency (EACREEE)': ['EACREEE'],
94
+ 'Food and Agriculture Organization (FAO)': ['FAO'],
95
+ 'Global Green Growth Institute (GGGI)': ['GGGI'],
96
+ 'International Finance Corporation (IFC)': ['IFC'],
97
+ 'International Organization for Migration (IOM)': ['IOM'],
98
+ 'Inter-American Development Bank (IDB)': ['IDB'],
99
+ 'United Nations Children\'s Fund (UNICEF)': ['UNICEF'],
100
+ 'United Nations Conference on Trade and Development (UNCTAD)': ['UNCTAD'],
101
+ 'United Nations Development Programme (UNDP)': ['UNDP'],
102
+ 'United Nations Economic and Social Commission (ECOSOC)': ['ECOSOC'],
103
+ 'United Nations Environment Programme (UNEP)': ['UNEP'],
104
+ 'United Nations Industrial Development Organization (UNIDO)': ['UNIDO'],
105
+ 'United Nations High Commissioner for Refugees (UNHCR)': ['UNHCR'],
106
+ 'United Nations Office for Project Services (UNOPS)': ['UNOPS'],
107
+ 'World Food Programme (WFP)': ['WFP'],
108
+ 'World Health Organization (WHO)': ['WHO'],
109
+ 'World Resources Institute (WRI)': ['WRI'],
110
+ 'World Wide Fund for Nature (WWF)': ['WWF']
111
+ }
112
+
113
+ # Initialize result column
114
+ df['org_renamed'] = None
115
+
116
+ # Step 1: Process abbreviations first (highest priority for exact acronyms)
117
+ # Use case-insensitive matching to catch "giz", "GIZ", "Giz", etc.
118
+ logger.info("Processing abbreviation matches")
119
+ for standard_name, abreviations in org_abreviations.items():
120
+ for abreviation in abreviations:
121
+ # Case-insensitive matching with word boundaries to avoid partial matches
122
+ pattern = r'\b' + re.escape(abreviation) + r'\b'
123
+ mask = df['organization'].str.contains(pattern, case=False, regex=True, na=False) & df['org_renamed'].isna()
124
+ df.loc[mask, 'org_renamed'] = standard_name
125
+
126
+ # Step 2: Process substring variations (e.g., "adventist development" in full org name)
127
+ # Use improved substring matching to reduce false positives
128
+ logger.info("Processing variation matches")
129
+ for standard_name, variations in org_variations.items():
130
+ for var in variations:
131
+ # Check if already matched by abbreviation
132
+ mask = df['org_renamed'].isna()
133
+ if mask.sum() == 0:
134
+ break
135
+ # Use simple substring matching (case-insensitive)
136
+ # Note: Not using word boundaries here as variations are often partial phrases
137
+ org_lower = df.loc[mask, 'organization'].str.lower()
138
+ submask = org_lower.str.contains(re.escape(var), regex=True, na=False)
139
+ df.loc[mask & submask, 'org_renamed'] = standard_name
140
+
141
+ # Step 3: Process fuzzy matches against dictionary for remaining unmatched organizations
142
+ unmatched_mask = df['org_renamed'].isna()
143
+ if unmatched_mask.sum() > 0:
144
+ logger.info(f"Processing fuzzy matches against dictionary for {unmatched_mask.sum()} unmatched organizations")
145
+
146
+ # Get unique unmatched organization names to avoid duplicate processing
147
+ unique_unmatched = df.loc[unmatched_mask, 'organization'].unique()
148
+ threshold = 85 # token_set_ratio handles extra tokens (e.g., country names) better
149
+
150
+ # Create mapping dictionary for unique unmatched orgs
151
+ org_mapping = {}
152
+ for org in unique_unmatched:
153
+ org_lower = str(org).lower()
154
+ best_match = None
155
+ highest_ratio = 0
156
+
157
+ # Check against all variations and standard names
158
+ for standard_name, variations in org_variations.items():
159
+ all_forms = [standard_name.lower()] + variations
160
+ for variant in all_forms:
161
+ # Use token_set_ratio for better matching with extra tokens/geographic qualifiers
162
+ ratio = fuzz.token_set_ratio(org_lower, variant)
163
+ if ratio > threshold and ratio > highest_ratio:
164
+ highest_ratio = ratio
165
+ best_match = standard_name
166
+
167
+ if best_match:
168
+ org_mapping[org] = best_match
169
+ logger.debug(f"Fuzzy matched '{org}' to '{best_match}' (score: {highest_ratio})")
170
+
171
+ # Apply the mapping to all matching rows
172
+ if org_mapping:
173
+ for original_org, standard_org in org_mapping.items():
174
+ mask = (df['organization'] == original_org) & df['org_renamed'].isna()
175
+ df.loc[mask, 'org_renamed'] = standard_org
176
+
177
+ # Step 4: Dynamic fuzzy matching of remaining unmatched organizations
178
+ # Compare new unmatched orgs against previously stored unmatched orgs
179
+ unmatched_mask = df['org_renamed'].isna()
180
+ if unmatched_mask.sum() > 0:
181
+ logger.info(f"Processing dynamic fuzzy matching for {unmatched_mask.sum()} remaining unmatched organizations")
182
+
183
+ # Get unmatched organizations in order of first appearance
184
+ unmatched_df = df[unmatched_mask].copy()
185
+ unmatched_df = unmatched_df.sort_values('id')
186
+ unique_unmatched_ordered = unmatched_df['organization'].drop_duplicates().tolist()
187
+
188
+ threshold = 95 # Higher threshold for dynamic matching to avoid false positives
189
+ stored_orgs = [] # List of previously seen unmatched orgs (becomes canonical)
190
+ org_mapping = {} # Maps original org name to stored canonical org
191
+
192
+ for org in unique_unmatched_ordered:
193
+ org_lower = str(org).lower()
194
+ best_match = None
195
+ highest_ratio = 0
196
+
197
+ # Compare against all previously stored unmatched organizations
198
+ for stored_org in stored_orgs:
199
+ stored_org_lower = str(stored_org).lower()
200
+ ratio = fuzz.token_set_ratio(org_lower, stored_org_lower)
201
+
202
+ if ratio > threshold and ratio > highest_ratio:
203
+ highest_ratio = ratio
204
+ best_match = stored_org
205
+
206
+ if best_match:
207
+ # Match found - map to the stored org
208
+ org_mapping[org] = best_match
209
+ logger.debug(f"Dynamically matched '{org}' to '{best_match}' (score: {highest_ratio})")
210
+ else:
211
+ # No match - store this org for future comparisons
212
+ stored_orgs.append(org)
213
+ org_mapping[org] = org
214
+
215
+ # Apply the mapping to all matching rows
216
+ for original_org, canonical_org in org_mapping.items():
217
+ mask = (df['organization'] == original_org) & df['org_renamed'].isna()
218
+ df.loc[mask, 'org_renamed'] = canonical_org
219
+
220
+ # Fill remaining empty values with original organization names
221
+ df.loc[df['org_renamed'].isna(), 'org_renamed'] = df.loc[df['org_renamed'].isna(), 'organization']
222
+
223
+ # Add concept count
224
+ df['concept_count'] = df.groupby('org_renamed').cumcount() + 1
225
+
226
+ # Reorder columns with id, organization, org_renamed, concept_count first, followed by all others
227
+ cols = ['id', 'organization', 'org_renamed', 'concept_count']
228
+ other_cols = [col for col in df.columns if col not in cols]
229
+ df = df[cols + other_cols]
230
+
231
+ return df
232
+
modules/pipeline.py ADDED
@@ -0,0 +1,339 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import time
3
+ import pandas as pd
4
+ from io import BytesIO
5
+ import streamlit as st
6
+ import torch
7
+ from setfit import SetFitModel
8
+ from transformers import pipeline
9
+ from openpyxl import Workbook
10
+ from openpyxl.styles import Font, NamedStyle, PatternFill
11
+ from openpyxl.styles.differential import DifferentialStyle
12
+ from modules.org_count import standardize_organization_names
13
+ from modules.utils import clean_text, extract_predicted_labels
14
+ # from modules.llm import check_duplicate_concepts
15
+ from modules.semantic_similarity import check_duplicate_concepts_semantic
16
+ from sentence_transformers import SentenceTransformer
17
+ import logging
18
+
19
+ logger = logging.getLogger(__name__)
20
+
21
+ # # Function for extracting classifications for each SECTOR label
22
+ def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
23
+
24
+ # verify output is a list of dictionaries
25
+ if isinstance(output, list) and all(isinstance(item, dict) for item in output):
26
+ # filter items with scores above the threshold
27
+ filtered_items = [item for item in output if item.get('score', 0) > threshold]
28
+
29
+ # sort the filtered items by score in descending order
30
+ sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
31
+
32
+ # extract the highest and second-highest labels
33
+ if len(sorted_items) >= 2:
34
+ highest_label = sorted_items[0].get('label')
35
+ second_highest_label = sorted_items[1].get('label')
36
+ elif len(sorted_items) == 1:
37
+ highest_label = sorted_items[0].get('label')
38
+ second_highest_label = None
39
+ else:
40
+ print("Warning: Less than two items above the threshold in the current list.")
41
+ highest_label = None
42
+ second_highest_label = None
43
+ else:
44
+ print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
45
+ highest_label = None
46
+ second_highest_label = None
47
+
48
+ # Output dictionary of highest and second-highest labels to the all_predicted_labels list
49
+ predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
50
+ return predicted_labels
51
+
52
+ # Function to call model and run inference for varying classification tasks/models
53
+ def predict_category(df, model_name, progress_bar, repo, profile, multilabel=False):
54
+ device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
55
+ model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2', 'bar_lab2']
56
+
57
+ # Model configuration mapping
58
+ model_config = {
59
+ 'ADAPMIT_TECH': {'col_name': 'tech_txt', 'top_k': 1},
60
+ 'ADAPMIT_SCOPE': {'col_name': 'scope_txt', 'top_k': 1},
61
+ 'LANG': {'col_name': 'scope_txt', 'top_k': 1},
62
+ 'default': {'col_name': 'scope_txt', 'top_k': None}
63
+ }
64
+
65
+ if model_name in model_names_sf:
66
+ col_name = re.sub(r'_(.*)', r'_txt', model_name)
67
+ model = SetFitModel.from_pretrained(profile+"/"+repo)
68
+ model.to(device)
69
+ # Get tokenizer from the model
70
+ tokenizer = model.model_body.tokenizer
71
+ else:
72
+ # Get configuration for the model, falling back to default if not specified
73
+ config = model_config.get(model_name, model_config['default'])
74
+ col_name = config['col_name']
75
+ model = pipeline("text-classification",
76
+ model=profile+"/"+repo,
77
+ device=device,
78
+ top_k=config['top_k'],
79
+ truncation=True,
80
+ max_length=512)
81
+
82
+
83
+
84
+ predictions = []
85
+ # probabilities = []
86
+ total = len(df)
87
+ for i, text in enumerate(df[col_name]):
88
+ try:
89
+ if model_name in model_names_sf:
90
+ # Truncate text for SetFit models
91
+ encoded = tokenizer(text, truncation=True, max_length=512)
92
+ truncated_text = tokenizer.decode(encoded['input_ids'])
93
+ prediction = model(truncated_text)
94
+ predictions.append(0 if prediction == 'NEGATIVE' else 1)
95
+ else:
96
+ prediction = model(text)
97
+ if model_name == 'ADAPMIT_SCOPE' or model_name == 'ADAPMIT_TECH':
98
+ predictions.append(re.sub('Label$', '', prediction[0][0]['label']))
99
+ elif model_name == 'SECTOR':
100
+ predictions.append(extract_predicted_labels(prediction[0], threshold=0.5))
101
+ elif model_name == 'LANG':
102
+ predictions.append(prediction[0][0]['label'])
103
+ except Exception as e:
104
+ logger.error(f"Error processing sample {df['id'][i]}: {str(e)}")
105
+ st.error("Application Error. Please contact support.")
106
+ # Update progress bar with each iteration
107
+ progress = (i + 1) / total
108
+ progress_bar.progress(progress)
109
+
110
+ return predictions
111
+
112
+
113
+ # Main function to process data
114
+ def process_data(uploaded_file, sens_level, azure_client, azure_deployment):
115
+ """
116
+ Process uploaded application data through ML pipeline
117
+
118
+ Args:
119
+ uploaded_file: Excel file containing application data
120
+ sens_level: Sensitivity level for filtering (4=Low, 5=Medium, 6=High)
121
+ azure_client: AzureOpenAI client instance for LLM calls
122
+ azure_deployment: Azure OpenAI deployment name
123
+
124
+ Returns:
125
+ Processed DataFrame with predictions and scores
126
+ """
127
+ # Define required columns and their mappings
128
+ required_columns = {
129
+ 'id': 'id',
130
+ 'scope': 'scope_txt',
131
+ 'technology': 'tech_txt',
132
+ 'financial': 'fin_txt',
133
+ 'barrier': 'bar_txt',
134
+ 'maf_funding_requested': 'maf_funding',
135
+ 'contributions_public_sector': 'cont_public',
136
+ 'contributions_private_sector': 'cont_private',
137
+ 'contributions_other': 'cont_other',
138
+ 'mitigation_potential': 'mitigation_potential'
139
+ }
140
+
141
+ # Read the Excel file
142
+ try:
143
+ df = pd.read_excel(uploaded_file)
144
+ logger.info("Data import successful")
145
+ # Clean up organization names
146
+ df = standardize_organization_names(df)
147
+
148
+
149
+ except Exception as e:
150
+ error_msg = f"Failed to read Excel file: {str(e)}"
151
+ logger.error(error_msg)
152
+ st.error("Failed to read the uploaded file. Please ensure it's a valid Excel file.")
153
+ raise ValueError(error_msg)
154
+
155
+ # Validate required columns
156
+ missing_columns = [col for col in required_columns.keys() if col not in df.columns]
157
+ if missing_columns:
158
+ error_msg = f"Missing required columns: {', '.join(missing_columns)}"
159
+ logger.error(error_msg)
160
+ st.error(error_msg)
161
+ raise ValueError(error_msg)
162
+
163
+ # Rename required columns while preserving all others
164
+ df = df.rename(columns={k: v for k, v in required_columns.items() if k in df.columns})
165
+
166
+ # Clean and process text fields
167
+ df.fillna('', inplace=True)
168
+ df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']] = df[['scope_txt', 'tech_txt', 'fin_txt', 'bar_txt']].applymap(clean_text)
169
+
170
+ # Define models and predictions
171
+ model_names_sf = ['scope_lab1', 'scope_lab2', 'tech_lab1', 'tech_lab3', 'fin_lab2','bar_lab2']
172
+ model_names = model_names_sf + ['ADAPMIT_SCOPE','ADAPMIT_TECH','SECTOR','LANG','DUPLICATE_CHECK']
173
+ # model_names_sf = []
174
+ # model_names = ['ADAPMIT_SCOPE','ADAPMIT_TECH']
175
+ total_predictions = len(model_names) * len(df)
176
+ progress_count = 0
177
+
178
+ # UI setup for progress tracking
179
+ st.subheader("Overall Progress:")
180
+ patience_text = st.empty()
181
+ patience_text.markdown("*You may want to grab a coffee, this can take a while...*")
182
+ overall_progress = st.progress(0)
183
+ overall_start_time = time.time()
184
+ estimated_time_remaining_text = st.empty()
185
+
186
+ # Model processing
187
+ step_count = 0
188
+ total_steps = len(model_names)
189
+ for model_name in model_names:
190
+ logger.info(f"Loading: {model_name}")
191
+ step_count += 1
192
+ model_processing_text = st.empty()
193
+ model_processing_text.markdown(f'**Current Task: Processing with model "{model_name}"**')
194
+ model_progress = st.empty()
195
+ progress_bar = model_progress.progress(0)
196
+
197
+ # Load the model and run inference
198
+ if model_name in model_names_sf:
199
+ df[model_name] = predict_category(df, model_name, progress_bar, repo='classifier_SF_' + model_name, profile='mtyrrell')
200
+ elif model_name == 'ADAPMIT_SCOPE':
201
+ df[model_name] = predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
202
+ elif model_name == 'ADAPMIT_TECH':
203
+ df[model_name]= predict_category(df, model_name, progress_bar, repo='ADAPMIT-multilabel-bge_f', profile='GIZ')
204
+ elif model_name == 'SECTOR':
205
+ sectors_dict = predict_category(df, model_name, progress_bar, repo='SECTOR-multilabel-bge_f', profile='GIZ', multilabel=True)
206
+ df['SECTOR1'] = [item['SECTOR1'] for item in sectors_dict]
207
+ df['SECTOR2'] = [item['SECTOR2'] for item in sectors_dict]
208
+ elif model_name == 'LANG':
209
+ df[model_name] = predict_category(df, model_name, progress_bar, repo='51-languages-classifier', profile='qanastek')
210
+ # df[model_name] = predict_category(df, model_name, progress_bar, repo='xlm-roberta-base-language-detection', profile='papluca')
211
+ elif model_name == 'DUPLICATE_CHECK':
212
+ # Load semantic similarity model for duplicate detection
213
+ device = torch.device("cuda") if torch.cuda.is_available() else (torch.device("mps") if torch.backends.mps.is_built() else torch.device("cpu"))
214
+ logger.info(f"Loading semantic similarity model on device: {device}")
215
+ semantic_model = SentenceTransformer('BAAI/bge-m3', device=device)
216
+
217
+ # Process duplicate check with progress tracking
218
+ duplicate_results = []
219
+ total = len(df)
220
+ for i, row in df.iterrows():
221
+ result = check_duplicate_concepts_semantic(
222
+ semantic_model,
223
+ row['id'],
224
+ row['org_renamed'],
225
+ row['scope_txt'],
226
+ df
227
+ )
228
+ duplicate_results.append(result)
229
+ # Update progress bar with each iteration
230
+ progress = (i + 1) / total
231
+ progress_bar.progress(progress)
232
+ df['duplicate_check'] = duplicate_results
233
+
234
+
235
+ logger.info(f"Completed: {model_name}")
236
+ model_progress.empty()
237
+
238
+ progress_count += len(df)
239
+ overall_progress_value = progress_count / total_predictions
240
+ overall_progress.progress(overall_progress_value)
241
+
242
+ # Calculate and display estimated time remaining
243
+ elapsed_time = time.time() - overall_start_time
244
+ steps_remaining = total_steps - step_count
245
+ if step_count > 1:
246
+ estimated_time_remaining = (elapsed_time / step_count) * steps_remaining
247
+ estimated_time_remaining_text.markdown(
248
+ f"Elapsed time: {elapsed_time:.1f}s. "
249
+ f"Estimated time remaining: {estimated_time_remaining:.1f}s"
250
+ f" (step {step_count+1} of {len(model_names)})"
251
+ )
252
+ else:
253
+ estimated_time_remaining_text.write(f'Calculating time remaining... (step {step_count+1} of {len(model_names)})')
254
+
255
+ model_processing_text.empty()
256
+
257
+ patience_text.empty()
258
+ estimated_time_remaining_text.empty()
259
+
260
+ st.write(f'Processing complete. Total time: {elapsed_time:.1f} seconds')
261
+
262
+
263
+ # df['ADAPMIT_SCOPE_SCORE'] = df['ADAPMIT_SCOPE'].apply(
264
+ # lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
265
+ # )
266
+ # df['ADAPMIT_TECH_SCORE'] = df['ADAPMIT_TECH'].apply(
267
+ # lambda x: next((item['score'] for item in x if item['label'] == 'MitigationLabel'), 0)
268
+ # )
269
+
270
+ # # Calculate average mitigation score
271
+ # df['ADAPMIT_SCORE'] = (df['ADAPMIT_SCOPE_SCORE'] + df['ADAPMIT_TECH_SCORE']) / 2
272
+
273
+ df['ADAPMIT'] = df.apply(lambda x: 'Adaptation' if x['ADAPMIT_SCOPE'] == 'Adaptation' and x['ADAPMIT_TECH'] == 'Adaptation' else 'Mitigation', axis=1)
274
+
275
+
276
+
277
+ # Convert funding columns to numeric, replacing any non-numeric values with NaN
278
+ df['maf_funding'] = pd.to_numeric(df['maf_funding'], errors='coerce')
279
+ df['cont_public'] = pd.to_numeric(df['cont_public'], errors='coerce')
280
+ df['cont_private'] = pd.to_numeric(df['cont_private'], errors='coerce')
281
+ df['cont_other'] = pd.to_numeric(df['cont_other'], errors='coerce')
282
+ # same for mitigation potential
283
+ df['mitigation_potential'] = abs(pd.to_numeric(df['mitigation_potential'], errors='coerce'))
284
+
285
+ # Fill any NaN values with 0
286
+ df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']] = df[['maf_funding', 'cont_public', 'cont_private', 'cont_other']].fillna(0)
287
+
288
+ # Get total of all leverage
289
+ df['lev_total'] = df.apply(lambda x: x['cont_public'] + x['cont_private'] + x['cont_other'], axis=1)
290
+ # Leverage > MAF request
291
+ # df['lev_gt_maf'] = df.apply(lambda x: 'True' if x['lev_total'] > x['maf_funding'] else 'False', axis=1) # not used
292
+ # Leverage > 0 ?
293
+ df['lev_gt_0'] = (df['lev_total'] > 0).astype(int)
294
+ # Calculate leverage as percentage of MAF funding
295
+ df['lev_maf_%'] = df.apply(lambda x: round(x['lev_total']/x['maf_funding']*100,2) if x['maf_funding'] != 0 else 0, axis=1)
296
+ # Create normalized leverage scale (0-1) where 300% leverage = 1
297
+ df['lev_maf_scale'] = df['lev_maf_%'].apply(lambda x: min(x/300, 1) if x > 0 else 0)
298
+ # EUR / tCO2e mitigation potential
299
+ df['cost_effectivness'] = df.apply(lambda x: round(x['maf_funding']/x['mitigation_potential'],2) if x['mitigation_potential'] > 0 else None, axis=1)
300
+ # Normalize cost_effectivness to 0-1 scale (lower cost = higher score, capped at 1000 EUR/tCO2e)
301
+ df['cost_effectivness_norm'] = df['cost_effectivness'].apply(lambda x: max(0, 1 - (x / 1000)) if x is not None else None)
302
+
303
+ # Test if all text fields don't have minimum required words
304
+ df['word_length_check'] = df.apply(lambda x:
305
+ True if len(x['scope_txt'].split()) < 10 and
306
+ len(x['fin_txt'].split()) < 10 and
307
+ len(x['tech_txt'].split()) < 10
308
+ else False, axis=1)
309
+
310
+ # Predict score
311
+ sector_classes = ['Energy','Transport','Industries']
312
+ df['pred_score'] = df.apply(lambda x: round((x['fin_lab2']*2 + x['scope_lab1']*2 + x['scope_lab2']*2 + x['tech_lab1'] + x['tech_lab3'] + x['bar_lab2'] + x['lev_gt_0']+x['lev_maf_scale'])/11*10,0), axis=1)
313
+ # labelling logic
314
+ df['pred_action'] = df.apply(lambda x:
315
+ 'REJECT' if (('concept_count' in df.columns and x['concept_count'] > 6) or
316
+ x['LANG'][0:2] != 'en' or
317
+ x['ADAPMIT'] == 'Adaptation' or
318
+ not any(sector in [x['SECTOR1'], x['SECTOR2']] for sector in sector_classes) or
319
+ x['word_length_check'] == True or
320
+ x['duplicate_check'] == True or
321
+ x['pred_score'] <= sens_level )
322
+ else 'PRE-ASSESSMENT' if sens_level+1 <= x['pred_score'] <= sens_level+2
323
+ else 'FULL-ASSESSMENT' if x['pred_score'] > sens_level+2
324
+ else 'ERROR', axis=1)
325
+
326
+ # Reorder columns in final dataframe
327
+ column_order = ['id', 'organization', 'org_renamed', 'concept_count', 'duplicate_check', 'scope_txt', 'tech_txt', 'fin_txt', 'maf_funding', 'cont_public',
328
+ 'cont_private', 'cont_other', 'scope_lab1', 'scope_lab2', 'tech_lab1',
329
+ 'tech_lab3', 'fin_lab2', 'bar_lab2', 'ADAPMIT_SCOPE', 'ADAPMIT_TECH', 'ADAPMIT', 'SECTOR1',
330
+ 'SECTOR2', 'LANG', 'lev_total', 'lev_gt_0', 'lev_maf_%', 'lev_maf_scale','mitigation_potential', 'cost_effectivness', 'cost_effectivness_norm',
331
+ 'word_length_check', 'pred_score', 'pred_action']
332
+
333
+ # Only include columns that exist in the DataFrame
334
+ final_columns = [col for col in column_order if col in df.columns]
335
+ df = df[final_columns]
336
+
337
+ return df
338
+
339
+
modules/prompts.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prompts library
2
+ from typing import List
3
+
4
+ def prompt_concept(concept: str, other_concepts: List[str]) -> str:
5
+ """Generate prompt for classifying concepts by similarity"""
6
+ prompt = f"""
7
+
8
+ Each organization is allowed to submit up to 6 concepts per year via a web portal. However, in some cases organizations submit the same concept multiple times.
9
+ This can happen for various reasons. For example, an organization may erroneously submit the same application twice because they lost access to the previous web session somehow.
10
+ In all such cases, it is not usually the case that the duplicate concepts are verbatim identical. It is more usually the case that there is simply high semantic alignment - i.e. it is the same concept, but there are minor superficial differences between each application.
11
+ Your task is to review the concept profiles submitted by a particular organization and identify the amount of similarity so that we can in turn identify cases of duplicate concepts.
12
+
13
+ Here is the concept profile for review:
14
+ {concept}
15
+
16
+ Please review this against the following concepts and assess for duplication:
17
+ {other_concepts}
18
+
19
+ Please conduct your review carefully - however, ensure that you tag all duplicates correctly. Please return your response according to the following structure:
20
+ """
21
+ return prompt
modules/semantic_similarity.py ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Semantic similarity-based duplicate detection
2
+ import pandas as pd
3
+ import logging
4
+ from sentence_transformers import SentenceTransformer
5
+ from sklearn.metrics.pairwise import cosine_similarity
6
+ from modules.utils import setup_logging
7
+
8
+ logger = setup_logging()
9
+
10
+
11
+ def check_duplicate_concepts_semantic(
12
+ model: SentenceTransformer,
13
+ concept_id: str,
14
+ organization: str,
15
+ concept_profile: str,
16
+ df: pd.DataFrame,
17
+ similarity_threshold: float = 0.85
18
+ ) -> bool:
19
+ """
20
+ Check for duplicate concepts within the same organization using semantic similarity
21
+
22
+ Args:
23
+ model: SentenceTransformer model for computing embeddings
24
+ concept_id: ID of the current concept being checked
25
+ organization: Organization name
26
+ concept_profile: Text description of the concept to check
27
+ df: DataFrame containing all application data
28
+ similarity_threshold: Threshold for considering concepts duplicates (0-1)
29
+ Recommended values: 0.80 (lenient) to 0.95 (strict)
30
+
31
+ Returns:
32
+ Boolean classification result
33
+ """
34
+
35
+ # Remove current concept from the dataframe
36
+ df_check = df[df['id'] != concept_id].copy()
37
+
38
+ # Get other concepts from the same organization
39
+ org_concepts = df_check[df_check['org_renamed'] == organization]
40
+ other_concepts = org_concepts['scope_txt'].tolist()
41
+
42
+ # If no other concepts from this organization, return False
43
+ if len(other_concepts) == 0:
44
+ return False
45
+
46
+ # Compute embedding for current concept
47
+ current_embedding = model.encode(
48
+ concept_profile if concept_profile else "",
49
+ convert_to_numpy=True
50
+ )
51
+
52
+ # Compute embeddings for other concepts
53
+ other_embeddings = model.encode(
54
+ [text if text else "" for text in other_concepts],
55
+ convert_to_numpy=True
56
+ )
57
+
58
+ # Compute cosine similarities
59
+ similarities = cosine_similarity(
60
+ current_embedding.reshape(1, -1),
61
+ other_embeddings
62
+ )[0]
63
+
64
+ max_similarity = similarities.max() if len(similarities) > 0 else 0.0
65
+
66
+ logger.info(f"Duplicate check response for concept ID {concept_id}: max_similarity={max_similarity:.3f}")
67
+
68
+ if max_similarity >= similarity_threshold:
69
+ return True
70
+ return False
modules/utils.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from io import BytesIO
3
+ from openpyxl import Workbook
4
+ from openpyxl.styles import Font, NamedStyle, PatternFill
5
+ from openpyxl.styles.differential import DifferentialStyle
6
+ import logging
7
+ from logging.handlers import RotatingFileHandler
8
+ import os
9
+ import configparser
10
+
11
+ def setup_logging():
12
+ # Set up logging
13
+ log_dir = 'logs'
14
+ os.makedirs(log_dir, exist_ok=True)
15
+ log_file = os.path.join(log_dir, 'app.log')
16
+
17
+ # Create a RotatingFileHandler
18
+ file_handler = RotatingFileHandler(log_file, maxBytes=1024 * 1024, backupCount=5)
19
+ file_handler.setFormatter(logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s'))
20
+
21
+ # Configure the root logger
22
+ logging.basicConfig(level=logging.INFO,
23
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
24
+ handlers=[file_handler, logging.StreamHandler()])
25
+
26
+ # Return a logger instance
27
+ return logging.getLogger(__name__)
28
+
29
+
30
+ def getconfig(configfile_path: str):
31
+ """
32
+ Read the config file
33
+ Params
34
+ ----------------
35
+ configfile_path: file path of .cfg file
36
+ """
37
+ config = configparser.ConfigParser()
38
+ try:
39
+ config.read_file(open(configfile_path))
40
+ return config
41
+ except:
42
+ logging.warning("config file not found")
43
+
44
+ # Function for creating Upload template file
45
+ def create_excel():
46
+ wb = Workbook()
47
+ sheet = wb.active
48
+ sheet.title = "template"
49
+ columns = ['id',
50
+ 'organization',
51
+ 'scope',
52
+ 'technology',
53
+ 'financial',
54
+ 'barrier',
55
+ 'maf_funding_requested',
56
+ 'contributions_public_sector',
57
+ 'contributions_private_sector',
58
+ 'contributions_other',
59
+ 'mitigation_potential']
60
+ sheet.append(columns) # Appending columns to the first row
61
+
62
+ # formatting
63
+ for c in sheet['A1:K4'][0]:
64
+ c.fill = PatternFill('solid', fgColor = 'bad8e1')
65
+ c.font = Font(bold=True)
66
+
67
+ # Save to a BytesIO object
68
+ output = BytesIO()
69
+ wb.save(output)
70
+ return output.getvalue()
71
+
72
+
73
+ # Function to clean text
74
+ def clean_text(input_text):
75
+ cleaned_text = re.sub(r"[^a-zA-Z0-9\s.,:;!?()\-\n]", "", input_text)
76
+ cleaned_text = re.sub(r"x000D", "", cleaned_text)
77
+ cleaned_text = re.sub(r"\s+", " ", cleaned_text)
78
+ cleaned_text = re.sub(r"\n+", "\n", cleaned_text)
79
+ return cleaned_text
80
+
81
+
82
+ # # Function for extracting classifications for each SECTOR label
83
+ def extract_predicted_labels(output, ordinal_selection=1, threshold=0.5):
84
+
85
+ # verify output is a list of dictionaries
86
+ if isinstance(output, list) and all(isinstance(item, dict) for item in output):
87
+ # filter items with scores above the threshold
88
+ filtered_items = [item for item in output if item.get('score', 0) > threshold]
89
+
90
+ # sort the filtered items by score in descending order
91
+ sorted_items = sorted(filtered_items, key=lambda x: x.get('score', 0), reverse=True)
92
+
93
+ # extract the highest and second-highest labels
94
+ if len(sorted_items) >= 2:
95
+ highest_label = sorted_items[0].get('label')
96
+ second_highest_label = sorted_items[1].get('label')
97
+ elif len(sorted_items) == 1:
98
+ highest_label = sorted_items[0].get('label')
99
+ second_highest_label = None
100
+ else:
101
+ print("Warning: Less than two items above the threshold in the current list.")
102
+ highest_label = None
103
+ second_highest_label = None
104
+ else:
105
+ print("Error: Inner data is not formatted correctly. Each item must be a dictionary.")
106
+ highest_label = None
107
+ second_highest_label = None
108
+
109
+ # Output dictionary of highest and second-highest labels to the all_predicted_labels list
110
+ predicted_labels = {"SECTOR1": highest_label, "SECTOR2": second_highest_label}
111
+ return predicted_labels
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ pandas
3
+ openpyxl
4
+ setfit
5
+ bcrypt
6
+ --extra-index-url https://download.pytorch.org/whl/cu113
7
+ torch
8
+ thefuzz
9
+ openai==2.9.0
10
+ python-dotenv
11
+ sentence-transformers
12
+ scikit-learn