Commit
Β·
441dd2f
1
Parent(s):
a042349
debug dependencies
Browse files- .gitignore +27 -0
- README.md +59 -6
- requirements.txt +4 -0
- results.csv +3 -0
.gitignore
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Ignore runs directory (uploaded files)
|
| 2 |
+
runs/
|
| 3 |
+
*.json
|
| 4 |
+
!results.csv
|
| 5 |
+
|
| 6 |
+
# Python
|
| 7 |
+
__pycache__/
|
| 8 |
+
*.py[cod]
|
| 9 |
+
*$py.class
|
| 10 |
+
*.so
|
| 11 |
+
.Python
|
| 12 |
+
|
| 13 |
+
# Virtual environments
|
| 14 |
+
venv/
|
| 15 |
+
env/
|
| 16 |
+
ENV/
|
| 17 |
+
|
| 18 |
+
# IDE
|
| 19 |
+
.vscode/
|
| 20 |
+
.idea/
|
| 21 |
+
*.swp
|
| 22 |
+
*.swo
|
| 23 |
+
|
| 24 |
+
# OS
|
| 25 |
+
.DS_Store
|
| 26 |
+
Thumbs.db
|
| 27 |
+
|
README.md
CHANGED
|
@@ -1,14 +1,67 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: indigo
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.49.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
-
license:
|
| 11 |
-
short_description: inner development phase use
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: CAPTCHAv2 Leaderboard
|
| 3 |
+
emoji: π
|
| 4 |
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
sdk_version: 5.49.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# CAPTCHAv2 Leaderboard
|
| 14 |
+
|
| 15 |
+
A comprehensive leaderboard for comparing model performance across different CAPTCHA puzzle types. This interactive dashboard allows you to:
|
| 16 |
+
|
| 17 |
+
- π View performance rankings across different CAPTCHA categories
|
| 18 |
+
- π Compare models using interactive visualizations
|
| 19 |
+
- π° Analyze cost-effectiveness of different models
|
| 20 |
+
- π Upload and update results easily
|
| 21 |
+
|
| 22 |
+
## Features
|
| 23 |
+
|
| 24 |
+
- **Interactive Leaderboard Table**: Sortable rankings with color-coded performance indicators
|
| 25 |
+
- **Performance Comparison Charts**: Visual bar charts showing pass rates across models
|
| 26 |
+
- **Performance by Type**: Detailed breakdown of performance across different CAPTCHA puzzle types
|
| 27 |
+
- **Cost-Effectiveness Analysis**: Scatter plot comparing performance vs. cost
|
| 28 |
+
- **Easy Upload**: Support for CSV and JSON result files
|
| 29 |
+
|
| 30 |
+
## How to Use
|
| 31 |
+
|
| 32 |
+
1. **View the Leaderboard**: Browse the current rankings and filter by category
|
| 33 |
+
2. **Sort Results**: Sort by Pass Rate, Duration, or Cost
|
| 34 |
+
3. **Upload Results**: Use the upload section to add new evaluation results
|
| 35 |
+
4. **Compare Models**: Use the visualizations to compare different models
|
| 36 |
+
|
| 37 |
+
## Uploading Results
|
| 38 |
+
|
| 39 |
+
The leaderboard supports multiple file formats:
|
| 40 |
+
|
| 41 |
+
- **CSV files**: Aggregated results with columns for Model, Provider, Agent Framework, Type, metrics, and per-type pass rates
|
| 42 |
+
- **JSON files**: Single object or array of aggregated results
|
| 43 |
+
- **benchmark_results.json**: Per-puzzle results in JSONL format (auto-converted)
|
| 44 |
+
|
| 45 |
+
See the upload section in the app for detailed instructions and file format requirements.
|
| 46 |
+
|
| 47 |
+
## Categories
|
| 48 |
+
|
| 49 |
+
The leaderboard tracks performance across various CAPTCHA types including:
|
| 50 |
+
- Dice Count
|
| 51 |
+
- Color Cipher
|
| 52 |
+
- Color Counting
|
| 53 |
+
- Dynamic Jigsaw
|
| 54 |
+
- Mirror
|
| 55 |
+
- Set Game
|
| 56 |
+
- Shadow Plausible
|
| 57 |
+
- Spooky variants (Circle, Jigsaw, Shape Grid, Size, Text)
|
| 58 |
+
- Trajectory Recovery
|
| 59 |
+
- Transform Pipeline
|
| 60 |
+
- And more...
|
| 61 |
+
|
| 62 |
+
## Model Types
|
| 63 |
+
|
| 64 |
+
Models are automatically categorized as:
|
| 65 |
+
- **Proprietary**: Commercial models (OpenAI, Anthropic, Google, etc.)
|
| 66 |
+
- **Open source**: Open source models (Llama, Mistral, Qwen, etc.)
|
| 67 |
+
|
requirements.txt
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio>=5.49.1
|
| 2 |
+
pandas>=2.3.3
|
| 3 |
+
matplotlib>=3.10.7
|
| 4 |
+
numpy
|
results.csv
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Model,Provider,Agent Framework,Type,Overall Pass Rate,Avg Duration (s),Avg Cost ($),Color_Cipher,Color_Counting,Dice_Count,Dynamic_Jigsaw,Static_Puzzle,Map_Parity,Mirror,Red_Dot,Set_Game,Shadow_Plausible,Spooky_Circle,Spooky_Circle_Grid,Spooky_Jigsaw,Spooky_Shape_Grid,Spooky_Size,Spooky_Text,Squiggle,Trajectory_Recovery,Transform_Pipeline
|
| 2 |
+
gpt-5-2025-08-07,OpenAI,browser-use,Proprietary,0.16363636363636364, 10, 24.1,0.15,0.0,0.0,0.2,,,0.2,0.3,,0.0,0.2,0.4,,,0.0,0.0,0.35,,
|
| 3 |
+
Browser-Use BU-1.0,browser-use,browser-use,Proprietary,0.03333333333333333, 2,8.3,0.1,0.0,0.0,,0.0,,0.1,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0
|