Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.0.2
title: rag_korean_manufacturing_docs
app_file: fixed_gradio_demo.py
sdk: gradio
sdk_version: 5.39.0
π Manufacturing RAG Agent
A comprehensive Retrieval-Augmented Generation (RAG) system designed specifically for manufacturing document analysis. This system can process PDFs, Excel files with embedded images, and standalone images to provide accurate answers with complete citations and metadata tracking.
β¨ Features
π Multi-Format Document Processing
- PDF Documents: Text extraction, table detection, and embedded image processing
- Excel Files: Worksheet data extraction, embedded image processing, and table detection
- Images: OCR text extraction with preprocessing for improved accuracy
- Metadata Preservation: Complete citation tracking with page numbers, worksheet names, and cell ranges
π§ Advanced RAG Capabilities
- Semantic Search: Vector-based similarity search using Qdrant
- Reranking: Improved relevance using Silicon Flow's Qwen3 reranker
- Fast LLM Inference: Sub-second response times using Groq's LPU architecture
- Citation Generation: Automatic source attribution with confidence scores
π§ Production-Ready Features
- Scalable Architecture: Designed to handle up to 1TB of manufacturing data
- Incremental Processing: Efficient updates without reprocessing existing data
- Comprehensive Monitoring: Health checks, statistics, and performance metrics
- Interactive Demo: Streamlit-based web interface for easy testing
ποΈ Architecture
graph TB
subgraph "User Interface"
UI[Streamlit Demo]
API[REST API]
end
subgraph "RAG Engine"
QA[Question Answering]
RET[Document Retrieval]
RANK[Reranking]
end
subgraph "Processing Pipeline"
DOC[Document Processor]
EMB[Embedding Generator]
OCR[Image OCR]
end
subgraph "Storage Layer"
VDB[(Qdrant Vector DB)]
MDB[(SQLite Metadata)]
FS[(File Storage)]
end
subgraph "External APIs"
GROQ[Groq LLM API]
SF[Silicon Flow API]
end
UI --> QA
QA --> RET
RET --> RANK
RANK --> GROQ
DOC --> EMB
DOC --> OCR
EMB --> SF
OCR --> SF
EMB --> VDB
DOC --> MDB
DOC --> FS
π Quick Start
Prerequisites
Python 3.8+
API Keys:
- Groq API key for LLM inference
- Silicon Flow API key for embeddings and reranking
- Qdrant instance (local or cloud)
System Dependencies:
- Tesseract OCR for image processing
- PyMuPDF for PDF processing
Installation
Clone the repository:
git clone <repository-url> cd manufacturing-rag-agentInstall dependencies:
pip install -r requirements.txtInstall system dependencies:
macOS:
brew install tesseractUbuntu/Debian:
sudo apt-get install tesseract-ocrWindows: Download and install from Tesseract GitHub
Set up environment variables:
cp .env.example .env # Edit .env with your API keysConfigure Qdrant:
Local Qdrant (Docker):
docker run -p 6333:6333 qdrant/qdrantOr use Qdrant Cloud and update the URL in
.env
Configuration
Edit src/config.yaml to customize the system:
# RAG System Configuration
rag_system:
embedding_model: "qwen3-embedding"
reranker_model: "qwen3-reranker"
llm_model: "openai/gpt-oss-120b"
chunk_size: 512
chunk_overlap: 50
max_context_chunks: 5
similarity_threshold: 0.7
# Document Processing
document_processing:
pdf_engine: "pymupdf"
excel_engine: "openpyxl"
ocr_engine: "tesseract"
image_processing: true
table_extraction: true
max_file_size_mb: 100
# Storage Configuration
storage:
qdrant_collection: "manufacturing_docs"
metadata_db_path: "./data/metadata.db"
file_storage_path: "./data/documents"
Running the Demo
Launch the Streamlit demo:
python launch_rag_demo.py
Or run directly:
streamlit run src/rag_demo.py
The demo will be available at http://localhost:8501
π Usage Guide
1. Document Upload
- Navigate to the "π Document Upload" page
- Upload your manufacturing documents (PDF, Excel, or images)
- Click "Process Documents" to ingest them into the system
- Monitor processing progress and results
2. Asking Questions
- Go to the "β Ask Questions" page
- Enter your question about the manufacturing data
- Optionally configure advanced settings:
- Number of context chunks
- Similarity threshold
- Document type filters
- View the answer with detailed citations
3. Analytics
- Visit the "π Analytics" page to view:
- Document processing statistics
- Document type distribution
- Processing status overview
- Recent activity
4. System Monitoring
- Check the "βοΈ System Status" page for:
- Component health checks
- Configuration details
- Performance metrics
π§ API Usage
Document Ingestion
from src.rag.ingestion_pipeline import DocumentIngestionPipeline
# Initialize pipeline
config = {...} # Your configuration
pipeline = DocumentIngestionPipeline(config)
# Ingest single document
result = pipeline.ingest_document("path/to/document.pdf")
# Batch ingestion
results = pipeline.ingest_batch([
"path/to/doc1.pdf",
"path/to/doc2.xlsx",
"path/to/image.png"
])
Question Answering
from src.rag.rag_engine import RAGEngine
# Initialize RAG engine
rag_engine = RAGEngine(config)
# Ask a question
response = rag_engine.answer_question(
"What is the average production yield for Q3?"
)
print(f"Answer: {response.answer}")
print(f"Confidence: {response.confidence_score}")
print(f"Sources: {len(response.citations)}")
# View citations
for citation in response.citations:
print(f"Source: {citation.source_file}")
if citation.page_number:
print(f"Page: {citation.page_number}")
if citation.worksheet_name:
print(f"Sheet: {citation.worksheet_name}")
π§ͺ Testing
Run the test suite:
# Run all tests
pytest
# Run specific test modules
pytest src/tests/test_document_processor.py
pytest src/tests/test_rag_system.py
# Run with coverage
pytest --cov=src --cov-report=html
π Performance
Benchmarks
Document Processing:
- PDF: ~2-5 seconds per page
- Excel: ~1-3 seconds per worksheet
- Images: ~1-2 seconds per image (with OCR)
Query Response Time:
- Vector Search: ~100-300ms
- Reranking: ~200-500ms
- LLM Generation: ~500-1500ms
- Total: ~1-3 seconds per query
Scalability:
- Tested with up to 10,000 documents
- Supports concurrent processing
- Memory-efficient chunking strategy
Optimization Tips
- Batch Processing: Process multiple documents together for better throughput
- Chunk Size: Adjust chunk size based on your document types
- Embedding Cache: Enable caching for repeated content
- Qdrant Optimization: Use appropriate vector size and distance metrics
π Security Considerations
- API Keys: Store securely in environment variables
- File Validation: Automatic file type and size validation
- Input Sanitization: All user inputs are sanitized
- Access Control: Implement authentication for production use
- Data Privacy: Consider data residency requirements for cloud APIs
π οΈ Troubleshooting
Common Issues
Tesseract Not Found:
# Install Tesseract OCR brew install tesseract # macOS sudo apt-get install tesseract-ocr # UbuntuQdrant Connection Failed:
- Check if Qdrant is running:
curl http://localhost:6333/health - Verify URL and API key in
.env
- Check if Qdrant is running:
API Rate Limits:
- Check your API quotas
- Implement exponential backoff (already included)
Memory Issues:
- Reduce batch size in configuration
- Process documents individually for large files
Slow Performance:
- Check network connectivity to APIs
- Monitor Qdrant performance
- Consider local embedding models for high-volume use
Debug Mode
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
Or set environment variable:
export DEBUG=true
π€ Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Run the test suite:
pytest - Submit a pull request
Development Setup
# Install development dependencies
pip install -r requirements-dev.txt
# Run pre-commit hooks
pre-commit install
# Run linting
flake8 src/
black src/
# Run type checking
mypy src/
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Groq for fast LLM inference
- Silicon Flow for embedding and reranking APIs
- Qdrant for vector database capabilities
- Streamlit for the interactive demo interface
- PyMuPDF for PDF processing
- OpenPyXL for Excel file handling
- Tesseract for OCR capabilities
π Support
For questions, issues, or feature requests:
- Check the Issues page
- Review the Troubleshooting section
- Create a new issue with detailed information
Built with β€οΈ for manufacturing excellence