Spaces:

A7m0d
/

rag_korean_manufacturing_docs

Sleeping

App Files Files Community

rag_korean_manufacturing_docs / README.md

A7m0d

Upload folder using huggingface_hub

7dfe46c verified 3 months ago

preview code

raw

history blame contribute delete

9.65 kB

A newer version of the Gradio SDK is available: 6.0.2

Upgrade

metadata

title: rag_korean_manufacturing_docs
app_file: fixed_gradio_demo.py
sdk: gradio
sdk_version: 5.39.0

🏭 Manufacturing RAG Agent

A comprehensive Retrieval-Augmented Generation (RAG) system designed specifically for manufacturing document analysis. This system can process PDFs, Excel files with embedded images, and standalone images to provide accurate answers with complete citations and metadata tracking.

✨ Features

📄 Multi-Format Document Processing

PDF Documents: Text extraction, table detection, and embedded image processing
Excel Files: Worksheet data extraction, embedded image processing, and table detection
Images: OCR text extraction with preprocessing for improved accuracy
Metadata Preservation: Complete citation tracking with page numbers, worksheet names, and cell ranges

🧠 Advanced RAG Capabilities

Semantic Search: Vector-based similarity search using Qdrant
Reranking: Improved relevance using Silicon Flow's Qwen3 reranker
Fast LLM Inference: Sub-second response times using Groq's LPU architecture
Citation Generation: Automatic source attribution with confidence scores

🔧 Production-Ready Features

Scalable Architecture: Designed to handle up to 1TB of manufacturing data
Incremental Processing: Efficient updates without reprocessing existing data
Comprehensive Monitoring: Health checks, statistics, and performance metrics
Interactive Demo: Streamlit-based web interface for easy testing

🏗️ Architecture

graph TB
    subgraph "User Interface"
        UI[Streamlit Demo]
        API[REST API]
    end
    
    subgraph "RAG Engine"
        QA[Question Answering]
        RET[Document Retrieval]
        RANK[Reranking]
    end
    
    subgraph "Processing Pipeline"
        DOC[Document Processor]
        EMB[Embedding Generator]
        OCR[Image OCR]
    end
    
    subgraph "Storage Layer"
        VDB[(Qdrant Vector DB)]
        MDB[(SQLite Metadata)]
        FS[(File Storage)]
    end
    
    subgraph "External APIs"
        GROQ[Groq LLM API]
        SF[Silicon Flow API]
    end
    
    UI --> QA
    QA --> RET
    RET --> RANK
    RANK --> GROQ
    
    DOC --> EMB
    DOC --> OCR
    EMB --> SF
    OCR --> SF
    
    EMB --> VDB
    DOC --> MDB
    DOC --> FS

🚀 Quick Start

Prerequisites

Python 3.8+
API Keys:
- Groq API key for LLM inference
- Silicon Flow API key for embeddings and reranking
- Qdrant instance (local or cloud)
System Dependencies:
- Tesseract OCR for image processing
- PyMuPDF for PDF processing

Installation

Clone the repository:

git clone <repository-url>
cd manufacturing-rag-agent

Install dependencies:
```
pip install -r requirements.txt
```
Install system dependencies:

macOS:
```
brew install tesseract
```
Ubuntu/Debian:
```
sudo apt-get install tesseract-ocr
```
Windows: Download and install from Tesseract GitHub

Set up environment variables:

cp .env.example .env
# Edit .env with your API keys

Configure Qdrant:

Local Qdrant (Docker):
```
docker run -p 6333:6333 qdrant/qdrant
```
Or use Qdrant Cloud and update the URL in .env

Configuration

Edit src/config.yaml to customize the system:

# RAG System Configuration
rag_system:
  embedding_model: "qwen3-embedding"
  reranker_model: "qwen3-reranker"
  llm_model: "openai/gpt-oss-120b"
  chunk_size: 512
  chunk_overlap: 50
  max_context_chunks: 5
  similarity_threshold: 0.7

# Document Processing
document_processing:
  pdf_engine: "pymupdf"
  excel_engine: "openpyxl"
  ocr_engine: "tesseract"
  image_processing: true
  table_extraction: true
  max_file_size_mb: 100

# Storage Configuration
storage:
  qdrant_collection: "manufacturing_docs"
  metadata_db_path: "./data/metadata.db"
  file_storage_path: "./data/documents"

Running the Demo

Launch the Streamlit demo:

python launch_rag_demo.py

Or run directly:

streamlit run src/rag_demo.py

The demo will be available at http://localhost:8501

📖 Usage Guide

1. Document Upload

Navigate to the "📄 Document Upload" page
Upload your manufacturing documents (PDF, Excel, or images)
Click "Process Documents" to ingest them into the system
Monitor processing progress and results

2. Asking Questions

Go to the "❓ Ask Questions" page
Enter your question about the manufacturing data
Optionally configure advanced settings:
- Number of context chunks
- Similarity threshold
- Document type filters
View the answer with detailed citations

3. Analytics

Visit the "📊 Analytics" page to view:
- Document processing statistics
- Document type distribution
- Processing status overview
- Recent activity

4. System Monitoring

Check the "⚙️ System Status" page for:
- Component health checks
- Configuration details
- Performance metrics

🔧 API Usage

Document Ingestion

from src.rag.ingestion_pipeline import DocumentIngestionPipeline

# Initialize pipeline
config = {...}  # Your configuration
pipeline = DocumentIngestionPipeline(config)

# Ingest single document
result = pipeline.ingest_document("path/to/document.pdf")

# Batch ingestion
results = pipeline.ingest_batch([
    "path/to/doc1.pdf",
    "path/to/doc2.xlsx",
    "path/to/image.png"
])

Question Answering

from src.rag.rag_engine import RAGEngine

# Initialize RAG engine
rag_engine = RAGEngine(config)

# Ask a question
response = rag_engine.answer_question(
    "What is the average production yield for Q3?"
)

print(f"Answer: {response.answer}")
print(f"Confidence: {response.confidence_score}")
print(f"Sources: {len(response.citations)}")

# View citations
for citation in response.citations:
    print(f"Source: {citation.source_file}")
    if citation.page_number:
        print(f"Page: {citation.page_number}")
    if citation.worksheet_name:
        print(f"Sheet: {citation.worksheet_name}")

🧪 Testing

Run the test suite:

# Run all tests
pytest

# Run specific test modules
pytest src/tests/test_document_processor.py
pytest src/tests/test_rag_system.py

# Run with coverage
pytest --cov=src --cov-report=html

📊 Performance

Benchmarks

Document Processing:
- PDF: ~2-5 seconds per page
- Excel: ~1-3 seconds per worksheet
- Images: ~1-2 seconds per image (with OCR)
Query Response Time:
- Vector Search: ~100-300ms
- Reranking: ~200-500ms
- LLM Generation: ~500-1500ms
- Total: ~1-3 seconds per query
Scalability:
- Tested with up to 10,000 documents
- Supports concurrent processing
- Memory-efficient chunking strategy

Optimization Tips

Batch Processing: Process multiple documents together for better throughput
Chunk Size: Adjust chunk size based on your document types
Embedding Cache: Enable caching for repeated content
Qdrant Optimization: Use appropriate vector size and distance metrics

🔒 Security Considerations

API Keys: Store securely in environment variables
File Validation: Automatic file type and size validation
Input Sanitization: All user inputs are sanitized
Access Control: Implement authentication for production use
Data Privacy: Consider data residency requirements for cloud APIs

🛠️ Troubleshooting

Common Issues

Tesseract Not Found:

# Install Tesseract OCR
brew install tesseract  # macOS
sudo apt-get install tesseract-ocr  # Ubuntu

Qdrant Connection Failed:
- Check if Qdrant is running: curl http://localhost:6333/health
- Verify URL and API key in .env
API Rate Limits:
- Check your API quotas
- Implement exponential backoff (already included)
Memory Issues:
- Reduce batch size in configuration
- Process documents individually for large files
Slow Performance:
- Check network connectivity to APIs
- Monitor Qdrant performance
- Consider local embedding models for high-volume use

Debug Mode

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

Or set environment variable:

export DEBUG=true

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Run the test suite: pytest
Submit a pull request

Development Setup

# Install development dependencies
pip install -r requirements-dev.txt

# Run pre-commit hooks
pre-commit install

# Run linting
flake8 src/
black src/

# Run type checking
mypy src/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Groq for fast LLM inference
Silicon Flow for embedding and reranking APIs
Qdrant for vector database capabilities
Streamlit for the interactive demo interface
PyMuPDF for PDF processing
OpenPyXL for Excel file handling
Tesseract for OCR capabilities

📞 Support

For questions, issues, or feature requests:

Check the Issues page
Review the Troubleshooting section
Create a new issue with detailed information

Built with ❤️ for manufacturing excellence