# Comic Story Generator: Code Handover Document **Date:** 2025-7-22 **Document Purpose:** This document provides a comprehensive technical handover for the Comic Story Generator project. It is intended for developers and future maintainers responsible for the deployment, maintenance, and extension of the application. --- ## 1. Project Overview The Comic Story Generator is a web application that automatically creates multi-page, textless comic stories from a user-provided description. The application leverages generative AI to produce visually coherent narratives, focusing on character consistency, expressive emotion, and logical panel sequencing. ### 1.1. Core Functionality The application is designed to translate a textual story concept into a purely visual comic strip. Key characteristics include: * **AI-Powered Narrative:** Utilizes Google's Gemini to interpret the user's concept and break it down into a structured, panel-by-panel narrative. * **Visual Generation:** Employs a GPT-based image model to render complete comic pages based on the AI-generated narrative structure. * **Intelligent Panel Detection:** Uses Gemini Vision to analyze the generated full-page image and accurately detect the boundaries of each panel, ensuring precise splitting. * **Customization:** Offers users control over the output, including: * **Layout:** Choice of panel count (from 4 to 24). * **Length:** Generation of 1 to 10 pages. * **Art Style:** A selection of visual styles, including "Classic Comic," "Manga," "Cartoon," "Digital Paint," and a high-contrast "Accessible" style designed for users with special needs. ### 1.2. High-Level Workflow The generation process follows a clear, multi-step pipeline: 1. **User Input:** The user submits a short description of the desired story. 2. **Story Generation:** The `StoryGenerator` component uses Gemini to create a detailed, scene-by-scene description for each comic panel. 3. **Page Generation:** The `ComicGenerator` takes the panel descriptions and instructs the GPT-Image model to generate a single, composite image representing a full comic page with panels arranged in a grid. 4. **Layout Analysis:** The generated page is passed to the `GeminiVision` component, which analyzes the image to identify the precise coordinates and boundaries of each panel. 5. **Panel Splitting:** The application uses the coordinates from the vision analysis to accurately split the composite image into individual panel images. 6. **Final Output:** The processed panels are presented to the user as a complete, multi-page visual story. --- ## 2. System Architecture The application is built on a modular architecture composed of three primary classes, each responsible for a distinct part of the generation pipeline. ### 2.1. System Diagram ```mermaid classDiagram class StoryGenerator{ +generate_story(description: string) : list[string] +enhance_visuals(panel_descriptions: list) : list[string] } class ComicGenerator{ +generate_page(panel_descriptions: list) : Image +split_panels(page_image: Image, grid_layout: dict) : list[Image] } class GeminiVision{ +analyze_layout(page_image: Image) : dict } StoryGenerator "1" -- "1" ComicGenerator : Provides panel descriptions ComicGenerator "1" -- "1" GeminiVision : Uses for layout analysis ``` ### 2.2. Data Flow The end-to-end data flow illustrates the interaction between the user, the application, and the underlying AI models. ```mermaid sequenceDiagram participant User participant App participant Gemini as Gemini (Text/Story) participant GPTImage as GPT-Image (Visuals) participant GeminiVision as Gemini Vision (Analysis) User->>+App: Submits story description App->>+Gemini: Requests story structure from description Gemini-->>-App: Returns panel-by-panel text descriptions App->>+GPTImage: Requests comic page generation from descriptions GPTImage-->>-App: Returns single full-page image App->>+GeminiVision: Requests layout analysis of the image GeminiVision-->>-App: Returns coordinates of each panel App->>User: Displays final, split-panel comic ``` --- ## 3. Setup & Installation ### 3.1. Prerequisites * **Python:** Version 3.9 or higher. * **API Keys:** * An active OpenAI API key. * An active Google API key with access to the Gemini family of models. ### 3.2. Installation Steps 1. **Clone the Repository:** ```bash git clone https://github.com/yourusername/Comic-Story-Generator.git cd Comic-Story-Generator ``` 2. **Create and Activate a Virtual Environment:** ```bash # Create the environment python -m venv venv # Activate the environment (macOS/Linux) source venv/bin/activate # Or, activate on Windows # venv\Scripts\activate ``` 3. **Install Dependencies:** ```bash pip install -r requirements.txt ``` 4. **Configure Environment Variables:** Create a `.env` file in the project root and add your API keys. ```bash echo "OPENAI_API_KEY=your_openai_key" > .env echo "GOOGLE_API_KEY=your_google_key" >> .env ``` *Note: Ensure the `.env` file is added to your `.gitignore` file to prevent committing secrets.* --- ## 4. Environment Variables / Secrets The application requires the following environment variables to be set in a `.env` file at the project's root. | Variable | Description | Required | Example | | :--- | :--- | :--- | :--- | | `OPENAI_API_KEY` | API key for the OpenAI service, used for GPT-Image generation. | Yes | `sk-xxxxxxxxxxxxxxxxxxxxxxxx` | | `GOOGLE_API_KEY` | API key for Google AI services, used for Gemini (story structure) and Gemini Vision (layout analysis). | Yes | `AIzaSyxxxxxxxxxxxxxxxxxxxxx` | --- ## 5. How to Run After completing the setup and installation steps, launch the application with the following command from the project's root directory: ```bash python app.py ``` The application will start a local web server, and the interface will be accessible at the URL provided in the console (typically `http://127.0.0.1:7860`). --- ## 6. Deployment Instructions [TODO] This section requires documentation for deploying the application to a production environment. Steps should include: * Recommended hosting provider (e.g., AWS, Heroku, DigitalOcean). * Instructions for setting up a production-grade web server (e.g., Gunicorn). * Configuration of a reverse proxy (e.g., Nginx). * Management of production environment variables/secrets. * Process management (e.g., using `systemd`). --- ## 7. Core Components & Logic The application logic is encapsulated in three main classes. ### 7.1. `StoryGenerator` * **Responsibility:** Handles the narrative creation phase. * **`generate_story()`:** Takes the raw user description as input. It constructs a prompt for the Gemini model to elicit a structured response containing a list of detailed text descriptions, one for each comic panel. * **`enhance_visuals()`:** Processes the panel descriptions to add specific visual cues and optimizations, particularly for the "Accessible" style, ensuring high contrast and simplified object representation. ### 7.2. `ComicGenerator` * **Responsibility:** Manages the visual generation and processing of the comic page. * **`generate_page()`:** Aggregates the panel descriptions from `StoryGenerator` into a single, complex prompt for the GPT-Image model. This prompt instructs the AI to create one composite image with all panels laid out in a grid. * **`split_panels()`:** Receives the generated page image and the layout data from `GeminiVision`. It uses this data to crop the page into individual panel images with high precision. ### 7.3. `GeminiVision` * **Responsibility:** Performs visual analysis on the generated comic page. * **`analyze_layout()`:** This is the core of the intelligent panel-splitting feature. It takes the full-page image as input and uses the Gemini Vision model to visually identify the boundaries of each panel. It returns a dictionary containing the coordinates and dimensions of the detected grid, which is more robust than assuming a fixed grid layout. --- ## 8. Third-party Dependencies The complete list of Python packages is specified in `requirements.txt`. Key dependencies include: * **`openai`**: Python client for the OpenAI API. * **`google-generativeai`**: Python client for the Google AI (Gemini) API. * **`python-dotenv`**: For loading environment variables from the `.env` file. * **`Pillow`**: For image manipulation (cropping and saving). * **[Info Needed]**: The web framework used to build `app.py` (e.g., `gradio`, `flask`, `fastapi`). --- ## 9. Testing Instructions [TODO] A testing framework has not been established for this project. Future work should include: * **Test Suite Setup:** Choose and configure a testing framework (e.g., `pytest`). * **Unit Tests:** Create unit tests for individual methods in `StoryGenerator`, `ComicGenerator`, and `GeminiVision`. This should involve mocking the API calls to AI services to test the data processing logic in isolation. * **Integration Tests:** Develop tests for the entire generation pipeline, from user input to final split panels. * **Continuous Integration:** Set up a CI pipeline (e.g., using GitHub Actions) to run tests automatically on pull requests. --- ## 10. Troubleshooting & Common Issues [TODO] This section should be populated as common issues are identified. Potential areas to document include: * **API Key Errors:** Steps to verify that API keys are correctly configured and have the necessary permissions. * **Incoherent Stories:** Guidance on how to write effective initial descriptions to improve narrative quality. * **Poor Panel Splitting:** Troubleshooting steps for when Gemini Vision fails to detect the layout correctly (e.g., checking image complexity, trying a different art style). * **Long Generation Times:** Explanation of typical performance and factors that can cause delays (e.g., API provider latency, number of panels). --- ## 11. TODOs / Future Work Based on the project's focus areas, the following are key areas for future development and contribution: * **Core Generation Logic:** * Improve character consistency across multiple pages. * Experiment with different AI models for potentially better visual or narrative results. * Add support for including text (dialogue, captions) as an optional feature. * **UI/UX Enhancements:** * Develop a more interactive interface for viewing and arranging panels. * Allow users to regenerate individual panels without restarting the entire process. * Add an option to export the final comic as a PDF or other formats. * **Accessibility Improvements:** * Further refine the "Accessible" art style based on user feedback. * Implement ARIA attributes and ensure full keyboard navigability for the web interface. * Add an "image description" feature where a text-to-speech engine can describe the generated panels. * **Documentation:** * Create a detailed API reference for developers looking to build on the platform. * Write user-facing guides on how to get the best results from the generator. --- ## 12. Contact / Ownership Info * **Source Code:** [https://github.com/yourusername/Comic-Story-Generator](https://github.com/yourusername/Comic-Story-Generator) * **License:** This project is licensed under the **MIT License**. For full details, see the `LICENSE` file in the repository. * **Primary Contact:** [Info Needed: Add primary maintainer's name and contact information (e.g., GitHub handle or email).]