RAG Document Application

An end-to-end RAG application (from scratch) based on FastAPI that processes PDFs, images, and web pages to obtain OCR data, generates embeddings using OpenAI's embedding models, and utilizes Pinecone as a vector database for search. It answers questions based on search results using OpenAI Chat Completion!

Checklist

Prerequisites

Python 3.11
Docker
Docker Compose (v2.20.2)
AWS S3 Credentials (API-Keys)
OpenAI API-Key
Pinecone API-Key
Prepare your .env file, learn in Configuration

Installation

git clone https://github.com/teamunitlab/rag-document-app.git
cd rag-document-app/
docker-compose up -d --build

Configuration

Fill a .env file in the root directory and add the following environment variables:

API_KEY=your_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_pinecone_index_name
AWS_ACCESS_KEY_ID=your_aws_access_key_id
AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
AWS_DEFAULT_REGION=your_aws_region
BUCKET_NAME=your_s3_bucket_name

Ensure AWS credentials and bucket policies are correctly configured to allow S3 access.

Usage

Run the FastAPI application:
```
docker-compose up -d
```
Access the API documentation at http://localhost:8000/docs.

Endpoints

Upload File

URL: /upload
Method: POST
Description: Upload files to S3. Limited to 10 requests per minute.
Request:
- files: List of files to upload.
- API-Key: Header for API key authentication.
Response: JSON containing file IDs and URLs.

Process OCR

URL: /ocr
Method: POST
Description: Process OCR for a given file URL. Limited to 10 requests per minute.
Request:
- url: URL of the file to process, it is obtained during file (ducument) uploading.
- API-Key: Header for API key authentication.
Response: JSON containing information about the processing status.

Extract Data

URL: /extract
Method: POST
Description: Reply to a query using OpenAI chat completions and search based on the given File ID from Pinecone, Limited to 10 requests per minute.
Request:
- file_id: The file ID, it is obtained during file (ducument) uploading.
- query: Ask a question from your document!
- API-Key: Header for API key authentication.
Response: JSON containing information about a reply to the question and the three top search results.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
.flake8		.flake8
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
process.py		process.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Document Application

Contents

Checklist

Prerequisites

Installation

Configuration

Usage

Endpoints

Upload File

Process OCR

Extract Data

References

About

Releases

Packages

Languages

teamunitlab/rag-document-app

Folders and files

Latest commit

History

Repository files navigation

RAG Document Application

Contents

Checklist

Prerequisites

Installation

Configuration

Usage

Endpoints

Upload File

Process OCR

Extract Data

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages