This project showcases a powerful document retrieval system leveraging hybrid embeddings. By combining dense and sparse embeddings, the system provides efficient and accurate search capabilities across a large collection of documents. The primary technologies used include LangChain, Qdrant, and Groq, among others.
-
Hybrid Retrieval Mode: Utilizes both dense and sparse embeddings for comprehensive search results.
-
Document Processing: Efficiently loads, preprocesses, and splits documents into manageable chunks.
-
Contextual Question Answering: Integrates a chat model for generating concise and relevant answers based on retrieved documents.
-
Flexible Deployment: Easily adaptable to different datasets and deployment environments.
-
LangChain: Framework for building chainable NLP applications.
-
Qdrant: Vector database for managing and querying embeddings.
-
Groq: Model serving and deployment platform.
-
OpenAI Embeddings: Dense embedding generation.
-
FastEmbedSparse: Sparse embedding generation for hybrid retrieval.
-
PyPDFLoader: Document loader for PDF files.
-
RecursiveCharacterTextSplitter: Utility for splitting text into chunks.
Feel free to fork this repository and contribute by submitting pull requests. For major changes, please open an issue first to discuss what you would like to change.
Special thanks to the open-source community and contributors who make projects like this possible.