Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
do-me authored Oct 21, 2023
1 parent 12d3e6e commit d4fecbe
Showing 1 changed file with 13 additions and 3 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## [App here](https://do-me.github.io/copernicus-services-semantic-search/)

A basic semantic search app based on 834 entries from Copernicus Services Catalogue chunked and indexed (mean embedding of all content chunks) in a ~2.4MB gzipped json with all-MiniLM-L6-v2. Enter any query and hit submit or enter. App loads ~27Mb of resources of data and scripts.
A basic semantic search app based on 834 entries from [Copernicus Services Catalogue](https://www.copernicus.eu/en/accessing-data-where-and-how/copernicus-services-catalogue) chunked and indexed (mean embedding of all content chunks) in a ~2.4MB gzipped json with [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2). Enter any query and hit submit or enter. App loads ~27Mb of resources of data and scripts. The ML model runs entirely in the browser thanks to [transformers.js](https://github.com/xenova/transformers.js).

![](copernicus-services-semantic-search-interface-dark.png)

Expand All @@ -12,10 +12,20 @@ If you'd like to search within the result's content, consider installing the Chr

![](semantic-finder-results.png)

It's basically performing semantic search in the results yielded by semantic search, Inception-like!
It finds the most relevant sections to your query in the actual content of the results by performing semantic search on the fly.

## Data mining tutorial

The process of creating the data dump can be repeated with the included [Jupyter Notebook](copernicus_services_miner.ipynb). You can re-run the process for updates (if you do so, please open a pull request for this repo or write so I can keep the data dump updated). The current dump holds 834 entries from 21 October 2023.
The process of creating the data dump includingcan be repeated with the included [Jupyter Notebook](copernicus_services_miner.ipynb). It includes the whole processing pipeline:
- data mining with requests and beautifulsoup
- preprocessing in pandas
- chunking the document text in smaller paragraphs of the right size for the ML model
- creating embeddings for each chunk
- calculating the mean embedding for each document
- saving as gzipped json (small file size & easy and fast to read in js with pako.js)

You can re-run the process for updates (if you do so, please open a pull request for this repo or write so I can keep the data dump updated) or use other indexing models like the current [MTEB](https://huggingface.co/spaces/mteb/leaderboard) leaders of the bge or gte family. You could also use a multilingual model to perform search queries in other languages. The current dump holds 834 entries from 21 October 2023.

![](copernicus-services-df.png)

If you like this project, ⭐ the repo or give a shoutout on social media!

0 comments on commit d4fecbe

Please sign in to comment.