Skip to content

xianchen2/Text_Retrieval_BM25

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Okapi BM25

A Python implementation of the BM25 for file retrieval

Given a query Q, containing keywords q1,...,qn, BM25 score of a document is

where the IDF weight of the query term qi is computed as:

Implementation

There are two main modules:

QueryParser parses the query to produce a list.

BuildIndex builds an inverted index and computes the scores of the documents according to the BM25 ranking function.

  • process_files: processes corpus files to produce a dictionary
  • index_one_file & regular_index: map words to their position in the corresponding document
  • inverted_index: return a dictionary with each word as the key and its value is another dictionary, whose key is filename and value is word position in that file
  • inverse_df: return a dictionary with each word as the key and the IDF as value
  • docLen and avgdocl: calculates the length of each document, the average document length in the text collection, respectively
  • BM25scores: return BM25 scores of the documents

About

Python implementation of the BM25 for file retrieval

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published