This project is a learning project, we had to present a demo application within 8 days to present at KPMG
This project is developed for KPMG, an International Limited is a multinational professional services network, and one of the Big Four accounting organizations. This is a Natural language processing (NLP) project, aiming to extract relevant information, present it in structured way to the client by incorporating the extracted information into available platforms.
Preprocessing
First we preprocess the files by doing different checks, because of a vertical or horizontal division by language of the documents. Then we feed the documents to Google DocumentAI, which produces a result as a dataframe and .txt files, split according to their language.
The resulting files are then checked with language detection and if necessary, split. And then moved to the /processed_data folder
Training
On this data we applied clustering and fine-tuned a pretrained Dutch based Roberta model from KULeuven (Dutch RoBERTa-based Language Model).
Application
A webserver will run with an input form. There we will locate a new CLA file in .pdf format that is supposedly obtained from a feed or a similar method.
For the presentation demo we run a telegram bot which live processing and real-time notifications on Telegram.
Live-Processing
A file input on the webserver is feeded to the process.py file which will run modified functions from preprocessing sequentially.
-For language detection we use FastTEXT -For the summarization we use MBartForConditionalGeneration
*To be implemented:
- For topic detection we will use Rake_NLTK
- For target labeling we will use our fine-tuned model from pretrained RobBERT*
- CLA_meta_from_2018.csv as provided by KPMG
- CLA documents as published, in .pdf format
1. For handling the files:
-
fix_languages : Checks the languages of all extracted text files and switches the NL and FR version when needed.
-
move_unprocessed_files : Moves files that have not been processed yet.
-
process_docs_cloud : Processes files with Google DocumentAI.
-
split_files : Detects if files needed to be split vertically by calculating mean and average length number and max length of sentences in the file.
-
split_max_page_10 : Splits .pdf documents with more than 10 pages for the Google Document AI (max page limit=10).
2. For making targets for the model:
- make_targets_from_metadata.ipynb: Extracts keywords from metadata with Rake-NLTK
3. For checking the output of text processing:
- visual_inspection : File for quick visual inspection of the extracted text files.
- ( Deprecated) model : Makes a model for classification of documents. Works with BERT and a Dutch RobBERTa tensorflow pretrained model.
- model_h_robberta_clusters : To make the targets from 100 clusters
- model_h_robberta_clusters_RUN : To Run the model on a .csv file with condensed text and cluster targets
Callable function
process_cloud.py
: Function to process a .pdf file with Google DocumentAI. Outputs as txt files
relies on the following files:
- split_text_horizontally : Function to process a .pdf file that has not been split with DocumentAI. Detects the languages per paragraph and writes output to NL and FR .txt files
- split_pdf_vertically : Function to detect if an input .pdfs has to be split. Detection method as explained above
- split_max_10_pages : Function to detect if an input .pdfs has to be split into pages for DocumentAI (max page limit=10).
- get_abstr_summary : Function to produce and abstract summary from a input text. Uses summi
- combine_txt_files : Function to combine text file that were obtained after the split_max_10_pages
- server : The Flask server to run for hosting a webpage for the demo
- templates : Templates for the webserver
- Python 3.7.9
- CUDA and CuNN
- Google Cloud account
- Create and activate a virtual env
- Install the necessary libraries from the requirements.txt file in the main folder `
create a telegram webhook with CMD:
ngrok http 80
Get the forwarding URL from the ngrok dashboard
run ./app/server.py
-
Preprocessing is done in sequential order and with Google DocumentAI (synchronous)
-
Model is based on clusters
-
The text from the preprocessing output is processed with NLTK_Rake with DEGREE_TO_RATIO method to extract the most important phrases and keywords, while filtering out a small array of words suchs as 'month', 'year' and verbs.. ( to be improved!)
-
From these condensed text we make 100 clusters with KMeans
-
We train the model based on a dataset of the processed texts with these 100 clusters
- Model = robbert-v2-dutch-based
- Optimizer= AdamW
- Tokenizer= RobertaTokenizer
- Loss = CategoricalCrossentropy
-
Training results are stored in /model/output
-
Model is saved in /model
-
The preprocessing works great and we were able to preprocess ALL documents.
-
Google DocumentAI OCR reading works great on .pdf files with a single language, so we feed single language docs or check each paragraph with FastText language detection
-
Classification model is based on a Dutch based Robberta model from KU_Leuven and targets 100 different clusters.
- We claim a classification accuracy of about 67% on the validation set (15% of the data)
- This figure shows the results with model trained on processed text with max length of 368 in 12 epochs, 4 batches, with CategoricalCrossentropy and AdamW optimizer
-
For the client app we made a simple tool to produce results with telegram. The results of processing a file will be output to a Telegram bot channel.
There is room for many improvements
- Language detection can perform better when feeding through DocumentAI OCR and then checking paragraphs for all documents, also the vertically split files
- We need to find a way to remove 'Erratum' blocks
- Perhaps LDA can help us to define the topic of a document better
- Clusters should not be random generated for later use (simple improvement)
- Fine tuning the extrated text processing method might improve model results drastically
- Remove much more irrelevant words and word categories (verbs, time,..)
- Include repitions (to be tested)