How to run the project?

In the following, we suppose that the LabNbook database and the versionning (of the form id_report.gzip) files are available in your local machine. Note that, the project can be run without the Prefect orchestrator. It can be done by removing the Prefect tags (@task, @flow and @logger) in all python flows (as flow_0.py) files and ignoring steps 1 and 2 below. Here is a Prefect overview of all flows executed on a subset of missions.

View content

Create a Prefect account following this link.
Configure Prefect cloud following this link.
Run the following command line in your terminal in order to clone this git repository:
```
 Git clone https://github.com/anismhaddouche/Indicators.git
```
Past the versionning folder into the data folder.
Create a virtual env with conda. If conda is not installed, follow this link.
```
 conda env create -f python_env.yml
```

Modifies these sections in the pyproject.toml file

database: in order to connect to the database (which is assumed to be installed in your machine):

      user = "your_user"
      password = "your_password"
      host = "localhost"
      database_name  = "your_database_name"

missions: choose if you want to run the project on all missions or only a subset.

      all = false # True to take all missions in the versioning folder
      subset =  ["1376","453","1559","1694","556","534","1640","1694","451","1237","533","647"]

You have to option for running all flows:
1. Local run with Prefect, open a terminal, navigate to the repository Indicators and run the following commands:
```
 conda activate ml 
 python scripts/run_flows.py
```
2. Cloud or local run with Prefect UI:
  1. Write these commands in your terminal prefect server start prefect deployment build scripts/run_flows.py:run_flows -n "labnbook" && prefect deployment apply run_flows-deployment.yaml && prefect agent start -q default
  2. Open Prefect UI (cloud or local) and click into RUN the Deployment menu
In order to get some reports run this command:
```
 streamlit run scripts/dashboard.py
```

Flows description

We describe here python scripts (flows) in the scripts folder.

View content

flow_0.py

The purpose of this flow is to connect to the previously installed LabNbook database and prepare LabDocs for the next flow which consists in calculating contribution matrices.

Tasks 1: `extract_text_init`

Dependencies
- The dictionary [database]in the project.tomlfile.
Returns
- The file data/tmp/0_labdocs_texts_init.json.gz

Tasks 2: `extract_text`

Dependencies:
- The dictionary [regex_text_patterns] in the project.tomlfile.
Returns:
- The folder data/tmp/0_missions_texts

flow_1.py

The purpose of this flow is to calculate contribution matrices and some variables that describes LabDocs as the number of tokens, segments, ... etc.

Tasks 1: `contrib_and_segmentation`

Dependencies
- The folder data/tmp/0_missions_texts
- The nlp model the config section [nlp][spacy_model] of the file project.toml
Returns
- The folder data/tmp/1_missions_contrib

flow_2.py

The purpose of this flow is to calculate all indicators.

Tasks 1: `nonsemantic_indicator`

Dependencies
- The dictionary [missions] in project.tomlfile
Returns
- The file data/tmp/2_collab.json.gz

Tasks 2: `semantic_indicators`

Dependencies
- The two dictionaries [nlp][model] and [missions] in the project.toml file
- The nlp model in the config section [config_nlp] of the file project.toml
Returns
- The file data/tmp/reports/2_semantic.json

flow_3.py

The purpose of this flow is to generate some reports.

Tasks 1: `summary_nonsemantic_indicators_csv`

Dependencies
- The file data/tmp/2_collab.json.gz
Returns
- The file data/tmp/reports/3_summary_nonsemantic_indicators.csv and its corresponding Pandas DataFrame df_nonsemantic

Tasks 2: `semantic_indicator_csv`

Dependencies
- The file data/tmp/reports/2_semantic.json
Returns
- The file data/tmp/reports/3_summary_semantic_indicator.csv and its corresponding Pandas DataFrame df_semantic

Tasks 3: `get_times`

Dependencies
- The Pandas DataFrames df_nonsemantic and df_semantic
Returns
- The file data/tmp/reports/3_times.csv

How to improve this work?

Besides the improvements concerning the quality of the python code, I propose to improve the model all-MiniLM-L6-v2 used in the task semantic_indicator of the flow_2.py. To this end, we give below some suggestions.

View content

Improve the `all-MiniLM-L6-v2` nlp model

What this model does?

As mentioned before, this model is used in the task semantic_indicator of the flow_2.py. In order to have an idea of how this model is used, let's suppose that we have a Labdoc that evolves from a version $v_1$ to a version $v_2$ where these versions may be written by the same author of two different authors. This model takes these two versions as input and gives a score in $[0,1]$ as output. The value $0$ means that the semantic content of $v_1$ and $v_2$ is completely different where $1$ means that it is the same semantic contents. Thus, this model is used to evaluate the semantic evolution of a LabDoc over its versions and results are saved in the file data/tmp/2_semantic.json.

It is worth noticing that this model is used sequentially between two Labdoc versions. For instance, givenv1, v2 andv3, results are of the form

$similarity(v_1,v_1) = s_1 =1$
$similarity(v_1,v_2) = s_2$
$similarity(v_2,v_3) = s_3$

where, for $i=1,2,3$ the scores $s_i$ $\in [0,1]$.

As a concrete example, here is the output of the Labdoc 340270 which is a dictionary of the form {"id_labdoc:{id_trace}:["id_user",score]} saved in the file data/tmp/2_semantic.json.

"340270": {"5866822": ["10893", 1], "5869856": ["10917", 0.57]}, "340978": {"5885737": ["10893", 1]}

Note that, the first score is always equals $1$ since it is computed with the same version ($similarity(v_1,v_1) = s_1 =1$) which is only useful for code purposes.

How does it work?

To compare the similarity between two versions of the same LabDoc, the process is done in two steps (See Figure 2 below).

The first step involves computing a vector of numbers in $R^p$ (a tensor) for each version, denoted as $v_1$ and $v_2$, respectively. This is known as the embedding step in natural language processing (NLP).
Then, we calculate the cosine similarity between these two vectors using the formula $similarity(v_1, v_2)$. You can refer to the Python script flow_2.py from line 104 to line 123 to understand how this calculation is performed.

Figure 2

How to improve this model ?

The objective is to improve the semantic interpretation, of Labdocs, of the used NLP model all-MiniLM-L6-v2 by improving its embedding. Note that, in this project I used this model for its implementation simplicity in order to have a first draft. It is not well adapted to our dataset since we have a lot of mathematical formulas. For future works, I suggest using a well-adapted model like MathBert since it is trained on scientific texts containing mathematical formulas.

In order to improve the embedding of our NLP model, we have to train (fine-tune) our pre-trained model to do a task using our set of LabDocs. A well-adapted task here is the Masker Language Modeling (MLM). It is an unsupervised learning technique that involves masking tokens in a text sequence and training a model to predict the missing tokens. This creates an improved embedding that better captures the semantics of the text (see this tutorial).

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data/tmp		data/tmp
doc		doc
scripts		scripts
tests/notebooks		tests/notebooks
.gitignore		.gitignore
Indicators.code-workspace		Indicators.code-workspace
README.md		README.md
pyproject.toml		pyproject.toml
python_env.yml		python_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run the project?

Flows description

flow_0.py

Tasks 1: `extract_text_init`

Tasks 2: `extract_text`

flow_1.py

Tasks 1: `contrib_and_segmentation`

flow_2.py

Tasks 1: `nonsemantic_indicator`

Tasks 2: `semantic_indicators`

flow_3.py

Tasks 1: `summary_nonsemantic_indicators_csv`

Tasks 2: `semantic_indicator_csv`

Tasks 3: `get_times`

How to improve this work?

Improve the `all-MiniLM-L6-v2` nlp model

What this model does?

How does it work?

How to improve this model ?

About

Releases

Packages

Languages

anismhaddouche/Indicators

Folders and files

Latest commit

History

Repository files navigation

How to run the project?

Flows description

flow_0.py

Tasks 1: extract_text_init

Tasks 2: extract_text

flow_1.py

Tasks 1: contrib_and_segmentation

flow_2.py

Tasks 1: nonsemantic_indicator

Tasks 2: semantic_indicators

flow_3.py

Tasks 1: summary_nonsemantic_indicators_csv

Tasks 2: semantic_indicator_csv

Tasks 3: get_times

How to improve this work?

Improve the all-MiniLM-L6-v2 nlp model

What this model does?

How does it work?

How to improve this model ?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Tasks 1: `extract_text_init`

Tasks 2: `extract_text`

Tasks 1: `contrib_and_segmentation`

Tasks 1: `nonsemantic_indicator`

Tasks 2: `semantic_indicators`

Tasks 1: `summary_nonsemantic_indicators_csv`

Tasks 2: `semantic_indicator_csv`

Tasks 3: `get_times`

Improve the `all-MiniLM-L6-v2` nlp model

Packages