Skip to content

Commit

Permalink
Merge pull request #245 from ShreyaGautamm/chore/report/week6
Browse files Browse the repository at this point in the history
chore(report): Data Pipeline Report Week6
  • Loading branch information
GMishx authored Jul 8, 2024
2 parents 1d0d0a9 + f920370 commit a2b6acb
Showing 1 changed file with 66 additions and 0 deletions.
66 changes: 66 additions & 0 deletions docs/2024/pipeline/updates/2024-07-04.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: Week 6
author: Shreya Gautam
tags: [gsoc24, pipeline]
---

<!--
SPDX-License-Identifier: CC-BY-SA-4.0
SPDX-FileCopyrightText: 2024 Shreya Gautam <gautamm.shreya@gmail.com>
-->

# WEEK 6
*(July 4, 2024)*

## Attendees:
- [Kaushlendra Pratap](https://github.com/Kaushl2208)
- [Gaurav Mishra](https://github.com/GMishx)
- [Avinal Kumar](https://github.com/avinal)

## Discussion:
1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns:

| original_content | original_is_enabled | edited_content | modified_is_enabled |
|-----------------------|---------------------|-----------------------|---------------------|


You can find the updated script [here](https://github.com/ShreyaGautamm/gsoc_24/blob/895ac5814097386f816d9ae703034cbe60244819/files/copyrights_script_v2.py).

2. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting.

3. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SVM.ipynb), for SGD [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/model_implementations/copyright_classification_SGD.ipynb), and the comparison between them can be found below. The dataset used for implementation can be found [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/datasets/fossology-master.csv).

The results are as follows:

<div style={{ display: 'flex', justifyContent: 'space-between' }}>

<div style={{ flex: '0 0 48%' }}>

**SGD Classifier:**

| | precision | recall | f1-score | support |
|---------------|-----------|--------|----------|---------|
| 0 | 0.99 | 0.99 | 0.99 | 2878 |
| 1 | 0.96 | 0.98 | 0.97 | 1016 |

</div>

<div style={{ flex: '0 0 48%' }}>

**SVM Classifier:**

| | precision | recall | f1-score | support |
|---------------|-----------|--------|----------|---------|
| 0 | 0.99 | 0.99 | 0.99 | 2878 |
| 1 | 0.96 | 0.97 | 0.97 | 1016 |

</div>

</div>

* Started working on creating a Python library for the NER-POS tagging task. I experimented with the Stanford NER Tagger. You can find my work [here](https://github.com/ShreyaGautamm/gsoc_24/blob/33917177a876562cc4d2f7c308f7e2dbe03cd4c3/files/ner/Stanford_Tagger.ipynb). However, I plan on exploring the fine-tuning of BERT or GPT for this task in the coming weeks.

## Subsequent Steps
* Address the issue with GitHub Actions by establishing a connection for retraining when a new copyright file is uploaded to the repository. Do all the implementations locally and then create the final yaml file to try out GitHub Actions.
* Explore and implement the fine-tuning of BERT or GPT for the NER-POS tagging task.

0 comments on commit a2b6acb

Please sign in to comment.