Skip to content

This repository includes analysis and models for distinguishing "Scam" from "Non-scam" messages. The data, originally extracted from images via OCR for the ScamGuard project, is used here to apply sentiment analysis and machine learning techniques to improve scam detection.

License

Notifications You must be signed in to change notification settings

HiroshiJoe/Philippine-Scam-Analysis-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scam Detection Using Text Analysis and Machine Learning

Overview

This repository contains the analysis and models used to differentiate between "Scam" and "Non-scam" messages. The data was initially extracted from images using OCR (Optical Character Recognition) as part of the ScamGuard project, which focused on developing an image-based identification app for phishing and scam messages. This research builds upon the previous work by exploring sentiment analysis and machine learning models to enhance scam detection capabilities.

Table of Contents

Project Description

This project aims to analyze and classify messages as "Scam" or "Non-scam" using various text and sentiment analysis techniques. We evaluated different machine learning models, including logistic regression and XGBoost, to determine the most effective approach for scam detection.

Methodology

Data Preprocessing

  1. Text Vectorization: Utilized TF-IDF vectorization to convert text into numerical features.
  2. Feature Engineering: Added features such as sentiment scores and text length to the model.
  3. Model Training: Tested various models, including logistic regression and XGBoost, to find the most accurate classifier.

Evaluation Metrics

  • Accuracy
  • Precision, Recall, and F1-Score
  • Confusion Matrix
  • Histograms of Sentiment Scores

Results

Model Performance

  • XGBoost Model: Achieved the highest accuracy in classifying scam and non-scam messages.
    • Accuracy: 93.75%

Sentiment Analysis

  • Scam Messages: Mean sentiment score of 0.18.
  • Non-scam Messages: Mean sentiment score of 0.16.

Word Frequency Analysis

  • Scam Messages: Top words include "free," "money," "link," etc.
  • Non-scam Messages: Top words include "lazada," "peso," "sale," etc.

Research Limitations

  • Dataset Constraints: The dataset derived from images and OCR preprocessing may contain inaccuracies or artifacts introduced during text extraction, potentially affecting the accuracy of the sentiment analysis and model performance.
  • Model Generalization: The study focused on text extracted from specific types of images (scam-related messages), which may limit the generalizability of the findings to other forms of text data or different types of scams not covered in this dataset.
  • Regional Specificity: The data used in this study is Philippine-based, which might not be representative of scam messages in other regions. The regional context and language nuances may impact the applicability of the results to other geographic locations or cultural contexts.

Recommendations

  1. Enhanced Filtering Techniques: Implement advanced filters and update them regularly to detect and filter out scam messages based on key terms and sentiment patterns.
  2. User Education: Educate users about common scam tactics and red flags through awareness campaigns and best practice guidelines.
  3. Sentiment and Emotional Analysis Integration: Utilize sentiment and emotion-based classification as part of the screening process for scam detection.
  4. Local Language Considerations: Adapt filters for local languages and collaborate with local experts to refine detection techniques.

How to Run the Code

  1. Clone the repository:
    git clone https://github.com/your-username/repository-name.git
  2. Navigate to the project directory:
    cd repository-name
  3. Install the required dependencies:
    pip install -r requirements.txt
  4. Run the analysis script:
    python analyze.py

Data Access

The dataset used in this research is private and intended for research purposes only. To obtain access to the data, please contact Heroshi Joe Abejuela directly at joeheroshi2807@gmail.com.

License

See the LICENSE file for details.

Acknowledgments

  • Heroshi Joe Abejuela: Researcher and author of this study.
  • ScamGuard Project: Previous work on image-based identification app for phishing.

Contact Information

For any inquiries or further information, please contact:

About

This repository includes analysis and models for distinguishing "Scam" from "Non-scam" messages. The data, originally extracted from images via OCR for the ScamGuard project, is used here to apply sentiment analysis and machine learning techniques to improve scam detection.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published