Git Product home page Git Product logo

prajwal-rp / plagiarism-detector Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 90 KB

Building a plagiarism detector that examines a text file and performs binary classification labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text.

Jupyter Notebook 100.00%
machine-learning natural-language-processing plagiarism-checker plagiarism-detection feature-engineering hyperparameter-tuning longest-common-subsequence python3 random-forest-classifier randomsearch-cv

plagiarism-detector's Introduction

Plagiarism-Detector

Detecting plagiarism is an active area of research.

Building a plagiarism detector that examines a text file and performs binary classification labeling that file as either plagiarized or not, depending on how similar that text file is to a provided source text.

About the data

  • The no of files in the dataset are 100
  • The number of unique tasks in the are 5
  • Unique tasks:['a' 'b' 'c' 'd' 'e']
  • Number of plagiarism categories are 5

Unique categories:['non' 'cut' 'light' 'heavy' 'orig']

  • The orig in the category refers to the source text for each type of task i,e(a,b,c,d,e) and will be used to compare each answers with this source file(wikipedia source file).
  • The non category refers that the file or data is not plagiarised.
  • The other three categories that are cut>light>heavy indicates that the document/answers are plagiarised.
  • Cut indicates copy pasted plagiarism
  • light indicates that the answer/document includes some sort of copying and paraphrasing from the source
  • Heavy indicates that the document/answer is taken from the source but changing some of the words and also the structure(challenging type of plagiarism and hard to detect) plagiarism_class_distribution

Feature Engineering

To know whether a document/answer has been plagiarized or not we have to check the similarity between the document and the source. To check this similarity we have to extract the similarity features Some of the similarity features that are considered for the feature extraction are:

  • Containment Features(extracted using different ngrams)
  • Longest Common Subsequence(extracted using dynamic programming)

Model Building

  • Using Random Forest Classifier to build the model using the extracted features
  • Using Cross Validation to reduce the overfitting of the model on the training data
  • Performing hyperparameter tuning to tune the model for further improvements
  • obtained an accuracy of 94% on the Testing data

The notebook for this project has been attached to this repository..

plagiarism-detector's People

Contributors

prajwal-rp avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.