Git Product home page Git Product logo

nlmate's Introduction

nlmate - Data preprocessing for NLMs

An open-source solution for scraping, preprocessing, and labeling datasets from specified Wikipedia pages for using in training Natural Language Processing (NLP) models.

Pipeline Test

Prerequisites

  • Python 3.6 or higher
  • Wikipedia API package
  • spaCy package
  • TextBlob package
  • pandas package

Install the required packages using the following command:

pip install wikipedia-api spacy textblob pandas

Usage

  1. (Optional) Generate a list of random Wikipedia page names using generate_random_wikipedia_pages.py. By default, it generates 50 random page names:

    This will create a file named wikipedia_pages.txt containing the Wikipedia page names.

    To use specified Wikipedia pages, enter the page titles line by line in the file wikipedia_pages.txt rather than using the random page generation script.

  2. Run fetch_and_label_wikipedia_data.py to fetch content from the listed Wikipedia pages, preprocess the data, label it, and save the resulting dataset as a CSV file:

python fetch_and_label_wikipedia_data.py

The output file training_data.csv will contain the structured training dataset.

Customization

To customize the labeling function, edit the label_func function in fetch_and_label_wikipedia_data.py. This function should take a text input and return a label based on the content of the text. The current implementation uses TextBlob sentiment analysis to assign "positive", "negative", or "neutral" labels based on the sentiment polarity of the text.

License

This project is licensed under the MIT License. See the LICENSE file for details.

nlmate's People

Contributors

cpstrommen avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.