Git Product home page Git Product logo

topic-modeling-textprep-norwegian-subtitles's Introduction

textPrep

a text preprocessing library for topic models

textPrep for Norwegian subtitles

This fork of textPrep has extended the toolkit to work with subtitles from Norwegian TV programs. This includes extensions to existing rules, new rules, as well as some small improvements in code used for general purposes. Here is an exhaustive list of the changes:

New rules

  • RemoveNumbers: Removes all instances of numbers (digits) from text (remove_numbers.py)
  • RemoveSubtitleMetadata: Removes metadata tags, characters and phrases that are found in Norwegian TV program subtitles (remove_subtitle_metadata.py)

Extensions to existing rules

  • RemoveStopWords: Extended to allow the use of a Norwegian stop word list for stop word removal (remove_stopwords.py)
  • Lemmatize: Extended to allow lemmatization in other languages than English with the help of SpaCy language models (lemmatize.py)
  • PartOfSpeech: Extended to allow PoS-tagging and removal in other languages than English with the help of SpaCy language models (part_of_speech.py)

Other changes

  • get_data_stats: This function adds support for calculating stop word statistics for datasets, using the extension above
  • word_co_frequency_document: This function is added for calculating word co-frequencies in individual documents, to enable parallelizing (common.py)
  • compute_metrics: Modified this function to return only coherence and diversity (evaluate_topic_set.py)

Requirements

to install relevant requirements:

pip install -r requirements.txt

Additional NLTK packages needed:

stopwords

wordnet

averaged_perceptron_tagger

To install NLTK packages:

python

import nltk 
nltk.download()

Choose just the required packages (the whole set of additional NLTK data is massive)

Using textPrep

Creating a pipeline and preprocessing

from preprocessing_pipeline import (Preprocess, RemovePunctuation, Capitalization, RemoveStopWords, RemoveShortWords, TwitterCleaner, RemoveUrls)

# initialize the pipeline
pipeline = Preprocess()

# initialize the rules you want to use
rp = RemovePunctuation(keep_hashtags=False)
ru = RemoveUrls()
cap = Capitalization()

# include extra data in a rule if necessary
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
stopwords_list.append(['rt', 'amp'])

rsw = RemoveStopWords(extra_sw=stopwords_list)

# add rules to the pipeline (the stringified rule makes it easy to save the pipeline details)
pipeline.document_methods = [(ru.remove_urls, str(ru),),
                             (rp.remove_punctuation, str(rp),),
                             (cap.lowercase, str(cap),),
                             (rsw.remove_stopwords, str(rsw),)
                             ]

You can load your data however you want, so long as it ends up as a list of lists. We provide methods for loading CSV files with and without dates.

# load the data
def load_dataset_with_dates(path):
    dataset = []
    try:
        with open(path, 'r') as f:
            for line in f:
                dataset.append(line.strip().split('\t')[1].split(' '))
        return dataset
    except FileNotFoundError:
        print('The path provided for your dataset does not exist: {}'.format(path))
        import sys
        sys.exit()

dataset = load_dataset_with_dates('data/sample_tweets.csv')
# dataset[i] = ['list', 'of', 'words', 'in', 'document_i']

# initialize the pipeline runner
from preprocessing_pipeline.NextGen import NextGen

runner = NextGen()

# preprocess the data, with some extra ngrams thrown in to ensure they are considered regardless of frequency
processed_dataset = runner.full_preprocess(dataset, pipeline, ngram_min_freq=10, extra_bigrams=None, extra_ngrams=['donald$trump', 'joe$biden', 'new$york$city'])

# assess data quality quickly and easily
from evaluation_metrics.dataset_stats import get_data_stats
print(get_data_stats(processed_dataset))

You can do some extra filtering after preprocessing, like TF-IDF filtering

from settings.common import word_tf_df

freq = {}
freq = word_tf_df(freq, processed_dataset)
filtered_dataset = runner.filter_by_tfidf(dataset=processed_dataset, freq=freq, threshold=0.25)

# assess data quality again 
from evaluation_metrics.dataset_stats import get_data_stats
print(get_data_stats(filtered_dataset))

Referencing textPrep

Churchill, Rob and Singh, Lisa. 2021. textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. DATA 2021.
@inproceedings{churchill2021textprep,
author = {Churchill, Rob and Singh, Lisa},
title = {textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data},
booktitle = {DATA 2021},
year = {2021},
}

topic-modeling-textprep-norwegian-subtitles's People

Contributors

rchurch4 avatar magnusrr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.