Git Product home page Git Product logo

stopwords's Introduction

stopwords is a go package that removes stop words from a text content. If instructed to do so, it will remove HTML tags and parse HTML entities. The objective is to prepare a text in view to be used by natural processing algos or text comparison algorithms such as SimHash.

GoDoc Build Status codecov.io Go Report Card

Join the chat at https://gitter.im/bbalet/stopwords

It uses a curated list of the most frequent words used in these languages:

  • Arabic
  • Bulgarian
  • Czech
  • Danish
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Japanese
  • Khmer
  • Latvian
  • Norwegian
  • Persian
  • Polish
  • Portuguese
  • Romanian
  • Russian
  • Slovak
  • Spanish
  • Swedish
  • Thai
  • Turkish

If the function is used with an unsupported language, it doesn't fail, but will apply english filter to the content.

How to use this package?

You can find an example here https:github.com/bbalet/gorelated where stopwords package is used in conjunction with SimHash algorithm in order to find a list of related content for a static website generator:

import (
      "github.com/bbalet/stopwords"
)

//Example with 2 strings containing P html tags
//"la", "un", etc. are (stop) words without lexical value in French
string1 := []byte("<p>la fin d'un bel après-midi d'été</p>")
string2 := []byte("<p>cet été, nous avons eu un bel après-midi</p>")

//Return a string where HTML tags and French stop words has been removed
cleanContent := stopwords.CleanString(string1, "fr", true)

//Get two (Sim) hash representing the content of each string
hash1 := stopwords.Simhash(string1, "fr", true)
hash2 := stopwords.Simhash(string2, "fr", true)

//Hamming distance between the two strings (diffference between contents)
distance := stopwords.CompareSimhash(hash1, hash2)

//Clean the content of string1 and string2, compute the Levenshtein Distance
stopwords.LevenshteinDistance(string1, string2, "fr", true)

Where fr is the ISO 639-1 code for French (it accepts a BCP 47 tag as well). https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

How to load a custom list of stop words from a file/string?

This package comes with a predefined list of stopwords. However, two functions allow you to use your own list of words:

stopwords.LoadStopWordsFromFile(filePath, langCode, separator)
stopwords.LoadStopWordsFromString(wordsList, langCode, separator)

They will overwrite the predefined words for a given language. You can find an example with the file stopwords.txt

How to overwrite the word segmenter?

If you don't want to strip the Unicode Characters of the 'Number, Decimal Digit' Category, call the function DontStripDigits before using the package :

stopwords.DontStripDigits()

If you want to use your own segmenter, you can overwrite the regular expression:

stopwords.OverwriteWordSegmenter(`[\pL]+`)

Limitations

Please note that this library doesn't break words. If you want to break words prior using stopwords, you need to use another library that provides a binding to ICU library.

These curated lists contain the most used words in various topics, they were not built with a corpus limited to any given specialized topic.

Credits

Most of the lists were built by IR Multilingual Resources at UniNE http://members.unine.ch/jacques.savoy/clef/index.html

License

stopwords is released under the BSD license.

stopwords's People

Contributors

bbalet avatar eivindam avatar moorereason avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stopwords's Issues

Chargement des stop word à partir d'un fichier.

Bonjour Jacques, Comme je pense que tu es français, je me permets d'écrire cette "issue" in french !

Afin de procéder a des tests sur l'utilisation de stops words sur différents textes, la liste de stop words est stocké dans un ficher comprenant un stop word sur chaque ligne. Ainsi cela me permet de tester sans recomplier.

Il serait donc intéressant d'avoir une fonction avec le chemin du fichier de stop word a charger en lieu et place du code de langue. par ex. CleanStringFile(content string, fileName string, cleanHTML bool))

Urdu Stop Words

I want to contribute by adding the Urdu stop words file. Do I have to submit a PR?

fatal error: concurrent map writes

this is causing a panic error under high volume:

fatal error: concurrent map writes goroutine 30452 [running]: runtime.throw(0x10276a9, 0x15) /usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc0003fcea8 sp=0xc0003fce78 pc=0x42f1a2 runtime.mapassign_faststr(0xe7a2c0, 0xc015fe1500, 0x104e145, 0x4, 0xc0151b0e38) /usr/local/go/src/runtime/map_faststr.go:211 +0x417 fp=0xc0003fcf10 sp=0xc0003fcea8 pc=0x4132c7 github.com/bbalet/stopwords.LoadStopWordsFromString(0x104e101, 0xa1, 0x100b6d9, 0x2, 0x100b47b, 0x1) /go/pkg/mod/github.com/bbalet/[email protected]/custom.go:74 +0xd30 fp=0xc0003fcf98 sp=0xc0003fcf10 pc=0xd12a00 github.com/urbn/catalog-search-service/app/services.(*ProductService).GetProducts(0xc000576050, 0x7f4f63e60fc0, 0xc02521ec80, 0xc015df9400, 0x0, 0xc015e3c760, 0x1e, 0x1, 0x0, 0xc015e623b8, ...)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.