Git Product home page Git Product logo

pattern's Introduction

Pattern

https://travis-ci.org/pattern3/pattern.svg?branch=master

Pattern is a web mining module for Python. It has tools for:

  • Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
  • Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
  • Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
  • Network Analysis: graph centrality and visualization.

It is well documented and bundled with 50+ examples and 350+ unit tests. The source code is licensed under BSD and available from http://www.clips.ua.ac.be/pages/pattern.

Pattern example workflow

Pattern example workflow

Version

2.6

License

BSD, see LICENSE.txt for further details.

Installation

Pattern is written for Python 2.5+ (no support for Python 3 yet). The module has no external dependencies except when using LSA in the pattern.vector module, which requires NumPy (installed by default on Mac OS X). To install Pattern so that it is available in all your scripts, unzip the download and from the command line do:

cd pattern-2.6
python setup.py install

If you have pip, you can automatically download and install from the PyPi repository:

pip install pattern

If none of the above works, you can make Python aware of the module in three ways: - Put the pattern folder in the same folder as your script. - Put the pattern folder in the standard location for modules so it is available to all scripts: * c:\python26\Lib\site-packages\ (Windows), * /Library/Python/2.6/site-packages/ (Mac OS X), * /usr/lib/python2.6/site-packages/ (Unix). - Add the location of the module to sys.path in your script, before importing it:

MODULE = '/users/tom/desktop/pattern'
import sys; if MODULE not in sys.path: sys.path.append(MODULE)
from pattern.en import parsetree

Example

This example trains a classifier on adjectives mined from Twitter. First, tweets that contain hashtag #win or #fail are collected. For example: "$20 tip off a sweet little old lady today #win". The word part-of-speech tags are then parsed, keeping only adjectives. Each tweet is transformed to a vector, a dictionary of adjective → count items, labeled WIN or FAIL. The classifier uses the vectors to learn which other tweets look more like WIN or more like FAIL.

from pattern.web    import Twitter
from pattern.en     import tag
from pattern.vector import KNN, count

twitter, knn = Twitter(), KNN()

for i in range(1, 3):
    for tweet in twitter.search('#win OR #fail', start=i, count=100):
        s = tweet.text.lower()
        p = '#win' in s and 'WIN' or 'FAIL'
        v = tag(s)
        v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective
        v = count(v) # {'sweet': 1}
        if v:
            knn.train(v, type=p)

print knn.classify('sweet potato burger')
print knn.classify('stupid autocorrect')

Documentation

http://www.clips.ua.ac.be/pages/pattern

Reference

De Smedt, T., Daelemans, W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2031–2035.

Contribute

The source code is hosted on GitHub and contributions or donations are welcomed, see the developer documentation. If you use Pattern in your work, please cite our reference paper.

Bundled dependencies

Pattern is bundled with the following data sets, algorithms and Python packages:

  • Beautiful Soup, Leonard Richardson
  • Brill tagger, Eric Brill
  • Brill tagger for Dutch, Jeroen Geertzen
  • Brill tagger for German, Gerold Schneider & Martin Volk
  • Brill tagger for Spanish, trained on Wikicorpus (Samuel Reese & Gemma Boleda et al.)
  • Brill tagger for French, trained on Lefff (Benoît Sagot & Lionel Clément et al.)
  • Brill tagger for Italian, mined from Wiktionary
  • English pluralization, Damian Conway
  • Spanish verb inflection, Fred Jehle
  • French verb inflection, Bob Salita
  • Graph JavaScript framework, Aslak Hellesoy & Dave Hoover
  • LIBSVM, Chih-Chung Chang & Chih-Jen Lin
  • LIBLINEAR, Rong-En Fan et al.
  • NetworkX centrality, Aric Hagberg, Dan Schult & Pieter Swart
  • PDFMiner, Yusuke Shinyama
  • Python docx, Mike Maccana
  • PyWordNet, Oliver Steele
  • simplejson, Bob Ippolito
  • spelling corrector, Peter Norvig
  • Universal Feed Parser, Mark Pilgrim
  • WordNet, Christiane Fellbaum et al.

Acknowledgements

Authors:

Contributors (chronological):

  • Frederik De Bleser
  • Jason Wiener
  • Daniel Friesen
  • Jeroen Geertzen
  • Thomas Crombez
  • Ken Williams
  • Peteris Erins
  • Rajesh Nair
    1. De Smedt
  • Radim Řehůřek
  • Tom Loredo
  • John DeBovis
  • Thomas Sileo
  • Gerold Schneider
  • Martin Volk
  • Samuel Joseph
  • Shubhanshu Mishra
  • Robert Elwell
  • Fred Jehle
  • Antoine Mazières + fabelier.org
  • Rémi de Zoeten + closealert.nl
  • Kenneth Koch
  • Jens Grivolla
  • Fabio Marfia
  • Steven Loria
  • Colin Molter + tevizz.com
  • Peter Bull
  • Maurizio Sambati
  • Dan Fu
  • Salvatore Di Dio
  • Vincent Van Asch
  • Frederik Elwert

pattern's People

Contributors

dim-an avatar duilio avatar fdb avatar foril avatar frederik-elwert avatar hayd avatar james-cirrushq avatar jgrivolla avatar kkoch986 avatar markus-beuckelmann avatar napsternxg avatar newworldorder avatar pet3ris avatar piskvorky avatar pjbull avatar rajeshnair avatar relwell avatar ritwikgupta avatar sloria avatar waylonflinn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pattern's Issues

Python 3 todo list

I've broken up python 3 migration (#1) into the following independent tasks.

Fix the (python 2) skipped tests see #3:

  • mysql (this just needs travis to install it I think this is no big deal, i.e. just a line to .travis.yml)
  • test command line (uses PIPE atm which doesn't work on py2.6)
  • test _keywords
  • skip MediaWikiArticle
  • I think there is a couple more from #3 (search for FIXME and TODO)

Python 3 stuff

  • print statements (#4)
  • get imports working from #4
  • sgmllib is depreciated in python 3 (!) https://docs.python.org/2/library/sgmllib.html (I've kinda worked around this by renaming html.parser (which is a hack), the fact html.parser is in base may mean this section of pattern is no longer required clips#4
  • ... (triage remaining)

test files passing on python 3

e.g. via nosetests test/test_xx.py

  • add separate travis py3 build for the thus far passing tests
  • test_graph #11
  • test_text en de... #16
  • test_metric
  • test_vector #17
  • test_web
  • test_db
  • remove py3 special case build (once all the above are passing!)

General stuff

  • use a single README (probably an rst is best)
  • add a MANIFEST.in - burn with fire the current os.walk stuff (other files?)
  • grab version number from the init file (rather than the pattern installation)
  • style is a bit random for importing py3 stuff atm (my fault) should be more consistent
  • remove testing suite functions (just use test_main).
  • some tests are pretty flaky (ie. numbers changing)
  • remove utf-8 print hack added to the examples (due to __future__ print_statement being sensitive to unicode) i.e. make all the things unicode. See #12
  • some tests are ombscurified by being so class based (although some which are still not very dry), atm I prefixed these with Abstract... but they should probably be ABC (the key is they can't start with Test otherwise nose etc tries to run them - and fails).
  • javascript tests (?) currently not run - unclear if do they do anything?!
  • add examples to tests (e.g. have just run all the example py files, no assertions just running... but potentially could add some asserts?) - I nearly have done this.
  • work out which dependancies can stop being vendorized (see below)
  • depreciation messages of uses of dependancies (which have been updated), see travis/nose output
  • add travis, coveralls, landscape (sign up with pattern & pattern3 gh accounts and add this repo)
  • add banner for the above into README
  • pep8/docformatter all the things (this should be simultaneously merged with clips/pattern, otherwise merging in work to clips will be very very difficult ???)
  • decide on when this can be merged back upstream (IMO this should be asap, we don't require py3 to be ready just that py2 still works... and a quick pep8 storm :) )

Performace

  • come up with some benchmarks to compare python 2 and 3 (and potentially the old code base)
  • profile and see what can be improved...

pattern3.db import CSV throws OverflowError: Python int too large to convert to C long

from pattern3.db import CSV

throws the error

Traceback (most recent call last):
File "", line 1, in
File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\db_init_.py", line 2159, in
csvlib.field_size_limit(sys.maxsize)
OverflowError: Python int too large to convert to C long

on

WinPython 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32

from pattern3.search import search throws IndentationError: expected an indented block error

from pattern3.search import search

throws the error

Traceback (most recent call last):
File "", line 1, in
File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\text\search.py", line 273
except ImportError:
^
IndentationError: expected an indented block

on

WinPython 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32

and on

Ubuntu
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]

pdfminer

After trying various imports, as suggested in the wiki, I found that all imports were working, except pattern.web, which required pdfminer - a module unavailable for Python 3. I have currently simply wrapped up the relevant import statements in a try: clause in web.py in order to make the import functional, but hope to migrate to pdfminer3k - the Python 3 port for pdfminer to permanently resolve this.

The port to pdfminer3k is already complete. It was just an installation issue on my machine. Sorry about raising this issue.

Make Pattern compatible with Python 3

We have a $1,850 budget (PSF grant + private donations) available for developers that want to help us port Pattern to Python 3, by joining the pattern3 development team and sending pull requests. Please read the grant proposal for an overview of the objectives. Visit the wiki for helpful hints to get started, an estimated timeline and bounties.

IndentationError: expected an indented block

When I run the code in python3,

from pattern3.en import tag
from nltk.corpus import wordnet as wn

# Annotate text tokens with POS tags
def pos_tag_text(text):

def penn_to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None

tagged_text = tag(text)
    tagged_lower_text = [(word.lower(), penn_to_wn_tags(pos_tag))
                         for word, pos_tag in
                         tagged_text]
    return tagged_lower_text

I got the error message:
File "/Anaconda3/lib/python3.6/site-packages/pattern3/text/tree.py", line 37
except:
^
IndentationError: expected an indented block

[BUG] typo in `web/__init__.py`

Line 3756: should be attrs instead of attr

Current: if self.classes.issubset(set([s.lower() for s in e.attr.get("class", [])])) is False:

Should be: if self.classes.issubset(set([s.lower() for s in e.attrs.get("class", [])])) is False:

Note: This change makes sense in python2 pattern. pattern3 should probably be e.attributes

"Modeling creativity with a semantic network of common sense" example throws error with pattern3

The "Modeling creativity with a semantic network of common sense" (https://www.clips.uantwerpen.be/pages/modeling-creativity-with-a-semantic-network-of-common-sense) example throws an error with pattern3.

from pattern.web import Google, plaintext 
from pattern.search import search 
 
def learn(concept): 
    q = 'I think %s is *' % concept 
    p = [] 
    g = Google(language='en') 
    for i in range(10): 
        for result in g.search(q, start=i, cached=True): 
            m = plaintext(result.description) 
            m = search(q, m) # use * as wildcard 
            if m: 
                p.append(m[0][-1].string) 
    return [w for w in p if w in PROPERTIES] 

for p in learn('Brussels'): 
    g.add_edge(p, 'Brussels', type='is-property-of') 

throws the error

Traceback (most recent call last):
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\web\__init__.py", line 571, in open
    return urllib.request.urlopen(request)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in learn
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\web\__init__.py", line 1400, in search
    data = url.download(cached=cached, **kwargs)
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\web\__init__.py", line 638, in download
    timeout, proxy, user_agent, referrer, authentication).read()
  File "E:\WPy-3710\python-3.7.1.amd64\lib\site-packages\pattern3\web\__init__.py", line 576, in open
    raise HTTP400BadRequest(src=e, url=url)
pattern3.web.HTTP400BadRequest

on

`WinPython 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32`

Readme update bundled/dependancies

After I ripped out some vendorised packages (IIRC some more work can be done here, I didn't update the readme).

  • review/finalize dependancies / bundled (vendorised) packages
  • update readme to reflect changes (e.g. python 3 is WIP/"supported")
  • discuss state of some e.g. PyWordNet... NLTK; pdfminer(s)

This may also need to be updated in the docs too.

Unicode all the things

Especially in the web module, tests atm intermittently fail eg https://travis-ci.org/pattern3/pattern/jobs/41182906 (that commit was a pass when made PR, then failed when I merged (several days later), now it passes again).

I guess need to take care that everything is decoded as unicode once it's read in.

Would be good if we could unit test some of these so they're not intermittent!

Split up big files

I think it would be good to do some surgery to some of the larger files e.g. web/__init__.py, there is too much going on there and it would be much more readable to separate the concerns.

Aside, I personally dislike stuff in the __init__.py (something pattern does pretty much exclusively), but changing that is a big change. Not sure what consensus is about that style-wise.

This can't be done until we have the tests passing on python 3 (!) and probably better coverage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.