clips / pattern Goto Github PK

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Home Page: https://github.com/clips/pattern/wiki

License: BSD 3-Clause "New" or "Revised" License

Python 87.17% JavaScript 12.80% HTML 0.03%

python machine-learning natural-language-processing web-mining wordnet sentiment-analysis network-analysis

pattern's Introduction

Pattern

Pattern is a web mining module for Python. It has tools for:

Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser
Natural Language Processing: part-of-speech taggers, n-gram search, sentiment analysis, WordNet
Machine Learning: vector space model, clustering, classification (KNN, SVM, Perceptron)
Network Analysis: graph centrality and visualization.

It is well documented, thoroughly tested with 350+ unit tests and comes bundled with 50+ examples. The source code is licensed under BSD.

Example

This example trains a classifier on adjectives mined from Twitter using Python 3. First, tweets that contain hashtag #win or #fail are collected. For example: "$20 tip off a sweet little old lady today #win". The word part-of-speech tags are then parsed, keeping only adjectives. Each tweet is transformed to a vector, a dictionary of adjective → count items, labeled WIN or FAIL. The classifier uses the vectors to learn which other tweets look more like WIN or more like FAIL.

from pattern.web import Twitter
from pattern.en import tag
from pattern.vector import KNN, count

twitter, knn = Twitter(), KNN()

for i in range(1, 3):
    for tweet in twitter.search('#win OR #fail', start=i, count=100):
        s = tweet.text.lower()
        p = '#win' in s and 'WIN' or 'FAIL'
        v = tag(s)
        v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective
        v = count(v) # {'sweet': 1}
        if v:
            knn.train(v, type=p)

print(knn.classify('sweet potato burger'))
print(knn.classify('stupid autocorrect'))

Installation

Pattern supports Python 2.7 and Python 3.6. To install Pattern so that it is available in all your scripts, unzip the download and from the command line do:

cd pattern-3.6
python setup.py install

If you have pip, you can automatically download and install from the PyPI repository:

pip install pattern

If none of the above works, you can make Python aware of the module in three ways:

Put the pattern folder in the same folder as your script.
Put the pattern folder in the standard location for modules so it is available to all scripts:
- c:\python36\Lib\site-packages\ (Windows),
- /Library/Python/3.6/site-packages/ (Mac OS X),
- /usr/lib/python3.6/site-packages/ (Unix).
Add the location of the module to sys.path in your script, before importing it:

MODULE = '/users/tom/desktop/pattern'
import sys; if MODULE not in sys.path: sys.path.append(MODULE)
from pattern.en import parsetree

Documentation

For documentation and examples see the user documentation.

Version

3.6

License

BSD, see LICENSE.txt for further details.

Reference

De Smedt, T., Daelemans, W. (2012). Pattern for Python. Journal of Machine Learning Research, 13, 2031–2035.

Contribute

The source code is hosted on GitHub and contributions or donations are welcomed.

Bundled dependencies

Pattern is bundled with the following data sets, algorithms and Python packages:

Brill tagger, Eric Brill
Brill tagger for Dutch, Jeroen Geertzen
Brill tagger for German, Gerold Schneider & Martin Volk
Brill tagger for Spanish, trained on Wikicorpus (Samuel Reese & Gemma Boleda et al.)
Brill tagger for French, trained on Lefff (Benoît Sagot & Lionel Clément et al.)
Brill tagger for Italian, mined from Wiktionary
English pluralization, Damian Conway
Spanish verb inflection, Fred Jehle
French verb inflection, Bob Salita
Graph JavaScript framework, Aslak Hellesoy & Dave Hoover
LIBSVM, Chih-Chung Chang & Chih-Jen Lin
LIBLINEAR, Rong-En Fan et al.
NetworkX centrality, Aric Hagberg, Dan Schult & Pieter Swart
spelling corrector, Peter Norvig

Acknowledgements

Authors:

Tom De Smedt ([email protected])
Walter Daelemans ([email protected])

Contributors (chronological):

Frederik De Bleser
Jason Wiener
Daniel Friesen
Jeroen Geertzen
Thomas Crombez
Ken Williams
Peteris Erins
Rajesh Nair
F. De Smedt
Radim Řehůřek
Tom Loredo
John DeBovis
Thomas Sileo
Gerold Schneider
Martin Volk
Samuel Joseph
Shubhanshu Mishra
Robert Elwell
Fred Jehle
Antoine Mazières + fabelier.org
Rémi de Zoeten + closealert.nl
Kenneth Koch
Jens Grivolla
Fabio Marfia
Steven Loria
Colin Molter + tevizz.com
Peter Bull
Maurizio Sambati
Dan Fu
Salvatore Di Dio
Vincent Van Asch
Frederik Elwert

pattern's People

Contributors

Stargazers

Watchers

Forkers

fdb savinos mrcrabby pombredanne raufrajar rudaoshi cloudappsetup mrocklin rajeshnair teloon danieleguido daeon mt3 piskvorky saidimu sp00 theorm owenytlo seacoastboy xu-hong amaudy anthonynystrom clintg abhimir avallark davidcoallier clopez relwell architjoshi joskid napsternxg mhluongo shekar73 revskill10 sddhrthrt dheerajrajagopal donfrancisco azappella nava45 barroque vibster vad751 kazuar netcoid bertomartin agallant justbilibili erjemin oiclid chiptip willieavendano beng jattenberg garpan12 brenden17 cosimo dhurley14 casualuser ahmed26 mmaker frrp pavlobaron carlosnasillo drunkencub venryll rayleyva tml jeremyjbowers jaywhy13 xydinesh mrmichalis thomasbhatia hikari jamestbrown ageek daqing15 zhouzhuojie skhockey345 ichaos jasti borborygmi yangzilong1986 ckim hasantayyar ethanblackburn wmelton smonami szho42 jiangdapeng ankit-maverick caliskanfurkan bottah aburan28 keremkacel namongk skyuplam gutelius toop kkoch986 sky54521

pattern's Issues

sgmllib is deprecated

sgmllib is deprecated, update with HTMLParser or htmllib.

umlaut problem in parse

I am getting the following error, if I have any umlaut in my string. (eg. Häuser, etc)

Traceback (most recent call last):
File "fetcher.py", line 28, in
main(sys.argv[1:])
File "fetcher.py", line 14, in main
s = parse(s, relations=False, lemmata=False)
File "/Users/hikari/Documents/dev/python/heroku/venv/lib/python2.7/site-packages/pattern/text/de/parser/init.py", line 193, in parse
m = dict((token.replace(u"ß", u"ss"), token) for sentence in s for token in sentence)
File "/Users/hikari/Documents/dev/python/heroku/venv/lib/python2.7/site-packages/pattern/text/de/parser/init.py", line 193, in
m = dict((token.replace(u"ß", u"ss"), token) for sentence in s for token in sentence)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

Sentiment Load does not take optional path

When the load function in pattern.text.en.parser.sentiment is given an optional path, it sets the '_path' attribute of the Lexicon object, whereas in all other locations the 'path' attribute (without the underscore) is referenced. If the underscore is removed from load, everything works as expected. I posted a more detailed explanation to a relevant google groups thread, which should hopefully be approved soon.

Here is the load function:
def load(self, path=None):
# Backwards compatibility with pattern.en.wordnet.Sentiment.
if path is not None:
self._path = path
self._parse()

Here is where the path attribute is intended to be used, in SentiWordNet._parse_path:
def _parse_path(self):
""" For backwards compatibility, look for SentiWordNet*.txt in:
pattern/en/parser/, patter/en/wordnet/, or the given path.
"""
try: f = (
glob(os.path.join(self.path)) +
glob(os.path.join(MODULE, self.path)) +
glob(os.path.join(MODULE, "..", "wordnet", self.path)))[0]
except IndexError:
raise ImportError, "can't find SentiWordnet data file"
return f

A quick search reveals that load() is the only place where _path is referenced.

I don't want to be stepping on anyone's toes or overstepping by bounds by pointing this out, but it does seem like load() has a typo. When '_path' in load() is changed to 'path', everything behaves as expected.

how extend to other languages?

There is a documentation that explain how extented pattern to another language?
(in my case in italian)

Ability to set proxy with all SearchEngine classes(Twitter,Facebook etc..)

I am not able to find a way to specify proxy before running APIs for Twitter search..If there is already a way then needs update on wiki or enhancement for APIs.

parts of speech example invalid syntax

The pattern documentation is generally excellent; this one at http://www.clips.ua.ac.be/pages/pattern-en#parser is not working:

The tag() function simply annotates words with their part-of-speech tag and returns a list of (word, tag)-tuples:

tag(string, tokenize=True, encoding='utf-8')
>>> from pattern.en import tag
>>>
>>> for word, pos in tag('I feel *happy*!')
>>>     if pos == "JJ": # Retrieve all adjectives.
>>>         print word

happy

When I try this in python 2.7.7, I get:

>>> from pattern.en import tag
>>> for word, pos in tag('I feel *happy*!')
  File "<stdin>", line 1
    for word, pos in tag('I feel *happy*!')
                                          ^
SyntaxError: invalid syntax

What is the correct syntax? Is there an example parts of speech tagger?

Cheers,
Dave

Not able to use Pattern on Ubuntu

Hello,

I would like to report that Pattern couldn't be used out of the box with Ubuntu (12).

I have spent 4 hours today trying to figure out how to run python script which uses Pattern on Ubuntu and i've failed. The script uses SVM classification and the problems come from how Pattern supports it.

Problem N1
Pattern is dependent on 3rd party SVM classification library libsvm. It has to be installed on the system if you wish to use SVM in pattern, otherwise Pattern dies with exception.

Problem N2
Even after you install libsvm by hand using "sudo apt-get libsvm3" you will not be able to use it.

You will get the following error:
"AttributeError: /usr/lib/libsvm.so.3: undefined symbol: svm_get_sv_indices"

To repeat this problem, simply try to run this example on Ubuntu
(https://github.com/clips/pattern/blob/master/examples/01-web/08-wiktionary.py)

The problem seems to happen because Pattern requires newer libsvm module than the one being installed with "apt-get".

What am i doing wrong ?
How do i properly install Pattern along with libsvm on Ubuntu ?

pattern.de.singularize() produces wrong results for suffix "en"

https://github.com/clips/pattern/blob/master/pattern/text/de/inflect.py#L316

The default pattern is to cut off the suffix en from plural nouns to return the singular but this does only work in some cases, not for one of the following:

Löwen → Löw
Konserven → Konserv

wordnet issues

I've made a fresh install of pattern-master yesterday and I'm running into issues with wordnet:

from pattern.en import wordnet
wordnet.synsets("train")

Traceback (most recent call last):
File "<pyshell#2>", line 1, in
wordnet.synsets("train")
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/init.py", line 95, in synsets
return [Synset(s.synset) for i, s in enumerate(w)]
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 316, in getitem
return self.getSenses()[index]
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 242, in getSenses
self._senses = tuple(map(getSense, self._synsetOffsets))
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 241, in getSense
return getSynset(pos, offset)[form]
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 1090, in getSynset
return _dictionaryFor(pos).getSynset(offset)
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 827, in getSynset
return _entityCache.get((pos, offset), loader)
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 1308, in get
value = loadfn and loadfn()
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 826, in loader
return Synset(pos, offset, _lineAt(dataFile, offset))
File "/usr/local/lib/python2.7/dist-packages/pattern/en/wordnet/pywordnet/wordnet.py", line 366, in init
(self._senseTuples, remainder) = _partition(tokens[4:], 2, string.atoi(tokens[3], 16))
File "/usr/lib/python2.7/string.py", line 403, in atoi
return _int(s, base)
ValueError: invalid literal for int() with base 16: '@'

this is happening in interactive use in IDLE.

When running 06-example.py from the location of the unzipped download I get an error at a later moment:

Traceback (most recent call last):
File "/home/christiaan/Downloads/pattern-master/examples/03-en/06-wordnet.py", line 46, in
s.append((a.similarity(b), word))
File "../../pattern/text/en/wordnet/init.py", line 272, in similarity
lin = 2.0 * log(lcs(self, synset).ic) / (log(self.ic * synset.ic) or 1)
ValueError: math domain error

by the class function Synset.similarity, probably when it has to calculate the log of a negative number when working with the synsets for the words 'cat' and 'spaghetti'. Unfortunately for me this is exactly the function I'm interested in. I can see a temporary workaround for me by placing the pattern modules on the path of my project and adding in a try... except block to circumvent the ValueError, but it looks like something's broken in the wordnet implementation, although the first issue might just be a problem for my system setup/messy clips-pattern version updates.

pattern.web.Google user-agent

Hello,

How does one set the user-agent when using the above module?

could you split file into segments according to per class, for me, it is so hard to read your code? thanks.

Server module not working

Traceback (most recent call last):
File "api.py", line 3, in
from pattern.server import App
ImportError: No module named server

match groups in search syntax

I'd love for the search syntax to have match groups just like regex. In my preference the ? symbol and () would have the same meanings as in regex syntax, so for example if I did:

search('There be DT (JJ? NN+)', s)

then I would get a match against "There is a red ball", and match item 0 would be "red ball", and it would also match "There is a ball" and match item 0 would be "ball".

However I realise that if lots of people are relying on parentheses to mean optional then it wouldn't be easy to change that.

Failing that, how about:

search('There be DT <(JJ) NN+>', s)

there are more semantically rich possibilities, e.g.

however I think that might be a little verbose, and get in the way of analyzing the search syntax which I think is better of as terse and as close as possible to regex (with which a lot of people are familiar)

Many thanks in advance

Spelling function for pattern.nl

Is a spelling function comparable to the one in pattern.en, also doable for Dutch and other languages?
What steps would it take?

typo in german test sentence

https://github.com/clips/pattern/blob/master/examples/03-en/05-tagset.py#L23

"die schwarze Katzen" => "die schwarzen Katzen"

Both pip and easy_install are broken for Pattern

easy_install with pattern never worked for me (unsure why):

(saffron)19:33:49 ~/code/saffron$ easy_install pattern
Searching for pattern
Reading https://pypi.python.org/simple/pattern/
Reading http://www.clips.ua.ac.be/pages/pattern
error: Connection reset by peer

pip install pattern used to work for me, but after upgrading pip to 1.5.2, pip install pattern I get:

$ pip install pattern
Downloading/unpacking pattern
  Could not find any downloads that satisfy the requirement pattern
  Some externally hosted files were ignored (use --allow-external pattern to allow).
Cleaning up...
No distributions at all found for pattern
Storing debug log for failure in /home/b/.pip/pip.log

This works:

pip install --allow-external pattern --allow-unverified pattern pattern

Problem with slash in pattern.es

Hi!
Playing with the Spanish Parser I have found an error when I am trying to parse a string with "/O". More specifically, the string was "La I/O de la CPU colapsa el sistema". This only happens when the string contains "/O" (/o works right).

I have been looking at the code in the parse method and slashes are not encoded with &slash, being this the cause of the problem.

Greets!

Support for session?

Hi,

I'm wondering if there is existing or planned support for sessions (i.e. reusing connections, support for cookies). When crawling some sites that requires logins session support could really come to help. In current web module I couldn't find a way to pass in new urllib2 opener with cookie lib support (https://github.com/clips/pattern/blob/master/pattern/web/__init__.py#L361) ; however it's mentioned in doc of HTTP301Redirect exception.

Is there an existing, proper way to add support for session and possibly persist connections?

patterns.vector.corpus.lsa: word appearing twice in LSA

Hello,
I am new to LSA but it is striking me as odd that a word would appear twice in the output for LSA. I reduced to 5 dimensions a 14000+ unique sparse feature corpus.

Is this possible for LSA? My conception was it performs dim reduction to unique features. Currently my only thought is that perhaps I did not normalize my input to all lower case and thus words with diff capitalizations are appearing twice in the output.

pip install broken

easy_install still works, but pip install pattern produces:

root@localhost:/home/anywhere# pip install pattern
Downloading/unpacking pattern
  Downloading pattern-1.8.zip (12.6Mb): 12.6Mb downloaded
  Running setup.py egg_info for package pattern
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
    IOError: [Errno 2] No such file or directory: '/home/anywhere/build/pattern/   setup.py'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

IOError: [Errno 2] No such file or directory: '/home/anywhere/build/pattern/setu   p.py'

----------------------------------------
Command python setup.py egg_info failed with error code 1
Storing complete log in /root/.pip/pip.log

looks like the installer is looking for setup.py in the wrong folder...

root@localhost:/home/anywhere# ls /home/anywhere/build/pattern
__MACOSX  pattern  pip-egg-info
root@localhost:/home/anywhere# ls /home/anywhere/build/pattern/pattern/
en        graph        LICENSE.txt  nl          search.py  table.py  web
examples  __init__.py  metrics.py   README.txt  setup.py   vector
root@localhost:/home/anywhere#

CLIPS.pattern Corpus Reduction with L1

What is the best way to substitute L1 normalization for the default L2 for Corpus Reduction for LSA purposes?

My assumption is:
setting ordinal in numpy.linalg.norm to 1 == L1
& setting ordinal in numpy.linalg.norm to -1 == neg L1

But I do not want to assume too much.

gmail count returns "ValueError: invalid literal for int() with base 10: '[NONEXISTENT] Unknown Mailbox:"

I tried to count the messages in some of my gmail folders. This seems to work for e.g. the folder "inbox"

>>> inbox=gmail.folders.get("inbox")
>>> inbox.count
147

However when I do this for a folder with a more complex name "[google ]/alle nachrichten" counting fails with an error

>>> all=gmail.folders.get("[google ]/alle nachrichten")
>>> all.count
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pattern/web/imap/__init__.py", line 212, in count
    return len(self)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pattern/web/imap/__init__.py", line 280, in __len__
    return int(response[0])
ValueError: invalid literal for int() with base 10: '[NONEXISTENT] Unknown Mailbox: [google ]/alle nachrichten (now in authenticated state) (Failure)'

It seems to me, that there is a problem with folders, which have spaces, brackets or slashes in their path. Can this be fixed?
As a workaround I found the following solution

>>> len(all.search("", field=SUBJECT))
9496

Python 3 support

Pattern should start supporting Python 3. Looking at the amount of code, it is a non-trivial task and any help is much appreciated.

model.vector broken since Sep 7 commit

Hi,

It seems model.vector is not populated since commit 52bbb5d.

You can try /examples/05-vector/03-lsa.py.

LSA failes since model.vector is empy.

I don't know enough python to fix this :)

Thanks!
Damon

Feature request: access to original indexes of tokens in a text

I've started working on a fork 'cause I have a project that I just want to make progress on that requires this, but if anyone wants to check it out, well, check out my fork. I'll be improving it slowly, just hacked it together to get it working.

I have some operations that I perform where I want to know where the original wordform was in a text, and this has to cover when things are lemmatized or just tokenized. Reasons I can't just do a find and replace are pretty clear: multiple same wordforms would mess things up, and obviously an algorithm could be created to retrieve the substring indexes despite this, it seemed better to build it in from somewhere earlier in the process.

... But, being someone not as familiar with the internals of pattern code, I thought I'd open up a feature request here just so that y'all are aware that it's something that is desired, but can maybe take a better stab at implementing it yourselves if you come to it. I'll keep my fork around though-- if it get sto a point where it could just as well be contributed, well, great. ;)

unable to pickle the output of pattern.en.split()

When trying to pickle a parse tree (generated with pattern.en.parse() and then pattern.en.split()), I get: "RuntimeError: maximum recursion depth exceeded".
Still, a pickle file was indeed created, but it couldn't be opened due to an EOF error. So apparently the data structure of Sentences() does not lend itself to pickling.
Just pickling the output of pattern.en.parse() worked fine, though.

'configure' is parsed as NNP and NN (VB* is expected)

from pattern.en import parse
s = "Configure the computer."
s = parse(s)
print s
Configure/NNP/B-NP/O the/DT/I-NP/O computer/NN/I-NP/O ././O/O
s = "Please configure the computer."
s = parse(s)
print s
Please/VB/B-VP/O configure/NN/B-NP/O the/DT/I-NP/O computer/NN/I-NP/O ././O/O

French sentiment analysis; issues with apostrophes

Hi,

I was trying sentiment analysis with following sentences:

import os, sys; #sys.path.insert(0, os.path.join("..", ".."))
from pattern.fr import sentiment as sentiment_fr, polarity as polarity_fr
print sentiment_fr("Moi j'aime bien les tests positifs!")

print sentiment_fr("Moi j' aime bien les tests positifs!")

I am a bit puzzled that scores are not equal and that first one really seems wrong.
I am not at all a python programmer but if I look in fr module, it seems stuff like "j'" should be replaced by "j' " (extra space added).

Eric

Unmusical warning

Not very serious, perhaps: I work on closed captioning files from television, and they often contain a musical note, especially in commercials -- that's UTF8 E2 99 AA. They trigger this loud warning:

/usr/lib/python2.7/dist-packages/pattern/text/init.py:979: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal and tokens[j] in ("'", """, u"?", u"?", "...", ".", "!", "?", ")", EOS):

Now that I understand what's triggering the warning, I'm not really bothered by it, so just FYI -- I don't know if there's an elegant way to handle unexpected unicode characters.

Cheers,
Dave

Extended support for static graphs and exporting graphs

In pattern.graph, I'm frequently using the node.fixed=True attribute to get more control of how a graph looks visually.
For example, http://theater.ua.ac.be/bih/october.

It would be nice to have some sort of export functionality for this, so the graphs would look nicer when printed. For instance, export to SVG or PDF.

On a related note, when I fix all of the nodes of a graph, it is no longer necessary to animate it, so I set frames=1. However, this does not work out well in Mobile Safari on iOS devices. The graph is rendered, but the nodes and edges are not drawn (only the labels). With an animated graph, this problem does not occur (although animation is very slow with heavily interconnected graphs).

package directory 'pattern/web/souppattern/db' does not exist

Hello,

Just did a;

$ sudo pip install pattern

and got the above error. I'm on Ubuntu 12.04

I can't use multiple separate taxonomies

from pattern.search import search, Taxonomy, WordNetClassifier, Classifier
my_taxonomy = Taxonomy()
#Make sure the classifiers data structure is not shared:
original_classifiers = my_taxonomy.classifiers
my_taxonomy.classifiers = []
my_taxonomy.classifiers.extend(original_classifiers)
my_taxonomy.classifiers.append(WordNetClassifier())
print search('RESPIRATORY_DISEASE', "I got a flu shot", taxonomy=my_taxonomy)
#Prints `[Match(words=[Word('flu')])]`
another_taxonomy = Taxonomy()
another_taxonomy.append('shot', type='test')
print search('TEST', "I got a flu shot", taxonomy=another_taxonomy)
#Prints `[Match(words=[Word('shot')])]`
print search('TEST', "I got a flu shot", taxonomy=my_taxonomy)
#Prints `[Match(words=[Word('shot')])]`
#But shouldn't this only be tagged for the other taxonomy?
print search('RESPIRATORY_DISEASE', "I got a flu shot", taxonomy=another_taxonomy)
#Prints `[Match(words=[Word('flu')])]`
#My new taxonomy includes the wordnet classifier from the other taxonomy instance.

Noun detection in pattern.de

I was wondering why pattern doesn't make use of more trivial ways to do the part-of-speech tagging. For example when it comes to noun detection in German, it turns out that it's actually really easy to do (most of the time), because all nouns start with an uppercase letter. So that means: If you know fore sure that the text you are parsing is not case insensitive, and you see a word with a lowercase first letter, it can not be a noun. Of course that doesn't work in the different direction, there would still be false positives (first word in a sentence starts with an uppercase letter, but that doesn't means it's a noun).

Since the parser still needs to work for case insensitive text, I'd suggest providing an extra argument (e.g. casesensitive = True) letting the parser know what text to expect.

Of course I know that 'NN' is the default part-of-speech tag, but is that really always a good idea provided that there are easy ways to say that a given word is definitely not a noun? All that is most likely only applicable for German, I guess.

What am I missing here?

Corpus Reduction Error

When I reduce a Corpus of a list of Documents i get:

In [246]: docs[0] # list of Documents(string, name='string name'
Out[246]: Document(id='NvOtJY1-1286', name=u'Weyes Blood')

In [247]:corpus_tags = Corpus(documents=docs)  

In [248]: corpus_tags.reduce()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-248-70f66f66e3dd> in <module>()
----> 1 ctags.reduce()

/media/FreeAgent_/Envs/scipy/local/lib/python2.7/site-packages/pattern/vector/__init__.py in latent_semantic_analysis(self, dimensions)

/media/FreeAgent_/Envs/scipy/local/lib/python2.7/site-packages/pattern/vector/__init__.py in __init__(self, corpus, k)

/media/FreeAgent_/Envs/scipy/local/lib/python2.7/site-packages/pattern/vector/__init__.py in __call__(self, vector)

AttributeError: 'Document' object has no attribute 'iteritems'

IndexError: list index out of range

When I use a taxonomy search as in the below demo code, I get a stack trace and IndexError exception

from pattern.en     import parsetree
from pattern.search import search, Pattern, Constraint, Taxonomy, WordNetClassifier

wn = Taxonomy()
wn.classifiers.append(WordNetClassifier())

p = Pattern()
p.sequence.append(Constraint.fromstring("{COLOR?}", taxonomy=wn))

pt = parsetree('the new iphone is availabe in silver, black, gold and white', relations=True, lemmata=True)
print p.search(pt)

Traceback (most recent call last):
File "bug.py", line 11, in
print p.search(pt)
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 746, in search
a=[]; [a.extend(self.search(s)) for s in sentence]; return a
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 750, in search
m = self.match(sentence, _v=v)
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 770, in match
m = self._match(sequence, sentence, start)
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 838, in _match
if i < len(sequence) and constraint.match(w):
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 620, in match
for p in self.taxonomy.parents(s, recursive=True):
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 331, in parents
return unique(dfs(self._normalize(term), recursive, {}, *_kwargs))
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 327, in dfs
a.extend(classifier.parents(term, *_kwargs) or [])
File "/usr/local/lib/python2.7/dist-packages/pattern/text/search.py", line 415, in _parents
try: return [w.senses[0] for w in self.wordnet.synsets(word, pos)[0].hypernyms()]
IndexError: list index out of range

Oddities while parsing numbers

I'm trying to parse case counts from promed articles. In some articles they are reported in a semi-structured way. e.g.

CASES: 3
DEATHS: 1

However, the counts are being parsed in unexpected ways preventing me from capturing the numeric portion.

1 is parsed correctly:

print parsetree("CASES: 1")

[Sentence('CASES/NNS/B-NP/O :/:/O/O 1/CD/O/O')]

However, some numbers are tagged as IN

print parsetree("CASES: 2")

[Sentence('CASES/NNS/B-NP/O :/:/O/O 2/IN/B-PP/O')]

This case is very strange, : 3 gets treated as a word. Is it being parsed as an emoticon?

print parsetree("CASES: 3")

[Sentence('CASES/NNS/B-NP/O :3/:/O/O')]

I haven't had any problems with two digit numbers.

print parsetree("CASES: 22")

[Sentence('CASES/NNS/B-NP/O :/:/O/O 22/CD/O/O')]

Yahoo search query creation

Hi,

I just tried to use the Yahoo SearchEngine class. I signed up for the Yahoo BOSS API. Everything works fine, but consider this example:

yahoo = Yahoo(license=(KEY, SECRET))
yahoo.search('Yahoo Reuters Jobs')

the query that gets generated looks like this:

Yahoo_Reuters_Jobs

I would expect that the query should look something like this:
Yahoo%20Reuters%20Jobs

So I urlquote my queries before passing it to the search method:

from urllib import quote

yahoo.search(quote('Yahoo Reuters Jobs'))

which works as I expected it.

Here's the implementation of the query construction:

url = URL(url, method=GET, query={
                 "q": oauth.normalize(query.replace(" ", "_")),
             "start": 1 + (start-1) * count,
             "count": min(count, type==IMAGE and 35 or 50),
            "format": "json"
        })

Any thoughts on that or am I missing something?

allowed input to tokenizer?

In [13]: parser.parse('==relate ** what is this?', lemmata=True, collapse=False)
Out[13]: 
[[[u'==relate', 'VB', 'B-VP', 'O', u'==relate'],
  [u'**', 'NN', 'B-NP', 'O', u'**'],
  [u'what', 'WP', 'O', 'O', u'what'],
  [u'is', 'VBZ', 'B-VP', 'O', 'be'],
  [u'this', 'DT', 'O', 'O', u'this'],
  [u'?', '.', 'O', 'O', u'?']]]

Are these = and * characters in tokens (and lemmas) the intended behaviour? I was surprised to find them in my output.

How much noise does the tokenizer in en.parser tolerate? Should I include some extra pre-processing before calling the tokenizer?

AttributeError: HTMLCanvasRenderer instance has no attribute 'type'

When executing HTML renderer example in http://www.clips.ua.ac.be/pages/pattern-graph:

export(g, 'sound', directed=True)

will raise a error: AttributeError: HTMLCanvasRenderer instance has no attribute 'type'

pattern/graph/init.py line 1193:

p.append("type:"%s"" % self.type)

should be fixed to

p.append("type:"%s"" % e.type)

how to add chinese support?

optimize Corpus.df()

It is faster (200x) to calculate df in one go for all documents, however this has a drawback if you only need df on a few documents. There should be an option to choose between both.

STRICT flag breaks [constraints with spaces] and taxonomy terms into multiple matches.

search('[these mosquitoes]', parsetree("I am afraid of these mosquitoes"), STRICT)
returns [Match(words=[Word(u'these/DT')]), Match(words=[Word(u'mosquitoes/NNS')])]

In the case of taxonomies this makes it difficult to identify which term was found.

Calling download() twice on a Result object results in an error

Example run

rss = Newsfeed().search('http://feeds.feedburner.com/Techcrunch')
dld = rss[4].download()
dld = rss[4].download()
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/pattern/web/init.py", line 846, in download
return URL(self.url).download(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/pattern/web/init.py", line 391, in download
return cache.get(id, unicode=False)
TypeError: get() takes no keyword arguments

sentiment does not return value between -1 and 1

Hellow,

The doc states that sentiment returns a polarity value between -1 and 1 but this does not appear to be the case. E.g. the following code below gives an even lower value than -1. Why is this?

from pattern.nl import sentiment
sentiment("ik vind het heel vervelend als dat gebeurt")
(-1.0133333333333332, -1.0133333333333332)

Sentiment.synset fails to find synsets with known sentiment

Demo of problem:

    from pattern.en import sentiment
    from pattern.en.wordnet import synsets

    ss = synsets("depress", 'VB')[0]

    print sentiment(ss).assessments
    #prints [(u'depress', 0.0, 0.0, None)], which isn't a shock given:

    print sentiment.synset(ss)
    #prints (0.0, 0.0)

However, line 622 en-sentiment.xml is:

    <word form="depress" wordnet_id="v-1814396" pos="VB" sense="lower someone's spirits" polarity="-0.2" subjectivity="0.1" intensity="1.0" reliability="0.9" />

and as such:

    print sentiment._synsets['v-' + str(ss.id)]
    #prints [-0.2, 0.1, 1.0]

The cause:

looks like this is incorrect key miss in synsets and thus __call__ is secondary to zfill(8)

It looks like you've got a number of key lengths that are possible:

print set(map(len, sentiment._synsets))
#prints {3, 7, 8, 9, 10}

Method behind xx-sentiment.xml not documented

I am trying to sort out where the text/xx/xx-sentiment.xml data come from. The unit tests include some proof of utility but I cannot find any documentation of the methodology used. After a lot of searching my best guess is that these are from De Smedt, Tom, and Walter Daelemans. "Vreselijk mooi!" (terribly beautiful): A Subjectivity Lexicon for Dutch Adjectives. LREC. 2012.

SentiWordNet is clearly called out in the docs. It would be nice to source the custom lexicon that ships with the package.

optimize Pattern._match()

We can probably rewrite all of this using (faster) regular expressions.

Twitter 403 Error

A fresh pip install pattern with my own Twitter API keys gave me a 403 Forbidden response every time. This was solved by changing line 1416 of pattern/web/__init__.py:

TWITTER = "http://api.twitter.com/1.1/"

to use HTTPS instead:

TWITTER = "https://api.twitter.com/1.1/"

Any reason why it's not by default? I'd recommend the change, given that (to my understanding) Twitter expects HTTPS.

hash() in Twitter example

In the example twitter.py, line 31, the code hashes the tweet id and description. This seems to lead to multiple rows of the same ID and Tweet but a different hash tag (if you run the .py more than once). Is this expected behaviour? In my opinion, you only want one row per Tweet.

How should I handle cross-chunk matching with taxonomies?

I am trying to use a taxonomy with some long terms in it, such as fungal lung infectious disease. The chunker used by parsetree() puts fungal lung and infectious disease into separate chunks preventing the full term in the taxonomy from matching the text. One work around I can imagine would be to chunk the longer taxonomy terms into multiple taxonomies, but then I would need to use queries like like 'search("DISEASE_BEGINNING DISEASE_INSIDE*", text)'. Is there a better way to handle this?

clips / pattern Goto Github PK

pattern's Introduction

Pattern

Example

Installation

Documentation

Version

License

Reference

Contribute

Bundled dependencies

Acknowledgements

pattern's People

Contributors

Stargazers

Watchers

Forkers

pattern's Issues

I was trying sentiment analysis with following sentences:

print sentiment_fr("Moi j' aime bien les tests positifs!")

should be fixed to

Recommend Projects

Recommend Topics

Recommend Org