Git Product home page Git Product logo

wordhoard's Introduction

Overviews

PyPI   License: MIT  GitHub issues  GitHub pull requests  Downloads 

Primary Use Case

Textual analysis is a broad term for various research methodologies used to qualitatively describe, interpret and understand text data. These methodologies are mainly used in academic research to analyze content related to media and communication studies, popular culture, sociology, and philosophy. Textual analysis allows these researchers to quickly obtain relevant insights from unstructured data. All types of information can be gleaned from textual data, especially from social media posts or news articles. Some of this information includes the overall concept of the subtext, symbolism within the text, assumptions being made and potential relative value to a subject (e.g. data science). In some cases it is possible to deduce the relative historical and cultural context of a body of text using analysis techniques coupled with knowledge from different disciplines, like linguistics and semiotics.

Word frequency is the technique used in textual analysis to measure the frequency of a specific word or word grouping within unstructured data. Measuring the number of word occurrences in a corpus allows a researcher to garner interesting insights about the text. A subset of word frequency is the correlation between a given word and that word's relationship to either antonyms and synonyms within the specific corpus being analyzed. Knowing these relationships is critical to improving word frequencies and topic modeling.

WordHoard was designed to assist researchers performing textual analysis to build more comprehensive lists of antonyms, synonyms, hypernyms, hyponyms and homophones.

Installation

Install the distribution via pip:

pip3 install wordhoard

General Package Utilization

Please reference the WordHoard Documentation for package usage guidance and parameters.

Sources

This package is currently designed to query these online sources for antonyms, synonyms, hypernyms, hyponyms and definitions:

  1. classicthesaurus.com
  2. collinsdictionary.com
  3. merriam-webster.com
  4. synonym.com
  5. thesaurus.com
  6. wordhippo.com
  7. wordnet.princeton.edu

Dependencies

This package has these core dependencies:

  1. backoff
  2. BeautifulSoup
  3. deckar01-ratelimit
  4. deepl
  5. lxml
  6. requests
  7. urllib3

Additional details on this package's dependencies can be found here.

Development Roadmap

If you would like to contribute to the WordHoard project please read the contributing guidelines.

Items currently under development:

  • Expanding the list of hypernyms, hyponyms and homophones
  • Adding part-of-speech filters in queries

Issues

This repository is actively maintained. Feel free to open any issues related to bugs, coding errors, broken links or enhancements.

You can also contact me at John Bumgarner with any issues or enhancement requests.

Sponsorship

If you would like to contribute financially to the development and maintenance of the WordHoard project please read the sponsorship information.

License

The MIT License (MIT). Please see License File for more information.

Author

Copyright (c) 2020 John Bumgarner

wordhoard's People

Contributors

gorluxor avatar johnbumgarner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

wordhoard's Issues

find_definitions can returns antonyms

Find definitions will sometimes return antonyms because the antonym cache is checked by mistake.

Returns
----------
:return: list of definitions
:rtype: list
"""
valid_word = self._validate_word()
if valid_word:
check_cache = caching.cache_antonyms(self._word)

TypeError: object of type 'NoneType' has no len()

Hi John,

Thanks for putting wordhoard together! I keep running into the following error, though not all the time:
TypeError: object of type 'NoneType' has no len()
(details below)

What I've been doing is just to feed in to wordhoard a file with a list of unique words on it, in order to find antonyms, and then I write the dictionary object out to a second file.

Here's where it works correctly:

antonyms were found for the word: bassoon [Small note -- should be 'No' antonyms I think]
Please verify that the word is spelled correctly.
None

antonyms were found for the word: bassist [Again, I think it should be 'No' antonyms]
Please verify that the word is spelled correctly.

I tried writing code like this:

Check if the result is not None before processing

        if wordhoard_antonym_results is not None:
            with open(output_filename, 'a') as outfile:
                outfile.write(json.dumps(wordhoard_antonym_results) + '\n')

But that didn't work, I suspect because the failure is happening in code earlier than when my code runs.

At any rate, here is the error, which for some reason is now occurring all the time. Not sure if this is because my ipaddress is being blocked, or what:

mown
INFO:wordhoard.antonyms:Thesaurus.com had no antonym reference for the word mown
ERROR:wordhoard.utilities.request_html:A RequestException has occurred when requesting https://www.wordhippo.com/what-is/the-opposite-of/mown.html
ERROR:wordhoard.utilities.request_html: File "/usr/local/lib/python3.9/dist-packages/wordhoard/utilities/request_html.py", line 102, in get_website_html
response = self._requests_retry_session().get(self._url_to_scrape,
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

A RequestException has occurred.
Please review the WordHoard logs for additional information.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 878, in urlopen
return self.urlopen(
[Previous line repeated 2 more times]
File "/usr/local/lib/python3.9/dist-packages/urllib3/connectionpool.py", line 868, in urlopen
retries = retries.increment(method, url, response=response, _pool=self)
File "/usr/local/lib/python3.9/dist-packages/urllib3/util/retry.py", line 592, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.wordhippo.com', port=443): Max retries exceeded with url: /what-is/the-opposite-of/mown.html (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/wordhoard/utilities/request_html.py", line 102, in get_website_html
response = self._requests_retry_session().get(self._url_to_scrape,
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 600, in get
return self.request("GET", url, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/requests/adapters.py", line 556, in send
raise RetryError(e, request=request)
requests.exceptions.RetryError: HTTPSConnectionPool(host='www.wordhippo.com', port=443): Max retries exceeded with url: /what-is/the-opposite-of/mown.html (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 16, in <cell line: 11>
wordhoard_antonym_results = Antonyms(search_string=line.strip(), output_format='dictionary').find_antonyms()
File "/usr/local/lib/python3.9/dist-packages/backoff/_sync.py", line 105, in retry
ret = target(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ratelimit/decorators.py", line 147, in wrapper
return func(*args, **kargs)
File "/usr/local/lib/python3.9/dist-packages/wordhoard/antonyms.py", line 227, in find_antonyms
query_results = self._run_query_tasks_in_parallel()
File "/usr/local/lib/python3.9/dist-packages/wordhoard/antonyms.py", line 180, in _run_query_tasks_in_parallel
finished_tasks.append(finished_task.result())
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/usr/lib/python3.9/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.9/dist-packages/wordhoard/antonyms.py", line 353, in _query_wordhippo
response = self._request_http_response(f'https://www.wordhippo.com/what-is/the-opposite-of/{self._word}.html')
File "/usr/local/lib/python3.9/dist-packages/wordhoard/antonyms.py", line 151, in _request_http_response
response = Query(url).get_website_html()
File "/usr/local/lib/python3.9/dist-packages/wordhoard/utilities/request_html.py", line 254, in get_website_html
sys.exit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/IPython/core/ultratb.py", line 1101, in get_records
return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
File "/usr/local/lib/python3.9/dist-packages/IPython/core/ultratb.py", line 248, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/IPython/core/ultratb.py", line 281, in _fixed_getinnerframes
records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
File "/usr/lib/python3.9/inspect.py", line 1543, in getinnerframes
frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
AttributeError: 'tuple' object has no attribute 'tb_frame'

MaxRetryError Traceback (most recent call last)
/usr/local/lib/python3.9/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
488 if not chunked:
--> 489 resp = conn.urlopen(
490 method=request.method,

30 frames
MaxRetryError: HTTPSConnectionPool(host='www.wordhippo.com', port=443): Max retries exceeded with url: /what-is/the-opposite-of/mown.html (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

RetryError Traceback (most recent call last)
RetryError: HTTPSConnectionPool(host='www.wordhippo.com', port=443): Max retries exceeded with url: /what-is/the-opposite-of/mown.html (Caused by ResponseError('too many 503 error responses'))

During handling of the above exception, another exception occurred:

SystemExit Traceback (most recent call last)
[... skipping hidden 1 frame]

SystemExit: 1

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
[... skipping hidden 1 frame]

/usr/local/lib/python3.9/dist-packages/IPython/core/ultratb.py in find_recursion(etype, value, records)
380 # first frame (from in to out) that looks different.
381 if not is_recursion_error(etype, value, records):
--> 382 return len(records), 0
383
384 # Select filename, lineno, func_name to track frames with

TypeError: object of type 'NoneType' has no len()

Getting definition of words rather prints it's antonyms

word = "mother"
definition = Definitions(search_string=word).find_definitions()
print(definition)

OUTPUT: ['father']

EXPECTED OUTPUT: ['a female parent']

wordhoard_error.yaml

2022-02-25 10:14:50:wordhoard.utilities.basic_soup:ERROR: Response Status Code: 403
2022-02-25 10:14:50:wordhoard.utilities.basic_soup:ERROR: HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden.
2022-02-25 10:14:50:wordhoard.utilities.basic_soup:ERROR: Associated URL: https://www.collinsdictionary.com/dictionary/english-thesaurus/mother
2022-02-25 10:20:03:wordhoard.utilities.basic_soup:ERROR: Response Status Code: 403
2022-02-25 10:20:03:wordhoard.utilities.basic_soup:ERROR: HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden.
2022-02-25 10:20:03:wordhoard.utilities.basic_soup:ERROR: Associated URL: https://www.collinsdictionary.com/dictionary/english-thesaurus/mother
2022-02-25 10:20:53:wordhoard.utilities.basic_soup:ERROR: Response Status Code: 403
2022-02-25 10:20:53:wordhoard.utilities.basic_soup:ERROR: HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden.
2022-02-25 10:20:53:wordhoard.utilities.basic_soup:ERROR: Associated URL: https://www.collinsdictionary.com/dictionary/english-thesaurus/mother
2022-02-25 10:22:55:wordhoard.utilities.basic_soup:ERROR: Response Status Code: 403
2022-02-25 10:22:55:wordhoard.utilities.basic_soup:ERROR: HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden.
2022-02-25 10:22:55:wordhoard.utilities.basic_soup:ERROR: Associated URL: https://www.collinsdictionary.com/dictionary/english-thesaurus/mother
2022-02-25 10:23:00:wordhoard.utilities.basic_soup:ERROR: Response Status Code: 403
2022-02-25 10:23:00:wordhoard.utilities.basic_soup:ERROR: HTTP 403 is an HTTP status code meaning access to the requested resource is forbidden.

Questionable antonyms for the word mother

When using Wordhoard I noted questionable antonyms coming from an unknown source for the word mother

from wordhoard import Antonyms

antonym = Antonyms('mother')
antonym_results = antonym.find_antonyms()
print(antonym_results)
['descendant', 'disassemble', 'dissuade', 'effect', 'end', 'father', 'male parent', 'result']

This output produces some questionable antonyms that have no relationship to the word mother.

Overwhelming number of synonyms

Hello,

Thank you for publishing this package. It is a highly beneficial resource.

When searching for synonyms, I noticed an unexpected behavior (bug).
For the word "good", the function find_synonyms() returns a list of 104 unique words. Among them are words that are not synonyms for the word "good." For example, "bully", "cracking", "bad", "boss", "hard", "spanking", and a couple of additional words that I am not sure if they are synonyms or not. The behavior is repeated with other words as well.

I am unsure if there is a specific website that enriches the synonyms with such words or if it is a bug in the crawling process. A possible solution may be to allow the selection of the websites on which the crawling process takes place.

I would highly recommend this option since I am unsure about the legitimacy of the other sources except for "merriam-webster" and "wordnet".

To date, I have decided to take the synonyms directly from "wordnet", as I cannot guarantee they are actually synonyms.

Ratelimit module missing

Version 1.5.0 uses deckar01-ratelimit. In the local development environment everything works fine, but when the recently version was pushed to PYPI and error was raised when using that version locally.

That error is:

from ratelimit import limits, RateLimitException
ModuleNotFoundError: No module named 'ratelimit'

An issue was open with the owner of deckar01-ratelimit to determine how to solve this issue.

NO ETA to FIX at the moment, so use Version 1.4.9 until a fix is issued.

Why not just use deep-translator?

Hi, I stumbled across this library by chance and I can't help but notice that the code (or at least a good part of it) under wordhoard/utilities/deep_translator.py is clearly copy-pasted from the deep-translator library. Like even the name of the file is the name of the library. Most of the code is exactly the same as in deep-translator, even variable names. Why not just use the library? it is open source anyway

Like, I understand that you added some extra stuff to it, but it would have been cool to just use the library as it is (as a dependency) or for example to give credit in your README or docs.

I think as open source developers, it would be really great if we support each other and the bare minimum in this case would be just to mention or give credits to someone for their work, because most of the time it is taken for granted.

Congratulation on your library/work. Keep up the good work.
IMHO, It would be cool if you just use deep-translator as a third party lib because you will get the last updates, so more features, definitely bug fixes and most importantly support.

wordhoard.synonyms producing errors

An error is being written in the wordhoard_error.yaml log file when the following code is executed:

from wordhoard.synonyms import Synonyms

synonyms = Synonyms('bad').find_synonyms()
print(synonyms)

This is the error:

2021-08-15 09:38:29:wordhoard.synonyms:ERROR: An IndexError occurred in the following code segment:
2021-08-15 09:38:29:wordhoard.synonyms:ERROR: File "/Users/unknownPython_Projects/scratch_pad_testing/venv/lib/python3.9/site-packages/wordhoard/synonyms.py", line 194, in _query_synonym_com
synonyms_list = find_synonyms[2].lstrip().replace('synonyms:', '').split(',')

Option to disable sorting of output?

I've notices that the output of synonyms are sorted alphabetically, which defeats the purpose of sorting by relevance? Is there an option to disable this?

Consider allowing for design-time and runtime configurability of sources

No configuration appears to be available that would allow consumers to specify which source datasets to use. Antonyms for instance uses _query_synonym_com _query_thesaurus_com _query_thesaurus_plus, and it would be helpful to be able to suppress each in configuration or runtime. Additionally, the ability to leverage a local override set would be helpful in some cases.

Amazing!

hehey, this is amazing. Is there a chanse to get all words from database in style
Hypernymy: hyponims?

Missing PYPI_description.md from github

wordhoard/setup.py

Lines 3 to 4 in 1e54f45

with open("PYPI_description.md", "r") as fh:
long_description = fh.read()

Produces an error when installing with

pip install -e .

Because the file (PYPI_description.md) does not exist in the repository.
Fix without having the file would be to comment it out and add
long_description = "temp description"

Synonyms returning the same result after switching to a different source

Demo code:

synonym = Synonyms(
    search_string="glaring",
    sources=[
        "synonym.com",
    ],
    rate_limit_timeout_period=1,
)
synonym_results = synonym.find_synonyms()
print(synonym_results)

synonym = Synonyms(
    search_string="glaring",
    sources=[
        "merriam-webster",
    ],
    rate_limit_timeout_period=1,
)
synonym_results = synonym.find_synonyms()
print(synonym_results)

It'd return the same result.

Questionable behavior of find_synonyms()

with the following code generic_utils.py:

from wordhoard import Synonyms
def synonym(word):
    syn = Synonyms(word)
    syn_res = syn.find_synonyms()
    return syn_res

ran from terminal with clean state:

>>> import generic_utils as gu
>>> gu.synonym('mother')
['ma', 'mom', 'mum', 'dam', 'mama', 'mater', 'mommy', 'mummy', 'mamma', 'mammy', 'momma', 'parent', 'para i', 'supermom', 
'puerpera', 'old lady', 'old woman', 'primipara', 'quadripara', 'quintipara', 'birth mother', 'mother-in-law',
'foster mother', 'female parent', 'surrogate mother', 'biological mother']
>>> gu.synonym('mother')
['noun']

env info with wordhoard==1.5.3 and python 3.10.10:

backoff==2.2.1
beautifulsoup4==4.12.2
certifi==2022.12.7
charset-normalizer==3.1.0
cloudscraper==1.2.71
deckar01-ratelimit==3.0.2
deepl==1.14.0
idna==3.4
lxml==4.9.2
pyparsing==3.0.9
requests==2.28.2
requests-toolbelt==1.0.0
soupsieve==2.4.1
urllib3==1.26.15

Seems to be some error with caching; once was able to get some error message, but not 100% sure that this is it.

ERROR:wordhoard.synonyms:A KeyError occurred in the following code segment:
ERROR:wordhoard.synonyms:  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/synonyms.py", line 571, in _query_thesaurus_com
    self._update_cache(part_of_speech_category, synonyms_list)
  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/synonyms.py", line 134, in _update_cache
    caching.insert_word_cache_synonyms(self._word, pos_category, synonyms)
  File "/<path>/.conda/envs/sam/lib/python3.10/site-packages/wordhoard/utilities/caching.py", line 65, in insert_word_cache_synonyms
    temporary_dict_synonyms[word][pos_category] += deduplicated_values

Was able to fix behaviour, when disabling caching totally by changing the line

check_cache = self._check_cache()
to
check_cache = [False]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.