datasciencecampus / pygrams Goto Github PK

This module started off of some code we acquired from HMRC, which was fit for purpose at the time and has changed a-lot ever since, but based on the same design-technique. I am not even convinced it is bag-free.

We need to re-design such that:
-> we pass the text docs in and we also pass the tf-idf matrix and the n-terms we got out of it as output at the report.
-> for each document grab all terms out of the tfidf matrix and create a graph where for each node add its neighbours with frequency score as well.

for example graph['wind_turbine'] = [('rotor_blade', 6), ('power_generator', 4), ..., ('wind_power', 2)]

Please consider adding some tests too!

Tidy up output folder

Rename outputs/fdg to outputs/visuals

improve ngrams-unbiasing performance

Opted for binary search over the feature set, as it is cheaper than the other option of linear search through the unsorted non-zero term-features.

The best option would be to iterate the sparse matrix row-wise with sorted feature names, can we do it?

refactor vv code to use the new api (subset of docs)

refactor the vv so it trains the tf-idf matrix from >1000 docs and gets results by row only. That can be done after #73

optimize doc_set in tf_idf

At the moment is order n^2. There is a linear way to do this. Do after tests are there from #85

Date range not reported

Describe the bug
When -yf or -yt used, date range always reported as 'None'

To Reproduce
Steps to reproduce the behavior:

python detect.py -yf 2000 -yt 2018
Observe output:

Patents readied; 1,000 patents loaded
1,000 patents available after publication date sift
Dropped 0 patents due to empty abstracts
In date range None to None there are 1,000 patents
1. power supply                   3.332515  
...

Expected behavior
Output should be:

Patents readied; 1,000 patents loaded
1,000 patents available after publication date sift
Dropped 0 patents due to empty abstracts
In date range 2000-01-01 to 2018-12-31 there are 1,000 patents
1. power supply                   3.332515  
...

Screenshots
N/A

Desktop (please complete the following information):

OS: MacOS
Browser: N/A
Version: 10.13.6

pygramms - genericize the app for any doc family

Create a cml gui to read csv excel and other common data and return tfidf etc

Add setup.py support

Table output

Create a table showing changes in feature rankings with focus, time and cite options

README.md buttons not reflecting status

Describe the bug
The build and coverage buttons are "unknown" state

To Reproduce
Look at the project home page - scroll down to see README.md displayed; the buttons are not showing pass/fail.

Expected behavior
Buttons should reflect state of master build.

Screenshots
N/A

Desktop (please complete the following information):

OS: MacOS
Browser: Safari
Version 11.1.2

Smartphone (please complete the following information):
N/A

Additional context
Clicking through the buttons works - underlying pages are live and showing success

json output config file

Json needs to include:

absolute_path_pkl (of pickled file used as input),
absolute_path_tech_report (of tech_report.txt),
cpc
year_from,
year_to,
pick,
time,
cite,
focus

output connectivity graph

fdgprep.py creates a graph which stores in json. Use this graph to generate a text output similar to fdg, where the tfidf ranked results ( without the score ) will be the nodes and right to them will be there linked terms ranked by frequency score.

Custom NLTK download folder

We need to add the following:

import nltk
nltk.nltk.data.path.append("<required_path>")

via a command line arg, so we can add a custom location without needing an environment variable

Support code coverage

Unit test coverage

"coveralls" works with travis...
https://docs.travis-ci.com/user/coveralls/
https://stackoverflow.com/questions/39501417/how-can-i-connect-coveralls-and-travis-in-github

Easy to add: https://github.com/marketplace/coveralls

Set up minimal code with unit tests

Get test_tfidf.py to work

Filter by concordance

introduce concordance-cpc mapping and option filter by concordance

Remove unused code and include attributions

scripts/visualization/wordclouds/multicloudplot.py remove unused fn
scripts/algorithms/tfidf.py - attribute analyser and tokenisers
Cite outputs/fdg/knockout-3.4.2.js attribution in README.md
Add note to README.md about 3rd party library usage, see GitHub network page
output/fdg attribute to https://github.com/Aeternia-ua/WebGenresForceDirectedGraph

Streamline TF-Idf Api

At the moment there is a method that does term extraction from a single doc, with tests for it, which is wrong.

We need to public methods, one get all terms and one being get terms for a list of doc_ids

Output term counts per week

We need the term counts per week rather than a full TFIDF matrix; hence term occurrence counts per week by accumulating how many patents per week use a given term - computed by number of patents in a week that have a non-zero TFIDF score.

matplotlib Pycharm bug

Describe the bug
When using the app in PyCharm the following error is given:

Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.

This is related to matplotlib within he multicloudplot.py file.

To Reproduce
Steps to reproduce the behavior:

Run the app on a clean install and fresh environment. Error should occur.

Expected behavior
The app will not run, instead displaying:

Screenshots
N/A

Desktop (please complete the following information):

OS: MacOS Sierra
Browser: N/A
Version: N/A

Additional context
Will fix in a new branch

requirements.txt not required

Describe the bug
requirements.txt was introduced to get GitHub to produce dependencies; this is now superseded by setup.py

fix graph generation

we need to be able to generate tf-idf scores by row to revive fdg and other graph applications

Continuous integration across all platforms

Is your feature request related to a problem? Please describe.
CI is currently just Linux

Describe the solution you'd like
Should test MacOS and Windows

Describe alternatives you've considered
Appveyor may be an option alongside Travis

USPTO "all" dataset URL is incorrect in README.md

URL should be
https://drive.google.com/open?id=1m7-_b7-4U7jkNSj4eBt2vE9wol2YAnJJ

Auto-generate dependency wheel

Dependencies can be generated by:

python3 setup.py bdist_wheel

And then directly installed with pip install 'wheel file name'

Documentation error - need to acknowledge USPTO

Describe the bug
Need to acknowledge use of USPTO data

To Reproduce
Examine README.md

Expected behavior
README.md should contain a reference to the USPTO data source

Screenshots
N/A

Desktop (please complete the following information):
N/A

Smartphone (please complete the following information):
N/A

Additional context
N/A

filter by doc_set

add a command line option -fb, --filter_by to filter by a document set of choice.

Table output produces different weighting compared to standard report

Describe the bug
Compare outputs of -o=table and -o=report; weights (and key terms) differ - they should match!

Include CPC classification amount

At the moment the tool doesn't state how many patents are analysed when CPC classification is applied. For example, the console output for python detect.py -cpc=Y02 -ps=USPTO-random-10000 would be:

Patents readied; 10,000 patents loaded
10,000 patents available after publication date sift
Sifting patents for Y02: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9848.67patent/s]
Dropped 0 patents due to empty abstracts

Instead the code should say:

Patents readied; 10,000 patents loaded
10,000 patents available after publication date sift
Sifting patents for Y02: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9848.67patent/s]
Dropped 0 patents due to empty abstracts
XX,XXX patents with Y02 classification analysed **_<- this is the new code needed_**

This is so the user knows if the analysis was performed on a suitable sample size.

Add memory usage to README.md

Test results (using patent_app_detect open source project)

VM running Windows 10 64-bit, 4Gb memory, 2.1GHz Xeon E5-2620

100 patents: 0:00:07 (insufficient results to enable analysis)
1,000 patents: 0:00:37
10,000 patents: 0:04:45 (285s); 283Mb
100,000 patents: 0:40:10 (2,410s); 810Mb
500,000 patents: 3:22:08 (12,128s); 2,550Mb
all patents: 3,152,701 patents: 21:13:09 (76,389s); 13,304Mb

tests for tf_idf doc_set

write a couple of tests to test the new feature

improve n-gram count accuracy

At the moment ngrams scoring is biased due to the fact that any ngram is contained in its (n+1)gram. Consider the example: big data analytics, big data and data analytics. The trigram gets the same count as the bigrams it contains. The bigrams also collect scoring from other documents they exist as a bi-gram term, hence a higher rank. This is why the ranks are always dominated with terms of lower n

We need to scan the tf-idf matrix row by row and nullify the tf-idf score for n-grams contained within a (n+1)gram. This way we will achieve better term distribution accross a given n-gram range.

Replace wordnet lemmatizer with spacy?

Spacy lemmatizer is computing the lemma on the fly rather than looking up a table. Before commiting check outputs are equal or better!

lemmatizer probs

Describe the bug
Lemmatizer often seems to be failing. For example, the tf-idf matrix contains:

abrasion resistance and abrasion resistant
heat_exchange and heat_exchanger

To Reproduce
Create a tf-idf matrix with USPTO-random-500000-term_present.pkl.bz2

Expected behavior
look for the above terms and try to find out why is this happening. Does it not unify if it sees a verb and a noun for example. Can we improve this?

update user manuals for pyGrams

Need a new front page for pyGrams ( Generic Use )

add chi2 option

add it as a command line argument and test it works

Store TF-IDF matrix as output

Is your feature request related to a problem? Please describe.
Option to store TF-IDF matrix once processing is complete rather than discard it. Useful for further processing.

Describe the solution you'd like
--output tfidf

Add remaining code

Import remaining code

Make tfidf object SOLID

Describe the bug
The tfidf object was not re-usable as we were fitting the abstracts in the constructor and performing the actual transformation on the public facing method, which if was called more than once, was performing obsolete expensive computations and also introducing error

To Reproduce
run the chi2 or citation tests and inspect the tfidf matrix

Solution
perform the tfidf transform on the constructor, will guarantee it is done once per object

Expected behavior
To refer to web assets we own and control; add in a small, valid JSON in GitHub and refer to that on the master branch (as it will not intentionally change)

Update ReadME with recent changes

Need to update ReadME with recent changes and fix output paths

V & V code missing

tfidf V&V test code missing - import from our development repo

datasciencecampus / pygrams Goto Github PK

pygrams's Issues

Recommend Projects

Recommend Topics

Recommend Org