Git Product home page Git Product logo

datasciencecampus / pygrams Goto Github PK

View Code? Open in Web Editor NEW
60.0 4.0 23.0 682.48 MB

Extracts key terminology (n-grams) from any large collection of documents (>1000) and forecasts emergence

Home Page: https://datasciencecampus.github.io/pygrams

License: Other

Python 99.22% JavaScript 0.64% CSS 0.06% HTML 0.06% Dockerfile 0.02%
nlp python scikit-learn nltk natural-language-processing patents tf-idf emergence-calculations dsc-projects

pygrams's Issues

Force-directed graphs no longer work

Describe the bug
Fails to load page

To Reproduce
Steps to reproduce the behavior:

  1. Open index.html
  2. Blank screen appears rather than FDG

Expected behavior
FDG :)

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
URL is wrong for empty.json

refactor fdgprep.py

This module started off of some code we acquired from HMRC, which was fit for purpose at the time and has changed a-lot ever since, but based on the same design-technique. I am not even convinced it is bag-free.

We need to re-design such that:
-> we pass the text docs in and we also pass the tf-idf matrix and the n-terms we got out of it as output at the report.
-> for each document grab all terms out of the tfidf matrix and create a graph where for each node add its neighbours with frequency score as well.

for example graph['wind_turbine'] = [('rotor_blade', 6), ('power_generator', 4), ..., ('wind_power', 2)]

Please consider adding some tests too!

improve ngrams-unbiasing performance

Opted for binary search over the feature set, as it is cheaper than the other option of linear search through the unsorted non-zero term-features.

The best option would be to iterate the sparse matrix row-wise with sorted feature names, can we do it?

Date range not reported

Describe the bug
When -yf or -yt used, date range always reported as 'None'

To Reproduce
Steps to reproduce the behavior:

  1. python detect.py -yf 2000 -yt 2018
  2. Observe output:
Patents readied; 1,000 patents loaded
1,000 patents available after publication date sift
Dropped 0 patents due to empty abstracts
In date range None to None there are 1,000 patents
1. power supply                   3.332515  
...

Expected behavior
Output should be:

Patents readied; 1,000 patents loaded
1,000 patents available after publication date sift
Dropped 0 patents due to empty abstracts
In date range 2000-01-01 to 2018-12-31 there are 1,000 patents
1. power supply                   3.332515  
...

Screenshots
N/A

Desktop (please complete the following information):

  • OS: MacOS
  • Browser: N/A
  • Version: 10.13.6

Table output

Create a table showing changes in feature rankings with focus, time and cite options

README.md buttons not reflecting status

Describe the bug
The build and coverage buttons are "unknown" state

To Reproduce
Look at the project home page - scroll down to see README.md displayed; the buttons are not showing pass/fail.

Expected behavior
Buttons should reflect state of master build.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: MacOS
  • Browser: Safari
  • Version 11.1.2

Smartphone (please complete the following information):
N/A

Additional context
Clicking through the buttons works - underlying pages are live and showing success

json output config file

Json needs to include:

absolute_path_pkl (of pickled file used as input),
absolute_path_tech_report (of tech_report.txt),
cpc
year_from,
year_to,
pick,
time,
cite,
focus

output connectivity graph

fdgprep.py creates a graph which stores in json. Use this graph to generate a text output similar to fdg, where the tfidf ranked results ( without the score ) will be the nodes and right to them will be there linked terms ranked by frequency score.

Custom NLTK download folder

We need to add the following:

import nltk
nltk.nltk.data.path.append("<required_path>")

via a command line arg, so we can add a custom location without needing an environment variable

Streamline TF-Idf Api

At the moment there is a method that does term extraction from a single doc, with tests for it, which is wrong.

We need to public methods, one get all terms and one being get terms for a list of doc_ids

Output term counts per week

We need the term counts per week rather than a full TFIDF matrix; hence term occurrence counts per week by accumulating how many patents per week use a given term - computed by number of patents in a week that have a non-zero TFIDF score.

matplotlib Pycharm bug

Describe the bug
When using the app in PyCharm the following error is given:

Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.

This is related to matplotlib within he multicloudplot.py file.

To Reproduce
Steps to reproduce the behavior:

  1. Run the app on a clean install and fresh environment. Error should occur.

Expected behavior
The app will not run, instead displaying:

Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends.

Screenshots
N/A

Desktop (please complete the following information):

  • OS: MacOS Sierra
  • Browser: N/A
  • Version: N/A

Additional context
Will fix in a new branch

requirements.txt not required

Describe the bug
requirements.txt was introduced to get GitHub to produce dependencies; this is now superseded by setup.py

fix graph generation

we need to be able to generate tf-idf scores by row to revive fdg and other graph applications

Continuous integration across all platforms

Is your feature request related to a problem? Please describe.
CI is currently just Linux

Describe the solution you'd like
Should test MacOS and Windows

Describe alternatives you've considered
Appveyor may be an option alongside Travis

Auto-generate dependency wheel

Dependencies can be generated by:

python3 setup.py bdist_wheel

And then directly installed with pip install 'wheel file name'

Documentation error - need to acknowledge USPTO

Describe the bug
Need to acknowledge use of USPTO data

To Reproduce
Examine README.md

Expected behavior
README.md should contain a reference to the USPTO data source

Screenshots
N/A

Desktop (please complete the following information):
N/A

Smartphone (please complete the following information):
N/A

Additional context
N/A

filter by doc_set

add a command line option -fb, --filter_by to filter by a document set of choice.

Include CPC classification amount

At the moment the tool doesn't state how many patents are analysed when CPC classification is applied. For example, the console output for python detect.py -cpc=Y02 -ps=USPTO-random-10000 would be:

Patents readied; 10,000 patents loaded
10,000 patents available after publication date sift
Sifting patents for Y02: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9848.67patent/s]
Dropped 0 patents due to empty abstracts

Instead the code should say:

Patents readied; 10,000 patents loaded
10,000 patents available after publication date sift
Sifting patents for Y02: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:01<00:00, 9848.67patent/s]
Dropped 0 patents due to empty abstracts
XX,XXX patents with Y02 classification analysed **_<- this is the new code needed_**

This is so the user knows if the analysis was performed on a suitable sample size.

Add memory usage to README.md

Test results (using patent_app_detect open source project)

VM running Windows 10 64-bit, 4Gb memory, 2.1GHz Xeon E5-2620

100 patents: 0:00:07 (insufficient results to enable analysis)
1,000 patents: 0:00:37
10,000 patents: 0:04:45 (285s); 283Mb
100,000 patents: 0:40:10 (2,410s); 810Mb
500,000 patents: 3:22:08 (12,128s); 2,550Mb
all patents: 3,152,701 patents: 21:13:09 (76,389s); 13,304Mb

improve n-gram count accuracy

At the moment ngrams scoring is biased due to the fact that any ngram is contained in its (n+1)gram. Consider the example: big data analytics, big data and data analytics. The trigram gets the same count as the bigrams it contains. The bigrams also collect scoring from other documents they exist as a bi-gram term, hence a higher rank. This is why the ranks are always dominated with terms of lower n

We need to scan the tf-idf matrix row by row and nullify the tf-idf score for n-grams contained within a (n+1)gram. This way we will achieve better term distribution accross a given n-gram range.

lemmatizer probs

Describe the bug
Lemmatizer often seems to be failing. For example, the tf-idf matrix contains:

abrasion resistance and abrasion resistant
heat_exchange and heat_exchanger

To Reproduce
Create a tf-idf matrix with USPTO-random-500000-term_present.pkl.bz2

Expected behavior
look for the above terms and try to find out why is this happening. Does it not unify if it sees a verb and a noun for example. Can we improve this?

Store TF-IDF matrix as output

Is your feature request related to a problem? Please describe.
Option to store TF-IDF matrix once processing is complete rather than discard it. Useful for further processing.

Describe the solution you'd like
--output tfidf

Make tfidf object SOLID

Describe the bug
The tfidf object was not re-usable as we were fitting the abstracts in the constructor and performing the actual transformation on the public facing method, which if was called more than once, was performing obsolete expensive computations and also introducing error

To Reproduce
run the chi2 or citation tests and inspect the tfidf matrix

Solution
perform the tfidf transform on the constructor, will guarantee it is done once per object

Restore txt report

The output report looks different and messy at the moment. Need to restore it to its previous state, as a single column sorted list.

streamline UI

Agree on what cli options stay and which not, with user-friendliness and output quality in mind

add embeddings support

Use embeddings with fasttext or similar model to:
-> group terms by cosine or wmd similarity as an option.
-> create stopword list of similar words to user's input
-> filter output using cosine distance from user's input

sentences extraction

From the graph pick the n most matching sentences, ie. the ones matching the top head nodes and their associated linked notes the most

f.js refers to a random website

Describe the bug
f.js should only refer to our website

To Reproduce
Examine outputs\fdg\f.js and notice dataURL

Expected behavior
To refer to web assets we own and control; add in a small, valid JSON in GitHub and refer to that on the master branch (as it will not intentionally change)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.