brunoamaral / gregory Goto Github PK

View Code? Open in Web Editor NEW

44.0 5.0 5.0 201.59 MB

Artificial Intelligence and Machine Learning to help find scientific research and filter relevant content

Home Page: https://gregory-ai.com/

License: Other

HTML 10.52% Python 89.13% Shell 0.24% Dockerfile 0.11%

multiple-sclerosis health research-tool machine-learning django python neurology

gregory's Introduction

I'm a Communication Strategist with a knack for code, https://brunoamaral.eu/

Latest project, a search engine that helps research Neurology and Multiple Sclerosis. https://gregory-ms.com/

Would you like to play a game? https://brunoamaral.eu/story/crypto/

bsky.app/profile/brunoamaral.eu

[email protected]

gregory's People

Contributors

Stargazers

Watchers

Forkers

anachaba moohax rmourey26 scytmj

gregory's Issues

Let visitors browse the database freely

both articles and clinical trials use "source" as a parameter

This does not feel right because on clinical trials it should not be a source. Maybe it should be "sponsor" or "published_in"

New sources for articles, by João Sequeira (Capuchos)

1. Registo nacional de ensaios clínicos (https://www.rnec.pt/pt_PT)
2. MS Journal (https://journals.sagepub.com/home/msj)
3. MS and Related Disorders Journal (https://www.msard-journal.com/)

make stopwords configurable

Gregory needs to be agnostic in order to be applied to any number of subjects.

Right now, stop words, or stop sentences, are hardcoded into the javascript:

https://github.com/brunoamaral/gregory/blob/main/assets/js/gregory.js#L5-L20

Maybe move this information into config.toml or another single configuration file that makes sense.

make build.py split the json into markdown files

implement email digest with new articles to be sent weekly

implement view of articles by source

fix sorting of clinical trials on website

fetch rss data via python feedparser

The node-red feedparser doesn't let us add an RSS url to it, so instead we will be using /python-ml/feedreader.py

Move API to django rest framework

List all articles
relevancy API
Remove pagination from articles
https://api.gregory-ms.com/articles/all[](https://api.gregory-ms.com/articles/id/19)

Example: https://api.gregory-ms.com/articles/all

List article that matches the {ID} number.

https://api.gregory-ms.com/articles/id/{ID}

Example: https://api.gregory-ms.com/articles/id/19[](https://api.gregory-ms.com/articles/keyword/myelin)

List all articles by keyword.

https://api.gregory-ms.com/articles/keyword/{keyword}

Example: https://api.gregory-ms.com/articles/keyword/myelin

List related articles by keywords

POST https://gregory-ms.com/articles/related/[](https://api.gregory-ms.com/articles/relevant)

Expects a json object of keywords in the post body.

{ "keywords": ['trials','gait rehabilitation','multiple sclerosis'] }
https://gregory-ms.com/articles/related/

List all relevant articles.

These are articles that we show on the home page because they appear to offer new courses of treatment.

https://api.gregory-ms.com/articles/relevant[](https://api.gregory-ms.com/articles/source/1)

Example: https://api.gregory-ms.com/articles/relevant

Articles’ Sources

List all articles from specified {source}.

https://api.gregory-ms.com/articles/source/{source_id}

Example: https://api.gregory-ms.com/articles/source/1[](https://api.gregory-ms.com/articles/sources)

List all available sources.

https://api.gregory-ms.com/articles/sources[](https://api.gregory-ms.com/trials/all)

Example: https://api.gregory-ms.com/articles/sources

Trials

List all trials.

https://api.gregory-ms.com/trials/all[](https://api.gregory-ms.com/trials/keyword/myelin)

Example: https://api.gregory-ms.com/trials/all

List all trials by keyword.

https://api.gregory-ms.com/trials/keyword/{keyword}

Example: https://api.gregory-ms.com/trials/keyword/myelin[](https://api.gregory-ms.com/trials/source/pubmed)

Trials’ Sources

List all trials from specified {source}.

https://api.gregory-ms.com/trials/source/{source}

Example: https://api.gregory-ms.com/trials/source/pubmed[](https://api.gregory-ms.com/trials/sources)

List all available sources.

https://api.gregory-ms.com/trials/sources

Example: https://api.gregory-ms.com/trials/sources

pythonshell node-red module is missing from the dockerfile

Add forms to subscribe to the weekly digest and clinical trial notifications

Currently we need to add users to the mailing lists manually. This would allow them to subscribe on their own.

Requires

Spam protection
Ability to add existing users to new lists
Ability to unsubscribe from a list by email link

add more information about sources to the database

Example:

[
    {
        "source": "CUF",
        "link": "https://www.example.com"
    },
    {
        "source": "ClinicalTrials.gov",
        "link": "https://www.example.com"
    },
    {
        "source": "Novartis",
        "link": "https://www.example.com"
    }
]

Other relevant information, the link of the search page and keywords we use.

update docker image to include postgres nodes for node-red

add listing and pagination for articles in markdown format

depends on #5

excel export does not contain the full data source

Add api endpoint to query db by keywords in title

json data in noun_phrases breaks form submit in django admin

DataError at /admin/gregory/articles/2097/change/
invalid input syntax for type json
LINE 1: ...amptz, "sent_to_twitter" = NULL, "noun_phrases" = '[''Centra...
                                                             ^
DETAIL:  Token "'" is invalid.

error building the container on Ubuntu 21.04

$ sudo docker-compose up

Creating volume "gregory_flows" with local driver
Creating volume "gregory_python" with local driver
Creating node-red ... error

ERROR: for node-red  Cannot create container for service node-red: failed to mount local volume: mount ./docker-python:/var/lib/docker/volumes/gregory_python/_data, flags: 0x1000: no such file or directory

ERROR: for node-red  Cannot create container for service node-red: failed to mount local volume: mount ./docker-python:/var/lib/docker/volumes/gregory_python/_data, flags: 0x1000: no such file or directory
ERROR: Encountered errors while bringing up the project.

make dates for clinical trials and articles explicit UTC

Manage subscriptions through django's admin

Add European Clinical Trial Register

https://www.clinicaltrialsregister.eu/ctr-search/search?query=multiple+sclerosis

These results are available as an RSS Feed.
https://www.clinicaltrialsregister.eu/ctr-search/rest/feed/bydates?query=multiple+sclerosis

Move db maintenance into django container

Admin container can't run training for the Machine Learning models

relevant articles are not listed

move database from SQLite to Postgres

reasons for it:

better handling of timestamp data
equal integration with metabase
faster (?) response time

best approach would be psql -d gregory -f ./docker-data/gregory.db but it results in syntax errors because of the html values in some columns.

API for related articles returns the source_id instead of the source_name

Example https://gregory-ms.com/articles/1/

weekly summary includes too many articles

some of the articles listed seem that they were not marked as relevant.

deleting an article does not delete the relationship with the category(ies)

I must have missed something when I wrote the models.

from django.db import models
class Categories(models.Model):
	category_id = models.AutoField(primary_key=True)
	category_name = models.CharField(blank=True, null=True,max_length=200)
	category_description = models.TextField(blank=True, null=True)

	def __str__(self):
		return self.category_name

	class Meta:
		managed = True
		verbose_name_plural = 'categories'
		db_table = 'categories'

class Articles(models.Model):
	article_id = models.AutoField(primary_key=True)
	title = models.TextField(blank=False, null=False, unique=True)
	summary = models.TextField(blank=True, null=True)
	link = models.URLField(blank=False, null=False, max_length=2000)
	published_date = models.DateTimeField(blank=True, null=True)
	discovery_date = models.DateTimeField()
	source = models.ForeignKey('Sources', models.DO_NOTHING, db_column='source', blank=True, null=True)
	relevant = models.BooleanField(blank=True, null=True)
	ml_prediction_gnb = models.BooleanField(blank=True, null=True)
	ml_prediction_lr = models.BooleanField(blank=True, null=True)
	noun_phrases = models.JSONField(blank=True, null=True)
	categories = models.ManyToManyField(Categories)
	entities = models.ManyToManyField('Entities')
	sent_to_admin = models.BooleanField(blank=True, null=True)
	sent_to_subscribers = models.BooleanField(blank=True, null=True)
	sent_to_twitter = models.BooleanField(blank=True, null=True)
	doi = models.CharField(max_length=280, blank=True, null=True)

	def __str__(self):
		return str(self.article_id)

	class Meta:
		managed = True
		# unique_together = (('title', 'link'),)
		verbose_name_plural = 'articles'
		db_table = 'articles'


class Entities(models.Model):
	entity = models.TextField()
	label = models.TextField()


	class Meta:
		managed = True
		verbose_name_plural = 'entities'
		db_table = 'entities'


class Sources(models.Model):
	TABLES = [('articles', 'Articles'),('trials','Trials')]


	source_id = models.AutoField(primary_key=True)
	source_for = models.CharField(choices=TABLES, max_length=50, default='articles')
	name = models.TextField(blank=True, null=True)
	link = models.TextField(blank=True, null=True)
	language = models.TextField()
	subject = models.TextField()
	method = models.TextField()
	

	def __str__(self):
		return self.name

	class Meta:
		managed = True
		verbose_name_plural = 'sources'
		db_table = 'sources'


class Trials(models.Model):
	trial_id = models.AutoField(primary_key=True)
	discovery_date = models.DateTimeField(blank=True, null=True)
	title = models.TextField(blank=False,null=False, unique=True)
	summary = models.TextField(blank=True, null=True)
	link = models.URLField(blank=False, null=False, max_length=2000)
	published_date = models.DateTimeField(blank=True, null=True)
	source = models.ForeignKey('Sources', models.DO_NOTHING, db_column='source', blank=True, null=True)
	relevant = models.BooleanField(blank=True, null=True)
	sent = models.BooleanField(blank=True, null=True)
	sent_to_twitter = models.BooleanField(blank=True, null=True)
	sent_to_subscribers = models.BooleanField(blank=True, null=True)

	def __str__(self):
		return str(self.trial_id) 

	class Meta:
		managed = True
		verbose_name_plural = 'trials'
		db_table = 'trials'

include spacy.io in node-red container

We are using https://github.com/explosion/spaCy to detect the noun phrases in the title of articles. This information is then used to list related articles on each page.

Half of the build process is running spacy.io, so it should be included in the node-red flows to save that information in the database.

We could run it as a separate script, but I don't want to split the different processing steps between the container and the host server.

allow adding and removing rss sources from a backoffice dashboard

Listing for physical therapists is missing

convert machine learning scripts to use True / False instead of Zero / One

running 3_predict.py returns an error using the scikit branch

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 100, in _predict_binary
    score = np.ravel(estimator.decision_function(X))
AttributeError: 'GaussianNB' object has no attribute 'decision_function'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "3_predict.py", line 120, in <module>
    data = predictor(dataset)
  File "3_predict.py", line 110, in predictor
    prediction = pipelines[model].predict([input])
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/metaestimators.py", line 113, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)  # noqa
  File "/usr/local/lib/python3.7/dist-packages/sklearn/pipeline.py", line 470, in predict
    return self.steps[-1][1].predict(Xt, **predict_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 457, in predict
    indices.extend(np.where(_predict_binary(e, X) > thresh)[0])
  File "/usr/local/lib/python3.7/dist-packages/sklearn/multiclass.py", line 103, in _predict_binary
    score = estimator.predict_proba(X)[:, 1]
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 125, in predict_proba
    return np.exp(self.predict_log_proba(X))
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 104, in predict_log_proba
    jll = self._joint_log_likelihood(X)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 489, in _joint_log_likelihood
    n_ij = -0.5 * np.sum(np.log(2.0 * np.pi * self.var_[i, :]))
AttributeError: 'GaussianNB' object has no attribute 'var_'

automate the update of the Machine Learning models

Create frontend view for current research

https://www.mssociety.org.uk/research/explore-our-research/emerging-research-and-treatments/explore-treatments-in-trials

The goal is to list the current research as listed by MS Society with a listing of trials and published articles.

update list of RSS feeds in the website

Make Dockerfile.django use requirements.txt

Data should be fetched from the PG database and not the api

Right now, we are fetching articles/all?format=json which uses Django Rest Framework to dump the whole database.

Fetching from the postgres database directly will speed up the build and cut down on processing.

make sure PG schema is correct

https://github.com/brunoamaral/gregory/blob/211806446e0584a9819ae5f949b580b7567eb5db/gregory-pg.schema.sql

Anyone reading this, I could really use some help making sure the new tables and columns are correct.

Main requirement is that the value for an article's title or URL needs to be unique.
Diagram below shows a possible DB model for version 7.

sanitize clinical trials and articles

select trial_id,published_date,discovery_date from trials where published_date = '';

161, mostly from Novartis.

Automatic categorisation does not take synonyms into account

This is a caveat where the system fails to include articles in the corresponding category if the noun used is different. For example Ocrelizumab and Ocrevus, or Natalizumab and Tysabri. These nouns correspond to a single medication, respectively, however, in the current state, Gregory can only identify them as being separate entities.