mardi4nfdi / docker-importer Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 4.43 MB

Import data from external data sources into the portal

Home Page: https://mardi4nfdi.github.io/docker-importer

Dockerfile 0.59% Shell 0.44% Python 93.23% Jupyter Notebook 5.74%

docker-importer's People

Contributors

Watchers

docker-importer's Issues

Properties assigned can only be read after 10 minutes.

[] Idea: create a property cache during import.
[] Idea: SQL Anfrage direkt

create mardi_importer namespace for python modules

Currently, the python package is called "mardi_importer" and includes the packages "zbmath", "wikidata", etc, which can be imported via import zbmath, etc.
It seems cleaner to bundle these modules into an "umbrella" module, i.e., folder structure

docker-importer/
    src/
        mardi_importer/
            __init__.py
            zbmath/
            wikidata/
            ...

such that the imports work via from mardi_importer import zbmath.

TODOS:

create folder
adapt all imports in the py files to new structure

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

[Epic] Import additional data from zbMATH Open

Issue description:
Additional data from the zbMath open API should be imported into the MaRDI-Portal.
Related: #6

Remarks:

Import ID's of the articles, import zbMath classification code and keywords, DOIs, document_title
Create authors as items
Set MaRDI oai zb preview format
Filter out duplicates
In Wikibase, add ZBMath classification code and keywords as items

TODO:

Acceptance-Criteria

Data can be downloaded from zbMath
Data can be imported into MaRDI-Portal
New data can be imported
Import data incrementally
Duplicates are filtered out
Data can be rolled back

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Using Crossref data: "No sign-up is required to use the REST API, and the data can be treated as facts from members. The data is not subject to copyright, and you may use it for any purpose.

Crossref generally provides metadata without restriction; however, some abstracts contained in the metadata may be subject to copyright by publishers or authors." (https://www.crossref.org/documentation/retrieve-metadata/rest-api/)

Identify and correct escaped characters

E.g. \& in zbmath titles (might also happen in other fields and in data from other sources)

Develop parser of given markdown template.

Overwrite repr methods for MardiEntities

New feature description in words:
I find the current implementation of repr in BaseEntity.py confusing to get a quick overview of an entity.
repr should return:

English/German label
English/German description
Key-value pairs for statements and qualifiers if present.

TODOS:

Rewrite repr in MardiEntities.py to implement the previous schema.

Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)

Participants in discussion have been invited as assignees
All fields of the issue have been filled
Example fields have been removed
The main MaRDI project has been assigned to this issue

Pull last version of WikibaseIntegrator in Dockerfile

Issue description:
Update Dockerfile to install the last version of WikibaseIntegrator.
(Check that MardiImporter still works after that)

TODOS:

Delete lines 62-63 in Dockerfile

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

[Epic] ArXiv Importer

Epic description in words:
Requirements for the ArXiv importer

Additional Info:
Currently metadata can be imported through OAI (https://info.arxiv.org/help/oa/index.html). This includes:

publication date
arxiv ID
DOI
arXiv classification
Mathematics Subject Classification ID
author name strings

Corresponding Milestones:

corresponding milestone one

Epic issues:

issue one github-link
...

Related bugs:

Epic acceptance criteria:

first criterion

Checklist for this epic:

the main MaRDI project has been assigned as project
report has been created

Create wikibase entity in relation to each workflow.

Deduplication using knowledge graph embeddings

Will be filled later

polyDB updater

Issue description:
Metadata on 21 polyDB collections has been inserted to the KG.
The update functionality to overwrite existing entities with new information from polyDB.org and to create new collections has to be implemented.

TODOS:

Implement update() function in polydb/collection.py

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Some packages are not in correct format because input template, mitagate mistakes

Build functionality to pull updates from wikidata for existing entities

Issue description:

TODOS:

example todo to copy

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Import swMath software information

Issue description:

TODOS:

example todo to copy

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Import DOI from papers related a given R package (if available from the long description of the package)

Replace R entity with the corresponding wikidata entity (Q206904)

R should be an instance of programming language (as in Q206904) and not an instance of R package.

Substitute mysql.connector for SQLAlchemy to avoid user warnings

In EntityCreator.py, implement the SQL connection with SQLAlchemy instead of mysql-connector-python, since this is not supported by pandas and results in user warnings.

When done, delete mysql-connector-python from setup.py.

identify items that need to be imported for zbMath and set up import

Fields returned by ZBMath query:

ZBMath id --> document id --> zbmath_id.split(":")[-1]
author --> P50 --> author.split("|"); got mapping to orcid; will be included into the api at a later time (within zbmath: match by author_id)
author id --> zbMath author id P1556 --> author_ids.split("|")
document title --> title: P1476
source --> e.g. journal name --> published in P1433
- edition: b = item.split(",")[0].split(" ")[-1] \n if b.isdigit()...
- page numbers: item.split("|")[0].split(",")[-1].strip().split(" ")[0]
(- year (slightly long): re.search('(([^)]+)',item.split("|")[0].split(",")[-1].strip().split(" ")[-1]).group(1))
classifications --> Mathematics Subject Classification ID P3285 (?) --> classifications.split("|")
language --> language of work or name P407
links --> url P2699
keywords -> just use strings --> or would it be better to make items for them?
doi --> DOI P356
publication year --> 'P577': 'publication date'
serial --> journal name: item.split("|")[0] ; publisher: item.split("|")[1].split(",")[0]

Fix python docstrings for sphinx

Issue description:

sphinx assumes that python docstrings are written in restructuredtext format, which differs slightly from markdown. I quite like the google style for docstrings: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings

TODOS:

agree on docstring style
adapt docstrings accordingly & be consistent in the future

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Document how to start zbMATH Open data import

Could you document how the zbMATH Open data import can be started?

Extract wikibase entity information from markdown file.

Add metrics exporter for Grafana

The importers should dump metrics that can be visualized by Prometheus/Grafana.

For instance:

number of imported entities (per import)
total number of entities
date/time of import
performance metrics: time, size of data?

@LizzAlice let's discuss

Create duplicate authors when ORCID not provided.

Whether an author exists already can only be checked if the ORCID ID is provided in CRAN.
If no ORCID ID is provided, duplicate entities shall be created for each name, i.e. only checking whether an author exists by comparing the name of existing entities is not sufficient.
This will result in multiple duplicate entities for each author (since several authors have published >1 packages).
Explore whether a better solution is possible to avoid duplicates.

[Epic] Integrate TA4 workflows into the portal

Epic description in words:
TA4 workflows are generated as markdown files.
The objective is to process these workflows and based on them create wikibase entities and mediawiki pages.
Once the workflow info is in the portal based (entities and pages exist), search functionality for existing workflows should be optimized.

TODOS:

#64
#65
#66
Create mediawiki page based on the info that has been inserted into wikibase and on the workflow template.
Predefine SPARQL queries to explore/visualize the imported workflows.

Related bugs:

Epic acceptance criteria:

first criterion

Checklist for this epic:

the main MaRDI project has been assigned as project
report has been created

Insert Wikibase QID/PID on WikibaseImport entities

Properties and Items imported using WikibaseImport should be completed with a statement indicating their original ID in Wikidata.
This could be done after WikibaseImport finishes with a script that reads the wbs_entity_mapping sql table and creates the corresponding statements.

Entities can be importer, but Entity can't be modified after import, enable modification for Entities

Check how entities could be idiomaticly modified ?
Otherwise check for custom implementation possibilities.

Create regular user in docker container

The docker container uses root. It seems cleaner to create a regular user in the Dockerfile.

Note that this might interfere with #23 (pip install --user -e .)

Checklist for this issue:
(Some checks for making sure this feature-request is completely formulated)

Participants in discussion have been invited as assignees
All fields of the issue have been filled
Example fields have been removed
MaRDI_Project has been assigned as project

Detangle papers for zbmath

In the zbmath upload, some papers were put into the same item because they had the same title and description. This is now fixed because the de number gets appended to the descriptions, but the cases in which this happened still need to be logged and reuploaded.

formulate query for getting number of items containing more than one paper
formulate query to get number of papers it should be
formulate query to get and download de numbers for all papers that should be reuploaded
formulate query to get ids for all items with more than one paper
delete all items with more than one paper
upload

1.) Query for getting number of items where this happened:

SELECT (COUNT(?item) as ?count)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}

--> result: 15382 items

2.) Query for getting the number of papers it should be:

SELECT (SUM(?count) as ?totalCount)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}

--> result: 38021

3.) Query for downloading all item ids for entities that should be deleted:

SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)

4.) Query for getting all zbmath de numbers from these papers:

SELECT ?item (COUNT(?hasID) as ?count) (GROUP_CONCAT(?hasID; separator=", ") as ?ids)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)

--> doing this in one sparql query broke something, so I will most likely have to do this via console

Build the docker-importer as container with CI and enable it in portal-compose.

Copy CI workflow
Check that build is successfull
Check if a cronjob is running already on the built container and enable functionality to toggle it with env var.
Check that the python implentations are installed.

Import ORCID ID when available and add it to Authors Entities.

Import polyDB metadata

New feature description in words:
Import polyDB metadata (polydb.org) into the knowledge graph.

TODOS:

Import metadata through the Rest API (https://polymake.org/polytopes/paffenholz/polyDB-rest.html)
Clean metadata (disambiguate authors through ORCID, retrieve DOIs for publications)
Define new properties and items (polyDB collection, contributed by)
Import supporting entities from Wikidata
Create entities in the KG
Import locally the entire polyDB dataset
Publish the metadata in production
Write exemplary SPARQL queries

Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)

Participants in discussion have been invited as assignees
All fields of the issue have been filled
Example fields have been removed
The main MaRDI project has been assigned to this issue

Change image publishing schedule from each night to when there is sth new

See title

Prepare OpenML import

Get Overview about data available via api. This will be documented here:

dataset data openml.datasets.list_datasets(output_format="dataframe")
- did: unique dataset ID
- name: non unique
- version: int, the combination of name and version seems to be unique in every case but one
- uploader: int (maybe this is a user id??)
- status: "active" for all of them
- format: one of ARFF, SParse_ARFF, arff or sparse_arff
- MajorityClassSize: number or NaN
- MaxNominalAttDistinctValues: number or NaN
- MinorityClassSize: number or NaN
- NumberOfClasses: number or NaN
- NumberOfFeatures: number or NaN
- NumberOfInstances: number or NaN
- NumberOfInstancesWithMissingValues: number or NaN
- NumberOfMissingValues: number or NaN
- NumberOfNumericFeatues: number or NaN
- NumberOfSymbolicFeatures: number or NaN
evaluations (have to give evaluation function)
- run_id: run id
- task_id: task id
- setup_id: setup id
- flow_id: flow id
- flow_name: flow name
- data_id: dataset id?
- data_name: dataset name?
- function: evaluation function
- upload_time: time it was uploaded
- uploader: uploader number
- uploader_name: name string
- value: int
- values: always None?
- array_data: always None?
flows
- id: unique id
- full_name: name with number in parentheses
- name: name of python class or function\
- version: number
- external_version: None or package versions with package name in the form 'openml==0.14.1,sklearn==1.3.0'
- uploader: number
runs
- run_id: unique id
- task_id: task id
- setup_id: setup id
- flow_id: flow id
- uploader: number
- task_type: instance of task type in the following form: TaskType.LEARNING_CURVE
- upload_time: time in the format of 2014-04-06 23:30:40
- error_message: string
setups:
- setup_id: unique id
- flow_id: flow id
- parameters: dict of things that are given as numbers; the dicts contain information such as flow information, data_type, default_value etc
study openml.study.list_studies(output_format="dataframe") (a bit unclear, what this is, but there are only two... However, from the ids, it seems as if there were more)
- id: unique id, only 123 and 226
- main_entity_type: "run"
- status: "active"
- creation_date: time in the format of 2019-02-21 19:55:30
- creator: number
- alia: NaN or "amlb"
tasks openml.tasks.list_tasks(output_format="dataframe")
- tid: unique task id
- ttid: String with task type in the form of TaskType.TASK_TYPE_NAME
- did: dataset id
- name: should be the task name, but actually looks like the dataset name
- task_type: task type as in ttid, but in words
- status: "active" for all of them
- estimation_procedure: string
- evaluation_measures: string or NaN
- source_data: seems to be the same as did
- target_feature: string
- MajorityClassSize: number or NaN (is this the value from the dataset?)
- MaxNominalAttDistinctValues: number or NaN (is this the value from the dataset?)
- MinorityClassSize: number or NaN (is this the value from the dataset?)
- NumberOfClasses: number or NaN (is this the value from the dataset?)
- NumberOfFeatures: number or NaN (is this the value from the dataset?)
- NumberOfInstances: number or NaN (is this the value from the dataset?)
- NumberOfInstancesWithMissingValues: number or NaN (is this the value from the dataset?)
- NumberOfMissingValues: number or NaN (is this the value from the dataset?)
- NumberOfNumericFeatures: number or NaN (is this the value from the dataset?)
- NumberOfSymbolicFeatures: number or NaN (is this the value from the dataset?)
- number_samples: number or NaN
- cost_matrix: NaN or matrix in list of lists format or string or number
- source_data_labeled: NaN or '1227' or '1451'
- target_feature_event: NaN, or 'event' or 'OS_event'
- target_feature_left: NaN
- target_feature_right: NaN or "time" or "OS_years"
- quality_measure: NaN or string
- target_value: NaN or string

Dependencies: Task on Dataset; Run on Task, Setup and Flow; Setup on Flow, Evaluation on Run, Task, Setup, Flow, Dataset

[Epic] Import CRAN Packages to MaRDI Portal

Issue description:
Since the findability of CRAN Packages for R is a requirement for TA3 we aim to offer a search functionality for these packages in the MaRDI Portal.

https://cran.r-project.org/web/packages/available_packages_by_date.html

A first step is to import the existing packages to our knowledge base (wikibase) to make them interlinkable --- ?
Then in a next step, find a way to keep the packages up to date.

TODOS:

Acceptance-Criteria

packages imported
follow up clarified

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue has been estimated

[Epic] Develop new importer strategy

Epic description in words:

Additional Info:

Corresponding Milestones:

corresponding milestone one

Epic issues:

Test Airflow, setup basic case.
Define automatic update workflow.
Setup Bot-users for each import source.
API for the mapping SQL table.

Related bugs:

Epic acceptance criteria:

first criterion

Checklist for this epic:

the main MaRDI project has been assigned as project
report has been created

[Epic] Import properties from Wikibase that are relevant

Investigate tools. https://www.mediawiki.org/wiki/Wikibase/Importing
Implement the chosen tool in a container
#19
Check if all relevant scripts from Jupyter Notebooks have been implemented in Docker-Importer.
#18
Check if Wikibase Importer is sufficient for future since it seems to be not well maintained.

Implement 'Publication Date' for each 'Software version identifier' property for each R package.

Research SPARQL Import

Issue description:

How does synchronization work ? Every 10 Minutes ?
Goal: SPARQL through backend to recognize duplicated entities
Are there possible alternatives ?

TODOS:

example todo to copy

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Author disambiguation

Issue description:
The current importers (CRAN, zbMath, polyDB) create entities for authors using ORCID ID, zbMath ID or no identifier.
For the cases in which an identifier exists, authors might have been created more than once by different importers.
Duplicate authors should be identified, merged and completed with information from Wikidata.
The dataset mentioned here (MaRDI4NFDI/portal-compose#344) can be useful for the task.

TODOS:

For each author with a zbMath ID check if the Wikidata QID can be found
For each author with an ORCID ID check if the Wikidata QID can be found
For each author with a Wikidata QID, import if available zbMath ID, ORCID and arXiv author ID.
Try to get more ORCID IDs with zbMath API; see MaRDI4NFDI/portal-compose#487
Check arXiv author ID in e.g. https://arxiv.org/a/0000-0002-7970-7855.html
Merge duplicate entities.

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Limited creation of entities by WikibaseImport

Describe the bug
WikibaseImport is now only allowed to import 10 entities per minute. Then the following error is triggered:
"As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes."

It probably has to do with bot permissions in MediaWiki 1.39, see: https://www.mediawiki.org/wiki/Manual:$wgRateLimits

Expected behavior
WikibaseImport should be able to create as many new entities as necessary, without artificial limitations.

To Reproduce
Steps to reproduce the behavior:

Inside docker-importer execute a scripts that requires WikibaseImport to import a list of properties/items from Wikidata.

Screenshots
(If applicable, add screenshots to help explain your problem.)

Additional context
Add any other context about the problem here.

In example, information about the device used for producing the bug

Checklist for this issue:
(Some checks for making sure this issue is completely formulated)

Assignee has been set for this issue
All fields of the issue have been filled
MaRDI_Project has been assigned as project

Create new entities with further data from Crossref for each related publication.

Problems in KG to be fixed

several authors with same zbmath id uploaded
join authors over several ids, e.g. zbmath author id to orcid
zbmath papers like this, with same name (probably book) but several zbmath ids https://portal.mardi4nfdi.de/wiki/Item:Q3408402 --> this apparently happened several times, can be found with SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)
see if papers can be matched via orcid

Duplication in scientific articles from multiple sources

When articles are imported from zbmath that were already imported from crossref (e.g. https://portal.mardi4nfdi.de/wiki/Item:Q149570), there can be problems:

duplicate authors
same doi, but lower/upper case
different publication date

Look into using Sphinx for documentation

Issue description:

TODOS:

example todo to copy

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Pull data from zbMath

Issue description:

TODOS:

example todo to copy

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed

Add polyDB ID in the polyDB importer

Issue description:
The polyDB ID (https://portal.mardi4nfdi.de/wiki/Property:P1437) has been created manually and also manually added to the already existing polyDB collections.
The polyDB importer has to be modified to create this property during setup() and to add it to new collections automatically.

TODOS:

Add 'polyDB ID' with formatter URL in new_entities.json
Add statement with wdt:'poly DB' for each new collection.

Acceptance-Criteria

Checklist for this issue:

Assignee has been set for this issue
All fields of the issue have been filled
Issue has been assigned to the main project
Code was merged
Feature branch has been deleted and issues were updated / closed
Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

identify items that need to be created and set up creation

Configure Synchronization with CRAN through config file

Include config file to update R packages only in a given timeframe (for instance, only yesterday, last week, everything.)

Php memory error after creating items

Describe the bug
A while after creating new items, the error ** Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 512000 bytes) in /var/www/html/includes/libs/rdbms/database/Database.php on line 1389** appears when accessing the item page.

Expected behavior
This error should not happen.

To Reproduce
Steps to reproduce the behavior:

Import items with importer.
Wait an undefined time <= 1 day.
Access item page
See error

Additional context
Add any other context about the problem here.

In example, information about the device used for producing the bug

Checklist for this issue:
(Some checks for making sure this issue is completely formulated)

Assignee has been set for this issue
All fields of the issue have been filled
MaRDI_Project has been assigned as project

Switch from Wikibase Importer to Wikibase Integrator

Epic description in words:
The Wikibase Importer is lacks desired functionalities, such as importing statements and property values. Thus, the plan is to switch to the Wikibase Integrator.

Issues:

mardi4nfdi / docker-importer Goto Github PK

docker-importer's People

Contributors

Watchers

docker-importer's Issues

Recommend Projects

Recommend Topics

Recommend Org