Git Product home page Git Product logo

docker-importer's People

Contributors

aot29 avatar dajuno avatar eloiferrer avatar lizzalice avatar rimmoussa avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

docker-importer's Issues

create mardi_importer namespace for python modules

Currently, the python package is called "mardi_importer" and includes the packages "zbmath", "wikidata", etc, which can be imported via import zbmath, etc.
It seems cleaner to bundle these modules into an "umbrella" module, i.e., folder structure

docker-importer/
    src/
        mardi_importer/
            __init__.py
            zbmath/
            wikidata/
            ...

such that the imports work via from mardi_importer import zbmath.

TODOS:

  • create folder
  • adapt all imports in the py files to new structure

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

[Epic] Import additional data from zbMATH Open

Issue description:
Additional data from the zbMath open API should be imported into the MaRDI-Portal.
Related: #6

Remarks:

  • Import ID's of the articles, import zbMath classification code and keywords, DOIs, document_title
  • Create authors as items
  • Set MaRDI oai zb preview format
  • Filter out duplicates
  • In Wikibase, add ZBMath classification code and keywords as items

TODO:

  • Create a new project for usage examples of the MaRDI-Portal
  • Setup Jupyter notebook and OpenRefine containers
  • Prototype query to zbMath in Jupyter notebook, document (data source)
  • Prototype and/or document import into Wikibase in Jupyter notebook (data sink)
  • Make a subset of the data for testing
  • #41
  • #62
  • handle duplicates
  • test import with smaller file
  • Make sure external identifiers link
  • Make sure XML-escaped strings are unescaped, e.g "Computers \& Mathematics with Applications"
  • Do import of complete data set in portal
  • Document how to do an update
  • write tests
    see also: MaRDI4NFDI/portal-compose#82

Acceptance-Criteria

  • Data can be downloaded from zbMath
  • Data can be imported into MaRDI-Portal
  • New data can be imported
  • Import data incrementally
  • Duplicates are filtered out
  • Data can be rolled back

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

Using Crossref data: "No sign-up is required to use the REST API, and the data can be treated as facts from members. The data is not subject to copyright, and you may use it for any purpose.

Crossref generally provides metadata without restriction; however, some abstracts contained in the metadata may be subject to copyright by publishers or authors." (https://www.crossref.org/documentation/retrieve-metadata/rest-api/)

Overwrite __repr__ methods for MardiEntities

New feature description in words:
I find the current implementation of repr in BaseEntity.py confusing to get a quick overview of an entity.
repr should return:

  • English/German label
  • English/German description
  • Key-value pairs for statements and qualifiers if present.

TODOS:

  • Rewrite repr in MardiEntities.py to implement the previous schema.

Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)

  • Participants in discussion have been invited as assignees
  • All fields of the issue have been filled
  • Example fields have been removed
  • The main MaRDI project has been assigned to this issue

Pull last version of WikibaseIntegrator in Dockerfile

Issue description:
Update Dockerfile to install the last version of WikibaseIntegrator.
(Check that MardiImporter still works after that)

TODOS:

  • Delete lines 62-63 in Dockerfile

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

[Epic] ArXiv Importer

Epic description in words:
Requirements for the ArXiv importer

Additional Info:
Currently metadata can be imported through OAI (https://info.arxiv.org/help/oa/index.html). This includes:

  • publication date
  • arxiv ID
  • DOI
  • arXiv classification
  • Mathematics Subject Classification ID
  • author name strings

Corresponding Milestones:

  • corresponding milestone one

Epic issues:

  • issue one github-link
  • ...

Related bugs:

Epic acceptance criteria:

  • first criterion

Checklist for this epic:

  • the main MaRDI project has been assigned as project
  • report has been created

polyDB updater

Issue description:
Metadata on 21 polyDB collections has been inserted to the KG.
The update functionality to overwrite existing entities with new information from polyDB.org and to create new collections has to be implemented.

TODOS:

  • Implement update() function in polydb/collection.py

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Build functionality to pull updates from wikidata for existing entities

Issue description:

TODOS:

  • example todo to copy

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

Import swMath software information

Issue description:

TODOS:

  • example todo to copy

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

identify items that need to be imported for zbMath and set up import

Fields returned by ZBMath query:

  • ZBMath id --> document id --> zbmath_id.split(":")[-1]
  • author --> P50 --> author.split("|"); got mapping to orcid; will be included into the api at a later time (within zbmath: match by author_id)
  • author id --> zbMath author id P1556 --> author_ids.split("|")
  • document title --> title: P1476
  • source --> e.g. journal name --> published in P1433
    - edition: b = item.split(",")[0].split(" ")[-1] \n if b.isdigit()...
    - page numbers: item.split("|")[0].split(",")[-1].strip().split(" ")[0]
    (- year (slightly long): re.search('(([^)]+)',item.split("|")[0].split(",")[-1].strip().split(" ")[-1]).group(1))
  • classifications --> Mathematics Subject Classification ID P3285 (?) --> classifications.split("|")
  • language --> language of work or name P407
  • links --> url P2699
  • keywords -> just use strings --> or would it be better to make items for them?
  • doi --> DOI P356
  • publication year --> 'P577': 'publication date'
  • serial --> journal name: item.split("|")[0] ; publisher: item.split("|")[1].split(",")[0]

Fix python docstrings for sphinx

Issue description:

sphinx assumes that python docstrings are written in restructuredtext format, which differs slightly from markdown. I quite like the google style for docstrings: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings

TODOS:

  • agree on docstring style
  • adapt docstrings accordingly & be consistent in the future

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

Add metrics exporter for Grafana

The importers should dump metrics that can be visualized by Prometheus/Grafana.

For instance:

  • number of imported entities (per import)
  • total number of entities
  • date/time of import
  • performance metrics: time, size of data?

@LizzAlice let's discuss

Create duplicate authors when ORCID not provided.

Whether an author exists already can only be checked if the ORCID ID is provided in CRAN.
If no ORCID ID is provided, duplicate entities shall be created for each name, i.e. only checking whether an author exists by comparing the name of existing entities is not sufficient.
This will result in multiple duplicate entities for each author (since several authors have published >1 packages).
Explore whether a better solution is possible to avoid duplicates.

[Epic] Integrate TA4 workflows into the portal

Epic description in words:
TA4 workflows are generated as markdown files.
The objective is to process these workflows and based on them create wikibase entities and mediawiki pages.
Once the workflow info is in the portal based (entities and pages exist), search functionality for existing workflows should be optimized.

TODOS:

  • #64
  • #65
  • #66
  • Create mediawiki page based on the info that has been inserted into wikibase and on the workflow template.
  • Predefine SPARQL queries to explore/visualize the imported workflows.

Related bugs:

Epic acceptance criteria:

  • first criterion

Checklist for this epic:

  • the main MaRDI project has been assigned as project
  • report has been created

Insert Wikibase QID/PID on WikibaseImport entities

  • Properties and Items imported using WikibaseImport should be completed with a statement indicating their original ID in Wikidata.
  • This could be done after WikibaseImport finishes with a script that reads the wbs_entity_mapping sql table and creates the corresponding statements.

Create regular user in docker container

The docker container uses root. It seems cleaner to create a regular user in the Dockerfile.

Note that this might interfere with #23 (pip install --user -e .)

Checklist for this issue:
(Some checks for making sure this feature-request is completely formulated)

  • Participants in discussion have been invited as assignees
  • All fields of the issue have been filled
  • Example fields have been removed
  • MaRDI_Project has been assigned as project

Detangle papers for zbmath

In the zbmath upload, some papers were put into the same item because they had the same title and description. This is now fixed because the de number gets appended to the descriptions, but the cases in which this happened still need to be logged and reuploaded.

  • formulate query for getting number of items containing more than one paper
  • formulate query to get number of papers it should be
  • formulate query to get and download de numbers for all papers that should be reuploaded
  • formulate query to get ids for all items with more than one paper
  • delete all items with more than one paper
  • upload

1.) Query for getting number of items where this happened:

SELECT (COUNT(?item) as ?count)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}

--> result: 15382 items

2.) Query for getting the number of papers it should be:

SELECT (SUM(?count) as ?totalCount)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}

--> result: 38021

3.) Query for downloading all item ids for entities that should be deleted:

SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)

4.) Query for getting all zbmath de numbers from these papers:

SELECT ?item (COUNT(?hasID) as ?count) (GROUP_CONCAT(?hasID; separator=", ") as ?ids)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)

--> doing this in one sparql query broke something, so I will most likely have to do this via console

Import polyDB metadata

New feature description in words:
Import polyDB metadata (polydb.org) into the knowledge graph.

TODOS:

  • Import metadata through the Rest API (https://polymake.org/polytopes/paffenholz/polyDB-rest.html)
  • Clean metadata (disambiguate authors through ORCID, retrieve DOIs for publications)
  • Define new properties and items (polyDB collection, contributed by)
  • Import supporting entities from Wikidata
  • Create entities in the KG
  • Import locally the entire polyDB dataset
  • Publish the metadata in production
  • Write exemplary SPARQL queries

Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)

  • Participants in discussion have been invited as assignees
  • All fields of the issue have been filled
  • Example fields have been removed
  • The main MaRDI project has been assigned to this issue

Prepare OpenML import

Get Overview about data available via api. This will be documented here:

  • dataset data openml.datasets.list_datasets(output_format="dataframe")
    • did: unique dataset ID
    • name: non unique
    • version: int, the combination of name and version seems to be unique in every case but one
    • uploader: int (maybe this is a user id??)
    • status: "active" for all of them
    • format: one of ARFF, SParse_ARFF, arff or sparse_arff
    • MajorityClassSize: number or NaN
    • MaxNominalAttDistinctValues: number or NaN
    • MinorityClassSize: number or NaN
    • NumberOfClasses: number or NaN
    • NumberOfFeatures: number or NaN
    • NumberOfInstances: number or NaN
    • NumberOfInstancesWithMissingValues: number or NaN
    • NumberOfMissingValues: number or NaN
    • NumberOfNumericFeatues: number or NaN
    • NumberOfSymbolicFeatures: number or NaN
  • evaluations (have to give evaluation function)
    • run_id: run id
    • task_id: task id
    • setup_id: setup id
    • flow_id: flow id
    • flow_name: flow name
    • data_id: dataset id?
    • data_name: dataset name?
    • function: evaluation function
    • upload_time: time it was uploaded
    • uploader: uploader number
    • uploader_name: name string
    • value: int
    • values: always None?
    • array_data: always None?
  • flows
    • id: unique id
    • full_name: name with number in parentheses
    • name: name of python class or function\
    • version: number
    • external_version: None or package versions with package name in the form 'openml==0.14.1,sklearn==1.3.0'
    • uploader: number
  • runs
    • run_id: unique id
    • task_id: task id
    • setup_id: setup id
    • flow_id: flow id
    • uploader: number
    • task_type: instance of task type in the following form: TaskType.LEARNING_CURVE
    • upload_time: time in the format of 2014-04-06 23:30:40
    • error_message: string
  • setups:
    • setup_id: unique id
    • flow_id: flow id
    • parameters: dict of things that are given as numbers; the dicts contain information such as flow information, data_type, default_value etc
  • study openml.study.list_studies(output_format="dataframe") (a bit unclear, what this is, but there are only two... However, from the ids, it seems as if there were more)
    • id: unique id, only 123 and 226
    • main_entity_type: "run"
    • status: "active"
    • creation_date: time in the format of 2019-02-21 19:55:30
    • creator: number
    • alia: NaN or "amlb"
  • tasks openml.tasks.list_tasks(output_format="dataframe")
    • tid: unique task id
    • ttid: String with task type in the form of TaskType.TASK_TYPE_NAME
    • did: dataset id
    • name: should be the task name, but actually looks like the dataset name
    • task_type: task type as in ttid, but in words
    • status: "active" for all of them
    • estimation_procedure: string
    • evaluation_measures: string or NaN
    • source_data: seems to be the same as did
    • target_feature: string
    • MajorityClassSize: number or NaN (is this the value from the dataset?)
    • MaxNominalAttDistinctValues: number or NaN (is this the value from the dataset?)
    • MinorityClassSize: number or NaN (is this the value from the dataset?)
    • NumberOfClasses: number or NaN (is this the value from the dataset?)
    • NumberOfFeatures: number or NaN (is this the value from the dataset?)
    • NumberOfInstances: number or NaN (is this the value from the dataset?)
    • NumberOfInstancesWithMissingValues: number or NaN (is this the value from the dataset?)
    • NumberOfMissingValues: number or NaN (is this the value from the dataset?)
    • NumberOfNumericFeatures: number or NaN (is this the value from the dataset?)
    • NumberOfSymbolicFeatures: number or NaN (is this the value from the dataset?)
    • number_samples: number or NaN
    • cost_matrix: NaN or matrix in list of lists format or string or number
    • source_data_labeled: NaN or '1227' or '1451'
    • target_feature_event: NaN, or 'event' or 'OS_event'
    • target_feature_left: NaN
    • target_feature_right: NaN or "time" or "OS_years"
    • quality_measure: NaN or string
    • target_value: NaN or string

Dependencies: Task on Dataset; Run on Task, Setup and Flow; Setup on Flow, Evaluation on Run, Task, Setup, Flow, Dataset

[Epic] Import CRAN Packages to MaRDI Portal

Issue description:
Since the findability of CRAN Packages for R is a requirement for TA3 we aim to offer a search functionality for these packages in the MaRDI Portal.

https://cran.r-project.org/web/packages/available_packages_by_date.html

A first step is to import the existing packages to our knowledge base (wikibase) to make them interlinkable --- ?
Then in a next step, find a way to keep the packages up to date.

TODOS:

  • import existing packages to wikibase
  • type of import should be done in a way that the packages can be kept consistently up to date.
  • #18
  • #11
  • #12
  • #20
  • #37
  • #38
  • #39
  • #40
  • #46
  • #49
  • #50
  • Substitute WikibaseImport functions for WikibaseIntegrator functions
  • Disambiguate packages and authors through Wikidata
  • Deploy new functionality

Acceptance-Criteria

  • packages imported
  • follow up clarified

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue has been estimated

[Epic] Develop new importer strategy

Epic description in words:

Additional Info:

Corresponding Milestones:

  • corresponding milestone one

Epic issues:

  • Test Airflow, setup basic case.
  • Define automatic update workflow.
  • Setup Bot-users for each import source.
  • API for the mapping SQL table.

Related bugs:

Epic acceptance criteria:

  • first criterion

Checklist for this epic:

  • the main MaRDI project has been assigned as project
  • report has been created

Research SPARQL Import

Issue description:

  • How does synchronization work ? Every 10 Minutes ?
  • Goal: SPARQL through backend to recognize duplicated entities
  • Are there possible alternatives ?

TODOS:

  • example todo to copy

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Author disambiguation

Issue description:
The current importers (CRAN, zbMath, polyDB) create entities for authors using ORCID ID, zbMath ID or no identifier.
For the cases in which an identifier exists, authors might have been created more than once by different importers.
Duplicate authors should be identified, merged and completed with information from Wikidata.
The dataset mentioned here (MaRDI4NFDI/portal-compose#344) can be useful for the task.

TODOS:

  • For each author with a zbMath ID check if the Wikidata QID can be found
  • For each author with an ORCID ID check if the Wikidata QID can be found
  • For each author with a Wikidata QID, import if available zbMath ID, ORCID and arXiv author ID.
  • Try to get more ORCID IDs with zbMath API; see MaRDI4NFDI/portal-compose#487
  • Check arXiv author ID in e.g. https://arxiv.org/a/0000-0002-7970-7855.html
  • Merge duplicate entities.

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Limited creation of entities by WikibaseImport

Describe the bug
WikibaseImport is now only allowed to import 10 entities per minute. Then the following error is triggered:
"As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes."

It probably has to do with bot permissions in MediaWiki 1.39, see: https://www.mediawiki.org/wiki/Manual:$wgRateLimits

Expected behavior
WikibaseImport should be able to create as many new entities as necessary, without artificial limitations.

To Reproduce
Steps to reproduce the behavior:

  1. Inside docker-importer execute a scripts that requires WikibaseImport to import a list of properties/items from Wikidata.

Screenshots
(If applicable, add screenshots to help explain your problem.)

Additional context
Add any other context about the problem here.

  • In example, information about the device used for producing the bug

Checklist for this issue:
(Some checks for making sure this issue is completely formulated)

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • MaRDI_Project has been assigned as project

Problems in KG to be fixed

  • several authors with same zbmath id uploaded
  • join authors over several ids, e.g. zbmath author id to orcid
  • zbmath papers like this, with same name (probably book) but several zbmath ids https://portal.mardi4nfdi.de/wiki/Item:Q3408402 --> this apparently happened several times, can be found with SELECT ?item (COUNT(?hasID) as ?count)
    WHERE {
    ?item wdt:P1451 ?hasID.
    }
    GROUP BY ?item
    HAVING (COUNT(?hasID) > 1)
  • see if papers can be matched via orcid

Look into using Sphinx for documentation

Issue description:

TODOS:

  • example todo to copy

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

Pull data from zbMath

Issue description:

TODOS:

  • example todo to copy

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed

Add polyDB ID in the polyDB importer

Issue description:
The polyDB ID (https://portal.mardi4nfdi.de/wiki/Property:P1437) has been created manually and also manually added to the already existing polyDB collections.
The polyDB importer has to be modified to create this property during setup() and to add it to new collections automatically.

TODOS:

  • Add 'polyDB ID' with formatter URL in new_entities.json
  • Add statement with wdt:'poly DB' for each new collection.

Acceptance-Criteria

Checklist for this issue:

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • Issue has been assigned to the main project
  • Code was merged
  • Feature branch has been deleted and issues were updated / closed
  • Issue is tracked by an epic, or the label 'non-epic' is set to the issue.

Php memory error after creating items

Describe the bug
A while after creating new items, the error ** Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 512000 bytes) in /var/www/html/includes/libs/rdbms/database/Database.php on line 1389** appears when accessing the item page.

Expected behavior
This error should not happen.

To Reproduce
Steps to reproduce the behavior:

  1. Import items with importer.
  2. Wait an undefined time <= 1 day.
  3. Access item page
  4. See error

Additional context
Add any other context about the problem here.

  • In example, information about the device used for producing the bug

Checklist for this issue:
(Some checks for making sure this issue is completely formulated)

  • Assignee has been set for this issue
  • All fields of the issue have been filled
  • MaRDI_Project has been assigned as project

Switch from Wikibase Importer to Wikibase Integrator

Epic description in words:
The Wikibase Importer is lacks desired functionalities, such as importing statements and property values. Thus, the plan is to switch to the Wikibase Integrator.

Issues:

  • fix entity/property.write() not working
    • reproduce error
    • try to find working format using WikibaseIntegrator code
  • get id from insert
  • method for checking if entity already exists
  • method for writing
  • method for updating
  • update database with wikidata id and local id
  • add claims to primary items
  • include wikidata id as claim when importing data from wikidata
  • set user agent when interacting with wikidata
  • permanently install wikibaseintegrator package in docker
  • restructure code
  • write tests
  • Remove WikibaseImport extension from docker-wikibase

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.