mardi4nfdi / docker-importer Goto Github PK
View Code? Open in Web Editor NEWImport data from external data sources into the portal
Home Page: https://mardi4nfdi.github.io/docker-importer
Import data from external data sources into the portal
Home Page: https://mardi4nfdi.github.io/docker-importer
Currently, the python package is called "mardi_importer" and includes the packages "zbmath", "wikidata", etc, which can be imported via import zbmath
, etc.
It seems cleaner to bundle these modules into an "umbrella" module, i.e., folder structure
docker-importer/
src/
mardi_importer/
__init__.py
zbmath/
wikidata/
...
such that the imports work via from mardi_importer import zbmath
.
TODOS:
Checklist for this issue:
Issue description:
Additional data from the zbMath open API should be imported into the MaRDI-Portal.
Related: #6
Remarks:
TODO:
Acceptance-Criteria
Checklist for this issue:
Using Crossref data: "No sign-up is required to use the REST API, and the data can be treated as facts from members. The data is not subject to copyright, and you may use it for any purpose.
Crossref generally provides metadata without restriction; however, some abstracts contained in the metadata may be subject to copyright by publishers or authors." (https://www.crossref.org/documentation/retrieve-metadata/rest-api/)
E.g. \& in zbmath titles (might also happen in other fields and in data from other sources)
New feature description in words:
I find the current implementation of repr in BaseEntity.py confusing to get a quick overview of an entity.
repr should return:
TODOS:
Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)
Issue description:
Update Dockerfile to install the last version of WikibaseIntegrator.
(Check that MardiImporter still works after that)
TODOS:
Acceptance-Criteria
Checklist for this issue:
Epic description in words:
Requirements for the ArXiv importer
Additional Info:
Currently metadata can be imported through OAI (https://info.arxiv.org/help/oa/index.html). This includes:
Corresponding Milestones:
Epic issues:
Related bugs:
Epic acceptance criteria:
Checklist for this epic:
Will be filled later
Issue description:
Metadata on 21 polyDB collections has been inserted to the KG.
The update functionality to overwrite existing entities with new information from polyDB.org and to create new collections has to be implemented.
TODOS:
Acceptance-Criteria
Checklist for this issue:
Issue description:
TODOS:
Acceptance-Criteria
Checklist for this issue:
Issue description:
TODOS:
Acceptance-Criteria
Checklist for this issue:
R should be an instance of programming language (as in Q206904) and not an instance of R package.
In EntityCreator.py, implement the SQL connection with SQLAlchemy instead of mysql-connector-python, since this is not supported by pandas and results in user warnings.
When done, delete mysql-connector-python from setup.py.
Fields returned by ZBMath query:
Issue description:
sphinx assumes that python docstrings are written in restructuredtext format, which differs slightly from markdown. I quite like the google style for docstrings: https://github.com/google/styleguide/blob/gh-pages/pyguide.md#38-comments-and-docstrings
TODOS:
Checklist for this issue:
Could you document how the zbMATH Open data import can be started?
The importers should dump metrics that can be visualized by Prometheus/Grafana.
For instance:
@LizzAlice let's discuss
Whether an author exists already can only be checked if the ORCID ID is provided in CRAN.
If no ORCID ID is provided, duplicate entities shall be created for each name, i.e. only checking whether an author exists by comparing the name of existing entities is not sufficient.
This will result in multiple duplicate entities for each author (since several authors have published >1 packages).
Explore whether a better solution is possible to avoid duplicates.
Epic description in words:
TA4 workflows are generated as markdown files.
The objective is to process these workflows and based on them create wikibase entities and mediawiki pages.
Once the workflow info is in the portal based (entities and pages exist), search functionality for existing workflows should be optimized.
TODOS:
Related bugs:
Epic acceptance criteria:
Checklist for this epic:
The docker container uses root. It seems cleaner to create a regular user in the Dockerfile.
Note that this might interfere with #23 (pip install --user -e .
)
Checklist for this issue:
(Some checks for making sure this feature-request is completely formulated)
In the zbmath upload, some papers were put into the same item because they had the same title and description. This is now fixed because the de number gets appended to the descriptions, but the cases in which this happened still need to be logged and reuploaded.
1.) Query for getting number of items where this happened:
SELECT (COUNT(?item) as ?count)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}
--> result: 15382 items
2.) Query for getting the number of papers it should be:
SELECT (SUM(?count) as ?totalCount)
WHERE{
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)}
--> result: 38021
3.) Query for downloading all item ids for entities that should be deleted:
SELECT ?item (COUNT(?hasID) as ?count)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)
4.) Query for getting all zbmath de numbers from these papers:
SELECT ?item (COUNT(?hasID) as ?count) (GROUP_CONCAT(?hasID; separator=", ") as ?ids)
WHERE {
?item wdt:P1451 ?hasID.
}
GROUP BY ?item
HAVING (COUNT(?hasID) > 1)
--> doing this in one sparql query broke something, so I will most likely have to do this via console
New feature description in words:
Import polyDB metadata (polydb.org) into the knowledge graph.
TODOS:
Checklist for this issue:
(Some checks for making sure this feature request is completely formulated)
See title
Get Overview about data available via api. This will be documented here:
openml.datasets.list_datasets(output_format="dataframe")
openml.study.list_studies(output_format="dataframe")
(a bit unclear, what this is, but there are only two... However, from the ids, it seems as if there were more)
openml.tasks.list_tasks(output_format="dataframe")
Dependencies: Task on Dataset; Run on Task, Setup and Flow; Setup on Flow, Evaluation on Run, Task, Setup, Flow, Dataset
Issue description:
Since the findability of CRAN Packages for R is a requirement for TA3 we aim to offer a search functionality for these packages in the MaRDI Portal.
https://cran.r-project.org/web/packages/available_packages_by_date.html
A first step is to import the existing packages to our knowledge base (wikibase) to make them interlinkable --- ?
Then in a next step, find a way to keep the packages up to date.
TODOS:
Acceptance-Criteria
Checklist for this issue:
Epic description in words:
Additional Info:
Corresponding Milestones:
Epic issues:
Related bugs:
Epic acceptance criteria:
Checklist for this epic:
Issue description:
TODOS:
Acceptance-Criteria
Checklist for this issue:
Issue description:
The current importers (CRAN, zbMath, polyDB) create entities for authors using ORCID ID, zbMath ID or no identifier.
For the cases in which an identifier exists, authors might have been created more than once by different importers.
Duplicate authors should be identified, merged and completed with information from Wikidata.
The dataset mentioned here (MaRDI4NFDI/portal-compose#344) can be useful for the task.
TODOS:
Acceptance-Criteria
Checklist for this issue:
Describe the bug
WikibaseImport is now only allowed to import 10 entities per minute. Then the following error is triggered:
"As an anti-abuse measure, you are limited from performing this action too many times in a short space of time, and you have exceeded this limit. Please try again in a few minutes."
It probably has to do with bot permissions in MediaWiki 1.39, see: https://www.mediawiki.org/wiki/Manual:$wgRateLimits
Expected behavior
WikibaseImport should be able to create as many new entities as necessary, without artificial limitations.
To Reproduce
Steps to reproduce the behavior:
Screenshots
(If applicable, add screenshots to help explain your problem.)
Additional context
Add any other context about the problem here.
Checklist for this issue:
(Some checks for making sure this issue is completely formulated)
When articles are imported from zbmath that were already imported from crossref (e.g. https://portal.mardi4nfdi.de/wiki/Item:Q149570), there can be problems:
Suggested solutions:
Issue description:
TODOS:
Acceptance-Criteria
Checklist for this issue:
Issue description:
TODOS:
Acceptance-Criteria
Checklist for this issue:
Issue description:
The polyDB ID (https://portal.mardi4nfdi.de/wiki/Property:P1437) has been created manually and also manually added to the already existing polyDB collections.
The polyDB importer has to be modified to create this property during setup() and to add it to new collections automatically.
TODOS:
Acceptance-Criteria
Checklist for this issue:
Describe the bug
A while after creating new items, the error ** Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 512000 bytes) in /var/www/html/includes/libs/rdbms/database/Database.php on line 1389** appears when accessing the item page.
Expected behavior
This error should not happen.
To Reproduce
Steps to reproduce the behavior:
Additional context
Add any other context about the problem here.
Checklist for this issue:
(Some checks for making sure this issue is completely formulated)
Epic description in words:
The Wikibase Importer is lacks desired functionalities, such as importing statements and property values. Thus, the plan is to switch to the Wikibase Integrator.
Issues:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.