dbpedia / gsoc Goto Github PK

Google Summer of Code organization

gsoc's Introduction

DBpedia GSoC projects

This repo is used to coordinate efforts around the Google Summer of Code (GSoC) program in the DBpedia community.

The Summer of Code program sponsors students to work on Open Source projects during the summer. Many students job during summer holidays to sustain their living. Instead of that, why not exercise your skills for three months while getting paid?

2019

Communication

Join us on Slack and on the mailing list, comment or create an issue on this repo to discuss a specific topic.

Contribute

Becoming a mentor

A mentor's role is to help the student with his work and grade him at the end of the program. In exchange you will get help to work on one of the projects that you are interested in advancing.

Becoming a student

As a student, if you get picked, you will be paid for three months during the summer to work on one of the projects in the list. If there are no projects in the list that you want to work on, talk to us. Bring your own ideas. We might still be interested to mentor you.

gsoc's People

Contributors

Stargazers

Watchers

gsoc's Issues

Improve the Mapping experience, possibly with gamification (use the new RML mapping UI)

Description

Gamify the mapping process

Goals

At the moment, the mapping process is not very appealing to the users. The goal is to improve the user experience, make the mapping process easier, funnier and more engaging than it currently is.
Impact: Increase the coverage and precision of the infobox to ontology mappings.

Warm up tasks

Write a script that calculates the most active contributor of the week / month

Mentors

Keywords

UI, mapping

Automatic schema alignment between DBpedia mappings in different languages

Description

DBpedia has diverse mapping communities and, sometimes amphisemy between languages or, lack of coordination, may lead to the use of wrong properties for a DBpedia mapping between different languages. A classic example of the elevation of a mountain that is described and mapped as dbo:elevation across most languages except Spanish where they were using dbo:height because both concepts have the same term in their language. This project is meant to identify and, ideally, correct all such mappings from DBpedia. The working approach is defined in “Predicting Incorrect Mappings: A Data-Driven Approach Applied to DBpedia”.

Goals

Apply and improve ideas of the paper to the actual DBpedia mappings. The paper provides a working proof of concept but needs to run in the open and adjustments will be probably needed. Goal is to apply the techniques in as many language pairs as possible and identify all misaligned mappings. A great add-on would be a simple interface where the mapping community will be given the identified wrong mappings and they would vote if the mapping is indeed incorrect and, if so, suggest the proper mapping.

Impact

Very big impact in DBpedia data quality.

Warm up tasks

Read the latest main DBpedia paper to get to know how the framework and the mappings work
Write a few actual infobox mappings in your language
Read this paper
Re-run the experiments described in the paper. Here you may find the training set: https://www.openml.org/s/53
Experiment with a few other algorithms

Mentors

Mariano Rico, Nandana Mihindukulasooriya, Dimitris Kontokostas

Keywords

Machine learning, schema alignment

Extending Extraction Framework with Citations, Commons and Lexemes Extractors

Description

DBpedia is a crowd-sourced community effort to extract structured content from the various Wikimedia projects which is publicly available for everyone on the Web. This project will improve the DBpedia extraction (https://github.com/dbpedia/extraction-framework) process which is continuously being developed by community with citations, commons and lexemes information.

Goals

Student will develop the required modules which will parse the information from the specific source. Developed modules will be used to extract wider range of knowledge from the Wikimedia which will be presented openly to the community usage with different interest and language edition.

Impact

Created triples for the specific type of knowledge will be published to the community usage.

Warm up tasks

Preliminary experience with Extraction Framework
#8
#9

Mentors

TBA

Keywords

Extraction framework, text parsing, RDF generation

Mapping Generation from Resource Descriptions

Description

DBpedia currently maintains mappings between Wikipedia infobox template properties to the DBpedia ontology, since several similar templates exist (in single as well as over multiple languages) to describe closely related types of infoboxes. The aim of the project is to enrich and possibly correct the existing mappings with a data-driven method to propose or generate mappings automatically by analyzing instance data from distinct language-specific datasets. This will be a follow-up of a previous GSoC project, which mainly mapped the classes to infobox templates.
A central goal is also to map Wikidata property identifiers.

Goals

Provide suggestions (eg by using statistical probabilities) for template parameters which properties from DBpedia ontology and from Wikidata should be mapped.

Impact

Increase the coverage for mapped languages and yet not mapped languages, which finally leads to better data quality.

Warm up tasks

Familiarize with and evaluate the results of the previous project code base (no fixed stipulation to re-use this).

Mentors

Keywords

mappings, knowledge base completion, data quality

A Neural QA Model for DBpedia

Description

In the last years, the Linked Data Cloud has grown to over 100 billion facts pertaining to a multitude of domains. The DBpedia knowledge base consists of 4.58 million things on its own. However, accessing this information is challenging for lay users as they are not able to use SPARQL as querying language without exhaustive training.
Recently, Deep Learning architectures based on Neural Networks called seq2seq have shown to achieve the state-of-the-art results at translating sequences into sequences. In this direction, we suggest a GSoC topic around Neural Networks to translate any natural language expression into sentences encoding SPARQL queries. Our preliminary work on Question Answering with Neural SPARQL Machines (NSpM) shows promising results but it is restricted on selected DBpedia classes.
In this GSoC project, the candidate will extend the NSpM to cover more classes of DBpedia and to enable high-quality Question Answering.
The source code can be found here, however we will use this repository as workspace.

Goals

Create query templates for DBpedia.
Train the NSpM recurrent neural network for complex question answering on DBpedia.
(Optional) Evaluate the model against the QALD benchmark.

Impact

The project will allow users to access DBpedia knowledge using natural language.

Warm-up tasks

Read the paper SPARQL as a Foreign Language here.
Download & edit a sample template and train a Neural SPARQL Machine model on a DBpedia class. Please check the wiki for a detailed explanation.
Example of successful project proposal.

Mentors

Tommaso Soru, Edgard Marx, Ricardo Usbeck

Keywords

question answering, deep learning, neural networks, sparql, tensorflow, python

A Multilingual Neural RDF Verbalizer

Description:

Natural Language Generation (NLG) is the process of generating coherent natural language text from non-linguistic data (Reiter and Dale, 2000). Despite community agreement on the actual text and speech output of these systems, there is far less consensus on what the input should be (Gatt and Krahmer, 2017). A large number of inputs have been taken for NLG systems, including images (Xu et al., 2015), numeric data (Gkatzia et al., 2014), semantic representations (Theune et al., 2001) and Semantic Web (SW) data (Ngonga Ngomo et al., 2013; Bouayad-Agha et al., 2014). Presently, the generation of natural language from SW, more precisely from RDF data, has gained substantial attention (Bouayad-Agha et al., 2014; Staykova, 2014). Some challenges have been proposed to investigate the quality of automatically generated texts from RDF (Colin et al., 2016). Moreover, RDF has demonstrated a promising ability to support the creation of NLG benchmarks (Gardent et al., 2017). However, English is the only language which has been widely targeted. Even though there are studies which explore the generation of content in languages other than English, to the best of our knowledge, no work has been proposed to train a multilingual neural model for generating texts in different languages from RDF data.

Goals:

In this GSoc Project, the candidate is entitled to train a multilingual neural model which is capable of generating natural language sentences from DBpedia RDF triples.

Impact:

The project may allow users to generate automatically short summaries about entities which do not have a human abstract using triples.

Warm-up tasks:

Read the papers:

NeuralREG: An end-to-end approach to referring expression generation
RDF2PT: Generating Brazilian Portuguese Texts from RDF Data
BENGAL: An Automatic Benchmark Generator for Entity Recognition
and Linking
Download and get familiar with the code of papers above.
https://github.com/DiegoMoussallem/RDF2NL.
https://github.com/dice-group/RDF2PT
https://github.com/dice-group/BENGAL
https://github.com/ThiagoCF05/NeuralREG

Mentors

Diego Moussallem

Keywords

NLG, Semantic Web, NLP

Pay-As-You-Go Quality Evaluation of the DBpedia Resources

Description

Extraction of the triples from unstructured sources causes several quality problems. These problems causses wrong results in the information retrieval systems. Nonetheless, the quality has both subjective and objective point of view. While some quality dimensions can be assessed using generic tools (objective), others need crowd-source evaluation of the resource (subjective). This project aims at pay as you go quality evaluation of the DBpedia resources in the information retrieval setting, taking into account both subjective and objective quality dimensions. User will request a query by a quality threshold for the specific dimension (trust, freshness etc.). For the given query from the user, the results will be shown (black box) and feedback from the user will be received. According to the feedback, the quality graph of the resource will be updated w.r.t. the given quality dimension.

Goals

The goals of the candidate are as follows:

implementing a web interface to allow user to request a query with her quality preferences and get her results.
an algorithm that computes the quality of the resource incrementally according to the user feedback.
a system which maps the quality results of the resources to the quality graph for each source and updates the quality graph for the given feedback.

Impact

The project will provide a system that computes data quality in an pay-as-you manner and provides structured data graphs for the resources.

Warm up tasks

Preliminary experience with Triple Check Mate tool and paper
Read about Quality Assessment Methodologies for Linked Open Data
Find 10 or 100 errors in DBpedia #11

Mentors

TBD (possible names: Beyza Yaman).

Keywords

Quality, Feedback, Crowd-source, Information retrieval

Workflow for linking external datasets

Description

DBpedia is long time known as a central hub in the Linked Open Data (LOD) cloud as there are numerous datasets linked by DBpedia and even more linking to DBpedia resources. Nevertheless, outgoing links are highly appreciated. In the DBpedia Links repository people can upload there linksets or scripts to generate such. What is currently missing is a toolset to create links between DBpedia and external datasets automatically and to review them if needed. State-of-the-art tools for link creation between datasets are SILK and LIMES.
The student should create a workflow for automatic link generating and curation of automatically generated links, given there is a new dataset that should be linked to DBpedia. This workflow should be supported by a web-based GUI.

There are two levels of linking:

schema level: linking external vocabularies to the DBpedia ontology
instance level: linking resources from external datasets to DBpedia resources
Both levels can be considered in this project.

Goals

Impact

Get better links from DBpedia to external datasets.

Warm up tasks

Mentors

Keywords

data quality, linking

Explainable Knowledge Discovery on DBpedia

Description

The latest DBpedia release comprises a knowledge graph having 326,035,765 edges. As the data originates from a human-generated wiki, the graph is far from being complete. This project idea aims at the realisation of an algorithm to perform knowledge base completion (or link prediction) on DBpedia. We want to tackle this problem using a rule-based approach in order to meet the requirements for an explainable AI.

Goals

The goals of the candidate would be to adapt the code of and employ HornConcerto, an algorithm to discover Horn rules and new relationships in large graphs. The method has already shown to outperform existing approaches in terms of runtime and memory consumption and mine high-quality rules for the completion task.

Ultimate goals:

discover explainable hidden patterns among DBpedia entities and properties;
evaluate the algorithm on a knowledge base completion benchmark;
write a scientific paper.

Impact

The project will enhance the quality and completeness of DBpedia data.

Warm-up tasks

Read the paper:
- Beyond Markov Logic: Efficient Mining of Prediction Rules in Large Graphs (2018) (working paper)
- Follow the examples and run the code.
- Discover rules and new relationships in a dataset of your choice (also unrelated with DBpedia).
Example of successful project proposal.

Mentors

TBD (possible names: Tommaso Soru, Aman Mehta, Amandeep Srivastava).

Keywords

knowledge discovery, knowledge base completion, link prediction, explainable artificial intelligence, association rules

Setup a YASGUI interface for DBpedia

Effort

1 day

Skills

Javascript

Description

Setup and configure a YASGUI interface to work with the DBpedia SPARQL endpoint. You might do this locally or on the web. Try to configure the client (and optionally server) in order to give the enduser a more comfortable SPARQL environment.

Impact

Learn basics of communication with the SPARQL endpoint.

A Neural QA Model for DBpedia (GSoC 2019)

Previous projects

This project idea is a follow-up of GSoC 2018 project A Neural QA Model for DBpedia.

Description

Recently, Deep Learning architectures based on Neural Networks called seq2seq have shown to achieve the state-of-the-art results at translating sequences into sequences. In this direction, we suggest a GSoC topic around Neural Networks to translate any natural language expression into sentences encoding SPARQL queries. Our preliminary work on Question Answering with Neural SPARQL Machines (NSpM) shows promising results but the coverage is restricted to manually-curated templates.

The most up-to-date source code can be found here. During the GSoC, we will use this repository as workspace.

Goals

In this GSoC project, the candidate can choose between the following research directions:

employ a language model (e.g., Question Generation, Universal Sentence Encoders) to automatically discover query templates;
perform experiments on compositionality for complex QA;

with the following ultimate goals:

train one or more NSpM models on DBpedia;
evaluate the model against either the QALD benchmark (direction 1) or a new task-oriented dataset (direction 2).

Impact

The project will allow users to access DBpedia knowledge using natural language.

Warm-up tasks

Read the papers:
- SPARQL as a Foreign Language (2017);
- Neural Machine Translation for Query Construction and Composition (2018).
Download & edit a sample template and train a Neural SPARQL Machine model on a DBpedia class. Please check the wiki for a detailed explanation.
Read the blog of last year's student to understand what has been achieved so far and reproduce the experiments.
Example of successful project proposal.

Mentors

Rricha Jalota and Nausheen Fatma (backup: Aashay Singhal, Aman Mehta, Tommaso Soru).

Keywords

structured question answering, deep learning, neural networks, sparql, tensorflow, python

Tool to generate RDF from DBpedia abstracts (natural language text)

Description

With the recent advances (e.g. SyntaxNet) in the analysis of texts in natural language, the conversion of texts into RDF triples is becoming a real possibility. This project will apply these ideas to a real use case: DBpedia. We will add the power of syntactic analyzers with the benefits of Name Entity identifiers (like Spotlight) to generate highly trustable RDF triples from the textual information (long abstract) about a given DBpedia resource.

Goals

The tool created will generate a new nt file with the triples proposed for all the DBpedia resources. This tools could be exploited by the DBpedia extraction process to provide a new nt file in the DBpedia downloads.

Impact

Increase the number of RDF triples for a given DBpedia resource.

Warm up tasks

Experience with SyntaxNet o any other NLP tool capable of providing a syntactic analyzer of natural language. Here we have to reach a balance between power and number of supported languages.
Fluent RDF and DBpedia datasets (downloads).

Mentors

Mariano Rico

Keywords

NLP, text parsing, syntactic analysis, RDF generation

Run Extraction Framework

Effort

1-2 days

Skills

basic maven, executing README file

Description

The DBpedia extraction framework can download a set of Wikipedia XML dumps and extract facts. There is a configuration file where you specify the language(s) you want and just run it. Setup your download & extract configuration files and run a simple dump-based extraction.

Impact

Get to know the was the extraction framework works.

Dataset recommendation system in databus for several analysis. Integrating chatbot with the main website and DBpedia databus to get the dataset information.

Description

Currently, the chatbot is not available on the main website and enhancing the chatbot by integrating it with the main website and DBpedia databus for dataset information would do a great deal for users. Firstly, the scope of chatbot will be increased by adding more narrations and dialogues in order to generate more relevant responses. Another functionality is recommendation system of dataset in data bus.

Goals

The goal of this system is basically to reduce the task of managing the activities. Instead of finding the dataset and its information by going through the website and number of datasets, users can ask the chatbot to get that job done, and in response user can get directly what they have asked for with in few seconds. On the other hand, recommendation system will be used in databus for recommending datasets to all the users. These recommendations will be given by chatbot itself.

Impact

This project will reduce the amount of time spend on several activities and increase interactivity with the chatbot that will be integrated with the main website. More interactivity with bot will eventually increase the scope of it day by day.

Keywords

Natural Language Processing, Data Analysis, Recommendation System.

Fusing the List Extractor and the Table Extractor

Description

Currently, there are 2 different projects for extracting triples from lists and from tables. Both project's aim is to extract data from wikipedia pages and to create a dictionary for mapping elements found in those pages. The student has to study how these projects work (how they create dictionaries, how they call for services, etc.) and he has to merge them, in order to create a unified extractor. The student has to restructure both the projects such that both projects use a common dictionary, thus making it easier for the existing projects to be integrated into one. The student can also add a GUI so that it becomes easier for users with little/no knowledge about the project can add triples. The GUI should have a tool that can look up for existing classes and properties from the latest DBpedia ontology. Also, implement other facilities for users perspective (like add more comments, demo that shows all steps, etc.). Also, the student should add support for different languages, so that the extractor can extract triples from different editions (languages) of Wikipedia. This should include support for languages that don't support Latin alphabets (like Greek, Hebrew etc.). Multithreading implementation: try to create threads into extractors in order to make them faster.

Goals

There are two main goals to achieve:

Merge two projects in order to get a unique way to analyze wikipedia structures (lists and tables).
Create a GUI interface to help user. Furthermore it will be helpful adding more comments and tips.

Another aspect that could be studied is how to speed up this analysis' process. The entire work can be reorganized in different threads (this is an additional goal, it's not essential).

Impact

DBpedia will have only one program to extract data from Wikipedia article pages.
Furthermore, users will have new facilities, like a GUI or tips on how he could work better with this application.

Warm up tasks

Study parser's code and explain a possible dictionary structure that can be used for both projects.
Mockup of GUI interface that has to organize user's work (e.g. how users add new rules or how he can view statistics of domain analysis).

Mentors

Luca Virgili, Krishanu Konar

Keywords

Python, RDF, Java

Extend the Extraction Framework for your language

Effort

1-2 days

Skills

basic maven, scala

Description

The DBpedia extraction framework has a default configuration that is language agnostic. However, language specific configuration can boost the coverage and precision of the extracted data for that particular language. We keep all language specific configurations here. Browse through the code and try to see how you can improve existing languages of provide configuration for a new one.

Impact

Improvements in the data quality & quantity for a particular language

Golden standard and quality tool for DBpedia types

Description

Several works from academia and industry exploit the "type" of DBpedia resources. This "type" is a class in the DBpedia ontology, like Person, Movie or Device. The "type" comes from (1) the Wikipedia infobox of the resource and (2) the mapping created by humans. Therefore, DBpedia extractors cannot assign a type to a resource when (1) the resource has not infobox in Wikipedia, or (2) the resource has an infobox not mapped. For many languages this lack of type reaches 50% of resources.
Several experimental studies have tried to infer the type of a resource from the "connections" this resource has in the graph this resource belongs. For instance, [1] follows a statistic approach, and [2] follows a machine learning approach.
However, these approaches need a validation that is not simple: as DBpedia classes are in a hierarchy (Writer is a subclass of Person, Poet is a subclass of Writer, etc.) with up to 7 levels, the deeper levels use to have fewer resources. Therefore, the precision and recall of the "type predictors" must be validated per clase or, at least, per level.

[1] Paulheim, H., Bizer, C.: Type inference on noisy RDF data. ISWC 2013. LNCS, vol. 8218, pp. 510–525. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41335-3_32

[2] Rico M., Santana-Pérez I., Pozo-Jiménez P., Gómez-Pérez A.: Inferring Types on Large Datasets Applying Ontology Class Hierarchy Classifiers: The DBpedia Case. EKAW 2018. LNCS, vol. 11313. Springer. https://doi.org/10.1007/978-3-030-03667-6_21

Goals

In order to achieve this validation we need a "golden standard" in which we have manually ensured the type of several resources for each type of the ontology. This "golden standard" should be built using ad hoc software tools. Ideally a web application.

Impact

Enhance the quality of the DBpedia. With this golden standard we could evaluate more easily the approaches to assign a type to a un-typed resource. Also could help us to assign alternative types to typed resources, for example, a more specific (deeper) type or, may be, an alternative type in another DBpedia class hierarchy branch.

Caveats

The DBpedia ontology grows up. The tool should be able to generate a golden standard for every new version of the DBpedia ontology. The latest version is here.
DBpedia is not only the English DBpedia. There are several "chapters" of DBpedia, each one for a specific language. The list of available DBpedias is here.
The class of a resource could depend on the (human) evaluator. Therefore, a multi-evaluator tool is required. Fleiss' kappa could help us to measure this agreement level.

Ideal profile

Experience with Linked Data technologies (RDF, SPARQL), development of web applications.

Warm up tasks

For a given language (e.g. English)
- Find 10 Wikipedia entries without infobox.
- Find 10 Wikipedia entries with infobox but whose infobox has no mapping.
Read these papers:
- To get an idea of the numbers involved in a DBpedia: An Analysis of the Quality Issues of the Properties Available in the Spanish DBpedia
- To get an idea of the mapping process: Predicting Incorrect Mappings: A Data-Driven Approach Applied to DBpedia

Mentors

Mariano Rico

Keywords

golden standard, resource type validation.

Find 10 or 100 errors in DBpedia

Effort

1 day

Skills

curiosity, attention to detail, spreadsheets

Description

There are several classes of errors in DBpedia. Data may be incorrect or missing. Errors may be caused by different reasons, for instance 1) wrong information or wrong format in Wikipedia or other original source; 2) the DBpedia Extraction Framework (DEF) might be making errors during automatic extraction; 3) there might be errors in the mappings or in the ontology. In this task you will browse through DBpedia entities, read Wikipedia pages, (optionally) run some SPARQL queries via the Web UI and analyze the results that come back. Your objective is to judge whether information is correct and try to detect the possible sources of error. You will log your findings in a spreadsheet that will be reviewed with one of the core developers of DBpedia. They will review your analysis and help you determine the source of error.

Impact

Data quality is one of the most important challenges in open data sets like DBpedia. By finding and categorizing errors, you will learn more about how DBpedia works and help us draft a plan of action that will efficiently improve our data quality by tackling the largest sources of errors first.

Recurrent Neural Network Embedding for Knowledge-Base Completion

Student Proposal

Description

I want to build this project in as a Python wrapper that enhances the knowledge base embeddings in DBpedia to provide more accurate semantic information by using generative models to provide schemes for link prediction and entity recognition systems. As we aim to get the embeddings of these documents in the DBpedia Knowledge Base, I want to first perform the paper implementation Yuxing Zhang: Recurrent Neural Network Embedding for Knowledge-base Completion. As knowledge base completion is a task of inferring the missing triples, or out-of-vocabulary triples, from existing triples in the knowledge base, this will help build a model to solve this problem.

Goals

With this project, I aim to make the DBPedia corpus of KB embeddings more complete by using this wrapper that will further allow the extension of the knowledge base, add missing links, etc.

Warm up tasks

This project has a direct impact on the DBPedia corpus as the models used in this task will be used to find missing links in the DBpedia Knowledge base. It will also allow a better representation of the embeddings.

Mentors

Keywords

Python, NLP, Machine Learning, Deep Learning, Knowledge Graph, Knowledge Base, Knowledge Base Embeddings

DBpedia Embeddings for Out-Of-Vocabulary Resources

Description

A DBpedia Knowledge Base embedding (KB embedding) is a learned high-dimensional representation of a KB symbol. In less cryptic words, we want to build a vector of D dimensions (e.g. D=300) to represent each URI.
In this project, we aim at adding code such that the DBpedia extraction framework outputs one real-valued vector of D dimensions for each DBpedia Instance, each Class and each Property such that if two resources (I, C or R) are similar in meaning, then those vectors are also close to each other in vector space. These vectors are learned from DBpedia’s graph itself (plus Wikipedia text), and evaluations can be done with heldout data also from DBpedia or Wikipedia, or from existing eval datasets.
During last year’s GSoC, students implemented a novel algorithm for scalable KB embedding and found that one existing algorithm can scale to the size of DBpedia. However, these approaches still have not been tested on out-of-vocabulary (OOV) resources. That is, the ability of returning a vector for those resources that did not belong to the training set.

Goals

Devise or adapt a model for KB embedding which can deal with out-of-vocabulary resources.

Impact

Embeddings are widely used in NLP as a way to encode distributional semantics. Distributed representations of DBpedia resources will allow people to use semantic similarity to help with entity linking, relationship extraction, etc. They may be used to extend type coverage, add missing links, etc making DBpedia more complete as a KB.

Warm up tasks

Run the RVA-based embedding algorithm on a DBpedia subset (read notebook).
Run the evaluation of existing KB embedding methods.
Train a word embedding algorithm (e.g., Word2Vec, GloVe, fastText).
Example of successful project proposal.

Mentors

Peng Xu, Thiago Galery, Tommaso Soru

Keywords

knowledge graph embedding, vector space model, distributional semantics

Predicate Detection using Word Embeddings for Question Answering over Linked Data

Description

Question answering over Linked Data can be broadly segmented into three tasks. Identifying named entities, identifying predicates (relation extraction) and finally generating a precise SPARQL query that can answer the question by using the identified entities and predicates.

Relation extraction is one of, if not the hardest step in this process and the dominant method is the usage of custom-built lexicons to match words in a query to a dictionary of phrases mapped to DBpedia predicates. Instead, we suggest the usage of word embeddings for solving the relation extraction task.

Goals

For this project we will be using the LC-QuAD dataset which contains 5000 questions derived from DBpedia along with their corresponding SPARQL query and generic query template. The research problem is as follows:

Given a question, its corresponding SPARQL template:

Identify the DBpedia entities (resources) for each triple in the question.
Using the identified resource apply word embeddings for each predicate label of said resource and find the closest match among the available words in the input question
Experiment with different similarity metrics for matching predicate labels to the input question.
Evaluate the overall performance of the system compared to existing methods using GERBIL.

Warm-up Tasks

Read the papers
Pytorch
1. Get familiar with Pytorch with some basic tutorials eg: How to use Pre-trained Word Embeddings in PyTorch
2. FastText
Familiarize yourself with SPARQL
1. Short intro video on SPARQL and querying DBpedia
2. SPARQL by example
Example of successful project proposal.

Impact

The project will allow users to access DBpedia knowledge using natural language.

Mentors

TBD (Ram G Athreya, Rricha Jalota and Ricardo Usbeck)

Extracting Table of Contents (TOCs) for Articles

Description

Each Wikipedia article is structured by headings and subheadings. These structures indicate the relevance of certain aspects for the described entity. Extracting such data can help in categorizing the entities and facts about the entity. E.g. cities usually have paragraphs on History, Geography and Demographics, while soccer clubs have paragraphs on Honours, Players and Stadiums. Obviously, there are pitfalls: E.g. these paragraphs are not uniformly captioned, thus an alignment (ideally to DBpedia resources) between variations would be helpful. The newly created dataset should follow Linked Data principles, e.g. a sufficiently expressive vocabulary should be used to describe TOCs (ideally as resources), the order of TOC entries, etc.
Optionally, it would be interesting to apply the dataset for a meaningful application, e.g. generating missing types.

Goals

Extract TOCs from article pages and produce an RDF dataset describing the article TOCs in a comprehensive way.

Impact

A new dataset which can be used in various ways. Insights in aspects of DBpedia entities.

Warm up tasks

Run Extraction Framework #8

Mentors

Magnus

Keywords

extraction

Data Quality Dashboard for DBpedia

Description

DBpedia offers large quantities of structured data. Though, DBpedia has partly insufficient data quality which originate from different sources, e.g. incorrect extractions and value transformations in the extraction framework, inconsistent mappings, incorrect data in Wikipedia articles, and generally incompleteness.

Goals

Visualize a set of metrics in an easy to read interactive UI that facilitates the decision on what should be fixed next in DBpedia.

Impact

The interface will help DBpedia contributors to adopt a “data quality first” attitude, enable data-driven prioritization of development tasks.

Warm up tasks

Find 10 or 100 errors in DBpedia #11
Read about Quality Assessment Methodologies for Linked Open Data

Mentors

Possible names: Bharat Suri

Keywords

data quality, front-end, full-stack, javascript, react js, meteor js, user interface, UI

dbpedia / gsoc Goto Github PK

gsoc's Introduction

DBpedia GSoC projects

2019

Communication

Contribute

Becoming a mentor

Becoming a student

gsoc's People

Contributors

Stargazers

Watchers

Forkers

gsoc's Issues

Description

Goals

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm-up tasks

Mentors

Keywords

Description:

Goals:

Impact:

Warm-up tasks:

Mentors

Keywords

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm up tasks

Mentors

Keywords

Description

Goals

Impact

Warm-up tasks

Mentors

Keywords

Effort

Skills

Description

Impact

Previous projects

Description

Goals

Impact

Warm-up tasks

Mentors

Keywords

Description

Goals