dbpedia / extraction-framework Goto Github PK

The software used to extract structured data from Wikipedia

Shell 0.87% Java 9.74% Scala 78.68% PHP 3.09% HTML 0.69% CSS 0.06% JavaScript 2.68% Hack 4.19% Dockerfile 0.01%

extraction-framework's Introduction

DBpedia Information Extraction Framework

Homepage: http://dbpedia.org
Documentation: http://dev.dbpedia.org/Extraction
Get in touch with DBpedia: https://wiki.dbpedia.org/join/get-in-touch
Slack: join the #dev-team slack channel within the the DBpedia Slack workspace - the main point for developement updates and discussions

About DBpedia
Getting Started
The DBpedia Extraction Framework
Contribution Guidelines
- Developer's Certificate of Origin
License

About DBpedia

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.
To check out the projects of DBpedia, visit the official DBpedia website.

Getting Started

The Easy Way - Execution using the MARVIN release bot

Running the extraction framework is a relatively complex task which is in details documented in the advanced QuickStart guide. To run the extraction process same as the DBpedia core team does, you can do using the MARVIN release bot. The MARVIN bot automates the overall extraction process, from downloading the ontology, mappings and Wikipedia dumps, to extraction and post-processing the data.

git clone https://git.informatik.uni-leipzig.de/dbpedia-assoc/marvin-config
cd marvin-config
./setup-or-reset-dief.sh
# test run Romanian extraction, very small
./marvin_extraction_run.sh test
# around 4-7 days
./marvin_extraction_run.sh generic

Standalone Execution

If you plan to work on improving the codebase of the framework you would need to run the extraction framework alone as described in the QuickStart guide. This is highly recommended, since during this process you will learn a lot about the extraction framework.

Extractors represent the core of the extraction framework. So far, many extractors have been developed for extraction of particular information from different Wikimedia projects. To learn more, check the New Extractors guide, which explains the process of writing new extractor.
Check the Debugging Guide and learn how to debug the extraction framework.

Execution using Apache Spark

In order to speed up the extraction process, the extraction framework has been adopted to run on Apache Spark. Currently, more than half of the extractors can be executed using Spark. The extraction process using Spark is a slightly different process and requires different Execution. Check the QuickStart guide on how to run the extraction using Apache Spark.

Note: if possible, new extractors should be implemented using Apache Spark. To learn more, check the New Extractors guide, which explains the process of writing new extractor.

The DBpedia Extraction Framework

The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Github repository (GNU GPL License). The change log may reveal more recent developments. More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki

The DBpedia extraction framework is structured into different modules

Core Module : Contains the core components of the framework.
Dump extraction Module : Contains the DBpedia dump extraction application.

Core Module

Components

Source : The Source package provides an abstraction over a source of Media Wiki pages.
WikiParser : The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
Extractor : An Extractor is a mapping from a page node to a graph of statements about it.
Destination : The Destination package provides an abstraction over a destination of RDF statements.

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

Ontology Classes used to represent an ontology. Methods for both, reading and writing ontologies are provided. All classes are located in the namespace org.dbpedia.extraction.ontology
DataParser Parsers to extract data from nodes in the abstract syntax tree. All classes are located in the namespace org.dbpedia.extraction.dataparser
Util Various utility classes. All classes are located in the namespace org.dbpedia.extraction.util

Dump extraction Module

More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions.

To know more about the extraction framework, click here

Contribution Guidelines

If you want to work on one of the issues, assign yourself to it or at least leave a comment that you are working on it and how.
If you have an idea for a new feature, make an issue first, assign yourself to it, then start working.
Please make sure you have read the Developer's Certificate of Origin, further down on this page!

Fork the main extraction-framework repository on GitHub.
Clone this fork onto your machine (git clone <your_repo_url_on_github>).
Switch to the dev branch (git checkout dev).
From the latest revision of the dev branch, make a new development branch from the latest revision. Name the branch something meaningful, for example fixRestApiParams (git checkout dev -b fixRestApiParams).
Make changes and commit them to this branch.

Please commit regularly in small batches of things "that go together" (for example, changing a constructor and all the instance creating calls). Putting a huge batch of changes in one commit is bad for code reviews.
In the commit messages, summarize the commit in the first line using not more than 70 characters. Leave one line blank and describe the details in the following lines, preferably in bullet points, like in 7776e31....

When you are done with a bugfix or feature, rebase your branch onto extraction-framework/dev (git pull --rebase git://github.com/dbpedia/extraction-framework.git). Resolve possible conflicts and commit.
Push your branch to GitHub (git push origin fixRestApiParams).
Send a pull request from your branch into extraction-framework/dev via GitHub.

In the description, reference the associated commit (for example, "Fixes #123 by ..." for issue number 123).
Your changes will be reviewed and discussed on GitHub.
In addition, Travis-CI will test if the merged version passes the build.
If there are further changes you need to make, because Travis said the build fails or because somebody caught something you overlooked, go back to item 4. Stay on the same branch (if it is still related to the same issue). GitHub will add the new commits to the same pull request.
When everything is fine, your changes will be merged into extraction-framework/dev, finally the dev together with your improvements will be merged with the master branch.

Please keep in mind:

Try not to modify the indentation. If you want to re-format, use a separate "formatting" commit in which no functionality changes are made.
Never rebase the master onto a development branch (i.e. never call rebase from extraction-framework/master). Only rebase your branch onto the dev branch, if and only if nobody already pulled from the development branch!
If you already pushed a branch to GitHub, later rebased the master onto this branch and then tried to push again, GitHub won't let you saying "To prevent you from losing history, non-fast-forward updates were rejected". If (and only if) you are sure that nobody already pulled from this branch, add --force to the push command.
"Don’t rebase branches you have shared with another developer."
"Rebase is awesome, I use rebase exclusively for everything local. Never for anything that I've already pushed."
"Never ever rebase a branch that you pushed, or that you pulled from another person"
In general, we prefer Scala over Java.

More tips:

Guides to setup your development environment for IntelliJ IDEA or Eclipse.
Get help with the Maven build or another form of installation.
Download some data to work with.
How to run from Scala/Java or from a JAR.
Having different troubles? Check the troubleshooting page or post on https://forum.dbpedia.org.

Important: Developer's Certificate of Origin

By sending a pull request to the extraction-framework repository on GitHub, you implicitly accept the Developer's Certificate of Origin 1.1

License

The source code is under the terms of the GNU General Public License, version 2.

extraction-framework's People

Contributors

Stargazers

Watchers

Forkers

jimkont rikoadi jimregan mmorsey ninniuz ktobah ask4amit gaurav-pant-virus juliencojan za925 ciangit shruti-gupta rsharnag cristiancantoro kkasunperera hadyelsahar egipto87 conanx ziorufus adib2011 omrio fumi zavg accardoso mhm5000 djodjoni imclab peepkungas jeschkies mgns polymathronic lastlegion ali1k aaryabhata sericwong smhaisale hywelmj wencanluo justajoe jayantjain93 descl nilesh-c chongwf gaurav pmeinhardt dfleischhacker alismayilov mihaillomonosov b-rich nono314 aragonopendata normalerweise semihyavuzz silky dineshreddykdp feroshjacob sguignot franikm abhayprakash atomerju kdpv rubensalgado madhushib chile12 enayatullah boyan-simeonov arpitachandra alexandrutodor bianca-pereira importbigdata agarwalprabal memaldi kunalsen ilangostl icedwater e-dorigatti ankit-ks gone-phishing juhijetwani akirato ujjwalwahi hrbazoo rishirdua navinpai devendradesale johnokoriee4 billyzafack fsonntag unnonouno abhishekg2389 eamosse pie2014 mzdu nicoring adrapereira malcolmgreaves andreasknoepfle lukeswissman prateekmathur1991 nmadhire

extraction-framework's Issues

Oddly Incorrect Values

I've come across a few values that are incorrect but oddly close to the correct value in some way.

http://dbpedia.org/page/Nantong gives the dbpedia-owl:populationTotal as "72828350 (xsd:integer)", when, according to Wikipedia, it actually is 7,282,835. So the digits are almost correct, only the additional zero at the end is wrong.

http://dbpedia.org/page/Durg is a similar example: DBpedia gives the populationTotal as 2810436, when, according to Wikipedia, it actually is 281,436. So again really close digit-wise, only the zero in the middle is wrong.

http://dbpedia.org/page/Johnstown,_Colorado seems particularly interesting. The populationTotal given at DBpedia is "38279887 (xsd:integer)" and in the infobox at Wikipedia we find "9,887", i.e. the second part of that number but the first part, 3827, is nowhere to be seen in the infobox. However, in the opening paragraph it says: "The population was 3,827 at the 2000 census." How did that find its way into the parsed value?

Automatically collect disambiguation pages for each language

from @jcsahnwaldt

problem: each Wikipedia edition uses its own set of templates to mark
pages as disambiguation pages.

current solution: manually collect these templates for each language

better solution: automatically download the info from the Wikipedia editions.

details: Each MediaWiki instance knows which templates are disambig
templates. Most Wikipedia editions have a page like
http://en.wikipedia.org/wiki/MediaWiki:Disambiguationspage . It's
fairly simple to parse this. For some editions, the page is empty, but
then there are other ways to get the disambig template list.

I started working on this but didn't have time to finish:

https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/util/WikiDisambigReader.scala

If anyone wants to finish this (estimated time needed: 5 to 50 hours), @jcsahnwaldt would be glad to help.

405 Not Allowed Errors on SPARQL Endpoint

Same issue as described in this ticket, only today. Queries against the SPARQL endpoint were fine last night, and this morning I'm seeing a 405 Nginx "Not Allowed" error.

http://sourceforge.net/tracker/?func=detail&aid=3515926&group_id=190976&atid=935521

Specific response body:

<title>405 Not Allowed</title>

405 Not Allowed

nginx/1.4.1

Please let me know if there's any testing I can do to help resolve this.

handle {{ICD10}} and {{ICD9}} templates in property values

I suspect there's some sort of bug when parsing the semi-structured data on ICD-9 and ICD-10 codes from wikipedia.

When posing this SPARQL query at http://dbpedia.org/sparql, you'll notice the great majority of ICD-9 and ICD-10 codes are just quoted commas, instead of the actual codes available at Wikipedia

SELECT DISTINCT ?label ?icd9_code ?icd10_code ?abstract ?wikipediaLink
WHERE {
    ?s a <http://dbpedia.org/ontology/Disease>.
    ?s rdfs:label ?label .
    ?s <http://dbpedia.org/ontology/abstract> ?abstract .

    ?s <http://dbpedia.org/ontology/icd9> ?icd9_code .
    ?s <http://dbpedia.org/ontology/icd10> ?icd10_code .

    ?wikipediaLink <http://xmlns.com/foaf/0.1/primaryTopic> ?s .
    FILTER (langMatches(lang(?label), "en") && langMatches(lang(?abstract), "en"))
} LIMIT 5

Kind Regards

Bad parse for a property in french cities infoboxes

Hi,

I saw a bug of the extraction for the french cities infoboxes, but only on one, the property "mandat maire". I give you an example, this property has two types of values :

mandat maire = 2008 - 2014

mandat maire = [[2008]] - [[2014]]

And with the extraction process, only "2008" is extracted. The good extraction will be "2008 - 2014".

So which files I have to modify to correct this ?

Best.

Julien.

Extend mapping with inverse property generation

Automatically generate an inverse property for a triple, e.g. parent / child, spouse relationships, etc.
For example: A prop B => B invProp A

This may introduce some redundant information but it may be desirable in some cases

Add license information to images

When images from dbpedia are used in an application e.g. link to a dbpedia-resource and use the thumbnail to illustrate your resource. It would be good if licensing information for the images would be available. After a discussion with Dimitris he suggested the following

"Regarding DBPedia Images, licenses in commons.wikipedia.org are marked with templates that are different for every language. We use this configuration [1] to exclude images that are non-free.

What we could do to handle your case is the exact opposite:

Make a list of free templates and map each one of them to a license vocabulary item.
Then after we extract each image we could add an extra triple in the form
imageURI hasLicense licenseURI"

I'd like to add that as a feature request.

[1] https://github.com/jimkont/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/config/mappings/ImageExtractorConfig.scala

Decimal Separator Integers

With regards to the dbpedia-owl:populationTotal property, a number of resources have population counts that are too small by a factor of about 1000:
http://dbpedia.org/resource/Ciudad_Ayala,_Morelos "7"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Garg%C5%BEdai "17"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Urban_water_management_in_Bogot%C3%A1,_Colombia "7"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Fruitland,_Maryland "5"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Assomada "12"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Bermeo "17"^^http://www.w3.org/2001/XMLSchema#integer

Most of those have in common that in Wikipedia a dot is used as a thousands separator, for example, 16.814 for http://en.wikipedia.org/wiki/Garg%C5%BEdai, which ends up as 17 in DBpedia, so the original value is apparently interpreted as a floating point number and then rounded to create an integer for populationTotal. Arguably that's more of a problem with consistency in Wikipedia but maybe a way could be found to check first, if floating point values make any sense and if not (range of xsd:int) interpret anything as thousands separator by default.

Oddly enough, http://en.wikipedia.org/wiki/Fruitland,_Maryland gives the population as "4,866" and still ends up as 5 in DBpedia.

Create a mediawiki extension for abstract creation

DBepdia abstracts are created using a modified mediawiki instance to properly resolve templates. We used to keep a patched mediawiki clone and now just the modified files [1]

The best approach would be to create an extension out of those files to make it even more portable.
Some recent discussion and explanation (by @jcsahnwaldt ) on this can be found here [2]

[1] https://github.com/dbpedia/extraction-framework/tree/master/dump/src/main/mediawiki
[2] http://sourceforge.net/mailarchive/message.php?msg_id=30569736

Enforce Java 7 through Maven

To have a clearer error message when the framework is compiled with Java version < 7,
the maven-enforcer-plugin could help.

live dbpedia dumps

Hi, I tried to get a recent live dbpedia dump. But the last one in http://live.dbpedia.org/dumps/ is over 2 months ago (dbpedia_2013_03_04.nt.bz2).
When will the next dump be generated ?

Thanks.

Create a Wikidata Extractor

There is already sample code with comments that could help bootstrap this extractor.

A skeleton file can be found here [1], along with a download [2] and an extraction configuration script.

There is also a sample dump file, so, for testing you just have to run
$ ../run extraction extraction extraction.wikidata
from the dump folder

Also note that scala has builtin support for JSON (in package scala.util.parsing.json._)

[1] https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/WikidataExtractor.scala
[2] https://github.com/dbpedia/extraction-framework/blob/master/dump/download.wikidata
[3] https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.wikidata

Issue with ParserUtilsConfig.scalesMap

Scales in ParserUtilsConfig are used in regexes and should contain escaped special chars, e.g.:

"milj." -> 9 in the nl map should be "milj." -> 9,

or special chars should be escaped before creating a RegEx.

This specific issue with the nl language causes a lot of failures.

Very big property resource

Submitted by Enno Meijers

In http://nl.dbpedia.org/resource/Gornji_Petrovci there is a very big (huge) property resource
prop-nl:gemeenteGornjiPetrovciLigtInHetNoordoo...nnenbbd8ff18f21f63ee6047c12813f46ec3

I am not sure if this is a parser error or we should just check the property name size

Here's the source page http://nl.wikipedia.org/w/index.php?title=Gornji_Petrovci&action=edit

Imperial Conversion, Inches Cut Off

When you look at the results of the query
select ?s where {?s dbpedia-owl:height 1.524}
you will find that most people listed there are in fact not 1.524m / 5ft tall but rather something between 1.524 and 1.82m (=6ft). It seems that DBpedia parses the imperial value (even though metric values appear to be directly available in most cases ) but cuts off the inches.
This issue is not unique to people with heights between 5 and 6 feet (they just make up a large part of the dataset) but can also be observed for other multiples of 1 feet (0.3048m), e.g. http://dbpedia.org/page/South_African_Class_NG10_4-6-2 with 3.048m (=10ft) instead of 10 ft 6 in (3.200 m) or http://dbpedia.org/page/Nuneham_Railway_Bridge with 4.572m (=15ft) instead of 15 feet 9 inches (4.80 m).

Additional number in attribute value

Most of those have in common being in South Australia and having the population count ending in 2006. At Wikipedia, the population is given as something like "240 2006 Census", so apparently that additional number in the value field was parsed together with the main value.

Construct mapping value by adding a prefix / suffix

Extend the mapping syntax by adding the ability to add prefixes / suffixes to mapped values.

This will be very helpfull in cases where we map to an id (like viaf) and we want to construct the proper URI and add an owl:sameAs triple

Wrong latitude - longitude extract

Hi,

Sometimes the latitude and longitude are wrote like that :

| latitude = 20/15/55.28/S
| longitude = 57/28/44.59/E

And the framework extract only "20" and "57". Is-it a bug from the framework or from wikipedia infobox ?

Best.

Julien.

dbpedia extraction errors for links within texts

When the wikipedia value contains texts with links in it, dbpedia
turns to extract the first wiki link.
One example below, "Barack Obama" becomes a occupation for "Stephanie
Cutter" since [[Barack Obama]] appears in the text of occupation.

http://en.wikipedia.org/wiki/Stephanie_Cutter     Occupation    Deputy campaign manager for President [[Barack Obama]]'s 2012 reelection campaign.

http://live.dbpedia.org/page/Stephanie_Cutter    dbpedia-owl:occupation    dbpedia:Barack_Obama

This seems a general extraction errors and also exists for other wiki pages.
Two more examples:

Britney_Spears extracted as notable work instead of her song
http://en.wikipedia.org/wiki/Sheppard_Solomon
[[Britney Spears]] - "[[Touch of My Hand]]"

http://live.dbpedia.org/page/Sheppard_Solomon
dbpedia-owl:notableWork dbpedia:Britney_Spears

Adolf_Hitler extracted an occupation
http://en.wikipedia.org/w/index.php?title=Johann_Baur&action=edit
laterwork =Became [[Adolf Hitler]]'s personal pilot

http://live.dbpedia.org/page/Johann_Baur
dbpedia-owl:occupation dbpedia:Adolf_Hitler

Elevation Unit Misinterpretation

With regards to dbpedia-owl:elevation, we found two resources, http://dbpedia.org/page/Wrights_Lake and http://en.wikipedia.org/wiki/Shadow_Mountain_Lake where the prime symbol ( ' ) that represents ft is ignored, resulting in values that are three times larger than they should be in DBpedia.

Two other resources show weird values in those regards:
http://dbpedia.org/page/Zamar gives elevations of 1518.208800 (xsd:double) and 16345.000000 (xsd:double) where Wikipedia gives them as "4,981 m (16,345 ft)".
The latter value is obviously the ft value interpreted as meters but the first one is equal to dividing the meter value of 4,981 by 3.28084, i.e. converting the meter value to meters again.

http://dbpedia.org/page/Zapatoca is a similar case. Wikipedia gives the elevation as "Elevation 1,720 m (4,000 ft)", which is not consistent because 4,000ft = 1219m but DBpedia renders this as "1219.200000 (xsd:double) and 13123.000000 (xsd:double)". So the first value is correctly converted from the elevation in feet but the second value is far off because it's the ft value again converted to ft (4000*3.28084=13123).

Date as Runtime

When you look at objects for http://dbpedia.org/ontology/Work/runtime, you find quite a few runtimes that look suspiciously like recent years.
For example,
http://dbpedia.org/page/Fake_History dbpedia-owl:Work/runtime 2011.0
http://dbpedia.org/page/The_Curse_(Omen_album) dbpedia-owl:Work/runtime 1996.0
http://dbpedia.org/page/White_Christmas_(song) dbpedia-owl:Work/runtime 1942.0

Inspection of the infobox at Wikipedia shows that in each case the alleged runtime value is in fact a date that appears in the corresponding field such as
Length 3:02 (1942 recording)
3:04 (1947 recording)
for White Christmas.

Though in some cases, this issue already exists in Wikipedia, e.g.http://en.wikipedia.org/wiki/I_Love_John_Frigo...He_Swings

http://dbpedia.org/ontology/runtime has the same issues but not that plain to see because the runtimes are given in seconds.

Clean up release

Hello,

I noticed on the main Dbpedia page that version 3.9 has been released. It doesn't seem to be reflected in the extraction framework repository though.

There is not tag in Git
Version is not updated in pom file

Millimeter - Separators

For some resources where the height is given in mm at Wikipedia, the values appear to be too small by a factor of 1000 at DBpedia.
E.g.:
http://dbpedia.org/resource/Renault_Espace__space_III__1 "0.00169"^^http://www.w3.org/2001/XMLSchema#double
http://dbpedia.org/resource/MaK_G_1204_BB "0.00422"^^http://www.w3.org/2001/XMLSchema#double
http://dbpedia.org/resource/Treno_ad_alta_frequentazione "0.0043"^^http://www.w3.org/2001/XMLSchema#double
http://dbpedia.org/resource/FS_Class_E491/2 "0.00431"^^http://www.w3.org/2001/XMLSchema#double

The problem could be that some (http://en.wikipedia.org/wiki/MaK_G_1204_BB, http://en.wikipedia.org/wiki/Treno_ad_alta_frequentazione) use a dot as a thousands separator and others ( http://en.wikipedia.org/wiki/Renault_Espace, http://en.wikipedia.org/wiki/FS_Class_E491/2) use a comma so the only fix could be not to use the mm value when alternatives are available.

A similar issue can be seen with regards to http://dbpedia.org/ontology/displacement
By looking at the results for
select ?s ?o where {?s http://dbpedia.org/ontology/displacement ?o Filter(?o < 1.0e-04)}
i.e. cars with less than 0.1 liters displacement, we find a number of values that appear to be too small by a factor of 1000, e.g. http://dbpedia.org/page/Peugeot_406 with dbpedia-owl:displacement 0.000002 (xsd:double), where Wikipedia lists a displacement of "1.997 cc" ( http://en.wikipedia.org/wiki/Peugeot_406 )

Extraction problem with a note.

Hi there,

It seems there is a little problem when extracting the full name of a wikipedia page that has an associated note:

Full name of this page has a [fn 1] after the proper name : http://en.wikipedia.org/wiki/Prince_William,_Duke_of_Cambridge

And in the dbpedia page: http://dbpedia.org/page/Prince_William,_Duke_of_Cambridge the note is extracted as part of the fullname:

foaf:name: William Arthur Philip Louisref|As a member of the Royal Family entitled to be called His Royal Highness, William formally has no surname. When one is used, it is Mountbatten-Windsor. In his military career, William uses the surname Wales. According to letters patent of February 1960, his house and family name is Windsor. The middle name Louis is pronounced .|group=fn |name=sur

Have you noticed this issue before?

Many thanks in advance,
Ramón

ChemBox extractor does not copy the identifiers?

If I look at http://live.dbpedia.org/page/Azulene I do not see the extracted InChI, CAS registry number, nor PubChem compound ID. What is the reason here?

(Is it possible to run the extractor on a single wikipedia page, removing the need for a full fledged MW/MySQL installation?)

Issue from list separators

By example this from http://fr.wikipedia.org/wiki/Iron_Man_(comics) :

activité = Directeur de Stark Enterprises{{Clr}}Ancien Secrétaire de la Défense{{Clr}}Ancien Directeur du [[SHIELD]]

Become this :

http://fr.dbpedia.org/property/activité "Directeur de Stark EnterprisesAncien Secrétaire de la DéfenseAncien Directeur du SHIELD"@fr

The issue occur with many other separators. A solution will be to create a map of all these separators.

I can take care of it. But I would like to know which files I have to modify for not spending too much time in seeking them.

activeYearsStart/EndYear - time span vs. date vs. decade

For dbpedia-owl:activeYearsStart/EndYear, there are many dates where the year has less than 4 digits, see
select ?s where {?s http://dbpedia.org/ontology/activeYearsStartYear ?o FILTER ( ?o < "0999-01-01T00:00:00Z"^^xsd:dateTime )}.
E.g. http://dbpedia.org/page/Lions_Gate_Chorus and http://dbpedia.org/page/The_Unband . This is obviously going to be wrong most of the time and the issue seems to be caused in most cases by misinterpreting a time span of XX years as the year XX.
Sometimes it is caused by only stating a decade in Wikipedia though, e.g.: http://en.wikipedia.org/wiki/Depswa

Make extraction framework available in central Maven repository

It would be very convenient for users to have extraction framework available in central Maven repository. Right now it's hard even to build a stable release yourself because of #87.

Update Freebase linking script to the new Freebase format

Freebase is now delivering RDF dumps in [1]
Dumps are using Turtle format.
In order to leverage new dumps the CreateFreebaseLinks script must be updated.

[1] https://developers.google.com/freebase/data

AKSW Maven Repository 503

After a fresh checkout and an mvn install all is fine, except for a few artifacts, that maven is not able to download, since they seem to reside in a special Maven repository, that yields a 503 (Service Temporary Unavailable).

The artefacts in question (formatting is mine, for readability):

Downloading: http://maven.aksw.org/archiva/repository/internal/
    org/oclc/oai/harvester2/0.1.12/harvester2-0.1.12.pom
Downloading: http://maven.aksw.org/archiva/repository/internal/
    com/openlink/virtuoso/virtjdbc4/6.1.6/virtjdbc4-6.1.6.pom
Downloading: http://maven.aksw.org/archiva/repository/internal/
    org/aksw/commons/model/0.4/model-0.4.pom

The maven error is:

Could not transfer artifact 
    org.oclc.oai:harvester2:pom:0.1.12 from/to aksw
    (http://maven.aksw.org/archiva/repository/internal):
    Failed to transfer file:
    http://maven.aksw.org/archiva/repository/internal/
    org/oclc/oai/harvester2/0.1.12/harvester2-0.1.12.pom. 
    Return code is: 503 , 
    ReasonPhrase:Service Temporarily Unavailable.

The repository is referenced in: live/pom.xml#L192

Is it advised to download the artifacts by hand? Will maven.aksw.org be up again? Thanks.

ChemBox extractor incorrectly extracts IUPAC names

For example, for :Azulene it creates this triple:

dbpedia:Azulene
dbpprop:iupacname
"bicyclo5.3.0decapentaene"@en .

But the ChemBox has the correct name with a [ and ] in the name:

| IUPACName = bicyclo[5.3.0]decapentaene

Handle "list templates" as values in templates

When the framework parses a property value from a template it breaks it according to some rules (i.e. new lines,
, comma, etc) to get multiple values, otherwise returns the whole string as a single value.

Recently wikipedia started using some special templates to enumerate different values such as Plainlist template.

The bug here is that we cannot parse these templates with a result to reject the whole string. Ideally we should be able to handle multiple templates for different languages

Relevant code: https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia/extraction/dataparser
Similar configurations: https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia/extraction/config/dataparser

Bug report: http://www.mail-archive.com/[email protected]/msg04290.html

Update http://wiki.dbpedia.org/Documentation

and all other wiki pages to point to GitHub, not SourceForge.

Conversion Metric (cm to m?)

The values at Wikipedia look fine mostly, so my only guess is that something expects the value to be given in cm and then converts to m by dividing by 100.

Map categories to classes

Add this option in the mapping syntax.
Categories are not always "clean" so this needs careful planning (i.e. add exception rules, etc)

missing inferred type in DBpedia Live

http://live.dbpedia.org/page/History_of_Kagoshima_Prefecture has type http://dbpedia.org/ontology/AdministrativeRegion, but not http://dbpedia.org/ontology/Region, although in DBpedia Live (but not at the official endpoint) they are in a subclass relationship:

At http://live.dbpedia.org/ontology/AdministrativeRegion, you can check that it is (currently) subclass of http://live.dbpedia.org/ontology/Region.

Hybrid Infobox & Mappings extractor

Create a combination of infobox & mapping extractor.
Infobox extracts very raw data that in many cases have errors but can be useful if no mapping exists for the extracted template.

The new extractor should extract this raw data only if a mapping does not exists

Refactor core to accept new formats

Wikidata introduced JSON as a new wiki text format. To accommodate this change and be prepared for the next formats that will follow @jcsahnwaldt suggested a number of refactoring actions.

There are described in pull request #35 as well as in this developers thread

Pull request #35 also contains 2 drawings on the current extraction design and the new one we want.

This work is estimated from 1 day (minimum) to 1 week (maximum)

Don't capitalize page name found in dump file

I am currently experiencing a problem with DBpedia resource that on Wikipedia are present with same page names (after the capitalization normalization has been done). To let you understand the problem:

$ zgrep "<http://dbpedia.org/resource/<2c6b>>"  ~/wikipedia/enwiki/20130604/enwiki-20130604-page-ids.ttl.gz
<http://dbpedia.org/resource/Ⱬ> <http://dbpedia.org/ontology/wikiPageID> "16504503"^^<http://www.w3.org/2001/XMLSchema#integer> <http://en.wikipedia.org/wiki/Ⱬ?oldid=542530740> .
<http://dbpedia.org/resource/Ⱬ> <http://dbpedia.org/ontology/wikiPageID> "20363161"^^<http://www.w3.org/2001/XMLSchema#integer> <http://en.wikipedia.org/wiki/Ⱬ?oldid=453440263> .

I have inspected the two page ids here and here

The two resources points to the same page with ID 16504503.

The information is also present in the redirect dataset but, due to capitalization I don't know which one is a redirect and which one is the target page:

$ zgrep <2c6b> ~/wikipedia/enwiki/20130604/enwiki-20130604-redirects.ttl.gz
<http://dbpedia.org/resource/Ⱬ> <http://dbpedia.org/ontology/wikiPageRedirects> <http://dbpedia.org/resource/Ⱬ> <http://en.wikipedia.org/wiki/Ⱬ?oldid=453440263> .
<http://dbpedia.org/resource/Ⱬ> <http://dbpedia.org/ontology/wikiPageRedirects> <http://dbpedia.org/resource/Ⱬ> <http://en.wikipedia.org/wiki/Ⱬ?oldid=453440263> .

Mapping and redirected infoboxes

Reported from @ninniuz (Andrea Di Menna) here: http://sourceforge.net/mailarchive/message.php?msg_id=30285781

I noticed there is a problem with redirected infoboxes and the test extraction.
If I create a mapping for infobox A which is redirected from infobox
B, all the entities which actually use infobox B will not get mapped
in the test extraction.

Example:

http://mappings.dbpedia.org/server/mappings/en/extractionSamples/Mapping_en:Infobox_skier

http://en.dbpedia.org/resource/Adam_Małysz does not get any rdf:type
extracted (among all).

Checking the wikipedia article [1] I see that it uses Infobox ski
jumper [2] which redirects to Infobox skier [3]

Would it be possible to correct the test extraction framework?

Moreover, the mapping does not show up in live dbpedia [4].
Not even for those which are mapped in the test extraction, e.g. [5]
Also, on [5] it looks like the information about the
Template:Infobox_skier disappeared, while it is present in default
dbpedia [6].

Do you know what is going on?

Thanks,
Andrea

[1] http://en.wikipedia.org/wiki/Adam_Ma%C5%82ysz
[2] http://en.wikipedia.org/wiki/Template:Infobox_ski_jumper
[3] http://en.wikipedia.org/wiki/Template:Infobox_skier
[4] http://live.dbpedia.org/page/Adam_Ma%C5%82ysz
[5] http://live.dbpedia.org/page/Anne_Heggtveit
[6] http://dbpedia.org/page/Anne_Heggtveit

Metric Parsing cm Cut Off

For some of those, the height attribute looks weird at Wikipedia, for example http://en.wikipedia.org/wiki/Guy_Poitevin "1 m 81, 80 kg" but others look perfectly fine, e.g. http://en.wikipedia.org/wiki/Trecia-Kaye_Smith "1.85 m (6 ft 1 in)"

Periapsis Dot Decimal Separator Without Zero in Front

I'm coming to the end of my bug reporting spree.
With regards to the http://dbpedia.org/ontology/Planet/periapsis property, there are a number of extreme outliers with values of over 1e24, such as http://dbpedia.org/page/4544_Xanthus, http://dbpedia.org/resource/11066_Sigurd and http://dbpedia.org/resource/2135_Aristaeus.
This seems to be caused by the way the decimal point is used without a zero in front of it.
For example:
http://en.wikipedia.org/wiki/2135_Aristaeus gives the following values (in astronomical units, AU):
Ap 2.404654338157649
Peri .7949766633909954
which then gets converted to
dbpedia-owl:Planet/apoapsis
3.597311687362601E8
dbpedia-owl:Planet/periapsis
1.1892681609232877E24

Where the first is the result for 2.4 AU to km (http://www.google.com/search?q=2.404654338157649+AU+to+km) and the second is the result for 7949766633909954 (~=8e15) AU to km (http://www.google.com/search?q=7949766633909954+au+to+km).

Map only first / last value from infobox proeprty

Extend the mapping syntax to facilitate extraction of only the first or last value from a property

Administrative interface for DBpedia-Live

DBpedia-Live processes all wikipedia updates in real-time. However there isn’t any nice administration interface to show the actual system status. What we want here to a small localhost server that will show real-time statistics for the live extraction, start stop live components and add manually items to queue.

Double infobox case

Hi,

Sometimes an article is described by two infoboxes, for example : http://fr.wikipedia.org/wiki/Luis_Fernandez

Which is described by :

{{Infobox Footballeur
| image = [[Fichier:Luis Fernandez.jpg|200px|Luis Fernandez]]
| légende = Luis Fernandez entraîneur du [[Stade de Reims]] en [[2009 en football|2009]].
| nom = Luis Miguel Fernández Toledo
| période pro = [[1978 en football|1978]]-[[1993 en football|1993]]
| pays = {{France}}
| nation sportive = {{France}}
| date_de_naissance = {{date sport|2|10|1959|en football|âge=oui}}
| ville = [Tarifa]
| taille = {{Taille|m=1.81}}
| club_actuel =
| numero_en_club =
| position = [[Milieu de terrain]] puis [[entraîneur]]
| parcours junior = {{parcours junior
|[[1969 en football|1969]]-[[1970 en football|1970]] |{{FRA-d}} [[AS Minguettes Vénissieux|AS Minguettes]]
|[[1970 en football|1970]]-[[1978 en football|1978]]| {{FRA-d}} [[Association Sportive de Saint-Priest|AS Saint-Priest]]
}}
| parcours pro = {{parcours pro
|[[1978]]-[[1986]]|{{FRA-d}} [[Paris Saint-Germain Football Club|Paris Saint-Germain]]| 237 (35)
|[[1986]]-[[1989]]| {{FRA-d}} [[Racing Club de France Football Colombes 92|Matra Racing]]| {{0}}59 {{0}}(3)
|[[1989]]-[[1993]]| {{FRA-d}} [[Association sportive de Cannes Football|AS Cannes]]| {{0}}96 {{0}}(3)
|'''1978-1993|'''Total|'''392 (41)
}}
| sélection nationale = {{parcours national
|[[1982]]-[[1992]]|{{FRA football}}| {{0}}60 {{0}}(6)
}}
| carrière entraineur = {{trois colonnes
|[[1993]]-[[1994]]| {{FRA-d}} [[Association sportive de Cannes Football|AS Cannes]]|
|[[1994]]-[[1996]]| {{FRA-d}} [[Paris Saint-Germain Football Club|Paris Saint-Germain]]|
|[[1996]]-[[2000]]| {{ESP-d}} [[Athletic Bilbao]]|
|[[2000]]-[[2003]]| {{FRA-d}} [[Paris Saint-Germain Football Club|Paris Saint-Germain]]|
|[[2003]]-[[2004]]| {{ESP-d}} [[Espanyol de Barcelone|Espanyol Barcelone]]|
|[[2005]]| {{QAT-d}} [[Al-Rayyan SC]]|
|[[2005]]-[[2006]]| {{ISR-d}} [[Betar Jérusalem]]|
|[[2006]]-[[2007]]| {{ESP-d}} [[Real Betis Balompié|Betis Séville]]|
|[[2009]]| {{FRA-d}} [[Stade de Reims]]|
|[[2010]]-[[2011]]| {{ISR football}}|
}}
}}

And

{{Infobox animateur audiovisuel
| nom = Luis Fernandez
| image =
| image taille = 180
| légende = Luis Fernandez
| date de naissance = {{Date|2|octobre|1959|âge=oui}}
| lieu de naissance = [[Tarifa]]
{{Espagne}}
| date de décès =
| lieu de décès =
| nationalité = [[France|Française]]
| émissions = ''[[Luis Attaque]]''
| chaînes = [[RMC]]
| site web = http://luisattaque.rmc.fr
}}

And only the first one is taken. Is-it normal or not ?

Thanks.

Julien.

multiple Wikipedia templates handling for mapping to class

submitted by Marco Fossati in [1], [2]

Description

Up to now, the first template infobox on a Wikipedia article defines the DBpedia type of this article, while further infobox templates will be extracted as instances of the
corresponding types and own URIs.
Hence, the current behavior is to build new URIs in case of multiple templates in one wiki article.
However, this may lead to the creation of URIs with double underscores (something like "blank nodes").
The problem is big enough and is likely to affect all other chapters.
The objective is to identify and implement extraction strategies that would cover all (or most) cases in Wikipedia.
Strategy ideas

Intuitively, when multiple templates occur in the same wiki article, the extractor should generate a unique entity (i.e. subject) and assign all the mapped types and properties to it, no matter where the templates are in the wiki article. This is not a robust strategy, as it indeed fixes some errors in, but may add some other.

Another idea is to declare class disjunction i.e. owl:disjointWith axiom in the DBpedia ontology and raise an error when 2 templates map to disjoint classes.

Examples

http://it.wikipedia.org/w/index.php?title=Diabolik_(fumetto)&action=edit -> 'personaggio' and 'fumetto e animazione' templates.
'personaggio' is about the fictional character, 'fumetto e animazione' is about the comic books. In this case, it wouldn't make sense to attach all properties to one subject URI. We need two different valid URIs.
http://it.wikipedia.org/w/index.php?title=Alfredo_Binda&action=edit -> 'sportivo' and 'bio' templates.
It's pretty clear that both templates are about the person, and it would make a lot of sense to attach all extracted properties to the same subject URI.
http://es.wikipedia.org/w/index.php?title=Jacques_Chirac&action=edit -> 'Ficha de autoridad' and 'Ficha de criminal' templates.
In this case, it would be nice to use the same subject URI for both templates, but the info from 'Ficha de criminal' seems by far not as important.

links

[1] https://sourceforge.net/mailarchive/message.php?msg_id=30369047
[2] https://sourceforge.net/mailarchive/message.php?msg_id=29907224

fix generate-settings

Reported today by Omri Oren on [email protected] :

http://noc.wikimedia.org/conf/langlist has gone away. generate-settings doesn't work anymore. This file probably contains almost the same data: http://noc.wikimedia.org/conf/wikipedia.dblist . We'll have to adapt GenerateWikiSettings.scala and NonIsoLanguagesMappingTest.scala . We'll also have to add language 'min' to Language.scala.

New wikipedia dumps format

Reported on dbp-spotlight-users by Hans Nägeli:
Wikipedia-Dumps (I checked en, de and fr) have start and end tags of the
form "# started 2012-06-04T09:54:24Z", which cause a parsing error.

Enrich Ontology with owl Axioms

(Suggested by @JensLehmann)

For now the framework supports only the following owl axioms:

owl:equivalentClass
owl:disjointWith
owl:equivalentProperty

with owl:disjointWith not supported when exporting the ontology.

(1) (in page 12) suggests some axioms that could enrich DBpedia.
For now, what is needed is just to be able to read these axioms from the wiki and to write them when exporting the owl file (2).

(1) http://jens-lehmann.org/files/2012/ekaw_enrichment.pdf
(2) https://github.com/dbpedia/extraction-framework/tree/master/core/src/main/scala/org/dbpedia/extraction/ontology/io

Extend conditional mappings with more complex rules

For now, this is what we support for conditional mappings: [1]

we would like to extend the syntax to allow more complex conditions (like &&, ||, !).
A discussion on this can be found in [2].

[1] http://mappings.dbpedia.org/index.php/Template:Condition
[2] https://sourceforge.net/mailarchive/message.php?msg_id=30541656

dbpedia / extraction-framework Goto Github PK

extraction-framework's Introduction

DBpedia Information Extraction Framework

Contents

About DBpedia

Getting Started

The Easy Way - Execution using the MARVIN release bot

Standalone Execution

Execution using Apache Spark

The DBpedia Extraction Framework

Core Module

Dump extraction Module

Contribution Guidelines

Important: Developer's Certificate of Origin

License

extraction-framework's People

Contributors

Stargazers

Watchers

Forkers

extraction-framework's Issues

405 Not Allowed

Description

Examples

links

Recommend Projects

Recommend Topics

Recommend Org