Git Product home page Git Product logo

de.setf.wilbur's People

Contributors

lisp avatar vityok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

de.setf.wilbur's Issues

Update README.md

The project is installable with Quicklisp:

(ql:quickload "wilbur")

This should be mentioned in the readme file to let new users know an easy way to try and run the library.

Problem with some Unicode chars

It looks like Wilbur has a problem with certain Unicode chars in certain circumstances.

Code to reproduce:

  1. download RDF/XML date from DBPedia:
wget http://dbpedia.org/data/Semantic_Web.rdf
  1. parse with external format explicitly defined:
(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input
             :external-format :utf-8))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces error both on CCL and SBCL:

> Error: Cannot decode this: (#\U+30BB #\U+30DE #\U+30F3 #\U+30C6 #\U+30A3 #\U+30C3 #\U+30AF #\U+30FB #\U+30A6 #\U+30A7 #\U+30D6)
> While executing: (:INTERNAL WILBUR::COLLAPSE WILBUR:COLLAPSE-WHITESPACE), in process listener(1).
debugger invoked on a SIMPLE-ERROR in thread
#<THREAD "main thread" RUNNING {AB2F861}>:
  Cannot decode this: (#\HANGUL_SYLLABLE_U #\HANGUL_SYLLABLE_KEU
                       #\HANGUL_SYLLABLE_RA #\HANGUL_SYLLABLE_I
                       #\HANGUL_SYLLABLE_NA)
(WILBUR:COLLAPSE-WHITESPACE "우크라이나")

But everything works fine if the external format is not specified:

(defvar stream (open #P"Semantic_Web.rdf"
             :direction :input))
(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream "http://dbpedia.org/page/Semantic_Web"))

Produces:

#<TEMPORARY-PARSER-DB size 157 #x1862A5C6>

That then can be successfully queried.

The problem is even more evident when using flexi-streams.

problem with namespaces in the parse of XML files

I got the following error

XML -- missing NAMESPACE definition "doac:LanguageSkill"
   [Condition of type WILBUR:MISSING-NAMESPACE-DEFINITION]

When running

(defvar stream1 (open #P"0675365413696898.rdf"
             :direction :input))

(setf wilbur:*db*
      (wilbur:parse-db-from-stream stream1 "0675365413696898.rdf"))

The file is actually correct:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bio="http://purl.org/vocab/bio/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:doac="http://ramonantonio.net/doac/0.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:fgvterms="http://www.fgv.br/terms/" xmlns:event="http://purl.org/NET/c4dm/event.owl#" xmlns:gn="http://www.geonames.org/ontology#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:lattes="http://www.cnpq.br/2001/XSL/Lattes" xml:base="http://www.fgv.br/lattes/0675365413696898">
....

Lack of documentation

There is some documentation for the original project: http://wilbur-rdf.sourceforge.net/docs/
but it is not included in this version and there are no links to it.

For somebody new to this project it is very difficult to understand it without helpful documentation.

Please add documentation (docs) to this project.

performance : Wilbur vs RDFLib

Same task, similar code. For use RDFLib I had to convert to ntriples first, but besides that, Wilbur took ~1 hour and RDFLib did the same in ~ 1 min.

  1. https://gist.github.com/arademaker/fe6b31d25f12eb307ed6cbea4395a357 and
  2. https://gist.github.com/arademaker/dcedb56952f5aa014c6729211cdb2540

Any idea? How to investigate this difference?

$ rapper -c opennlp/Dissertation.pdf.rdf
rapper: Parsing URI file:///Users/ar/work/papers/opennlp/Dissertation.pdf.rdf with parser rdfxml
rapper: Parsing returned 865058 triples

$ rapper -o ntriples -i rdfxml opennlp/Dissertation.pdf.rdf > lixo.ntriples
$ time python3.7 rdf-to-json.py lixo.ntriples lixo.json

real    0m59.568s
user    0m58.309s
sys    0m0.830s


$ sbcl --noinform --noprint --eval "(load \"rdf-to-json.lisp\")" --eval "(main (nth 1 sb-ext:*posix-argv*) (nth 2 sb-ext:*posix-argv*))" --eval "(sb-ext:quit)" opennlp/Dissertation.pdf.rdf lixo.json

real    54m37.053s
user    54m18.341s
sys    0m7.938s

Inconsistencies in serialization formats

With two serializer methods, it could be expected that if Wilbur correctly reads or generates triples/DB, then it would serializer correctly to both.

However, there seem to be some inconsistencies between the ntriples and the rdf/xml serializers. Some code that runs in the first one won't run on the other one.

For instance, I have some triples:

10: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:ID #1 {10052D3BC3}>
11: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:FORM #"The" {10052D4313}>
12: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:LEMMA #"the" {10052D4523}>
13: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:UPOSTAG #"DET" {10052D4743}>
14: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:XPOSTAG #"DT" {10052D4963}>
15: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:FEATS !conll:pronTypeArt {10052D4B73}>
16: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:FEATS !conll:definiteDef {10052D4D33}>
17: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:HEAD #3 {10052D4F43}>
18: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:DEPREL #"det" {10052D5163}>
19: #<WILBUR:TRIPLE !NAMESPACE:test-1 !conll:DEPS #"_" {10052D5373}>

They were generated by the following code (I've removed non-relevant parts):

(let* ((token-node (node (format nil "NAMESPACE:~a-~a" sentence-id (token-id token))))
	(slots '(id form lemma upostag xpostag feats head deprel deps))
	(slot-nodes
	 (list
	  'id `(,(wilbur:literal (slot-value token 'id)))
	  'form `(,(wilbur:literal (slot-value token 'form)))
	  'lemma `(,(wilbur:literal (slot-value token 'lemma)))
	  'upostag `(,(wilbur:literal (slot-value token 'upostag)))
	  'xpostag `(,(wilbur:literal (slot-value token 'xpostag)))
	  'feats (convert-features-to-rdf (slot-value token 'feats))
	  'head `(,(wilbur:literal (slot-value token 'head)))
	  'deprel `(,(wilbur:literal (slot-value token 'deprel)))
	  'deps `(,(wilbur:literal (slot-value token 'deps))))))
    
    `(,@(mappend
	  #'(lambda (slot)
	      (mapcar
	       #'(lambda (value-node) 
		   (wilbur:triple
		    token-node
		    (node (format nil "conll:~a" (string-upcase slot)))
		    value-node))
	       (getf slot-nodes slot)))
	  slots)))

This head field is an integer. While serialization as ntriples works correctly, exporting it as a number, serialization as rdf/xml returns an error:

The value
  3
is not of type
  SEQUENCE
   [Condition of type TYPE-ERROR]

Restarts:
 0: [RETRY] Retry SLIME REPL evaluation request.
 1: [*ABORT] Return to SLIME's top level.
 2: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {1001FAFFA3}>)

Backtrace:
  0: (SB-IMPL::SEQUENCE-TO-LIST 3) [tl,external]
  1: (WILBUR::EXTENDED-STRING->CHAR-CODES 3)
  2: (WILBUR:ESCAPE-XML-STRING 3 T)
  3: ((LABELS WILBUR::DUMP :IN WILBUR::DUMP-AS-RDF/XML) !NAMESPACE:test-1 ((!#1=conll:DEPS . #"_") (!#1#:DEPREL . #"det") (!#1#:HEAD . #3) (#2=!#1#:FEATS . !#1#:definiteDef) (#2# . !#1#:pronTypeArt) (!#1#:..
  4: (WILBUR::DUMP-AS-RDF/XML (#<WILBUR:TRIPLE #1=!#2=NAMESPACE:c6DC441D0-76F3-460E-A332-DC3F66422077 #3=!rdf:type !#4=conll:Corpus {10052D03E3}> #<WILBUR:TRIPLE #1# #5=!rdfs:label #"my-corpus" {10052D0693..

( @arademaker )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.