natlibfi / annif-corpora Goto Github PK
View Code? Open in Web Editor NEWDocument and subject corpus collection for use in testing Annif subject indexing tool
Document and subject corpus collection for use in testing Annif subject indexing tool
I can't build the jyu theses corpus, because two documents are missing:
I changed the Makefile to proceed with this step. Now, if a download fails, the source (.url) and destination (.pdf) file is deleted.
%.pdf: %.url
- wget -q --no-use-server-timestamps -O $@ -i $< || rm -f $@
+ wget -q --tries=5 --no-use-server-timestamps -O $@ -i $< || rm -f $@ $<
Also, if the conversion from text to pdf fails the complete process stops. I fixed this with the following change (all conversion errors are ignored):
%.txt: %.pdf
- pdftotext $<
+ pdftotext $< || true
Finally, it may be worth to mention in the documentation, that it is possible to speed up the process of downloading and converting with make -j16 pdf
or make -j4 txt
.
It looks like the links contained in eng-test are looking for .txts and everything in juy-theses eng is .url. This seems to be true of all of the example document corpuses. I haven't tested all of them, but it looks like all of them would break.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.