Git Product home page Git Product logo

literaturecrawler's Introduction

BIOfid Literature Crawler

This crawler was created as part of the BIOfid-project. It is primarily intended to crawl the Biodiversity Heritage Library and Zobodat. However, the crawler was created to be highly extensible for any other text source. The crawler stores metadata and demanded text files in seperated subdirectories, depending on the file format.

Given a configuration file config/harvesting.yml, the crawler downloads all demanded items (i.e. books, monographies, a journal issue) and store them locally. In the configuration file the base output directory is given. Subsequently, all included crawlers create their own subdirectory and within these, they create two directories text and metadata, which store all text files and the metadata as XML, respectively.

Requirements

The project needs OpenJDK 11+ and Maven 3.6+. At least the harvesting of items from the Botanical Garden of Madrid (via the BHLHarvester) will not work with Oracle Java 8, because of not available cipher suites for the TLS encryption.

Building

To build the project simply call mvn package -Dcom.sun.security.enableAIAcaIssuers=true. This should give you a file target/LiteratureCrawler.jar. This you can run simply with

java -jar target/LiteratureCrawler.jar

and the application will run.

Why enabling enableAIAcaIssuers

The -Dcom.sun.security.enableAIAcaIssuers=true parameter is necessary to be able to connect to the server of the Botanical Garden of Madrid. They serve their content via BHL. If this parameter is not given, the SSL connection will fail.

If your want to know more on this problem: StackOverflow is your friend .

In the tests, this parameter is set automatically and you don't have to call it explicitly via the CLI.

Building in Docker

If you want or have to build a Docker image for the BIOfid Literature crawler, you can do this by:

docker build --tag literature-crawler:latest .

You should configure your harvesters BEFORE building, beause the config files are pushed to the image. However, there are ways to map the host config files to a container, using the -v parameter when calling docker run.

To run the image in a container, call:

mkdir output
mkdir logs
docker run -v "$PWD"/output:/harvesting -v "$PWD"/logs:/usr/src/literature-crawler/logs  -v "$PWD"/config/harvesting.yml:/usr/src/literature-crawler/config/harvesting.yml --user $(id -u):$(id -g) literature-crawler:latest

This command will put all the content generated in the container, put into the folder output in your current directory. Furthermore, it mounts the directories output and logs to the running container, so you have both permanent. Also, the configuration file is mounted to the container, so you can configure the harvester, even after you have build it.

Testing

To run all unit tests on a UNIX machine call mvn test. The tests create a temporary directory at /tmp/test. This works on UNIX just fine, but the behavior was not tested on Windows machines.

BHL Harvester

For the BHL Harvester it is mandatory to provide an BHL API key, which you can request here. You can provide this key either directly in the configuration file or only give a path to a file containing only the BHL key.

Configuration

The BHL Harvester differentiates between single items and titles. Both can be provided as keywords in the configuration file followed by lists (even only with a single element). While items are processed "as is", titles (i.e. a series of books) are first resolved to their items and then these items are downloaded.

Custom Harvester

If you want to harvest another source, you can simply create a custom class extending the Harvester class and integrating the demanded abstract functions. After also giving it a name and a class setting in the configuration file, you should be fine.

Bugs

If you find bugs, please do not hesitate to open an issue!

literaturecrawler's People

Contributors

dependabot[bot] avatar grazingscientist avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.