vida-nyu / ache Goto Github PK

View Code? Open in Web Editor NEW

437.0 37.0 134.0 68.22 MB

ACHE is a web crawler for domain-specific search.

Home Page: http://ache.readthedocs.io

License: Apache License 2.0

Shell 0.16% Java 94.42% CSS 0.28% JavaScript 4.93% Dockerfile 0.04% Jinja 0.18%

web-crawler focused-crawler domain-specific-search web-scraping web-spider web-search hacktoberfest

ache's Introduction

ACHE Focused Crawler

ACHE is a focused web crawler. It collects web pages that satisfy some specific criteria, e.g., pages that belong to a given domain or that contain a user-specified pattern. ACHE differs from generic crawlers in sense that it uses page classifiers to distinguish between relevant and irrelevant pages in a given domain. A page classifier can be from a simple regular expression (that matches every page that contains a specific word, for example), to a machine-learning based classification model. ACHE can also automatically learn how to prioritize links in order to efficiently locate relevant content while avoiding the retrieval of irrelevant content.

ACHE supports many features, such as:

Regular crawling of a fixed list of web sites
Discovery and crawling of new relevant web sites through automatic link prioritization
Configuration of different types of pages classifiers (machine-learning, regex, etc)
Continuous re-crawling of sitemaps to discover new pages
Indexing of crawled pages using Elasticsearch
Web interface for searching crawled pages in real-time
REST API and web-based user interface for crawler monitoring
Crawling of hidden services using TOR proxies

License

Starting from version 0.11.0 onwards, ACHE is licensed under Apache 2.0. Previous versions were licensed under GNU GPL license.

Documentation

More info is available on the project's documentation.

Installation

You can either build ACHE from the source code, download the executable binary using conda, or use Docker to build an image and run ACHE in a container.

Build from source with Gradle

Prerequisite: You will need to install recent version of Java (JDK 8 or latest).

To build ACHE from source, you can run the following commands in your terminal:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
./gradlew installDist

which will generate an installation package under ache/build/install/. You can then make ache command available in the terminal by adding ACHE binaries to the PATH environment variable:

export ACHE_HOME="{path-to-cloned-ache-repository}/ache/build/install/ache"
export PATH="$ACHE_HOME/bin:$PATH"

Running using Docker

Prerequisite: You will need to install a recent version of Docker. See https://docs.docker.com/engine/installation/ for details on how to install Docker for your platform.

We publish pre-built docker images on Docker Hub for each released version. You can run the latest image using:

docker run -p 8080:8080 vidanyu/ache:latest

Alternatively, you can build the image yourself and run it:

git clone https://github.com/ViDA-NYU/ache.git
cd ache
docker build -t ache .
docker run -p 8080:8080 ache

The Dockerfile exposes two data volumes so that you can mount a directory with your configuration files (at /config) and preserve the crawler stored data (at /data) after the container stops.

Download with Conda

Prerequisite: You need to have Conda package manager installed in your system.

If you use Conda, you can install ache from Anaconda Cloud by running:

conda install -c vida-nyu ache

NOTE: Only released tagged versions are published to Anaconda Cloud, so the version available through Conda may not be up-to-date. If you want to try the most recent version, please clone the repository and build from source or use the Docker version.

Running ACHE

Before starting a crawl, you need to create a configuration file named ache.yml. We provide some configuration samples in the repository's config directory that can help you to get started.

You will also need a page classifier configuration file named pageclassifier.yml. For details on how configure a page classifier, refer to the page classifiers documentation.

After you have configured a classifier, the last thing you will need is a seed file, i.e, a plain text containing one URL per line. The crawler will use these URLs to bootstrap the crawl.

Finally, you can start the crawler using the following command:

ache startCrawl -o <data-output-path> -c <config-path> -s <seed-file> -m <model-path>

where,

<configuration-path> is the path to the config directory that contains ache.yml.
<seed-file> is the seed file that contains the seed URLs.
<model-path> is the path to the model directory that contains the file pageclassifier.yml.
<data-output-path> is the path to the data output directory.

Example of running ACHE using the sample pre-trained page classifier model and the sample seeds file available in the repository:

ache startCrawl -o output -c config/sample_config -s config/sample.seeds -m config/sample_model

The crawler will run and print the logs to the console. Hit Ctrl+C at any time to stop it (it may take some time). For long crawls, you should run ACHE in background using a tool like nohup.

Data Formats

ACHE can output data in multiple formats. The data formats currently available are:

FILES (default) - raw content and metadata is stored in rolling compressed files of fixed size.
ELATICSEARCH - raw content and metadata is indexed in an ElasticSearch index.
KAFKA - pushes raw content and metadata to an Apache Kafka topic.
WARC - stores data using the standard format used by the Web Archive and Common Crawl.
FILESYSTEM_HTML - only raw page content is stored in plain text files.
FILESYSTEM_JSON - raw content and metadata is stored using JSON format in files.
FILESYSTEM_CBOR - raw content and some metadata is stored using CBOR format in files.

For more details on how to configure data formats, see the data formats documentation page.

Bug Reports and Questions

We welcome user feedback. Please submit any suggestions, questions or bug reports using the Github issue tracker.

We also have a chat room on Gitter.

Contributing

Code contributions are welcome. We use a code style derived from the Google Style Guide, but with 4 spaces for tabs. A Eclipse Formatter configuration file is available in the repository.

Contact

Aécio Santos [[email protected]]
Kien Pham [[email protected]]

ache's People

Contributors

Stargazers

Watchers

Forkers

evolucity curtiszimmerman adeze ahmadia ksmaheshkumar rajatiit ericwhyne eric011 mrg7 dreilly369 nyimbi liangkai anukat2015 clementgoubet tonellotto crazykid199 jaisanas ashbt 1130695 abhagat-splunk weeshlow pulin-zou maqzi fmacias64 jpmantuano anudeepti2004 qingyunw simlan j4son marslabtron arryboom ipomoealba taffy5366 mkenne11 pacejj27 ntpdate lylang aliweiya zshell thezedwards okbengii lauraedelson joelb-git oooooooops wangbin321 uruscg-llc nitin-panwar lovingxiuxiu jporter-dev super-louis erginaor afcarl lifedom bhoomi17 onlyfish79 hkxiron zwleagle mdzz110 sam65536 escap-data-hub olivierh59500 rajivraj bbhunter cih-y2k pragyavaidya crashtec hule-bot vishalbelsare visionandy hhy5277 lr123 reynoldsm88 watermelon12138 caimily rowhit zhanglihdu akhil6760 hexj masterscott faybp kznmft wall-eeeeeee him754 zolute pking74 5l1v3r1 zafodb bic-harness vickzhang mariabeatrizsilva danilovazb bicatana 3v1lb1t olegderid will-ashworth aslanshemilov conjecto slooppe jiangbinhui kimoai

ache's Issues

All queries made by the seed finder should be made available

Currently we have to parse a log file to get the intermediate queries. We need to have a better way to expose them.

Remove fragment from URL in HTML Parser

Some URLs have piece called fragment:

http://en.example.org/index.html#fragment

The HTML parser should remove the fragment because it makes different URLs point to the same web server page. This makes ACHE download duplicated pages to be downloaded.

Add metrics for frontier load times

when i search chinese i have some issues

When we crawl Spain website, data_target will appear, and will not appear garbled. When we crawl the Chinese Web site, without data_target, only data_nagetive, and when the URL is GBK-encoding format, the crawl URL garbled when encoding format is UTF-8, crawl the URL to display normal,how can i solve it ?

Store both relevant and irrelevant files in a single repository

Instead of having to separate repositories for "target" and "negative" pages, ACHE should store all pages in a single data repository. In order to distinguish between relevant and irrelevant pages, properties for page classification output (class and probability) should be added to each page entry.

Periodic recrawling of sitemaps.xml to discover new links

ACHE should be able to periodically recrawl sitemap.xml files to discover recently added links.

Add support for sitemap.xml protocol

Supporting sitemap.xml protocol will enable a more efficient intra-site discovery of URLs.

Elasticsearch 5.x (update the ES client)

Hello,

I made some tests and the crawler is not working with the version 5.x of the Elasticsearch.
The ES client is out of date. Is there any plan to support the ES version 5.x?
Would be great support the version 5.x and have some options to use some features provided by the ingestion node.

Properly handle DNS resolution failures

When the download fails with a java.net.UnknownHostException, it is an indication that the domain doesn't exists anymore, and further requests to others URLs from this domain are not necessary.

Better logging of download requests

Store HTTP response headers

ACHE should store HTTP response headers. HTTP headers are useful to help identify the mime-type type of data being downloaded, which will be important to support correct crawling of multimedia objects such as image and video. It also contains other user information such as cookies.

In class focusedCrawler.crawler.CrawlerImpl, this values should be taken from class URLConnection and stored class focusedCrawler.util.Page:

public lass Page {
  Map<String, String> responseHeaders;
  Map<String, String> getReponseHeaders() { ... }
}

Embbed stopwords list file into jar resources

To ease ACHE configuration, we should embbed the stop-words list file into the JAR resources. This way the user don't need to specify a stop-words list if it's not really needed.

TargetStorage fails to store pages because of too long file name

TargetCBORRepository and TargetFileRepository eventualy fail to store pages because the file name is too long. A shorter name scheme should be used to avoid this errors. Example:

[2015-09-17 19:36:23,046]ERROR [dispatcher-22] (TargetCBORRepository.java:63) - Failed to store object in repository.
java.io.FileNotFoundException: ./data/data_target/www.trust.org/http%3A%2F%2Fwww.trust.org%2Fspotlight%2FEbola%2F%3Futm_medium%3Demail%26utm_campaign%3DAlertNet%2520Expresso%252028%2520Jan%25202015%26utm_content%3DAlertNet%2520Expresso%252028%2520Jan%25202015%2520CID_f5d471ee9a77c8bdd3ade118bd02acbe%26utm_source%3DCampaign%2520Monitor%26utm_term%3DEbola%2520outbreak%2520in%2520West%2520Africa (File name too long)
        at java.io.FileOutputStream.open0(Native Method) ~[na:1.8.0_60]
        at java.io.FileOutputStream.open(FileOutputStream.java:270) ~[na:1.8.0_60]
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213) ~[na:1.8.0_60]
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162) ~[na:1.8.0_60]
        at com.fasterxml.jackson.core.JsonFactory.createGenerator(JsonFactory.java:1072) ~[jackson-core-2.5.4.jar:2.5.4]
        at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:2730) ~[jackson-databind-2.5.4.jar:2.5.4]
        at focusedCrawler.target.TargetCBORRepository.insert(TargetCBORRepository.java:60) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:91) [ache-0.3.1.jar:0.3.1]
        at [...]

Problems connecting to storage

I'm trying to kick ache off. I've built a model and then am using the following to run -

./ache startCrawl /tmp/output /home/user/Desktop/topic/topic_config /home/user/Desktop/topic/topic.seeds /home/user/Desktop/topic/topic_model /home/user/Downloads/langdetect-03-03-2014/profiles

I get the same issue on ache when I check it out from continuum.io's repo. The issue is not dissimilar to this one nasa-jpl-memex/memex-explorer#357

The following issue happens, and keeps happening. Is there anything I need to configure with regards to configuration?

I've used the config from your repo, with my seeds and my model.

Any ideas?

Erro de comunicacao: Connection refused
Dormindo 5 mls
[11/MAR/2015:13:56:19] [SocketAdapterFactory] [produce] [localhost:1988]
Erro de comunicacao: Connection refused
crawler_group_2_0:Connection refused
focusedCrawler.crawler.CrawlerException: crawler_group_2_0:Connection refused
    at focusedCrawler.crawler.CrawlerImpl.selectUrl(CrawlerImpl.java:250)
    at focusedCrawler.crawler.Crawler.run(Crawler.java:258)
java.net.ConnectException: Connection refused
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at java.net.Socket.connect(Socket.java:528)
    at java.net.Socket.<init>(Socket.java:425)
    at java.net.Socket.<init>(Socket.java:208)
    at focusedCrawler.util.storage.socket.StorageRemoteAdapter.getSocket(StorageRemoteAdapter.java:78)
    at focusedCrawler.util.storage.socket.StorageRemoteAdapter.defaultMethod(StorageRemoteAdapter.java:180)
    at focusedCrawler.util.storage.socket.StorageRemoteAdapter.select(StorageRemoteAdapter.java:300)
    at focusedCrawler.util.storage.distribution.StorageRemoteAdapterReconnect.select(StorageRemoteAdapterReconnect.java:265)
    at focusedCrawler.crawler.CrawlerImpl.selectUrl(CrawlerImpl.java:213)
    at focusedCrawler.crawler.Crawler.run(Crawler.java:258)
R>crawler_group_2_0>Sleeping 5000 mls.
RM>crawler_group>crawler_group_0_0>Time(3708):Sleeping as a consequence of this problem: 'crawler_group_0_0:Connection refused':selectUrl() linkStorage.
RM>crawler_group>crawler_group_1_0>Time(3694):Sleeping as a consequence of this problem: 'crawler_group_1_0:Connection refused':selectUrl() linkStorage.
RM>crawler_group>crawler_group_2_0>Time(3673):Sleeping as a consequence of this problem: 'crawler_group_2_0:Connection refused':selectUrl() linkStorage.
RM>crawler_group>crawler_group_3_0>Time(3707):Sleeping as a consequence of this problem: 'crawler_group_3_0:Connection refused':selectUrl() linkStorage.
RM>crawler_group>crawler_group_4_0>Time(3704):Sleeping as a consequence of this problem: 'crawler_group_4_0:Connection refused':selectUrl() linkStorage.
R>crawler_group_0_0>Total time is 5007 mls [5000,0,0,0,0,0,0,5000,0]

ACHE crawls stall if given too few seeds

The ACHE crawl will be "running" as if it is still doing things but it is actually just spinlocked.

Steps to reproduce:

start a new ACHE crawl with 3 seeds and run it.

How many seeds are needed?

HTML parser fails for some HTML files

Class PaginaURL occasionally fails to parse some HTML files and prints exceptions like: java.lang.StringIndexOutOfBoundsException: String index out of range: -3 and java.net.MalformedURLException: unknown protocol: httphttp.
One way to reproduce is run the command buildModel using the sample trainning data from the config folder:

build/install/ache/bin/ache buildModel -t config/sample_training_data/ -o /tmp/model -c config/sample_config/stoplist.txt

Refactor RegexBasedDetector into a new type of classifier

RegexBasedDetector is hardcoded in TargetStorage. This class should be refactored out to a new type of classifier such as WekaTargetClassifier and RegexBasedDetector and should be configurable using pageclassifier.yml config file.

Create page repository that stores multiple pages per file

Currently, FileSystemTargetRepository creates one file per URL downloaded which does not scale. A better option that stores multiple pages per file should be available. Some modules, such as the link classifier, depends on the file structure created by the repository and should be changed to remove this dependency.

Implement new page repository
Close target storage and repositories on crawler exit
Remove online learning dependency from directory structure

Don't assume everything is a HTML page

ACHE assumes that every URL is a HTML page and tries to parse HTML and classify every page downloaded. It should look at the HTTP's Content-Type header and the actual content to try to detect what is the actual media type of the data.

pageclassifier.features file should be created by command 'ache buildModel'

Currently, model are composed of 2 files:

pageclassifier.model, which is created by the command ache buildModel and;
pageclassifier.features, which is created in a shell script: https://github.com/ViDA-NYU/ache/blob/master/script/build_model.sh#L6-L8

File pageclassifier.features should be created inside buildModel command using Java code.

Write documentation for features

Following features are not documented:

Link filters (blacklist/whitelist) using regular expressions
Configuration of page classifiers:
- SVM
- Text regular expressions
- URL regular expressions
Hard focus vs. Soft focus
Configuring link classifiers and online-learning
How to use seed finder command
Configuration of link selectors

Reorganize documentation using sphinx and read-the-docs

Create sphinx configuration files
Migrate documentation from README.md and wiki
Publish to readthedocs.io

Support storage of additional metadata fields in LinkStorage

Currently, the Link Storage only stores the link itself and the relevance of the link, using a key-value storage. ACHE should support storage of additional fields to allow development of other features such as link revisitation.

Content of downloaded images (binary content) is corrupted

All download images (maybe all binary files) are download successfully, but the content is corrupted before saving to disk.

Remove synchronization from TargetStorage that prevents better parallelization

Remove synchronization from TargetStorage that prevents some CPU-bound tasks such as language detection and page classification to run concurrently.

Support standard WARC file format

WARC is a standardized file format used for storing web crawl data. It's widely used for storing large scale web data collections such as CommonCrawl and ClueWeb12.

WARC ISO 28500 draft is available at: http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf

Creation of a Classifier that would have both WEKA and regex-based classifier in consideration

Title. It would be pretty sweet if we could build a Classifier that would classify a certain page using regex and WEKA. This would help us getting a more precise output.

We could start off by building one with the "baseline" behaviour and adapt it from there.

@aecio , you said this was somewhat easy to do, but I don't know very well how the weka classification options you have available work exactly. Which ones would work best together? Authority with BS?

Re-train the link classifier using a independent thread

Currently the periodic training of the link classifier happens same threads that process downloaded pages.
It should happen in independent daemon thread so it doesn't block page processing treads.

It should not block other threads
It should not prevent crawler from exiting when it is running (daemon thread)

Language profiles is hardcoded in the libs directory

Currently, language profiles are hardcoded in the libs directory. We should package the language profiles in the src/main/resources folder, so the user doesn't need to provide the language profile folder.

Allow user to provide cookies to be used in HTTP requests

Some websites require cookies generated by logins pages. ACHE should allow the user provide a file with cookies exported from a web browser, and then use this cookies in HTTP requests.

Add URL normalizer to avoid downloading duplicate URLs

A single URL can have multiple forms. ACHE needs to have a URL normalizer to convert URLs extracted from HTML into a canonical format in order to avoid downloading same page multiple times.

Simplify configuration using single YML file

Instead of using three different folders and lots of configuration files, we should use a single YML file containing all configurations. All configurations should also have one default value which is suitable for most use cases. Configurations should be added if the user wants to modify the default behavior.

Web-based UI for real-time metrics monitoring

Configure ElasticSearch index name using command line

Currently, ElasticSearch index name is hard-coded in class TargetElasticSearchRepository.

This should be passed as a parameter in command line:

ache startCrawl --es-index <index name> --config-path <config path> --seed-path <seed path> --model-path <model path> --lang-profile <lang detect profile path>

This feature depends on #3 to be completed.

Allow configuration of multiple backlink APIs

Commit 9c44726 added support for backlings using Google, but it's hardcoded and removed Mozcape API. The class BacklinkSurfer, which handles this right now, is pretty messy and needs some refactoring.

Configuration files contains a lot of entries that are not used in the code anymore

Sample configuration files contains a lot of entries that are not used in the code anymore.
Find out which are the unused parameters and remove them from the configurations files in folder sample_config.

Add support for robots.txt protocol

Download robots.txt file
Parse robots.txt rules using crawler-commons library parsers
Forbid crawling of links blocked by the robots.txt rules.
- Option 1: add a new link filter based on robots rules so links are never added to frontier
- Option 2: add a persistent metadata field allowedByRobots on the URLs in the frontier, and never select forbidden URLs to be crawled
- Option 3: check if metadata is forbidden by robots at right before selection time, use cache of robot rules for efficiency
Add configuration key to enable robots.txt compliance
Write unit and integration test cases

New regex-based page classifier that matches multiple fields simultaneously

Currently, there are 3 types of regex based classifiers based on fields URL, content, and title. But there is no way to match all fields at the same time, or combine multiple classifiers.

It would be useful a regex classifier can boolean combination (OR or AND) of regexes in multiple fields:

type: regex
parameters:
  combiner: OR|AND
  fields:
    url:
    - pattern1-for-url
    - pattern2-for-url
    title:
    - pattern1-for-title
    content:
    - pattern1-for-content

REST API for real-time metrics monitoring

Support to download .onion links through TOR HTTP proxies

Support to execute requests through SOCK proxies will allow to use proxies for features such as crawl TOR hidden services and IP rotation through proxies.

Crawling backlinks from auth pages not working

ACHE can use an external API (such as lsapi.seomoz.com) to crawl backlinks from authority pages. But, currently, this feature is not working. The code is not updated and the seomoz API returns an HTTP 401 error.

To reproduce the error, following options should be configured:

in file link_storage.cfg change SAVE_BACKLINKS to true:

SAVE_BACKLINKS TRUE

and the key BACKLINK_CONFIG should be changed to an absolute path for the file backlink.cfg:

BACKLINK_CONFIG  /path/to/config/sample_config/link_storage/backlink.cfg

in file target_storage.cfg:

 BIPARTITE TRUE

Follows the stacktrace:

java.io.IOException: Server returned HTTP response code: 401 for URL: http://lsapi.seomoz.com/linkscape/links/www.businessinsider.com%2Fnew-york-city-ebola-2014-10?AccessID=member-2e52b09aae&Expires=1365280453&Signature=WFcSAnhBG62xmt2f57bGrqCtiOM%3D&Filter=external&Scope=page_to_page&Limit=50&Sort=page_authority&SourceCols=4&TargetCols=4
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
        at focusedCrawler.link.classifier.builder.BacklinkSurfer.downloadPage(BacklinkSurfer.java:248)
        at focusedCrawler.link.classifier.builder.BacklinkSurfer.downloadBacklinks(BacklinkSurfer.java:137)
        at focusedCrawler.link.classifier.builder.BacklinkSurfer.getLNBacklinks(BacklinkSurfer.java:172)
        at focusedCrawler.link.BipartiteGraphManager.insertBacklinks(BipartiteGraphManager.java:156)
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:168)
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:118)
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:391)
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284)
Generic Exception

[2015-07-24 12:28:16,413] INFO [crawler_group_2_0] (LinkStorage.java:201) - An Exception occurred.
java.lang.NullPointerException: null
        at focusedCrawler.link.classifier.builder.BacklinkSurfer.getLNBacklinks(BacklinkSurfer.java:173) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.link.BipartiteGraphManager.insertBacklinks(BipartiteGraphManager.java:156) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:168) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:118) [ache-0.3.1.jar:0.3.1]
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:391) [ache-0.3.1.jar:0.3.1]
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284) [ache-0.3.1.jar:0.3.1]
[2015-07-24 12:28:16,414]ERROR [crawler_group_2_0] (CrawlerImpl.java:397) - Problem while sending page to storage.
focusedCrawler.util.storage.StorageException: null
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:202) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:118) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:391) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284) [ache-0.3.1.jar:0.3.1]
[2015-07-24 12:28:16,415]ERROR [crawler_group_2_0] (Crawler.java:294) - crawler_group_2_0:null
focusedCrawler.crawler.CrawlerException: crawler_group_2_0:null
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:398) ~[ache-0.3.1.jar:0.3.1]
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284) ~[ache-0.3.1.jar:0.3.1]
java.lang.NullPointerException
        at focusedCrawler.link.classifier.builder.BacklinkSurfer.getLNBacklinks(BacklinkSurfer.java:173)
        at focusedCrawler.link.BipartiteGraphManager.insertBacklinks(BipartiteGraphManager.java:156)
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:168)
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:118)
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:391)
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284)
--
focusedCrawler.util.storage.StorageException
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:202)
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:118)
        at focusedCrawler.crawler.CrawlerImpl.sendData(CrawlerImpl.java:391)
        at focusedCrawler.crawler.Crawler.run(Crawler.java:284)

Use named parameters in command line interface

Currently, ache is started using a command line interface with fixed parameters like this:

ache startCrawl <data output path> <config path> <seed path> <model path> <lang detect profile path>

Command line should be modified to accept named parameters:

ache startCrawl --data-output <data output path> --config-path <config path> --seed-path <seed path> --model-path <model path> --lang-profile <lang detect profile path>

All ACHE command should accept named parameters, namely:

ache startCrawl
ache addSeeds
ache buildModel
ache startLinkStorage
ache startTargetStorage
ache startCrawlManager

It's also desirable that parameters can have a short form:
--data-output or -o
--model-path or -m
--lang-profileor -l

This feature should be implemented using a library like argparse4j (http://argparse4j.sourceforge.net) or Apache Commons CLI (https://commons.apache.org/proper/commons-cli/).

This feature is important to allow optional parameters be added in the future without breaking clients that are using the interface. For instance, ElasticSearch integration will need this feature to setup index name when ElasticSearch is used as TargetStorage backend.

OnlineLearning loads may cause out of memory error

The current implementation of OnlineLearning uses ALL links to relevant pages found during to train a new Link Classifier, and assumes that all of them fit in memory. This leads to OutOfMemoryErrors and crashes the crawler in large crawls.

This is related to issue #62.

Reorganize repository structure into sub-modules

Reorganize the repository into multiple cohesive gradle sub-projects, for example:

ache-cli
ache-core
ache-rest-api
ache-dashboard
ache-tools

LinkClassifierBreadthSearch requires the file linkclassifier.features, but it is not really necessary

Remove hard-coded link filters from FrontierManager

Some links are not inserted into the LinkStorage because of some hard-coded filters on FrontrierManager class:
https://github.com/ViDA-NYU/ache/blob/master/src/main/java/focusedCrawler/link/frontier/FrontierManager.java#L85
This should be removed to allow crawling multimedia files.

OutOfMemoryError problems during large crawls

ACHE is stopping due to different types of OutOfMemoryErrors being thrown during large crawls.

Examples:

java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3664)
        at java.lang.String.<init>(String.java:201)
        at java.lang.String.toLowerCase(String.java:2635)
        at java.lang.String.toLowerCase(String.java:2658)
        at focusedCrawler.util.parser.PaginaURL.separadorTextoCodigo(PaginaURL.java:410)
        at focusedCrawler.util.parser.PaginaURL.<init>(PaginaURL.java:128)
        at focusedCrawler.util.parser.PaginaURL.<init>(PaginaURL.java:110)
        at focusedCrawler.util.vsm.VSMVector.<init>(VSMVector.java:167)
        at focusedCrawler.target.classifier.WekaTargetClassifier.getValues(WekaTargetClassifier.java:104)
        at focusedCrawler.target.classifier.WekaTargetClassifier.distributionForInstance(WekaTargetClassifier.java:90)
        at focusedCrawler.target.classifier.WekaTargetClassifier.classify(WekaTargetClassifier.java:74)
        at focusedCrawler.target.classifier.KeepLinkRelevanceTargetClassifier.classify(KeepLinkRelevanceTargetClassifier.java:22)
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:80)
        at focusedCrawler.crawler.async.FetchedResultHandler.processData(FetchedResultHandler.java:55)
        at focusedCrawler.crawler.async.FetchedResultHandler.completed(FetchedResultHandler.java:29)
        at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.doHandle(HttpDownloader.java:330)
        at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.run(HttpDownloader.java:313)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.lang.String.substring(String.java:1957)
        at java.lang.String.split(String.java:2341)
        at java.lang.String.split(String.java:2410)
        at focusedCrawler.link.BipartiteGraphRepository.getLNs(BipartiteGraphRepository.java:104)
        at focusedCrawler.link.OnlineLearning.forwardClassifier(OnlineLearning.java:233)
        at focusedCrawler.link.OnlineLearning.execute(OnlineLearning.java:64)
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:183)
        at focusedCrawler.link.LinkStorage.insert(LinkStorage.java:124)
        at focusedCrawler.target.TargetStorage.insert(TargetStorage.java:98)
        at focusedCrawler.crawler.async.FetchedResultHandler.processData(FetchedResultHandler.java:55)
        at focusedCrawler.crawler.async.FetchedResultHandler.completed(FetchedResultHandler.java:29)
        at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.doHandle(HttpDownloader.java:330)
        at focusedCrawler.crawler.async.HttpDownloader$FetchFinishedHandler.run(HttpDownloader.java:313)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

java.lang.OutOfMemoryError: GC overhead limit exceeded
        at com.esotericsoftware.kryo.serializers.DefaultSerializers$URLSerializer.read(DefaultSerializers.java:869)
        at com.esotericsoftware.kryo.serializers.DefaultSerializers$URLSerializer.read(DefaultSerializers.java:859)
        at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:782)
        at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:132)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:540)
        at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:709)
        at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.unserializeObject(RocksDBHashtable.java:79)
        at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable.access$100(RocksDBHashtable.java:21)
        at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable$RocksDBIterator.next(RocksDBHashtable.java:172)
        at focusedCrawler.util.persistence.rocksdb.RocksDBHashtable$RocksDBIterator.next(RocksDBHashtable.java:125)
        at focusedCrawler.link.frontier.FrontierManager.loadQueue(FrontierManager.java:138)
        at focusedCrawler.link.frontier.FrontierManager.nextURL(FrontierManager.java:231)
        at focusedCrawler.link.LinkStorage.select(LinkStorage.java:213)
        at focusedCrawler.crawler.async.AsyncCrawler.run(AsyncCrawler.java:60)
        at focusedCrawler.Main.startCrawl(Main.java:310)
        at focusedCrawler.Main.main(Main.java:113)

Support for crawling JavaScript-based dynamic pages

Provide support for crawling web pages that the content is rendered dynamically through JavaScript.

Allow user specify what kind of data should be downloaded

User should be able to specify what kind of data should be downloaded based on the mime type, for example:

only text (text/*)
text, images (image/) and videos (video/)
any kind (/).

Better handling of HTTP URL redirections

Some URLs return HTTP redirection to other URLs. Both original and redirected should be stored.

    public lass Page {
      URL rediretedUrl;
      URL getRedirectedUrl() {
         //...
      }
    }

HTML parser should use the redirected url as base URL resolve relative links. Finally, the "redirect URL" should be stored in LinkStorage as already visited and the data stored in TargetStorage should also include the redirected URL.