bixo / bixo Goto Github PK

Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop. By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications.

Home Page: http:/openbixo.org

Java 10.57% HTML 1.10% Arc 88.34%

bixo's Introduction

===============================
Introduction
===============================

Bixo is an open source Java web mining toolkit that runs as a series of Cascading
pipes. It is designed to be used as a tool for creating customized web mining apps.
By building a customized Cascading pipe assembly, you can quickly create a workflow
using Bixo that fetches web content, parses, analyzes, and publishes the results.

Bixo borrows heavily from the Apache Nutch project, as well as many other open source
projects at Apache and elsewhere.

Bixo is released under the Apache License, Version 2.0.

===============================
Building
===============================

See http://openbixo.org/documentation/building-bixo/ for full details.

You need Apache Ant 1.7 or higher. 

To get a list of valid targets:

% cd <project directory>
% ant -p

To  clean and build a jar (which also runs all tests):

% ant clean jar

Note that "ant clean test jar" will currently fail, due to a bug in the maven ant task
plugin used for managing dependencies.

To create Eclipse project files:

% ant eclipse

Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.

bixo's People

Contributors

Stargazers

Watchers

bixo's Issues

build failed due to missing dependencies

I'm trying to build bixo from git sources according to instructions in the github page: https://github.com/bixo/bixo

When I run 'ant clean test jar' it fails with the following output:

$ ant clean test jar
Buildfile: build.xml

clean:
     [echo] cleaning bixo-core

mvn-init:
     [echo] maven.repo.local=/home/ivanhoe/.m2/repository
[artifact:dependencies] [INFO] snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT: checking for updates from Apache Snapshots
[artifact:dependencies] [WARNING] repository metadata for: 'snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT' could not be retrieved from repository: Apache Snapshots due to an error: Error transferring file
[artifact:dependencies] [INFO] Repository 'Apache Snapshots' will be blacklisted
[artifact:dependencies] [INFO] snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT: checking for updates from Apache Releases
[artifact:dependencies] [WARNING] repository metadata for: 'snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT' could not be retrieved from repository: Apache Releases due to an error: Error transferring file
[artifact:dependencies] [INFO] Repository 'Apache Releases' will be blacklisted
[artifact:dependencies] Downloading: org/apache/tika/tika-parsers/0.9-SNAPSHOT/tika-parsers-0.9-SNAPSHOT.pom from Bixo
[artifact:dependencies] Downloading: org/apache/tika/tika-parsers/0.9-SNAPSHOT/tika-parsers-0.9-SNAPSHOT.jar from Bixo
[artifact:dependencies] An error has occurred while processing the Maven artifact tasks.
[artifact:dependencies]  Diagnosis:
[artifact:dependencies] 
[artifact:dependencies] Unable to resolve artifact: Missing:
[artifact:dependencies] ----------
[artifact:dependencies] 1) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies]   Try downloading the file manually from the project website.
[artifact:dependencies] 
[artifact:dependencies]   Then, install it using the command: 
[artifact:dependencies]       mvn install:install-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file
[artifact:dependencies] 
[artifact:dependencies]   Alternatively, if you host your own repository you can deploy the file there: 
[artifact:dependencies]       mvn deploy:deploy-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[artifact:dependencies] 
[artifact:dependencies]   Path to dependency: 
[artifact:dependencies]         1) bixo:bixo-core:jar:1.0-SNAPSHOT
[artifact:dependencies]         2) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies] ----------
[artifact:dependencies] 1 required artifact is missing.
[artifact:dependencies] 
[artifact:dependencies] for artifact: 
[artifact:dependencies]   bixo:bixo-core:jar:1.0-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies] from the specified remote repositories:
[artifact:dependencies]   Bixo (http://bixo.github.com/repo/),
[artifact:dependencies]   Apache Releases (https://repository.apache.org/content/repositories/releases/),
[artifact:dependencies]   central (http://repo1.maven.org/maven2),
[artifact:dependencies]   Apache Snapshots (https://repository.apache.org/content/groups/snapshots-group/)
[artifact:dependencies] 
[artifact:dependencies] 

BUILD FAILED
/usr/home/ivanhoe/work/bixo.git/bixo/build.xml:64: Unable to resolve artifact: Missing:
----------
1) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
        1) bixo:bixo-core:jar:1.0-SNAPSHOT
        2) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT

----------
1 required artifact is missing.

for artifact: 
  bixo:bixo-core:jar:1.0-SNAPSHOT

from the specified remote repositories:
  Bixo (http://bixo.github.com/repo/),
  Apache Releases (https://repository.apache.org/content/repositories/releases/),
  central (http://repo1.maven.org/maven2),
  Apache Snapshots (https://repository.apache.org/content/groups/snapshots-group/)



Total time: 3 seconds

It looks like tika-parsers are missing in the Bixo repository.

Update Cascading Dependency

Bixo depends on cascading-core in version 1.2.5. The most recent version is 2.1.0.
Any chance of upgrading it?

bin/bixo is working incorrectly with hadoop 0.20.203.0

Looks like jar file naming scheme was changed.

simple fix:

--- bin/bixo.fix1       2011-09-14 14:40:29.035885594 +0400
+++ bin/bixo    2011-09-14 14:41:40.798931767 +0400
@@ -97,7 +97,7 @@
 done

 # add Hadoop libs to CLASSPATH
-for f in $HADOOP_HOME/hadoop-*-core.jar; do
+for f in $HADOOP_HOME/hadoop-core-*.jar; do
   CLASSPATH=${CLASSPATH}:$f;
 done
 for f in $HADOOP_HOME/lib/*.jar; do

The build failes if you try "ant clean test jar"

But it works if you run "ant clean test" and then "ant jar".

It seems related to an already-reported issue in the Maven ant task plugin we use for our build. See http://jira.codehaus.org/browse/MANTTASKS-203

-- Ken

Fix documentation on FAQ re ignoring robots.txt

http://openbixo.org/documentation/faq/

This talks about a class that (I think) no longer exists.

Edit GitHub wiki to point to real web site

The GitHub wiki can be used for design docs, but it should have a ref to the "real" web site for people who find the project via GitHub.

Add Bixo page to Wikipedia

Create page - http://en.wikipedia.org/wiki/bixo_(web crawler)
See http://en.wikipedia.org/wiki/Heritrix for idea of format/content
Put in http://en.wikipedia.org/wiki/Category:Free_web_crawlers category
Edit Bixo page to disambiguate - http://en.wikipedia.org/w/index.php?title=Bixo&action=edit
See http://en.wikipedia.org/wiki/Disambiguation_page#Disambiguation_pages for how-to
Add to list of crawlers - http://en.wikipedia.org/wiki/Web_crawler
Add to Hadoop "see also" section on Wikipedia.

Remove big data files from git

Get rid of dmoz data, post to GitHub downloads, and add a note in a README about how to get the data file and run the corresponding tools.

Note that the git repo history will also need to be rewritten, using http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/

Add documentation about building & using Eclipse on Windows

I think this would only work with Cygwin, and maybe only with some specific configuration.

Move web site to new location

Get the content from bixo.101tec.com (either manually, or Marko sends us a dump)
Re-post to new WordPress site, or reformat (ugh) for the GitHub wiki
Fix up formatting problems, links (if absolute).

bin/bixo fails when there is more than one bixo-core.*.jar in the dist directory

I've been trying to run bin/bixo to run the SimpleTool after downloading and "ant dist" a distribution - which I then installed elsewhere.

It is failing in the bixo script because there is in fact two bixo-core.*.jar files

alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$ ls
bixo-core-1.0-SNAPSHOT.jar classes-it classes-main-eclipse dfs lib
bixo-dist-1.0-SNAPSHOT classes-it-eclipse classes-test it test
bixo-dist-1.0-SNAPSHOT.tgz classes-main classes-test-eclipse java-doc test-it
alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$ pwd
/home/alex/projects/bixo/bixo-bixo-da66523/build
alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$ find . -name "bixo-core-*.jar"
./bixo-core-1.0-SNAPSHOT.jar
./bixo-dist-1.0-SNAPSHOT/bixo-core-1.0-SNAPSHOT.jar
alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$

Now, maybe the distribution I've built is wrong, but the bixo script could be better prepared.

I've gotten around it for now by changing

BIXO_CORE=find "$BIXO_HOME" -name "bixo-core-*.jar"

BIXO_CORE=find "$BIXO_HOME" -name "bixo-core-*.jar" | head -1

but neither directory actually has any useful jars in it.

What is the current policy for fetching and using all the libraries which maven fetches? Do I have to run maven when I am just trying to run the code even if I am not compiling it?

The build instructions don't work out of the box

I can't get anything to build.

I've run

$ ant clean jar

in both the root and examples directories

In the root directory I get this error:

test-it:
[mkdir] Created dir: /Users/dane/Development/recommender/bixo/build/it
[junit] Running bixo.fetcher.FetcherTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 122.751 sec
[junit] Running bixo.fetcher.http.SimpleHttpFetcherIntegrationTest
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.724 sec
[junit] Test bixo.fetcher.http.SimpleHttpFetcherIntegrationTest FAILED

BUILD FAILED
/Users/dane/Development/recommender/bixo/build.xml:256: Tests failed!

In the examples directory I get this error:

BUILD FAILED
/Users/dane/Development/recommender/bixo/examples/build.xml:49: Unable to resolve artifact: Missing:

bixo:bixo-core:jar:1.0-SNAPSHOT

Try downloading the file manually from the project website.

Then, install it using the command:
mvn install:install-file -DgroupId=bixo -DartifactId=bixo-core -Dversion=1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file

Alternatively, if you host your own repository you can deploy the file there:
mvn deploy:deploy-file -DgroupId=bixo -DartifactId=bixo-core -Dversion=1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

Path to dependency:
1) bixo:bixo-examples:jar:1.0-SNAPSHOT
2) bixo:bixo-core:jar:1.0-SNAPSHOT

1 required artifact is missing.

for artifact:
bixo:bixo-examples:jar:1.0-SNAPSHOT

from the specified remote repositories:
Bixo (http://bixo.github.com/repo/),
Conjars (http://conjars.org/repo),
central (http://repo1.maven.org/maven2)

Total time: 2 seconds

When I run

$bin/bixodemo crawl -agentname dane -domain www.sotmclub.com -outputdir output -numloops 3

in the example directory

I get this error

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

or this error

Exception in thread "main" java.lang.NoClassDefFoundError: bixo/examples/crawl/DemoCrawlTool
Caused by: java.lang.ClassNotFoundException: bixo.examples.crawl.DemoCrawlTool
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

Update references to Bixo web site everywhere

Once the site has moved, all of the refs need to be updated. This probably isn't a complete list:

Yahoo developer mailing list
GitHub site (this place)
Ohloh project info

And we should see if 101tec.com will set up a perm redirect to the new location.

tika-parsers dependency version incorrect

The version for artifact "tika-parsers" (in pom.xml) is set to "0.8-SNAPSHOT" in the Bixo 0.5.1 dist -- this causes mvn-init to fail. This version should be "0.8".

Improve notes on how to run SimpleCrawlTool from Eclipse

Currently we don't say much, after describing how to create & then import the project into Eclipse.

Specifically, we should talk about what class to use as the basis for creating a new Java "Run Configuration", along with required parameters (including -Xmx256m for JVM).

Fix up or remove SimpleStatusTool

This assumes that crawl state is maintained in loop directories, but some of it is now in the SQL database. So either the code should be fixed up, or it should be removed.

Update release procedure

No longer use TeamCity - just push to GitHub download section.
Note that we could use API to push it, as part of dist build.
Add step to edit release note.
Add step to post to Freshmeat.

dist package not complete - error

I downloaded the latest bixo-dist-0.5.1

I am trying to run

bin/bixo crawl -agentname test -domain www.google.com -outputdir output -numloops 3

I get an error saying HADOOP_HOME not set, but when I set it an hadoop directory

I copied the hadoop core jar from hadoop to the bin directory, now I get the following error

Exception running tool: org/apache/commons/configuration/Configuration
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

Shouldn't the dist archive contain all the dependency to run the command?

Change StatusDatum and ParsedDatum to extend PayloadDatum

Change StatusDatum and ParsedDatum to extend PayloadDatum (instead of UrlDatum)

Add blog (for news) to Bixo web site

If we do this via WordPress, should be easy to have a blog, and we can include that on the front (static) page.

E.g. there's a news section on the HBase home page - http://hadoop.apache.org/hbase/

See http://bixolabs.com/ for an example of having blog links on right side (via widget) for a typicaly WordPress site.

Create build target for everything in contrib directory

Currently there's only the "helpful" example. It doesn't get built via ant, just Eclipse, so it's often broken. It would be great to have a target for building all sub-projects found in the contrib/ directory. And also a test-contrib that does compile-contrib. Then the dist target should depend on test-contrib, so we don't do releases if contribs don't compile or pass their tests.

Improve SimpleRobotRules parser

Currently there are a number of directives that it should ignore, versus reporting as warnings.

Also it should handle common typos/formatting errors - check out the robot-errors.zip file in the Downloads directory for about 24K files found that had processing problems.

Examples of formatting errors include:

Missing the ':' after the directive (user-agent * instead of user-agent: *)
Adding a space before the ':' separator (user-agent : * instead of user-agent: *)

Fix up openbixo.org, once it's been moved to BlueHost

Get rid of Pages widget, once sub-dirs have explicit links (current issue for Matt)
Use custom CSS to tune appearance
Set appropriate complimentary colors for links and such. Could check out Kuler for complementary color ideas
Decide about sticking w/green color theme
Set up for using Blog as News page

Pick new domain for hosting Bixo web site

bixo.101tec.com is going away. We've got three options (at least):

Use bixo.bixolabs.com. But being so tightly associated with a company (Bixolabs) might not be a good thing. And editing would require access to the bixolabs.com site.
Use something like bixo-project.org. $40/year for domain name, mapping to wordpress.com, and custom CSS (if we do it via wordpress.com). Or $20/mo for Slicehost.com.
Don't worry about it, and just use the wiki here to provide documentation. Free, but wiki is much more limited in formatting and such.

Some input on each of the above would be great.

Modify SimpleParser to allow a ParseContext to be set

With the current design of SimpleParser (and TikaCallable) there is no way to set a different HtmlMapper than the one TikaCallable uses by default. So, if a caller wants to use the IdentityHtmlMapper instead there is no easy way to do that - one has to duplicate the code (SimpleParser and TikaCallable).
It would be nice to have a constructor in SimpleParser that allowed a caller to pass in the ParseContext. TikaCallable would also need to change to allow passing in a ParseContext. If the ParseContext is null, TikaCallable can make the default ParseContext (as it is doing now), otherwise it should just pass along what it received to Tika's parse function.

Hook up openbixo.org domain to openbixo.wordpress.com

need to wait for openbixo.org registration to get processed first.

Update copyright of all files

New copyright holder is Bixo Labs.

Need standard template used for all header files.

Some files are from the Nutch project project originally, so need special info in header.

Need to work w/Stefan on copyright for files currently assigned to 101tec.

This should wait until we decide about whether to change license to Apache (currently MIT)

Improve Bixo web site for findability

It would be great to have Google Analytics installed - free if we use wordpress.com
Though some additional work aids w/using Google/Yahoo/Bing webmaster tools (meta tags)
Change title to be "Open Source Web Mining Toolkit | Bixo"

Extend build to enable developers to run bixo from dev environment

Currently developers need to build a distribution and then use that to execute bixo from the command line. Modify the build process so that for a developer the classpath includes the necessary libraries for running from within their local dev environment

Error: JAVA_HOME is not set

Building Bixo from the instructions at http://openbixo.org/documentation/building-bixo/, and then testing the build with "bin/bixo crawl ...." from http://openbixo.org/documentation/getting-started/, throws the error that JAVA_HOME is not set.

However, "echo $JAVA_HOME" works fine, which in my case is /opt/java, and "which java" returns /opt/java/bin/java, which tallies.
I'm using Arch Linux.

Wondering if there's any other location/configuration where i need to set JAVA_HOME?

bin/bixo is working incorrectly with symlinked BIXO_HOME

easy fix:

--- bin/bixo.orig   2011-09-14 14:11:02.837699246 +0400
+++ bin/bixo        2011-09-14 14:11:35.967722777 +0400
@@ -81,7 +81,7 @@
 IFS=

-BIXO_CORE=`find "\$BIXO_HOME" -name "bixo-core-*.jar"`
+BIXO_CORE=`find "\$BIXO_HOME/" -name "bixo-core-*.jar"`
 if [ -z BIXO_CORE ]; then
        echo "Unable to find the bixo-core jar"
        exit 1

Permanently prune all old lib/*.jar files from git repo

Use [this script][http://dound.com/2009/04/git-forever-remove-files-or-folders-from-history/] to get rid of old jars.

Remove 'SNAPSHOT' from tika requirement

I had to edit the pom.xml file to remove the '-SNAPSHOT' from the tika requirement.

Create Cascading workflow to convert fetched content into Avro files

This would take as input a crawl directory, and extract the FetchedDatum tuples from the /content subdir, then write out the results as Avro files.

There would need to be control for filtering content. For example, with the PTD project we'd want to exclude any FetchedDatum that has the no archive metadata flag set, or that has a language metadata tag set to anything other than English, or (maybe) uses a charset other than one of the standard English ones (us-ascii, UTF-8, ISO-8859-1, cp1252).

There may be the need to control the number of reducers, to get the desired number of resulting part-XXXXX output files. Each Avro file shouldn't be too big (e.g. 100MB max, for the PTD project).

For the PTD project, there would also be a final processing of the Avro files that might need to be bash scripted:

Use Hadoop to copy from HDFS to local storage on master
Compress as xxx.gz
Use Hadoop to push to S3
Delete from local storage

Set up Bixo webinar

Several steps to this:

Generate list of names (for people not on lists)
- Kelvin Tan
- Schmed
- Rob Wunderlich
- Stack?
- Erich Nachbar
- Who else on crawler-commons project?
- Invite people on Bixo-dev list (Fuad?)
Publicize it
- Crawler-commons mailing list
- Bixo-dev mailing list
- ken-blog
- Twitter
- LinkedIn Status
- ACM data mining SIG
What system to use to present it
- GoToMeeting?
- Needs to be something free, that works well w/Mac & Windows
Pick date/time that works for US & Europe
Create outline of material (something for Ken)
Make sure we can record it, for posting to Bixo project site.

DomainNames should handle the German country code (and those like it) more carefully

Currently, the code in DomainNames that tries to find the "paid level domain" of the input domain treats ae.com, gq.com, and hm.com as if they were domains within the Arab Emirates, Equitorial Guinea, and Heard & McDonald Islands (respectively). These are instead domains within the United States.

The problem stems from the (over)use of DomainNames.CC_ALWAYS_TLDS. The purpose of this set of country codes is to list those that should be treated as top-level domains after a global top-level domain prefix is added. For example, com.de is considered a paid level domain, because de is a member of CC_ALWAYS_TLDS, whereas com.jp is itself a top-level domain, making honda.com.jp a paid-level domain.

Unfortunately, this set is also (inappropriately) being used to list those country codes that should be treated as top-level domains after a global top-level domain suffix is added. For example, de.com is a German top-level domain (see http://www.de.com/ for details), whereas (as mentioned before) ae.com, gq.com, and hm.com are paid level domains in the United States.

We need to add a new set of country codes that includes de which DomainNames.getPld can check to determine what to do with input domains like de.com, ae.com, gq.com and hm.com. I'm not exactly sure how to determine what other country codes besides de should be put into the new set. Unfortunately, the source document I used to build the constants in DomainNames.java (https://github.com/bixo/bixo/blob/master/doc/effective_tld_names.dat) doesn't specify which CCs should be treated this way.

Ken suggests trying out the equivalent domain processing support that currently exists in crawler-commons.

build/lib is not created by dist target, but is read from

If you try an ant dist from the tarball then it reads from the ./build/lib directory. However nothing in ant dist creates that directory so ant fails.

I am guessing that the directory does exist if you have previously done an "ant job".

I'm not sure whether the problem is simply that the directory needs to exist, or whether there are some libraries inside which also need to exist.

(Tested on the 0.4.8 tarball)

Add bixo blog RSS feed to Ohloh

Once we've got this set up for the new Bixo website

Add support for reading/writing WARC files

I think this means defining a new Cascading Scheme that knows about the WARC file format.

I've uploaded an example WARC file here.

I've added a WarcOutputFormat class (just a start) to Bixo that I think would be what's needed on the Hadoop side of things, based on the Cascading TextLine class. This uses the Heritrix WARCWriter class to handle writing out records.

Side note - the ClueWeb09 project generated some older-format ARC files that were invalid, apparently due to newlines in header fields. This might be something to look out for, when generating WARC files.

Get rid of release branch

We should put releases into the Downloads area. Getting rid of the release branch should save a bunch of space, since I think each of these has a release/xxx tarball file that's different.

Measure performance during large-scale crawl using minimal HttpClient and lockless connection manager

Oleg had said:

You might want to have a look at the lest code in SVN trunk (to be
released as 4.3). Several classes such as the scheme registry that
previously had to be synchronized in order to ensure thread safety have
been replaced with immutable equivalents. There is also now a way to
create HttpClient in a minimal configuration without authentication,
state management (cookies), proxy support and other non-essential
functions.

The new API is not yet final and not properly documented. Presently this
can be done with HttpClients#createMinimal

He also said:

I experimented with the idea of lock-less (unlimited) connection manager

This was in response to an issue I'd run into, where the single global lock on the connection pool was causing a lot of contention when many (hundreds) of threads were all fetching at the same time. He's provided source code, which unfortunately I can't attach here - but it's on my disk.

Add support for boilerpipe cleanup of HTML

See http://code.google.com/p/boilerpipe/ - code is under Apache 2.0 license.

Currently relies on NekoHTML to do parsing, but we could modify it (the DefaultHTMLParser class) to just be a ContentHandler that you can optionally install during the parse process, to extract the content (similar to what we do now for language detection).

And we should ping Christian Kohlschütter (the author) when it's done so he can check it out - [email protected]

Use robots.txt parsing support now in crawler-commons

The robots.txt parsing code was recently ported to crawler-commons.

Bixo should be modified to leverage crawler-commons, though we need a way to get crawler-commons builds into a Maven repo first.

This issue is the same as https://issues.apache.org/jira/browse/NUTCH-1008

Set up for using Google webmaster tools

We'd have to add meta tag to current site - there's a way to do this via the WP admin.

Decide about switching to Apache license

Any issues with making this kind of change?

small fixes

Hi,

I've made a few fixes, patch is below:

From aa3cc26997aec1422472918ba0d0a32c5dfb261a Mon Sep 17 00:00:00 2001
From: ifp5 <>
Date: Mon, 15 Nov 2010 10:42:55 +0000
Subject: [PATCH] small fixes:
 - changed shell path to /bin/sh as /bin/bash doesn't exists in many OSes (e.g. FreeBSD, Solaris)
 - install bixo-core to maven repo before building contrib sources 
 - fixed contrib sources compilation (building contrib required to create distribution and it may fail otherwise)

---
 bin/bixo                                           |    2 +-
 build.xml                                          |    1 +
 .../com/transpac/helpful/tools/AnalyzeEmail.java   |    5 +++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bin/bixo b/bin/bixo
index d11f493..9a707b7 100755
--- a/bin/bixo
+++ b/bin/bixo
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/bin/sh
 #
 # The Bixo command script
 #
diff --git a/build.xml b/build.xml
index 52ca926..e24f827 100644
--- a/build.xml
+++ b/build.xml
@@ -555,6 +555,7 @@ SOFTWARE.
        

        
                
                        
diff --git a/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java b/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
index e4335e3..061a9e2 100644
--- a/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
+++ b/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
@@ -61,6 +61,7 @@ public class AnalyzeEmail {
        private static final int MAX_CONTENT_SIZE = 8 * 1024 * 1024;

        private static final int MAX_THREADS = 1;
+       private static final int NUM_REDUCERS = 1;

        private static final String MBOX_PAGE_STATUS_PIPE_NAME = "mbox page fetch status pipe";
        private static final String SPLITTER_PIPE_NAME = "Split emails pipe";
@@ -175,7 +176,7 @@ public class AnalyzeEmail {
             BaseScoreGenerator scorer = new FixedScoreGenerator();

             BaseFetcher fetcher = new SimpleHttpFetcher(MAX_THREADS, userAgent);
-            FetchPipe fetchPagePipe = new FetchPipe(importPipe, scorer, fetcher);
+            FetchPipe fetchPagePipe = new FetchPipe(importPipe, scorer, fetcher, NUM_REDUCERS);

             // Here's the pipe that will output UrlDatum tuples, by extracting URLs from the mod_mbox-generated page.
                Pipe mboxPagePipe = new Each(fetchPagePipe.getContentTailPipe(), new ParseModMboxPageFunction(), Fields.RESULTS);
@@ -190,7 +191,7 @@ public class AnalyzeEmail {
             fetcher = new SimpleHttpFetcher(MAX_THREADS, defaultPolicy, userAgent);

             // We can create the fetch pipe, and set up our Mbox splitter to run on content.
-            FetchPipe fetchMboxPipe = new FetchPipe(mboxPagePipe, scorer, fetcher);
+            FetchPipe fetchMboxPipe = new FetchPipe(mboxPagePipe, scorer, fetcher, NUM_REDUCERS);
             SplitEmails splitterPipe = new SplitEmails(fetchMboxPipe);

             // Now create the pipe that's going to analyze the emails we get after splitting them up.
-- 
1.7.3.1