Git Product home page Git Product logo

bixo's Introduction

===============================
Introduction
===============================

Bixo is an open source Java web mining toolkit that runs as a series of Cascading
pipes. It is designed to be used as a tool for creating customized web mining apps.
By building a customized Cascading pipe assembly, you can quickly create a workflow
using Bixo that fetches web content, parses, analyzes, and publishes the results.

Bixo borrows heavily from the Apache Nutch project, as well as many other open source
projects at Apache and elsewhere.

Bixo is released under the Apache License, Version 2.0.

===============================
Building
===============================

See http://openbixo.org/documentation/building-bixo/ for full details.

You need Apache Ant 1.7 or higher. 

To get a list of valid targets:

% cd <project directory>
% ant -p

To  clean and build a jar (which also runs all tests):

% ant clean jar

Note that "ant clean test jar" will currently fail, due to a bug in the maven ant task
plugin used for managing dependencies.

To create Eclipse project files:

% ant eclipse

Then, from Eclipse follow the standard procedure to import an existing Java project into your Workspace.


bixo's People

Contributors

cwensel avatar devsprint avatar finack avatar kkrugler avatar mgraney avatar rockwalrus avatar schmed avatar sgroschupf avatar vmagotra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bixo's Issues

build failed due to missing dependencies

I'm trying to build bixo from git sources according to instructions in the github page: https://github.com/bixo/bixo

When I run 'ant clean test jar' it fails with the following output:

$ ant clean test jar
Buildfile: build.xml

clean:
     [echo] cleaning bixo-core

mvn-init:
     [echo] maven.repo.local=/home/ivanhoe/.m2/repository
[artifact:dependencies] [INFO] snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT: checking for updates from Apache Snapshots
[artifact:dependencies] [WARNING] repository metadata for: 'snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT' could not be retrieved from repository: Apache Snapshots due to an error: Error transferring file
[artifact:dependencies] [INFO] Repository 'Apache Snapshots' will be blacklisted
[artifact:dependencies] [INFO] snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT: checking for updates from Apache Releases
[artifact:dependencies] [WARNING] repository metadata for: 'snapshot org.apache.tika:tika-parsers:0.9-SNAPSHOT' could not be retrieved from repository: Apache Releases due to an error: Error transferring file
[artifact:dependencies] [INFO] Repository 'Apache Releases' will be blacklisted
[artifact:dependencies] Downloading: org/apache/tika/tika-parsers/0.9-SNAPSHOT/tika-parsers-0.9-SNAPSHOT.pom from Bixo
[artifact:dependencies] Downloading: org/apache/tika/tika-parsers/0.9-SNAPSHOT/tika-parsers-0.9-SNAPSHOT.jar from Bixo
[artifact:dependencies] An error has occurred while processing the Maven artifact tasks.
[artifact:dependencies]  Diagnosis:
[artifact:dependencies] 
[artifact:dependencies] Unable to resolve artifact: Missing:
[artifact:dependencies] ----------
[artifact:dependencies] 1) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies]   Try downloading the file manually from the project website.
[artifact:dependencies] 
[artifact:dependencies]   Then, install it using the command: 
[artifact:dependencies]       mvn install:install-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file
[artifact:dependencies] 
[artifact:dependencies]   Alternatively, if you host your own repository you can deploy the file there: 
[artifact:dependencies]       mvn deploy:deploy-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]
[artifact:dependencies] 
[artifact:dependencies]   Path to dependency: 
[artifact:dependencies]         1) bixo:bixo-core:jar:1.0-SNAPSHOT
[artifact:dependencies]         2) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies] ----------
[artifact:dependencies] 1 required artifact is missing.
[artifact:dependencies] 
[artifact:dependencies] for artifact: 
[artifact:dependencies]   bixo:bixo-core:jar:1.0-SNAPSHOT
[artifact:dependencies] 
[artifact:dependencies] from the specified remote repositories:
[artifact:dependencies]   Bixo (http://bixo.github.com/repo/),
[artifact:dependencies]   Apache Releases (https://repository.apache.org/content/repositories/releases/),
[artifact:dependencies]   central (http://repo1.maven.org/maven2),
[artifact:dependencies]   Apache Snapshots (https://repository.apache.org/content/groups/snapshots-group/)
[artifact:dependencies] 
[artifact:dependencies] 

BUILD FAILED
/usr/home/ivanhoe/work/bixo.git/bixo/build.xml:64: Unable to resolve artifact: Missing:
----------
1) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT

  Try downloading the file manually from the project website.

  Then, install it using the command: 
      mvn install:install-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file

  Alternatively, if you host your own repository you can deploy the file there: 
      mvn deploy:deploy-file -DgroupId=org.apache.tika -DartifactId=tika-parsers -Dversion=0.9-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

  Path to dependency: 
        1) bixo:bixo-core:jar:1.0-SNAPSHOT
        2) org.apache.tika:tika-parsers:jar:0.9-SNAPSHOT

----------
1 required artifact is missing.

for artifact: 
  bixo:bixo-core:jar:1.0-SNAPSHOT

from the specified remote repositories:
  Bixo (http://bixo.github.com/repo/),
  Apache Releases (https://repository.apache.org/content/repositories/releases/),
  central (http://repo1.maven.org/maven2),
  Apache Snapshots (https://repository.apache.org/content/groups/snapshots-group/)



Total time: 3 seconds

It looks like tika-parsers are missing in the Bixo repository.

Update Cascading Dependency

Bixo depends on cascading-core in version 1.2.5. The most recent version is 2.1.0.
Any chance of upgrading it?

bin/bixo is working incorrectly with hadoop 0.20.203.0

Looks like jar file naming scheme was changed.

simple fix:

--- bin/bixo.fix1       2011-09-14 14:40:29.035885594 +0400
+++ bin/bixo    2011-09-14 14:41:40.798931767 +0400
@@ -97,7 +97,7 @@
 done

 # add Hadoop libs to CLASSPATH
-for f in $HADOOP_HOME/hadoop-*-core.jar; do
+for f in $HADOOP_HOME/hadoop-core-*.jar; do
   CLASSPATH=${CLASSPATH}:$f;
 done
 for f in $HADOOP_HOME/lib/*.jar; do

Add Bixo page to Wikipedia

  1. Create page - http://en.wikipedia.org/wiki/bixo_(web crawler)
    See http://en.wikipedia.org/wiki/Heritrix for idea of format/content
    Put in http://en.wikipedia.org/wiki/Category:Free_web_crawlers category
  2. Edit Bixo page to disambiguate - http://en.wikipedia.org/w/index.php?title=Bixo&action=edit
    See http://en.wikipedia.org/wiki/Disambiguation_page#Disambiguation_pages for how-to
  3. Add to list of crawlers - http://en.wikipedia.org/wiki/Web_crawler
  4. Add to Hadoop "see also" section on Wikipedia.

Move web site to new location

  1. Get the content from bixo.101tec.com (either manually, or Marko sends us a dump)
  2. Re-post to new WordPress site, or reformat (ugh) for the GitHub wiki
  3. Fix up formatting problems, links (if absolute).

bin/bixo fails when there is more than one bixo-core.*.jar in the dist directory

I've been trying to run bin/bixo to run the SimpleTool after downloading and "ant dist" a distribution - which I then installed elsewhere.

It is failing in the bixo script because there is in fact two bixo-core.*.jar files

alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$ ls
bixo-core-1.0-SNAPSHOT.jar classes-it classes-main-eclipse dfs lib
bixo-dist-1.0-SNAPSHOT classes-it-eclipse classes-test it test
bixo-dist-1.0-SNAPSHOT.tgz classes-main classes-test-eclipse java-doc test-it
alex@reynolds:
/projects/bixo/bixo-bixo-da66523/build$ pwd
/home/alex/projects/bixo/bixo-bixo-da66523/build
alex@reynolds:/projects/bixo/bixo-bixo-da66523/build$ find . -name "bixo-core-*.jar"
./bixo-core-1.0-SNAPSHOT.jar
./bixo-dist-1.0-SNAPSHOT/bixo-core-1.0-SNAPSHOT.jar
alex@reynolds:
/projects/bixo/bixo-bixo-da66523/build$

Now, maybe the distribution I've built is wrong, but the bixo script could be better prepared.

I've gotten around it for now by changing

BIXO_CORE=find "$BIXO_HOME" -name "bixo-core-*.jar"

to

BIXO_CORE=find "$BIXO_HOME" -name "bixo-core-*.jar" | head -1

but neither directory actually has any useful jars in it.


What is the current policy for fetching and using all the libraries which maven fetches? Do I have to run maven when I am just trying to run the code even if I am not compiling it?

The build instructions don't work out of the box

I can't get anything to build.

I've run

$ ant clean jar

in both the root and examples directories

In the root directory I get this error:

test-it:
[mkdir] Created dir: /Users/dane/Development/recommender/bixo/build/it
[junit] Running bixo.fetcher.FetcherTest
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 122.751 sec
[junit] Running bixo.fetcher.http.SimpleHttpFetcherIntegrationTest
[junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 0.724 sec
[junit] Test bixo.fetcher.http.SimpleHttpFetcherIntegrationTest FAILED

BUILD FAILED
/Users/dane/Development/recommender/bixo/build.xml:256: Tests failed!

In the examples directory I get this error:

BUILD FAILED
/Users/dane/Development/recommender/bixo/examples/build.xml:49: Unable to resolve artifact: Missing:

  1. bixo:bixo-core:jar:1.0-SNAPSHOT

Try downloading the file manually from the project website.

Then, install it using the command:
mvn install:install-file -DgroupId=bixo -DartifactId=bixo-core -Dversion=1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file

Alternatively, if you host your own repository you can deploy the file there:
mvn deploy:deploy-file -DgroupId=bixo -DartifactId=bixo-core -Dversion=1.0-SNAPSHOT -Dpackaging=jar -Dfile=/path/to/file -Durl=[url] -DrepositoryId=[id]

Path to dependency:
1) bixo:bixo-examples:jar:1.0-SNAPSHOT
2) bixo:bixo-core:jar:1.0-SNAPSHOT


1 required artifact is missing.

for artifact:
bixo:bixo-examples:jar:1.0-SNAPSHOT

from the specified remote repositories:
Bixo (http://bixo.github.com/repo/),
Conjars (http://conjars.org/repo),
central (http://repo1.maven.org/maven2)

Total time: 2 seconds

When I run

$bin/bixodemo crawl -agentname dane -domain www.sotmclub.com -outputdir output -numloops 3

in the example directory

I get this error

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

or this error

Exception in thread "main" java.lang.NoClassDefFoundError: bixo/examples/crawl/DemoCrawlTool
Caused by: java.lang.ClassNotFoundException: bixo.examples.crawl.DemoCrawlTool
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)

Update references to Bixo web site everywhere

Once the site has moved, all of the refs need to be updated. This probably isn't a complete list:

  • Yahoo developer mailing list
  • GitHub site (this place)
  • Ohloh project info

And we should see if 101tec.com will set up a perm redirect to the new location.

tika-parsers dependency version incorrect

The version for artifact "tika-parsers" (in pom.xml) is set to "0.8-SNAPSHOT" in the Bixo 0.5.1 dist -- this causes mvn-init to fail. This version should be "0.8".

Improve notes on how to run SimpleCrawlTool from Eclipse

Currently we don't say much, after describing how to create & then import the project into Eclipse.

Specifically, we should talk about what class to use as the basis for creating a new Java "Run Configuration", along with required parameters (including -Xmx256m for JVM).

Fix up or remove SimpleStatusTool

This assumes that crawl state is maintained in loop directories, but some of it is now in the SQL database. So either the code should be fixed up, or it should be removed.

Update release procedure

  1. No longer use TeamCity - just push to GitHub download section.
    Note that we could use API to push it, as part of dist build.
  2. Add step to edit release note.
  3. Add step to post to Freshmeat.

dist package not complete - error

I downloaded the latest bixo-dist-0.5.1

I am trying to run

bin/bixo crawl -agentname test -domain www.google.com -outputdir output -numloops 3

I get an error saying HADOOP_HOME not set, but when I set it an hadoop directory

I copied the hadoop core jar from hadoop to the bin directory, now I get the following error

Exception running tool: org/apache/commons/configuration/Configuration
java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration

Shouldn't the dist archive contain all the dependency to run the command?

Create build target for everything in contrib directory

Currently there's only the "helpful" example. It doesn't get built via ant, just Eclipse, so it's often broken. It would be great to have a target for building all sub-projects found in the contrib/ directory. And also a test-contrib that does compile-contrib. Then the dist target should depend on test-contrib, so we don't do releases if contribs don't compile or pass their tests.

Improve SimpleRobotRules parser

Currently there are a number of directives that it should ignore, versus reporting as warnings.

Also it should handle common typos/formatting errors - check out the robot-errors.zip file in the Downloads directory for about 24K files found that had processing problems.

Examples of formatting errors include:

  • Missing the ':' after the directive (user-agent * instead of user-agent: *)
  • Adding a space before the ':' separator (user-agent : * instead of user-agent: *)

Fix up openbixo.org, once it's been moved to BlueHost

  • Get rid of Pages widget, once sub-dirs have explicit links (current issue for Matt)
  • Use custom CSS to tune appearance
  • Set appropriate complimentary colors for links and such. Could check out Kuler for complementary color ideas
  • Decide about sticking w/green color theme
  • Set up for using Blog as News page

Pick new domain for hosting Bixo web site

bixo.101tec.com is going away. We've got three options (at least):

  • Use bixo.bixolabs.com. But being so tightly associated with a company (Bixolabs) might not be a good thing. And editing would require access to the bixolabs.com site.
  • Use something like bixo-project.org. $40/year for domain name, mapping to wordpress.com, and custom CSS (if we do it via wordpress.com). Or $20/mo for Slicehost.com.
  • Don't worry about it, and just use the wiki here to provide documentation. Free, but wiki is much more limited in formatting and such.

Some input on each of the above would be great.

Modify SimpleParser to allow a ParseContext to be set

With the current design of SimpleParser (and TikaCallable) there is no way to set a different HtmlMapper than the one TikaCallable uses by default. So, if a caller wants to use the IdentityHtmlMapper instead there is no easy way to do that - one has to duplicate the code (SimpleParser and TikaCallable).
It would be nice to have a constructor in SimpleParser that allowed a caller to pass in the ParseContext. TikaCallable would also need to change to allow passing in a ParseContext. If the ParseContext is null, TikaCallable can make the default ParseContext (as it is doing now), otherwise it should just pass along what it received to Tika's parse function.

Update copyright of all files

New copyright holder is Bixo Labs.

Need standard template used for all header files.

Some files are from the Nutch project project originally, so need special info in header.

Need to work w/Stefan on copyright for files currently assigned to 101tec.

This should wait until we decide about whether to change license to Apache (currently MIT)

Improve Bixo web site for findability

  • It would be great to have Google Analytics installed - free if we use wordpress.com
    Though some additional work aids w/using Google/Yahoo/Bing webmaster tools (meta tags)
  • Change title to be "Open Source Web Mining Toolkit | Bixo"

Error: JAVA_HOME is not set

Building Bixo from the instructions at http://openbixo.org/documentation/building-bixo/, and then testing the build with "bin/bixo crawl ...." from http://openbixo.org/documentation/getting-started/, throws the error that JAVA_HOME is not set.

However, "echo $JAVA_HOME" works fine, which in my case is /opt/java, and "which java" returns /opt/java/bin/java, which tallies.
I'm using Arch Linux.

Wondering if there's any other location/configuration where i need to set JAVA_HOME?

bin/bixo is working incorrectly with symlinked BIXO_HOME

easy fix:

--- bin/bixo.orig   2011-09-14 14:11:02.837699246 +0400
+++ bin/bixo        2011-09-14 14:11:35.967722777 +0400
@@ -81,7 +81,7 @@
 IFS=

-BIXO_CORE=`find "\$BIXO_HOME" -name "bixo-core-*.jar"`
+BIXO_CORE=`find "\$BIXO_HOME/" -name "bixo-core-*.jar"`
 if [ -z BIXO_CORE ]; then
        echo "Unable to find the bixo-core jar"
        exit 1

Create Cascading workflow to convert fetched content into Avro files

This would take as input a crawl directory, and extract the FetchedDatum tuples from the /content subdir, then write out the results as Avro files.

There would need to be control for filtering content. For example, with the PTD project we'd want to exclude any FetchedDatum that has the no archive metadata flag set, or that has a language metadata tag set to anything other than English, or (maybe) uses a charset other than one of the standard English ones (us-ascii, UTF-8, ISO-8859-1, cp1252).

There may be the need to control the number of reducers, to get the desired number of resulting part-XXXXX output files. Each Avro file shouldn't be too big (e.g. 100MB max, for the PTD project).

For the PTD project, there would also be a final processing of the Avro files that might need to be bash scripted:

  1. Use Hadoop to copy from HDFS to local storage on master
  2. Compress as xxx.gz
  3. Use Hadoop to push to S3
  4. Delete from local storage

Set up Bixo webinar

Several steps to this:

  • Generate list of names (for people not on lists)
    • Kelvin Tan
    • Schmed
    • Rob Wunderlich
    • Stack?
    • Erich Nachbar
    • Who else on crawler-commons project?
    • Invite people on Bixo-dev list (Fuad?)
  • Publicize it
    • Crawler-commons mailing list
    • Bixo-dev mailing list
    • ken-blog
    • Twitter
    • LinkedIn Status
    • ACM data mining SIG
  • What system to use to present it
    • GoToMeeting?
    • Needs to be something free, that works well w/Mac & Windows
  • Pick date/time that works for US & Europe
  • Create outline of material (something for Ken)
  • Make sure we can record it, for posting to Bixo project site.

DomainNames should handle the German country code (and those like it) more carefully

Currently, the code in DomainNames that tries to find the "paid level domain" of the input domain treats ae.com, gq.com, and hm.com as if they were domains within the Arab Emirates, Equitorial Guinea, and Heard & McDonald Islands (respectively). These are instead domains within the United States.

The problem stems from the (over)use of DomainNames.CC_ALWAYS_TLDS. The purpose of this set of country codes is to list those that should be treated as top-level domains after a global top-level domain prefix is added. For example, com.de is considered a paid level domain, because de is a member of CC_ALWAYS_TLDS, whereas com.jp is itself a top-level domain, making honda.com.jp a paid-level domain.

Unfortunately, this set is also (inappropriately) being used to list those country codes that should be treated as top-level domains after a global top-level domain suffix is added. For example, de.com is a German top-level domain (see http://www.de.com/ for details), whereas (as mentioned before) ae.com, gq.com, and hm.com are paid level domains in the United States.

We need to add a new set of country codes that includes de which DomainNames.getPld can check to determine what to do with input domains like de.com, ae.com, gq.com and hm.com. I'm not exactly sure how to determine what other country codes besides de should be put into the new set. Unfortunately, the source document I used to build the constants in DomainNames.java (https://github.com/bixo/bixo/blob/master/doc/effective_tld_names.dat) doesn't specify which CCs should be treated this way.

Ken suggests trying out the equivalent domain processing support that currently exists in crawler-commons.

build/lib is not created by dist target, but is read from

If you try an ant dist from the tarball then it reads from the ./build/lib directory. However nothing in ant dist creates that directory so ant fails.

I am guessing that the directory does exist if you have previously done an "ant job".

I'm not sure whether the problem is simply that the directory needs to exist, or whether there are some libraries inside which also need to exist.

(Tested on the 0.4.8 tarball)

Add support for reading/writing WARC files

I think this means defining a new Cascading Scheme that knows about the WARC file format.

I've uploaded an example WARC file here.

I've added a WarcOutputFormat class (just a start) to Bixo that I think would be what's needed on the Hadoop side of things, based on the Cascading TextLine class. This uses the Heritrix WARCWriter class to handle writing out records.

Side note - the ClueWeb09 project generated some older-format ARC files that were invalid, apparently due to newlines in header fields. This might be something to look out for, when generating WARC files.

Get rid of release branch

We should put releases into the Downloads area. Getting rid of the release branch should save a bunch of space, since I think each of these has a release/xxx tarball file that's different.

Measure performance during large-scale crawl using minimal HttpClient and lockless connection manager

Oleg had said:

You might want to have a look at the lest code in SVN trunk (to be
released as 4.3). Several classes such as the scheme registry that
previously had to be synchronized in order to ensure thread safety have
been replaced with immutable equivalents. There is also now a way to
create HttpClient in a minimal configuration without authentication,
state management (cookies), proxy support and other non-essential
functions.

The new API is not yet final and not properly documented. Presently this
can be done with HttpClients#createMinimal

He also said:

I experimented with the idea of lock-less (unlimited) connection manager

This was in response to an issue I'd run into, where the single global lock on the connection pool was causing a lot of contention when many (hundreds) of threads were all fetching at the same time. He's provided source code, which unfortunately I can't attach here - but it's on my disk.

Add support for boilerpipe cleanup of HTML

See http://code.google.com/p/boilerpipe/ - code is under Apache 2.0 license.

Currently relies on NekoHTML to do parsing, but we could modify it (the DefaultHTMLParser class) to just be a ContentHandler that you can optionally install during the parse process, to extract the content (similar to what we do now for language detection).

And we should ping Christian Kohlschütter (the author) when it's done so he can check it out - [email protected]

small fixes

Hi,

I've made a few fixes, patch is below:

From aa3cc26997aec1422472918ba0d0a32c5dfb261a Mon Sep 17 00:00:00 2001
From: ifp5 <>
Date: Mon, 15 Nov 2010 10:42:55 +0000
Subject: [PATCH] small fixes:
 - changed shell path to /bin/sh as /bin/bash doesn't exists in many OSes (e.g. FreeBSD, Solaris)
 - install bixo-core to maven repo before building contrib sources 
 - fixed contrib sources compilation (building contrib required to create distribution and it may fail otherwise)

---
 bin/bixo                                           |    2 +-
 build.xml                                          |    1 +
 .../com/transpac/helpful/tools/AnalyzeEmail.java   |    5 +++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/bin/bixo b/bin/bixo
index d11f493..9a707b7 100755
--- a/bin/bixo
+++ b/bin/bixo
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/bin/sh
 #
 # The Bixo command script
 #
diff --git a/build.xml b/build.xml
index 52ca926..e24f827 100644
--- a/build.xml
+++ b/build.xml
@@ -555,6 +555,7 @@ SOFTWARE.
        

        
                
                        
diff --git a/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java b/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
index e4335e3..061a9e2 100644
--- a/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
+++ b/contrib/helpful/src/main/java/com/transpac/helpful/tools/AnalyzeEmail.java
@@ -61,6 +61,7 @@ public class AnalyzeEmail {
        private static final int MAX_CONTENT_SIZE = 8 * 1024 * 1024;

        private static final int MAX_THREADS = 1;
+       private static final int NUM_REDUCERS = 1;

        private static final String MBOX_PAGE_STATUS_PIPE_NAME = "mbox page fetch status pipe";
        private static final String SPLITTER_PIPE_NAME = "Split emails pipe";
@@ -175,7 +176,7 @@ public class AnalyzeEmail {
             BaseScoreGenerator scorer = new FixedScoreGenerator();

             BaseFetcher fetcher = new SimpleHttpFetcher(MAX_THREADS, userAgent);
-            FetchPipe fetchPagePipe = new FetchPipe(importPipe, scorer, fetcher);
+            FetchPipe fetchPagePipe = new FetchPipe(importPipe, scorer, fetcher, NUM_REDUCERS);

             // Here's the pipe that will output UrlDatum tuples, by extracting URLs from the mod_mbox-generated page.
                Pipe mboxPagePipe = new Each(fetchPagePipe.getContentTailPipe(), new ParseModMboxPageFunction(), Fields.RESULTS);
@@ -190,7 +191,7 @@ public class AnalyzeEmail {
             fetcher = new SimpleHttpFetcher(MAX_THREADS, defaultPolicy, userAgent);

             // We can create the fetch pipe, and set up our Mbox splitter to run on content.
-            FetchPipe fetchMboxPipe = new FetchPipe(mboxPagePipe, scorer, fetcher);
+            FetchPipe fetchMboxPipe = new FetchPipe(mboxPagePipe, scorer, fetcher, NUM_REDUCERS);
             SplitEmails splitterPipe = new SplitEmails(fetchMboxPipe);

             // Now create the pipe that's going to analyze the emails we get after splitting them up.
-- 
1.7.3.1

Create svg versions of the graffle diagrams

Please can you export the graffle diagrams from the doc section of Bixo's github and save them as svg as well as graffle.

I appreciate that you may not want to store "compiled" things in source control, but I've tried a free graffle2svg converter without success.

Update version of commons-io

Bixo is using version 1.4 of commons-io. The latest release available on maven central is 2.0.1
We should probably update the version that Bixo is using.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.