wri / forma-clj Goto Github PK

The Forest Monitoring for Action (FORMA) project provides forest clearing alerts derived from MODIS satellite imagery every 16 days beginning in December 2005. FORMA is a project of World Resources Institute and was developed by the Center for Global Development.

License: Eclipse Public License 1.0

Thrift 0.40% Shell 2.63% Clojure 95.64% Java 1.33%

gfw forma

forma-clj's Introduction

What is FORMA?

The Forest Monitoring for Action (FORMA) project provides free and open forest clearing alert data derived from MODIS satellite imagery every 16 days beginning in 2006. FORMA is a project of World Resources Institute and was developed by the Center for Global Development.

Global Forest Watch

FORMA data can be seen in action through the Global Forest Watch initiative, a collaboration between the World Resources Institute, Center for Global Development, Vizzuality, Google, University of Maryland, and Imazon. It's a project that brings transparency to forest conservation through early-warning alerting and long-term monitoring systems.

About the software

The FORMA software is written in the Clojure programming language and rides on Cascading and Cascalog for processing "Big Data" on top of Hadoop using MapReduce.

Contributors

Dan Hammer @danhammer
Robin Kraft @robinkraft
Sam Ritchie @sritchie
Aaron Steele @eightysteele
Dave Petrovics @dpetrovics

forma-clj's People

Contributors

Stargazers

Watchers

forma-clj's Issues

hdf/modis-chunks shouldn't call thrift/pack

The hdf/modis-chunks in pull request #84 is calling thrift/pack, and it shouldn't be.

lein plugins (sporadically) cause compilation errors

Currently, they are commented out in project.clj. This is a long-standing issue, now opened on GH.

DataChunk* should handle nil dates coming in from Cascalog

Currently, as seen in issue #71, the !date field cannot be nullable if we're doing thrift packing within a query. Yet the precondition seems to handle dates correctly. The workaround we're using in #141 simply removes the !date field completely from the query, but we should get a better handle on why we can't have a nullable date field.

This query illustrates the problem:

(??- (let [src [["ndvi" (thrift/ModisPixelLocation* "500" 28 8 0 0) 1 "16" nil]]]
                                     (<- [?dc]
                                         (src ?name ?loc ?val ?t-res !date)
                                         (thrift/DataChunk* ?name ?loc ?val ?t-res !date :> ?dc))))

Caused by: java.lang.AssertionError: Assert failed: (or (not date) (string? (first date)))
    at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:289)
    at clojure.lang.RestFn.invoke(RestFn.java:497)
    at clojure.lang.Var.invoke(Var.java:431)
    at clojure.lang.AFn.applyToHelper(AFn.java:178)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)

Playground full of Thrift memory taps

Might be useful to have a playground.clj full of Thrift memory taps for testing and playing at the REPL.

removing forma.matrix.jblas?

What is this namespace used for? Can we eliminate it? The functions don't seem to be used much -- if at all -- in the rest of the code base.

gadmiso not using resources correctly

At the repl on a cluster (after uberjaring), (use 'forma.hadoop.jobs.cdm) raises this exception:

CompilerException java.io.FileNotFoundException: /home/hadoop/resources/admin-map.csv (No such file or directory), compiling:(gadmiso.clj:12)

Looks like this pull request may not have been quite ready for prime time.

mk-data-value having trouble with forma.schema.IntArray

Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: No method in multimethod 'mk-data-value' for dispatch value: class forma.schema.IntArray
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
    ... 131 more
Caused by: java.lang.IllegalArgumentException: No method in multimethod 'mk-data-value' for dispatch value: class forma.schema.IntArray
    at clojure.lang.MultiFn.getFn(MultiFn.java:121)
    at clojure.lang.MultiFn.invoke(MultiFn.java:163)
    at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:283)
    at clojure.lang.RestFn.invoke(RestFn.java:497)
    at clojure.lang.Var.invoke(Var.java:431)
    at clojure.lang.AFn.applyToHelper(AFn.java:178)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)

replace mys-*.csv test data with local IDN test data

Only called within logistic_test.clj namespace, starting at this function. Too many extra data sets floating around.

possible rain-ndvi TS length miss-match

TL;DR: rain ts likely to be 1-value shorter than NDVI. Can't ignore this, so we should just duplicate the previous value for the current month if there's no data. The implication is that we would also have to rerun FORMA for t and t-1.

This may be mostly a non-issue, but a rain ts and a 16-day MODIS ts will probably have different lengths. Dan says rain isn't hugely important, but his functions do assume the timeseries are the same length. But if there's no data yet for the most recent month, even ts-expander can't handle that.

For example, if we're running FORMA through the Jan. 17 period (which spills into February to incude 2/1), the rain value for February won't be available until March. So there won't be an expanded ts until sometime in early March, when the rain file will be updated. So the rain ts will be one element shorter than the ndvi.

So we could either not estimate for the most recent period, or we could just duplicate the previous value - we're only looking for broad trends of rainfall to filter out droughts and the like.

The (possibly) complicating factor is that if we just duplicate the previous value, eventually that duplicated value will be duplicated again as needed to make the ts the right length. This would give us a timeseries with the same value at the end, staying the same indefinitely. Eventually rain would become useless.

So, it may be that each FORMA run will also have to be for the now-2nd most recent month, so that once we have good data for the 2nd most recent month, we can update the probabilities for that period.

gadmiso namespace throwing strange errors with large map

Sourcing from a CSV file won't work on the cluster; but storing the map within the project (371KB, ~39,000 entries) screws up during compilation.

Replace the forma-val vector with a FormaValue. object

In certain parts of the code, a forma-val is not a thrift object, but instead a PersistentVector, e.g., when the forma-val is called by unpack-feature-vec. (Note that you cannot destructure a thrift object.) This issue amounts to changing this line by calling FormaValue* on the values, rather than returning a vector -- and then following the change all the way through unpack-feature-vec in the forma.ops.classify namespace.

modis preprocessing fails due to thrift packing issue

hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.modis "s3n://modisfiles/MOD13A1/" "s3n://pailbucket/blzmasterpail/" "{2000,2001,2002,2003,2004}*" :BLZ

Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: No implementation of method: :pack of protocol: #'forma.thrift/IPackable found for class: forma.schema.IntArray
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
    ... 97 more
Caused by: java.lang.IllegalArgumentException: No implementation of method: :pack of protocol: #'forma.thrift/IPackable found for class: forma.schema.IntArray
    at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:527)
    at forma.thrift$eval8067$fn__8068$G__8058__8073.invoke(thrift.clj:104)
    at clojure.lang.Var.invoke(Var.java:415)
    at clojure.lang.AFn.applyToHelper(AFn.java:161)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    ... 99 more

problem with joins using data from a pail

Using the local pail dataset we've concocted, we're having trouble joining static datasets a la static-src.

Extracting values from the thrift objects in the pail works fine, but the join fails completely.

Sample data and code here: https://gist.github.com/3035928

Represent t-res and s-res as Thrift enum

Throughout the code, we're using string values to represent temporal and spatial resolutions. As @sritchie pointed out, we do this because they're strings in the schema, and because they're not really numeric values -- we're never going to perform any operations on them -- they're really just identifiers. We could represent these values as keywords, but you can't put a keyword in a Thrift object. The solution is to have validation on the values. The right way to do this is to use a Thrift enum:

enum Source {
  TRES_500 = 1,
  SRES_16 = 2
}

Background on this issue: #33

cascalog-lzo - output doesn't appear on S3

A query with hfs-lzo-textline runs to completion, but despite writing files to the _temporary directory on S3, nothing actually ends up in the output location. That is, the temporary, staging part files never get moved out of staging into the main directory.

A simple query last night seemed to work, so I'm at a loss for why this isn't working today. Here's an example:

(use 'cascalog.lzo)
(use 'cascalog.api)

;; this copies data from one location to another, applying LZO compression:
(?- (hfs-lzo-textline "s3n://formaresults/test/LZOnew")
      (hfs-seqfile "s3n://formaresults/test/cleannilseries"))

;; this does the same thing but without compression:

(?- (hfs-seqfile "s3n://formaresults/test/LZOnew")
      (hfs-seqfile "s3n://formaresults/test/cleannilseries"))

Thrift API unpack function

We've surfaced a new Thrift API in feature/thrift-api which includes an unpack function for unpacking any Thrift object. There's a few concerns about this.

Basically Thrift objects don't really have ordering by default, so what should you get back when unpacking? Consider an IDL struct like this:

struct TimeValue {
  1: i32 timestamp
  2: optional string title
  5: i64 value
}

We need to make sure unpack responds consistently for populated and un-populated fields. Also instead of hardcoding unpack implementations, we should reflect on the class to get field names. It should "just work" for any Thrift object.

classify/logistic-beta-wrap

In classify/logistic-beta-wrap replace the maps with a list comprehension and just generally simplify this function.

problem with GDAL build

In order to run the HDF tests locally, I need to build GDAL. But get an error in the following step:

Almost there! Just make it:
$ cd gdal-1.9.1/swig/java/
$ make

I get the following error when calling make:

dan@hammer-statistic:~/gdal-1.9.1/swig/java$ emacs -nw java.opt 
dan@hammer-statistic:~/gdal-1.9.1/swig/java$ make
mkdir -p org/gdal/gdal
mkdir -p org/gdal/gdalconst
mkdir -p org/gdal/ogr
mkdir -p org/gdal/osr
swig -Wall -I../include -I../include/java -I../include/java/docs -outdir "org/gdal/gdal" -package "org.gdal.gdal"  -I/home/dan/gdal-1.9.1 -c++ -java -o gdal_wrap.cpp ../include/gdal.i
/bin/bash: swig: command not found
make: *** [gdal_wrap.cpp] Error 127

retrieve commit history from feature/deliver

We have momentum! The project is working! And, unfortunately, we are late on a number of features for WRI. We've got to move quickly. If a lost commit history is the only casualty in this massive, messy merge process, then we are actually in good shape. It does suck, however, to lose any commit history. I am submitting this as an Issue to be solved later. We should also wait and see a little bit, since I am about to send a series of pull requests to accommodate incremental updates to the FORMA data that will greatly impact the forma.clj and scatter.clj namespaces.

delete static-modis-chunks, it's never used.

static-modis-chunks

remove fire-series function from project

Replace this function with thrift/TimeSeries*. Waiting for tests to be written for forma.hadoop.jobs.timeseries before we can do this.

takes ages to clone repo - we need to prune branches (and rebase?)

It's taking several minutes to clone the repo on the cluster, probably because it now takes up 220mb. We certainly need to prune some branches, but if that isn't sufficient I propose that we rebase develop to squash the commits that added all those sequence files.

mucked up field names in fix for #151

That's what I get for trying to get fancy with rebase, etc. I've got an easy fix coming.

Exception in thread "main" java.lang.RuntimeException: Could not apply all operations [{:type :operation, :id "c2e5763e-c593-4ebf-8cae-225e379e5a87", :assembly #<workflow$compose_straight_assemblies$fn__4411 cascalog.workflow$compose_straight_assemblies$fn__4411@2008a5d2>, :infields ("?val"), :outfields ("!__gen80"), :allow-on-genfilter? false} {:type :operation, :id "19315025-571b-4d2a-9e25-20e5354b7a7d", :assembly #<workflow$compose_straight_assemblies$fn__4411 cascalog.workflow$compose_straight_assemblies$fn__4411@17be8e6c>, :infields ("?val"), :outfields ("?val"), :allow-on-genfilter? false}]
        at jackknife.core$throw_runtime.doInvoke(core.clj:104)
        at clojure.lang.RestFn.invoke(RestFn.java:408)
        at cascalog.rules$build_query.invoke(rules.clj:564)
        at cascalog.rules$build_rule.invoke(rules.clj:647)
        at forma.source.rain$resample_rain.doInvoke(rain.clj:189)
        at clojure.lang.RestFn.invoke(RestFn.java:470)
        at forma.source.rain$rain_chunks.invoke(rain.clj:197)
        at forma.hadoop.jobs.preprocess$rain_chunker.invoke(preprocess.clj:21)
        at forma.hadoop.jobs.preprocess$PreprocessRain_main.doInvoke(preprocess.clj:31)
        at clojure.lang.RestFn.applyTo(RestFn.java:146)
        at forma.hadoop.jobs.preprocess.PreprocessRain.main(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

mk-data-value error in static processing - we've seen this before ...

This should be a quick fix. We're packing data using thrift one too many times.

Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: No method in multimethod 'mk-data-value' for dispatch value: class forma.schema.IntArray
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:71)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
    ... 112 more

should undo workaround for blossom-chunk and fix

There are some hoops to jump through to get around an issue with blossom-chunk. Unfortunately I don't have a stacktrace for this. But it had something to do with how the old-school version worked as a tap in other queries.

I've submitted this issue because blossom-chunk was a really useful predicate macro when it worked. It would be nice to have it back.

For reference:

See mention in pull request.

Original blossom-chunk

Current workaround

A previous version before workaround for pail tap issue

Usage prior to workaround

humid vs. dry tropics

At one time we used the boundary of the humid tropical biome to screen out pixels. Currently I only see screening by VCF value. So I have to assume that we are hitting some of the non-humid tropical forests in the countries we're looking at. I don't want to add yet another step to the workflow just yet, but since we're proposing to focus just on the humid tropics, for the paper at least we might want to consider postprocessing that would eliminate pixels with ecoids that aren't subsets of the humid tropical biome.

We should at least look at the difference in accuracy inside and outside the biome to see whether accuracy assessment of the paper will be skewed by this.

rain preprocessing failing due to thrift packing

Still investigating ... here's the stacktrace:

Caused by: java.lang.AssertionError: Assert failed: (DataValue? val)
    at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:273)
    at clojure.lang.RestFn.invoke(RestFn.java:470)
    at clojure.lang.Var.invoke(Var.java:427)
    at clojure.lang.AFn.applyToHelper(AFn.java:172)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)

DataChunk* precondition failing because of malformed(?) date

This happens on feature/deliver when running preprocessing:

Caused by: java.lang.AssertionError: Assert failed: (or (not date) (string? (first date)))
    at forma.thrift$DataChunk_STAR_.doInvoke(thrift.clj:278)
    at clojure.lang.RestFn.invoke(RestFn.java:497)
    at clojure.lang.Var.invoke(Var.java:431)
    at clojure.lang.AFn.applyToHelper(AFn.java:178)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)
    ... 112 more

Thrift is getting called in predicate/chunkfier:

https://github.com/reddmetrics/forma-clj/blob/feature/deliver/src/clj/forma/hadoop/predicate.clj#L157

I'm running this command:

hadoop jar forma-0.2.0-SNAPSHOT-standalone.jar forma.hadoop.jobs.preprocess.PreprocessStatic "gadm" "/user/hadoop/gadm.txt" "s3n://pailbucket/cmr/" "500" :CMR

update fire filters to >= 330 kelvin and >= 50 confidence

Currently fires are filtered by > 300 k and/or > 50 confidence. These should be >=.

revisit multimethods vs protocols in hoptree namespace

See discussion of multimethods and protocols here in the hoptree namespace for calculating pixel neighbors, particularly Sam's comment and Dan's reply.

fire counts should be by day

The fires data are combined from Terra and Aqua. For a given pixel and fire, the fire could be included twice or more in the same day if both sensors picked it up. Indeed, in some places the same sensor might pass over twice - once during the day, once at night. Additional detections don't necessarily represent additional fires, so it seems best to have booleans for the different fire filter rules for a given day.

Corner case: a fire is picked up by Terra with 335k and 35 confidence. It's picked up by Aqua with 329k and 54 confidence. So we need to decide if we should count ANY detection above the thresholds as a detection for that day, or whether we only count a fire day if there is a fire that meets the threshold of interest. In the above case, for one fire we'd get a hit on the >=330k filter, and one for the >= 50 confidence filter, but no hit for the >=330k AND >= 50 conf case.

I vote for having a boolean be flipped to true for a given day as each criteria is met, even if by "different" fire detections. But I don't have a theoretical basis for thinking that, it just feels more right.

modis preprocessing fails due to precondition in ModisChunkLocation

java.lang.AssertionError: Assert failed: (every? (fn* [p1__8241#] (instance? java.lang.Long p1__8241#)) [h v id size])
    at forma.thrift$ModisChunkLocation_STAR_.invoke(thrift.clj:231)
    at clojure.lang.Var.invoke(Var.java:431)
    at clojure.lang.AFn.applyToHelper(AFn.java:178)
    at clojure.lang.Var.applyTo(Var.java:532)
    at cascalog.ClojureCascadingBase.applyFunction(ClojureCascadingBase.java:68)
    at cascalog.ClojureMap.operate(ClojureMap.java:34)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:86)

test forma.clj using small-sample data set

This will help with many of the other issues listed; as big things change, this set of tests will help identify whether anything screwed up at a relatively high level.

issue in stretch-test namespace

Something was noted here. Has it been resolved?

Fix hansen-latlon->cdm test in forma.hadoop.jobs.cdm-test

Related to #148, the hansen-latlon->cdm throws a Cascalog exception when running lein midje.

should we be able to thrift/pack nil values?

Using thrift/pack to pack a nil value raises the exception below. nil also causes an assertion error when it is the data value when calling DataChunk*, as described in #151.

Are nil values valid as datavalues in our thrift schema? If so, shouldn't thrift/pack handle them correctly? If not, there shouldn't be nil values floating around in the rain processing workflow as in #151.

(pack nil)

No implementation of method: :pack of protocol: #'forma.thrift/IPackable found for class: nil
  [Thrown class java.lang.IllegalArgumentException]

local step failure in lein test

Not sure what's going here. Each test namespace compiles fine individually, but when I run lein test I get the following error:

dan@hammer-statistic:~/Dropbox/github/reddmetrics/forma-clj$ lein test
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/mnt/hgfs/Dropbox/github/reddmetrics/forma-clj/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/mnt/hgfs/Dropbox/github/reddmetrics/forma-clj/lib/dev/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Exception in thread "main" cascading.flow.FlowException: local step failed
        at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:191)
        at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
        at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:679)

I am not even sure where else to look for more information on why something is failing.

jobs in cluster mode fail using with-job-conf & mapred.child.java.opts

In cluster mode (using EMR at least), for any query that uses with-job-conf to set the mapred.child.java.opts property, the job never actually starts. It tries to start a number of times, but ultimately fails. Changes to the setting directly in mapred-site.xml don't get picked up for some reason, so this setting has to be modified in hadoop-site.xml:

mapred.child.java.opts-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m

Sample query to reproduce error

This only appears to happen in cluster mode, but it works even on a single-instance EMR cluster. After uberjaring, from the repl, (use 'cascalog.api) then run this:

(with-job-conf {"mapred.child.java.opts" "-Xmx512"} 
                     (let [src [[1 2]]
                             out-loc (hfs-seqfile "s3n://formaexperiments/test-with-job-conf" :sinkmode :replace)]
                          (?<- out-loc [?a]
                                (src ?a ?b))))

Things I've tried

The query I really want to run (forma/beta-gen) starts if you don't use with-job-conf, but eventually fails b/c the reducers run out of memory for big ecoregions in Brazil and Indonesia. For a smaller country like Malaysia, we don't need to modify the memory configuration, but we must be able to control the memory configuration in order to calculate the beta vectors.
The simple sample query above works without using with-job-conf
It works with (with-job-conf {"mapred.map.tasks" 10} ...
It fails using Cascalog 1.9 AND Cascalog 1.9-wip with (with-job-conf {"mapped.child.java.opts" "-Xmx512"} ... and for several other memory configurations
It works if conf/hadoop-site.xml is modified, in this case so that the max child process memory allocation is 1025m:

<property><name>mapred.child.java.opts</name><value>-Djava.library.path=/home/hadoop/native -Xms1024m -Xmx1025m</value></property>

As far as workarounds go it's not too bad, but it's definitely a pain.

Sample error messages from the logs

(JOB_SETUP) 'attempt_201207101735_0006_m_000013_8' to tip task_201207101735_0006_m_000013, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 18:04:44,941 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 23 on 9001): Removing task 'attempt_201207101735_0006_m_000013_7' 2012-07-10 18:04:47,946 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 43 on 9001): Error from attempt_201207101735_0006_m_000013_8: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

(JOB_CLEANUP) 'attempt_201207101735_0003_m_000010_17' to tip task_201207101735_0003_m_000010, for tracker 'tracker_10.96.174.59:localhost/127.0.0.1:39641' 2012-07-10 17:53:00,970 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 61 on 9001): Removing task 'attempt_201207101735_0003_m_000010_16' 2012-07-10 17:53:03,973 INFO org.apache.hadoop.mapred.TaskInProgress (IPC Server handler 42 on 9001): Error from attempt_201207101735_0003_m_000010_17: java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 2012-07-10 17:53:06,977 INFO org.apache.hadoop.mapred.JobTracker (IPC Server handler 2 on 9001): Adding task

include information on "recent-ness" of fires

We are neglecting the potential power of highlighting recent fires in the model. Currently, we keep a running total of fires for each pixel through a given period, but have no information on fires in recent periods.

See discussion here.

Refactor thrift namespace to use :keys :or args

Right now functions are just using optional params, which is awkward to work with in preconditions.

should agg-chunks check number of pixels in chunk?

In static dataset preprocessing, (= ?count chunk-size) appears in the agg-chunks query, and drops any chunks that don't have the correct number of pixels (24k). Are there chunks that could have fewer than 24k pixels? If so, where would they appear? If not, we should be able to drop that line, no?

unpack-feature-vector could be cleaner

See discussion.

Aaron suggests modifying the function so we can replace the current mess with something akin to this:

(map thrift/unpack seq)

As structured now, that wouldn't work in this function since we're only taking bits and pieces of the various inputs to create the feature vector, and the feature vector has a very specific ordering.

forma.WholeFile

WholeFile.java is included in classes -- looking at it now -- but forma.clj fails to compile with this message:

Unknown location:
error: java.lang.RuntimeException: java.lang.ClassNotFoundException: forma.WholeFile

Unknown location:
error: java.lang.ClassNotFoundException: forma.WholeFile

Compilation failed.

revisit pre/post conditions in utils namespaces

See pull request #82. This was necessary to get data out, but we should revisit whether these pre and post conditions are needed. @danhammer knows more about why these changes were made.

:plugins messing with my swank

On the develop branch, I'm seeing some unexpected behavior with lein swank:

https://gist.github.com/806dff51766d586b1f35

The workaround is removing the :plugins from project.clj:

 :plugins [[lein-midje "1.0.8"]
            [swank-clojure "1.4.0-SNAPSHOT"]]

To reproduce:

$ git clone https://github.com/reddmetrics/forma-clj.git
$ cd forma-clj
$ lein swank

classify/unpack-feature-vec

In the classify/unpack-feature-vec, add back preconditions and refactor so that it takes a Thrift object as a forma-val.

Dogging.

sdfsdf

use precondition to check that weights are >= 0

When generating a weighted average, the weights should be >= 0. If they're not, we currently throw an exception. Better to use a precondition.

Discussion here

Code here

pull requests weren't merged in

The GH history of closed pull requests reflect what we thought had happened -- namely, that each of my small pull requests were merged into develop from reddmetrics/feature/... . However, the current BIG pull request indicates that, in fact, the smaller pull requests were not merged in. The BIG pull request is big, sure, but it's not nearly as big as it looks right now -- none of the preceding merges are reflected. We should probably figure out why...

Clean time series

Research into how to use reliability measure to better parse signal from noise in the NDVI timeseries. Eventually adding back the cleaning functions in tele-clean, related to the comment found here.

wri / forma-clj Goto Github PK

forma-clj's Introduction

What is FORMA?

Global Forest Watch

About the software

Contributors

forma-clj's People

Contributors

Stargazers

Watchers

forma-clj's Issues

Sample query to reproduce error

﻿Things I've tried

Sample error messages from the logs

Recommend Projects

Recommend Topics

Recommend Org

Things I've tried