geopython / stetl Goto Github PK

View Code? Open in Web Editor NEW

83.0 83.0 35.0 7.12 MB

Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.

Home Page: https://www.stetl.org

License: GNU General Public License v3.0

Python 98.01% Shell 0.09% XSLT 0.75% Dockerfile 1.16%

data-conversion etl etl-framework gis gml inspire osgeo pipeline python streaming-etl transformations

stetl's Introduction

geopython

Vanity package for geopython projects

pip install geopython

>>> import geopython

stetl's People

Contributors

Stargazers

Watchers

stetl's Issues

Authentication support for HttpInput Component

HTTP APIs may require authentication. The HttpInput Component should at least support:

Basic Authentication (username, password)
Bearer (Token) Authentication: single token (see https://tools.ietf.org/html/rfc6750)

The configuration of HttpInput should support at least these auth schemes, but also be prepared for future schemes.

Add option to always apply LCO

When using a ZIP file as input (for example the dutch dataset Bestuurlijke Grenzen, which is a ZIP file containing GML files), the layer creation options are only applied for the first GML file in this ZIP file.

I've made a patch with an option, named always_apply_lco, added to OgrOutput and Ogr2OgrExecOutput. When this option is set to true, the LCO's are always added to the ogr2ogr command string. I will create a PR when the Python 3 migration is done.

Add Stetl to Python Package index

Add Stetl to http://pypi.python.org/pypi

Allow command-line overriding of config settings in .ini file

Config settings are now in .cfg ini file. It would be nice to have a mechanism to override/substitute settings on the command line. Typical settings are database names, users, passwords etc.

e.g. main.py -c my.cfg -host=myhost.com -dbname=mydb etc

Strange behaviour XmlAssembler?

While writing unit tests for XmlAssembler, I ran into a couple of issues. At first I've set up a chain reading only one GML file with three FeatureMember elements. In my config I wanted to write an etree doc for every two elements. I'm expecting two documents in this case, one with two elements, and one with only one element (the last one). I was surprised that no doc was written (to stdout). Here is my config:

# Config file for unit testing XmlAssembler.

[etl]
chains = input_glob_file|parse_xml_file|xml_assembler|output_std

[input_glob_file]
class = inputs.fileinput.GlobFileInput
file_path = tests/data/dummy.gml

# The source input file producing XML elements
[parse_xml_file]
class = filters.xmlelementreader.XmlElementReader
element_tags = FeatureMember

# Assembles etree docs gml:featureMember elements, each with "max_elements" elements
[xml_assembler]
class = filters.xmlassembler.XmlAssembler
max_elements = 2
container_doc = <?xml version="1.0" encoding="UTF-8"?>
   <gml:FeatureCollectionT10NL
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:top10nl="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2"
    xmlns:brt="http://www.kadaster.nl/schemas/imbrt/brt-alg/1.0"
    xmlns:gml="http://www.opengis.net/gml/3.2"
    xsi:schemaLocation="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2 http://www.kadaster.nl/schemas/top10nl/vyyyymmdd/TOP10NL_1_2.xsd">
    </gml:FeatureCollectionT10NL >
element_container_tag = FeatureCollectionT10NL

[output_std]
class = outputs.standardoutput.StandardOutput

I was suspecting this check in XmlAssembler.consume_element:
if element is None or packet.is_end_of_stream() is True:
(Note that the is True is redundant, but that doesn't matter.)
It turned indeed out that packet.is_end_of_stream was true. I think it is already caused by the GlobFileInput. I've just added this input class yesterday. It could be the case that I'm not understanding properly when is_end_of_stream should be set to true, but I'm wondering whether a filter which can return multiple packets based on one input packet (for example when an XML file is being parsed using XmlElementReader) should actually reset is_end_of_stream or is_end_of_doc.

When I skip this check, so I'm only checking for element is None, then a new XML document is generatedfor every XML element, so I was getting 3 documents, instead of the expected 2.

When I'm reading all GML files in my test data directory (currently 3 files), by setting file_path to tests/data/*.gml in input_glob_file, I'm getting either 6 documents (while checking for packet.is_end_of_stream()) or 9 documents. With 3 files I'm actually expecting 6 documents (3 x 2), namely a doc with 2 elements followed by a doc with 1 element, three times. However, each document contains only one element, only of the first 2 GML files. When disabling the aforementioned check I'm getting 9 docs, each with one element.

So, my question is how packet.is_end_of_stream and packet.is_end_of_doc should actually behave. Should they be reset when one input packet result in multiple output packets for the particular component? Or is there more to it?

I've attached my unit test file. The method test_execute is just a work-in-progress.
test_xml_assembler.zip

Python 3 support

Currently the Stetl module doesn't appear to support Python 3, building with Python 3 results in SyntaxErrors:

  File "/usr/lib/python3.4/dist-packages/stetl/postgis.py", line 34
    except psycopg2.DatabaseError, e:
                                 ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/main.py", line 84
    print name, data
             ^
SyntaxError: Missing parentheses in call to 'print'

  File "/usr/lib/python3.4/dist-packages/stetl/factory.py", line 26
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/etl.py", line 73
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/utils/apachelog.py", line 183
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/inputs/fileinput.py", line 183
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/inputs/deegreeinput.py", line 158
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/filters/templatingfilter.py", line 173
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/filters/xmlassembler.py", line 68
    except Exception, e:
                    ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.4/dist-packages/stetl/filters/gmlsplitter.py", line 130
    except Exception, e:
                    ^
SyntaxError: invalid syntax

Transformation Filter using Python-templating languages

Templating languages are used extensively in Python web-frameworks like Django and Pylons.
There is an enormous choice in templating technologies, see https://wiki.python.org/moin/Templating,
from very simple parameter substitution to full-Python control. In many cases Templating may be much simpler than XSLT Filtering. Think of INSPIRE GML where 90% of the GML is just "boilerplate" GML with a few variables and constants to be substituted. This is also an experiment but I have good hope this can work for many (INSPIRE-) cases.

Via this issue a foundation is laid to support some very simple templating like Python built-in string.Template and the popular Jinja2 templatin (http://jinja.pocoo.org/).

Basic idea is a TemplatingFilter with a template file or string with input (Jinja2 context) structured passed in from an Input. Output is typically a document, like a GML file but other setups are possible.

Ogr2OgrExecOutput issues

I've noticed a couple of issues which make Ogr2OgrExecOutput a bit less flexible than necessary. I'm writing them here, because of pending changes in execoutput in PR #75, so these changes can be done once that PR has been closed (either accepted or dismissed). I'm prepared to do these changes myself. For now it would be wise to have a discussion about the proposed changes.

Handling of position arguments. The position arguments of ogr2ogr are dst_datasource_name src_datasource_name [layer [layer ...]]. The rest are keyword arguments and they can be ignored for this matter. So you'll need at least two position arguments for ogr2ogr to work. They can be expanded indefinitely by adding layers. So, I'd like to propose that, when composing the ogr2ogr command string the keyword arguments are all put in front of the position arguments. This means that self.dest_data_source should only be added to the command in the execute method. We're not handling layers yet, but they could be added to in the future.
The dest_data_source should be optional. For my current use case I'm exporting data from PostgreSQL to a couple of different files. I'm using a combination of LineStreamerFileInput (list of data to be exported) + FormatConverter (split list into separate strings) + StringSubstitutionFilter (handling temp dir) + Ogr2OgrExecOutput. So, ogr2ogr is invoked multiple times. In my command file I'm specifying the dest source, as well as an -sql option, which is also different for every invocation. In my config I've defined an empty dest_data_source.
Add an optional parameter for the source data source. In my use case (see above) this is always the same, namely the Postgres connection string.
Make it configurable whether the lco options should be only called once or during every invocation of ogr2ogr. Currently I've put a -lco argument in the options string.

As you might guess, I'm looking for a solution where not only the source can change after each invocation, but also the destination and parameters. Ideally they should be passed in through a record. While this is possible, I think this is a next step in the evolution of this output object.

Handle file/compressed/directory structures and file-chunking in FileInput classes

FileInput and derived classes like StringFileInput can handle lists of files from directory and glob.glob parameters. Still all file content is read/passed as a single Packet. Also .zip files are handled by a dedicated class ZipFileInput.

It should be possible to generalize FileInput to have derived classes read from files no matter if files came from directory structures, glob.glob expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.

A second aspect is file chunking: a FileInput may split up a single file into Packets containing data structures extracted from that file. For example, FileInputs like XmlElementStreamerFileInput and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read(). Currently these classes implement this fully
within their read() function, but the generic pattern is that they
maintain a "context" for the open/parsed file.

So all in all this issue addresses two general aspects:

handle any file-specs: directories, maps, Globbing, zip-files and any mix of these
handle fine-grained file-chunking: on each invoke()/read() may supply part of a file: a line an XML element etc.

See also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).

Provide compact and reusable Dockerfile

The current Dockerfile for Stetl needs several improvements:

its Docker Image is "bulky", about 1.4GB, due to use of Ubuntu as base Image
hard to reuse: single ENTRYPOINT, harder to add e.g. user-plugins
not in basedir of git-repo, harder to automate in e.g. Docker Hub/Cloud
provide Docker Compose example(s)

Ad 1) better is to use a small-sized Python base image like python-alpine Linux.

Are LCO options always executed when expected?

While working on Stetl I was wondering whether the LCO options are always executed when necessary, as expected. This appears not to be the case in my current version (see PR #28), using the Ogr2OgrExecOutput. The LCO options are only passed once. The problem is that you don't exactly when these options need to be executed, and I think this is also the issue with the current Ogr2OgrOutput.

For example, when the BRT is loaded (using file chunks), a temporary GML file is being created, which is then being loaded by ogr2ogr. This GML file doesn't necessarily contain all the feature types which can be found in BRT. Ogr2ogr only creates the tables (when loading in PostGIS) for the features occurring in the temporary GML file. So, on subsequent runs of ogr2ogr, new tables can be created, but in those cases the LCO options are not applied anymore.

Add (better) raster support to Stetl

Integrate https://github.com/mapbox/rasterio or something similar?

Ignore parse errors in Apache Log file input

Do not issue fatal exception (this stops ETL process), just skip log-record with warning.

Use XProc

XProc in a 2010 W3C recommendation http://www.w3.org/TR/xproc/

It might interesting considering using XProc in your context

GUI for Stetl ETL Execution

Several requests for "a GUI for Stetl" were made. There are two GUIs to consider:

a "construction GUI" to create Stetl Chains, i.e. a Stetl config file
an "execution GUI" to execute the actual ETL (based on Stetl config file)

The first would require a tool "FME-like" to draw Inputs, Filters and Outputs and connect and paramterize them. The second is easier: manage the execution for a single Stetl config file. This issue adresses only the second/execution GUI.

Requirements

An initial set of requirements:

cross-platform (Win, Mac, Lin)
easily (un)installable (installer)
minimal scenario: choose ETL config, parameterize (e.g. inputs and output parameters), execute

Secondary requirements:

prepare for becoming a Cloud service

Implementation options

cross-platform Python GUI framework like PyQT, WxPython etc
Webtech cross-platform: Electron or lighter: PyWebView: https://github.com/r0x0r/pywebview
Jupyter-based: https://github.com/r0x0r/pywebview ? (GUI via websockets)

Allow Environment vars to substitute/override config template arg-variables

Stetl supports reusable ETL configurations via symbolic/substitutable variables like host-names, database credentials etc.

These variables can be substituted via env (arg) files or -a parameters when using the stetl command.

In several Stetl deployments, in particular where Docker (Compose) and Kubernetes (K8s) is used, there is a need to configure these variables via the "Environment". For example the "Secret" store in K8s may store DB-credentials. These variables are usually passed as (Unix/Linux) environment variables. Actual use-cases are currently within the https://github.com/smartemission K8s project.
This is part of the SE migration from Docker plain to Compose and K8s.

Stetl should be able to either substitute and override template arg-values from environment variables. This may require some convention in naming, as we don't want to break existing Stetl configs. For example, a Stetl config may use the var-name {hostname} internally. We don't want to substitute non-related env-vars by accident. So possibly Stetl-related env-vars should be prefixed with STETL_ or alike.

Stetl should not output passwords and other particular data in its log

For example, when loading the BGT, the following output is generated:

2018-01-24 21:30:05,904 ETL INFO Substituting 15 args in config file from args_dict: OrderedDict([('__name__', 'asection'), ('input_dir', '/var/nlextract/data/bgt/leveringen/latest'), ('zip_files_pattern', '*.[zZ][iI][pP]'), ('filename_match', '*.gml'), ('temp_dir', 'temp'), ('gfs_template', 'gfs/imgeo-v2.1.1.gfs'), ('host', '****'), ('port', '5432'), ('user', '****'), ('password', '****'), ('database', 'bgt'), ('schema', 'latest'), ('multi_opts', '-fieldTypeToString StringList'), ('spatial_extent', ''), ('max_features', '20000')])

and

2018-01-24 21:30:06,211 output INFO cfg = {'database': 'bgt', 'class': 'outputs.dboutput.PostgresDbOutput', 'host': '****', 'user': '****', 'password': '****', 'port': '5432', 'schema': 'latest'}

I've also masked the host and username.

The reason that this behaviour is undesired, is that the logging can be parsed and stored by other tools. For example, with Docker it is common practice to output log data to stdout, and then the output can be processed and stored by components you don't own.

Graphical UI to create workflows

Something based on http://nodered.org/ or similar? Note that Node-RED depends on Node.js!

ZipFileInput: provide "glob.glob"-like filename-filter

The current version of ZipFileInput provides a list of all filenames within each .zip archive. In some cases we like to have a subset or single name from these lists. For example the Dutch Kadastral Parcels dataset contains .zips with both the polyline and polygon versions of the parcels. Some would like to extract only one of these.

Proposed is a configuration option like filename_match to which a regular expression can be provided. Possibly the Python library utility fnmatch can be of help: https://docs.python.org/2/library/fnmatch.html

Spurious lxml error in later lxml versions

From lxml iterparse in later versions e.g. the standard lxml version in Ubuntu 13.10 throw sudden parse exceptions where before a StopIteration exception was thrown and XML was valid...Known lxml issue, see:
https://bugs.launchpad.net/lxml/+bug/1185701

For now solved by catching the etree.XMLSyntaxError exception in XmlElementStreamerFileInput.read() together with existing StopIteration.

Use streaming XML parsing for GML splitting

Now GML splitting is line-based. This is tricky as it relies on EOL after each element. Better is to use (lxml) streaming parsing.

Document per-Component config attributes

Components (Inputs, Outputs, Filters) are configured in the .ini files with specific attributes. However it is not documented:

the posssible attributes
is the attribute optional or mandatory
the type (int, real, string etc)
the default value
some documentation

It seems hard to do this via docstrings and Sphinx autodoc. An idea is to supply this info via the commandline: stetl --doc stetl.inputs.fileinput.StringFileInput.

Generic ogr2ogr Input Component

Currently there is only OgrPostgisInput, but like the OgrOutput we should have a more generic OgrInput Component.

GeoJSON to GML Jinja2 Filter

Allow GeoJSON geometries to be converted to GML geometries. Uses ogr Python bindings. also example:

  {% for feature in features %}
    <gml:featureMember>
        <cities:City>
            <cities:name>{{ feature.properties.CITY_NAME }}</cities:name>
            <cities:geometry>
                {{ feature.geometry | geojson2gml(crs=crs, gml_format='GML3', gml_longsrs='YES') }}
            </cities:geometry>
        </cities:City>
    </gml:featureMember>
  {% endfor %}

Fix flake8 errors

Flake8 flake8 is a command-line utility for enforcing style consistency across Python projects.

There are still quite some errors in Stetl that need to be fixed:

./stetl/util.py:137:5: C901 'Util.elem_to_dict' is too complex (37)
./stetl/util.py:197:41: E721 do not compare types, use 'isinstance()'
./stetl/util.py:336:9: E722 do not use bare except'
./stetl/util.py:349:1: E722 do not use bare except'
./stetl/util.py:350:5: F401 'StringIO.StringIO' imported but unused
./stetl/util.py:354:1: C901 'TryExcept 354' is too complex (11)
./stetl/util.py:414:32: W601 .has_key() is deprecated, use 'in'
./stetl/filters/templatingfilter.py:166:5: C901 'Jinja2TemplatingFilter.create_template' is too complex (11)
./stetl/filters/xmlassembler.py:73:29: F841 local variable 'e' is assigned to but never used
./stetl/filters/xmlelementreader.py:78:5: C901 'XmlElementReader.process_xml' is too complex (11)
./stetl/filters/xmlelementreader.py:79:15: E714 test for object identity should be 'is not'
./stetl/filters/zipfileextractor.py:40:9: F841 local variable 'event' is assigned to but never used
./stetl/filters/zipfileextractor.py:46:9: F401 'os' imported but unused
./stetl/inputs/deegreeinput.py:49:5: C901 'DeegreeBlobstoreInput.read' is too complex (15)
./stetl/inputs/fileinput.py:153:29: F841 local variable 'e' is assigned to but never used
./stetl/inputs/fileinput.py:195:5: C901 'XmlElementStreamerFileInput.read' is too complex (12)
./stetl/inputs/fileinput.py:354:161: E501 line too long (175 > 160 characters)
./stetl/inputs/fileinput.py:382:29: F841 local variable 'e' is assigned to but never used
./stetl/inputs/fileinput.py:436:33: E251 unexpected spaces around keyword / parameter equals
./stetl/inputs/fileinput.py:437:5: E128 continuation line under-indented for visual indent
./stetl/inputs/fileinput.py:438:91: E203 whitespace before ','
./stetl/inputs/fileinput.py:562:1: W293 blank line contains whitespace
./stetl/inputs/httpinput.py:106:1: W293 blank line contains whitespace
./stetl/inputs/httpinput.py:276:1: W293 blank line contains whitespace
./stetl/outputs/dboutput.py:147:9: E122 continuation line missing indentation or outdented
./stetl/outputs/deegreeoutput.py:64:5: C901 'DeegreeBlobstoreOutput.write' is too complex (11)
./stetl/outputs/deegreeoutput.py:80:35: W601 .has_key() is deprecated, use 'in'
./stetl/outputs/deegreeoutput.py:83:39: W601 .has_key() is deprecated, use 'in'
./stetl/outputs/deegreeoutput.py:91:13: F841 local variable 'ogrGeomWKT' is assigned to but never used
./stetl/outputs/deegreeoutput.py:94:161: E501 line too long (174 > 160 characters)
./stetl/outputs/deegreeoutput.py:158:9: F841 local variable 'gml_doc' is assigned to but never used
./stetl/outputs/deegreeoutput.py:176:9: F841 local variable 'result' is assigned to but never used
./stetl/outputs/ogroutput.py:120:5: C901 'OgrOutput.init' is too complex (16)
./stetl/utils/apachelog.py:156:42: E701 multiple statements on one line (colon)
./stetl/utils/apachelog.py:186:5: C901 'parser.parse' is too complex (20)
./stetl/utils/apachelog.py:204:25: E722 do not use bare except'
./stetl/utils/apachelog.py:209:25: E722 do not use bare except'
./stetl/utils/apachelog.py:225:25: E722 do not use bare except'
./stetl/utils/apachelog.py:331:1: C901 'If 331' is too complex (11)
./stetl/utils/apachelog.py:335:5: E303 too many blank lines (2)
./stetl/utils/apachelog.py:434:84: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:435:84: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:443:79: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:449:78: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:470:5: E303 too many blank lines (2)

Once fixed, the flake8 command can be added to the .travis file.

Performance stats for Stetl Components

Currently when a Stetl ETL Chain is invoked it prints out the number of "rounds" (number of times a Stetl Chain is invoked) and the total processing (ETL) time.

Often more detailed performance metrics are required in order to track down performance bottlenecks.

This issue proposes a very simple and minimal stats to be printed per Stetl Component object:

number of times the Component is invoked invokes
processing time (in seconds): total, minimum, maximum, average (total/invokes)

As Stetl always has control of Component invokation, component.py seems to be the best option to collect timing stats.

A simple print line will do like:

2018-06-21 13:41:13,488 component INFO RefineFilter invokes=144 time(total, min, max, avg) = 0.150 0.001 0.072 0.001

Add SQLite DB Input

Provide an Input similar to PostgresDbInput: fetch data as record_array's from SQLite DB via configured query. Specialized cases may query something like last N inserted records.

May refactor common functionality in PostgresDbInput and SqliteDbInput into a common base class SqlDbInput.

Apache Log File input

There is a need for structured/record-based data of Apache logfiles. All kinds of analysis and statistics can be performed, for example when log-records are stored in a (spatial) database. Think of statistics for tiling services: which areas are requested the most? Then these areas could be pre-tiled in more resolutions. But also user-statistics like IP-adresses, HTTP-referers. Performance degradation could be monitored etc.

For these kinds of ETL an ApacheLogFileInput Stetl Component should be developed. As Apache logfiles can have multiple formats driven by an expression like:

  '%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'

This should be taken into account. If possible existing Open Source GPL implementations should be used like: https://apachelog.googlecode.com.

Dockerize Stetl

Docker is currently one of the easiest ways to deploy any service or program. This issue should Dockerize Stetl: provide Docker support for deploying Stetl.

What needs to be done:

provide a Dockerfile
may publish Stetl Docker image in the DockerHub

Successful experience with a Dockerized Stetl was gained in the SmartEmission project:
https://github.com/Geonovum/smartemission/tree/master/docker/stetl and usage
https://github.com/Geonovum/smartemission/tree/master/etl

The SE project could in time consume the official Stetl Docker image coming out of the Stetl project.

Generic OGR Output component

Like the OGR Input a generic output component that uses GDAL/OGR Python SWIG wrappers to Open/Create and write to any OGR datasource. Quite some parameters required. Puzzling what the input format should be. ogr_feature?

Unit testing Stetl

Provide Stetl with good unit test coverage.
Also see nlextract/NLExtract#193

Provide for Merging (Combining) Components

In addition to Splitting implemented via issue #35, there is a need for Combining/Merging at least Inputs. A use-case is within the Smart Emission Project: here we need to collect (harvest) data from multiple remote HTTP REST APIs, see smartemission/smartemission#61.

This could be implemented by allowing an Input to collect from multiple HTTP endpoints, but this would require specific implementations for each Input type.

Basic idea is to use the notation also used for Splitting via issue #35, for example to merge two inputs input1 and input2 into single filter and output, the following Chain would be defined:

(input1)(input2) | filter | output

This would be the most common use-case. Additional cases could be applied with sub-Chaining, for example:

(input1 | filter1) (input2 | filter2) | filter | output

Dependent on the ease of implementation, the latter cases may be included or else be deferred to a separate issue.

Add LineStreamerFileInput

LineStreamerFileInput will stream Packets, line by line from a text file. Used in Geonovum Sensors platform to read Records spread over multiple lines in the Smart Emission project. See also
Geonovum/sospilot#21

For example a single record could be like this:

07/24/2015 07:26:12,P.UnitSerialnumber,1
07/24/2015 07:26:12,S.Longitude,5914103
07/24/2015 07:26:12,S.Latitude,53949942
07/24/2015 07:26:12,S.SatInfo,90889
07/24/2015 07:26:12,S.O3,161
07/24/2015 07:26:12,S.BottomSwitches,0
07/24/2015 07:26:12,S.RGBColor,16772501
07/24/2015 07:26:12,S.LightsensorBlue,91
07/24/2015 07:26:12,S.LightsensorGreen,144
07/24/2015 07:26:12,S.LightsensorRed,155
07/24/2015 07:26:12,S.AcceleroZ,755
07/24/2015 07:26:12,S.AcceleroY,510
07/24/2015 07:26:12,S.AcceleroX,512
07/24/2015 07:26:12,S.NO2,91
07/24/2015 07:26:12,S.CO,32392
07/24/2015 07:26:12,S.Altimeter,118
07/24/2015 07:26:12,S.Barometer,101096
07/24/2015 07:26:12,S.LightsensorBottom,26
07/24/2015 07:26:12,S.LightsensorTop,224
07/24/2015 07:26:12,S.Humidity,48526
07/24/2015 07:26:12,S.TemperatureAmbient,299425
07/24/2015 07:26:12,S.TemperatureUnit,305400
07/24/2015 07:26:12,S.SecondOfDay,34016
07/24/2015 07:26:12,S.RtcDate,1012101
07/24/2015 07:26:12,S.RtcTime,596536
07/24/2015 07:26:12,P.SessionUptime,60811
07/24/2015 07:26:12,P.BaseTimer,9
07/24/2015 07:26:12,P.ErrorStatus,0
07/24/2015 07:26:12,P.Powerstate,79
07/24/2015 07:26:12,P.UnitSerialnumber,1  # new record etc
etc

Travis for Stetl

This issue is to get Travis working for Stetl. An initial travis.yml has been constructed but needs expansion:

tests (nose2 (how) is result failure notified?)
code coverage
Python style/syntax: flake8
different Python versions, in particular Python3

And possibly more. This issue is to identify Travis config work. The (nose2) tests themselves are in other issues like #50 and #52.

Also: GDAL2 support could not be realized easily: UbuntuGIS PPA seems to be blocked by Travis....

Provide for ETL Chain Splitting

There are cases where we would like to split a Stetl ETL Chain: for example to publish converted data to both a database and a CSV file or some or multiple web API(s). Most ETL frameworks provide a Splitter (and a Combiner or Merger, which is also handy but trickier and less required).

Implementation considerations

This could be built into to internal Stetl base classes. This would require a change in the .ini-file notation, examples:

Simplest case is Output splitting using the () notation as we may want to split into sub-Chains:

input | filter | (output1) (output2)

Splitting into sub-Chains at Filter-level:

input1 |filter1 | (filter2a | output1a) (filter2b | output1b)

to split the output of filter1 into sub-Chains filter2a | output1a and filter2b | output1b.

Combining could use similar notation:

(input1)(input2) | filter | output

Or even splitting + combining:

(input1)(input2) | filter | (filter2a | output1a) (filter2b | output1b)

Another option is a specialized Filter or Output. In the latter case like a CompositeOutput which is parameterized with a list of Outputs that it calls upon. Disadvantage is that the Chain configuration is hidden in the Composite Component's config.

In first instance, simple Chain splitting will be provided with this issue. A Combining implementation will be done in a separate issue.

Allow multiple -a args for Stetl main prog

Currently only one -a argument can be passed to set either a list of options or a single options (.args) file. Allowing multiple -a arguments allows for more simpler overriding of for example default options. For example:

stetl -c my.cfg -a default.args -a my.args

or

stetl -c my.cfg -a default.args -a db_host=host -a db_user=me -a db_password=xyz

This allows to keep all default args and in my.args or explicit settings just have a few options like passwords.
The order of the -a args will determine overriding order. Args passed via the Environment like stetl_db_password will still prevail over any commandline args.

Allow Format conversion

Connections between Stetl components need to have compatible inputs/outputs FORMATs. Often we would like to be able to convert in order to connect and reuse. The easiest seems to be a generic FormatConverterFilter Component that can be placed inbeween incompatible i/o's. The FCF can call upon specific converters, and be extensible for custom user-defined FORMATs as in issue #12.

All HTTP clients should use "requests" package

In particular httpoutput uses httplib which relies on urllib and only supports HTTP/1.0 which causes problems in some setups.

Better is to use the requests package, also has a simpler programming model.

Use Fiona input and output

Fiona https://github.com/Toblerity/Fiona is a simple but powerful library to access (read/write) OGR sources. Fiona will fit very nicely into Stetl at least in three aspects:

Fiona input modules
Fiona output modules
the Fiona Feature data structure as Stetl packet

Moving Stetl git repo from justb4 to geopython organization

On July 13, 2016 the GH repo https://github.com/justb4/stetl will be moved to https://github.com/geopython/stetl the GeoPython organization on GH. This issue is a checklist of TODOs for this transfer. Many thanks to Tom Kralidis for support. Items in Italic are done.

justb4 member of https://github.com/geopython
execute transfer to https://github.com/geopython
exec git remote set-url origin https://github.com/geopython/stetl in local project
all refs in Sphinx documentation and setup.py (PyPi) to new repo
check PyPi entry
Docker auto-image build on https://hub.docker.com/r/justb4/stetl
submodule Stetl in NLExtract GH project: https://github.com/opengeogroep/NLExtract/tree/master/externals thanks: http://pa1gitsolutions.blogspot.nl/2015/07/changing-git-submodules-urlbranch-to.html
where possible reupload presentations to Slideshare with new repo-link
update https://github.com/geopython/geopython.github.io/blob/master/index.html
notify contributors and other interested parties (NLExtract mailing list, Dutch OSGeo)
notify https://github.com/sebastic for Debian packaging
update Docker file in project https://github.com/Geonovum/smartemission/tree/master/docker/stetl
update Stetl git in project https://github.com/Geonovum/sospilot
update Stetl Gitter at https://gitter.im/justb4/stetl

Feature request: Pass options in config file

Stetl is having its configuration delivered in two different ways currently. A config file is delivered through the command line, but also many options are passed at the command line, through the -a parameter.

For example with TOP10extract (NLExtract):
python $STETL_HOME/stetl/main.py -c etl-top10nl.cfg -a "$pg_options temp_dir=temp max_features=$max_features gml_files=$gml_files $multi $spatial_extent"

The disadvantage of this approach is that the options are generated with shell scripting. This has a drawback on the portability of Stetl to other platforms, like Windows. (As far as I can tell, there are no other issues, although I've currently run Stetl only through MingW MSYS.)

Of course there is still a use case to have Stetl accept the -a command line option, but in many cases a the options which are passed through -a do not change.

Generic OGR Input Component

A native OgrInput that will use the Python SWIG wrappers from GDAL/OGR to Open (and Close) any OGR data source. Produced FORMAT can be ogr_feature and/or possibly ogr_layer or ogr_feature_array. Converters should be able to convert to a Fiona-like GeoJSON Python data structure.

Add new vsizip filter

When importing data from a ZIP file, for example the dutch dataset Bestuurlijke Grenzen, which is a ZIP file containing GML files, the current approach is to use a ZipFileInput in combination with a ZipFileExtractor. The latter extracts the ZIP file to a temporary directory. However, GDAL/OGR also has support for "virtual file systems". One of them is a filter named vsizip. When the string "/vsizip/" is prepended to the input path, OGR can directly read data from a ZIP file without unzipping. When a dataset is large (for example the dutch BGT) and you have not that much disk space left, then this way you don'tneed to unzip the individual files (even albeit the unzipping can be done one by one).

For this reason I've added a new filter named VsiZipFilter, and an abstract base class called VsiFilter, so other virtual file system filters can eventually be added in the future (for example vsicurl). These filters can even be chained, which is also true for their Stetl counterparts.

I will submit a PR when the Python 3 migration is done.

One note though: I've manually disabled the creation of GFS files when importing GML files through a VSI filter. The current approach, which is generating a GFS file next to the GML file (when it is unzipped), should be redesigned. There was an issue that the provided GFS template is ignored. This should be solved. Perhaps by passing GML_GFS_TEMPLATE in a different way (either as -lco, -config, -oo), or make a copy with a newer timestamp as the GML file (which is why the current approach works) and pass the name by GML_GFS_TEMPLATE, instead of generating it in the same location as the (unzipped) GML file.

PostgresInsertOutput: use UPDATE i.s.o. DELETE/INSERT to replace records

The current implementation of outputs.dboutput.PostgresInsertOutput uses DELETE (by key) followed by INSERT to optionally replace existing records (with same key). This is not really a true replacement, as e.g. a new gid may be created and auto-incremented. Also in cases the sequence of gid's may be containing "holes". Better is to use UPDATE (or even UPSERT in PG10). In Smart Emission ETL we were successful with te following addition of UPDATE, lazily creating a template UPDATE first, similar to INSERT:

    def create_update_query(self, record):
        # We assume that all records do the same UPDATE key/values
        # https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/6527838#6527838
        
        # e.g. UPDATE table SET field='C', field2='Z' WHERE id=3;
        query = "UPDATE %s SET (%s) = (%s) WHERE  %s = %s" % (
            self.cfg.get('table'), ",".join(['%s ' % k for k in record]), ",".join(["%s", ] * len(record.keys())), self.key, "%s")
        log.info('update query is %s', query)
        return query

    def insert(self, record):
        res = 0
        if self.replace and self.key and self.key in record:

            # Replace option: try UPDATE if existing
            # https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/6527838#6527838
            values = record.values()
            values.append(record[self.key])
            res = self.db.execute(self.update_query, values)
            # del_query = "DELETE FROM %s WHERE %s = '%s'" % (self.cfg.get('table'), self.key, record[self.key])
            # res = self.db.execute(del_query)

        if res < 1:
            # Do insert with values from the record dict
            # only if we did not do an UPDATE on existing record.
            self.db.execute(self.query, record.values())
        self.db.commit(close=False)

Make Packet FORMAT extensible

Now only specific formats are supported. It should be possibly to programmatically extend formats. e.g. FORMAT.add()

GDAL/OGR version 2.0?

Since GDAL/OGR version 2.0 is out for a while, is it a good idea to switch to this as the minimum version of Stetl? I haven't looked into it yet, but will soon. I'm just starting this discussion, because the handling of GFS files while reading GML seems to be somewhat buggy in version 1.11. See pull request #31. I'm also not very happy with how layer creation options are working right now (see issue #30), but I've no idea whether improvements have been made regarding this in version 2.0.

One of the minimal criteria to switch to the new version is that it is easily available on the most important platforms, i.e. Linux, Mac and Windows.

See http://www.osgeo.org/node/1591 for more information.

Make Stetl multithreaded

Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.

Support Fiona Input and Output

Fiona https://github.com/Toblerity/Fiona is a really nice framework to interact with OGR inputs and outputs. It also takes a Pythonistic approach in creating lightweight datastructures for Features (Python built-in types).

Integrating Fiona into Stetl should not be too hard:

Input: create a Fiona Input Component that can Open an OGR source and supply a stream or list of Features
FORMAT: add a new format fiona_record or something similar
Output: create a Fiona Output Component that can Open an OGR source and write a stream or list of Features

Within Filters we may apply any other programming, like with Shapely.

Provide a Sieve Filter

Several use cases arose where we need to filter out, i.e. "sieve" or "passthrough" data (Packets) based on content of their data. For example particular Records only need to be passthrough (or discarded) based on the value(s) of an attribute.

A particular case is within the Smart Emission ETL where within a Refiner we need to write all transformed timeseries data records to PostgreSQL, but only a subset (for gases CO,CO2,NO2,O3) to InfluxDB while still using all generic Stetl Components. One way to achieve this is to split the RefineFilter results and prefix the InfluxDBOutput with a RecordSieve Filter that let only records pass with a component attribute value matching these gases.

One can also think of geospatial sieving scenario's where one filters out a particular geospatial area, or as far as to have WFS/SLD Filter-like expressions.

Depth_search not working in file input

When setting depth_search=False to a file input, files which are in a subdirectory and match the glob pattern are still being found. The actual issue is in Util.make_file_list.
I have a fix ready. I will do a PR when the Python 3 migration is done.