Vanity package for geopython projects
pip install geopython
>>> import geopython
Stetl, Streaming ETL, is a lightweight geospatial processing and ETL framework written in Python.
Home Page: https://www.stetl.org
License: GNU General Public License v3.0
Vanity package for geopython projects
pip install geopython
>>> import geopython
HTTP APIs may require authentication. The HttpInput
Component should at least support:
The configuration of HttpInput
should support at least these auth schemes, but also be prepared for future schemes.
When using a ZIP file as input (for example the dutch dataset Bestuurlijke Grenzen, which is a ZIP file containing GML files), the layer creation options are only applied for the first GML file in this ZIP file.
I've made a patch with an option, named always_apply_lco, added to OgrOutput and Ogr2OgrExecOutput. When this option is set to true, the LCO's are always added to the ogr2ogr command string. I will create a PR when the Python 3 migration is done.
Add Stetl to http://pypi.python.org/pypi
Config settings are now in .cfg ini file. It would be nice to have a mechanism to override/substitute settings on the command line. Typical settings are database names, users, passwords etc.
e.g. main.py -c my.cfg -host=myhost.com -dbname=mydb etc
While writing unit tests for XmlAssembler, I ran into a couple of issues. At first I've set up a chain reading only one GML file with three FeatureMember elements. In my config I wanted to write an etree doc for every two elements. I'm expecting two documents in this case, one with two elements, and one with only one element (the last one). I was surprised that no doc was written (to stdout). Here is my config:
# Config file for unit testing XmlAssembler.
[etl]
chains = input_glob_file|parse_xml_file|xml_assembler|output_std
[input_glob_file]
class = inputs.fileinput.GlobFileInput
file_path = tests/data/dummy.gml
# The source input file producing XML elements
[parse_xml_file]
class = filters.xmlelementreader.XmlElementReader
element_tags = FeatureMember
# Assembles etree docs gml:featureMember elements, each with "max_elements" elements
[xml_assembler]
class = filters.xmlassembler.XmlAssembler
max_elements = 2
container_doc = <?xml version="1.0" encoding="UTF-8"?>
<gml:FeatureCollectionT10NL
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:top10nl="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2"
xmlns:brt="http://www.kadaster.nl/schemas/imbrt/brt-alg/1.0"
xmlns:gml="http://www.opengis.net/gml/3.2"
xsi:schemaLocation="http://www.kadaster.nl/schemas/imbrt/top10nl/1.2 http://www.kadaster.nl/schemas/top10nl/vyyyymmdd/TOP10NL_1_2.xsd">
</gml:FeatureCollectionT10NL >
element_container_tag = FeatureCollectionT10NL
[output_std]
class = outputs.standardoutput.StandardOutput
I was suspecting this check in XmlAssembler.consume_element:
if element is None or packet.is_end_of_stream() is True:
(Note that the is True
is redundant, but that doesn't matter.)
It turned indeed out that packet.is_end_of_stream was true. I think it is already caused by the GlobFileInput. I've just added this input class yesterday. It could be the case that I'm not understanding properly when is_end_of_stream should be set to true, but I'm wondering whether a filter which can return multiple packets based on one input packet (for example when an XML file is being parsed using XmlElementReader) should actually reset is_end_of_stream or is_end_of_doc.
When I skip this check, so I'm only checking for element is None
, then a new XML document is generatedfor every XML element, so I was getting 3 documents, instead of the expected 2.
When I'm reading all GML files in my test data directory (currently 3 files), by setting file_path to tests/data/*.gml in input_glob_file, I'm getting either 6 documents (while checking for packet.is_end_of_stream()
) or 9 documents. With 3 files I'm actually expecting 6 documents (3 x 2), namely a doc with 2 elements followed by a doc with 1 element, three times. However, each document contains only one element, only of the first 2 GML files. When disabling the aforementioned check I'm getting 9 docs, each with one element.
So, my question is how packet.is_end_of_stream and packet.is_end_of_doc should actually behave. Should they be reset when one input packet result in multiple output packets for the particular component? Or is there more to it?
I've attached my unit test file. The method test_execute is just a work-in-progress.
test_xml_assembler.zip
Currently the Stetl module doesn't appear to support Python 3, building with Python 3 results in SyntaxErrors:
File "/usr/lib/python3.4/dist-packages/stetl/postgis.py", line 34
except psycopg2.DatabaseError, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/main.py", line 84
print name, data
^
SyntaxError: Missing parentheses in call to 'print'
File "/usr/lib/python3.4/dist-packages/stetl/factory.py", line 26
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/etl.py", line 73
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/utils/apachelog.py", line 183
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/inputs/fileinput.py", line 183
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/inputs/deegreeinput.py", line 158
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/filters/templatingfilter.py", line 173
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/filters/xmlassembler.py", line 68
except Exception, e:
^
SyntaxError: invalid syntax
File "/usr/lib/python3.4/dist-packages/stetl/filters/gmlsplitter.py", line 130
except Exception, e:
^
SyntaxError: invalid syntax
Templating languages are used extensively in Python web-frameworks like Django and Pylons.
There is an enormous choice in templating technologies, see https://wiki.python.org/moin/Templating,
from very simple parameter substitution to full-Python control. In many cases Templating may be much simpler than XSLT Filtering. Think of INSPIRE GML where 90% of the GML is just "boilerplate" GML with a few variables and constants to be substituted. This is also an experiment but I have good hope this can work for many (INSPIRE-) cases.
Via this issue a foundation is laid to support some very simple templating like Python built-in string.Template and the popular Jinja2 templatin (http://jinja.pocoo.org/).
Basic idea is a TemplatingFilter with a template file or string with input (Jinja2 context) structured passed in from an Input. Output is typically a document, like a GML file but other setups are possible.
I've noticed a couple of issues which make Ogr2OgrExecOutput a bit less flexible than necessary. I'm writing them here, because of pending changes in execoutput in PR #75, so these changes can be done once that PR has been closed (either accepted or dismissed). I'm prepared to do these changes myself. For now it would be wise to have a discussion about the proposed changes.
As you might guess, I'm looking for a solution where not only the source can change after each invocation, but also the destination and parameters. Ideally they should be passed in through a record. While this is possible, I think this is a next step in the evolution of this output object.
FileInput
and derived classes like StringFileInput
can handle lists of files from directory and glob.glob
parameters. Still all file content is read/passed as a single Packet
. Also .zip
files are handled by a dedicated class ZipFileInput
.
It should be possible to generalize FileInput
to have derived classes read from files no matter if files came from directory structures, glob.glob
expanded file lists or .zip files. Even a mixture of these should be handled. For example within NLExtract https://github.com/nlextract/NLExtract/blob/master/bag/src/bagfilereader.py can handle any file structure provided.
A second aspect is file chunking
: a FileInput
may split up a single file into Packets containing data structures extracted from that file. For example, FileInput
s like XmlElementStreamerFileInput
and LineStreamerFileInput
open/parse a file but pass file-content (lines, parsed elements) in
fine-grained chunks on each read()
. Currently these classes implement this fully
within their read()
function, but the generic pattern is that they
maintain a "context" for the open/parsed file.
So all in all this issue addresses two general aspects:
file-specs
: directories, maps, Globbing
, zip-files and any mix of theseSee also issue #49 for additional discussion which lead to this issue.
The Strategy Design Pattern may be applied (many refs on the web).
The current Dockerfile for Stetl needs several improvements:
Ad 1) better is to use a small-sized Python base image like python-alpine Linux.
While working on Stetl I was wondering whether the LCO options are always executed when necessary, as expected. This appears not to be the case in my current version (see PR #28), using the Ogr2OgrExecOutput. The LCO options are only passed once. The problem is that you don't exactly when these options need to be executed, and I think this is also the issue with the current Ogr2OgrOutput.
For example, when the BRT is loaded (using file chunks), a temporary GML file is being created, which is then being loaded by ogr2ogr. This GML file doesn't necessarily contain all the feature types which can be found in BRT. Ogr2ogr only creates the tables (when loading in PostGIS) for the features occurring in the temporary GML file. So, on subsequent runs of ogr2ogr, new tables can be created, but in those cases the LCO options are not applied anymore.
Integrate https://github.com/mapbox/rasterio or something similar?
Do not issue fatal exception (this stops ETL process), just skip log-record with warning.
XProc in a 2010 W3C recommendation http://www.w3.org/TR/xproc/
It might interesting considering using XProc in your context
Several requests for "a GUI for Stetl" were made. There are two GUIs to consider:
The first would require a tool "FME-like" to draw Inputs, Filters and Outputs and connect and paramterize them. The second is easier: manage the execution for a single Stetl config file. This issue adresses only the second/execution GUI.
An initial set of requirements:
Secondary requirements:
Stetl supports reusable ETL configurations via symbolic/substitutable variables like host-names, database credentials etc.
These variables can be substituted via env (arg) files
or -a
parameters when using the stetl
command.
In several Stetl deployments, in particular where Docker (Compose) and Kubernetes (K8s) is used, there is a need to configure these variables via the "Environment". For example the "Secret" store in K8s may store DB-credentials. These variables are usually passed as (Unix/Linux) environment variables. Actual use-cases are currently within the https://github.com/smartemission K8s project.
This is part of the SE migration from Docker plain to Compose and K8s.
Stetl should be able to either substitute and override template arg-values from environment variables. This may require some convention in naming, as we don't want to break existing Stetl configs. For example, a Stetl config may use the var-name {hostname}
internally. We don't want to substitute non-related env-vars by accident. So possibly Stetl-related env-vars should be prefixed with STETL_
or alike.
For example, when loading the BGT, the following output is generated:
2018-01-24 21:30:05,904 ETL INFO Substituting 15 args in config file from args_dict: OrderedDict([('__name__', 'asection'), ('input_dir', '/var/nlextract/data/bgt/leveringen/latest'), ('zip_files_pattern', '*.[zZ][iI][pP]'), ('filename_match', '*.gml'), ('temp_dir', 'temp'), ('gfs_template', 'gfs/imgeo-v2.1.1.gfs'), ('host', '****'), ('port', '5432'), ('user', '****'), ('password', '****'), ('database', 'bgt'), ('schema', 'latest'), ('multi_opts', '-fieldTypeToString StringList'), ('spatial_extent', ''), ('max_features', '20000')])
and
2018-01-24 21:30:06,211 output INFO cfg = {'database': 'bgt', 'class': 'outputs.dboutput.PostgresDbOutput', 'host': '****', 'user': '****', 'password': '****', 'port': '5432', 'schema': 'latest'}
I've also masked the host and username.
The reason that this behaviour is undesired, is that the logging can be parsed and stored by other tools. For example, with Docker it is common practice to output log data to stdout, and then the output can be processed and stored by components you don't own.
Something based on http://nodered.org/ or similar? Note that Node-RED depends on Node.js!
The current version of ZipFileInput
provides a list of all filenames within each .zip archive. In some cases we like to have a subset or single name from these lists. For example the Dutch Kadastral Parcels dataset contains .zips with both the polyline and polygon versions of the parcels. Some would like to extract only one of these.
Proposed is a configuration option like filename_match
to which a regular expression can be provided. Possibly the Python library utility fnmatch
can be of help: https://docs.python.org/2/library/fnmatch.html
From lxml iterparse in later versions e.g. the standard lxml version in Ubuntu 13.10 throw sudden parse exceptions where before a StopIteration exception was thrown and XML was valid...Known lxml issue, see:
https://bugs.launchpad.net/lxml/+bug/1185701
For now solved by catching the etree.XMLSyntaxError exception in XmlElementStreamerFileInput.read() together with existing StopIteration.
Now GML splitting is line-based. This is tricky as it relies on EOL after each element. Better is to use (lxml) streaming parsing.
Components (Inputs, Outputs, Filters) are configured in the .ini files with specific attributes. However it is not documented:
It seems hard to do this via docstrings and Sphinx autodoc. An idea is to supply this info via the commandline: stetl --doc stetl.inputs.fileinput.StringFileInput.
Currently there is only OgrPostgisInput, but like the OgrOutput we should have a more generic OgrInput Component.
Allow GeoJSON geometries to be converted to GML geometries. Uses ogr Python bindings. also example:
{% for feature in features %}
<gml:featureMember>
<cities:City>
<cities:name>{{ feature.properties.CITY_NAME }}</cities:name>
<cities:geometry>
{{ feature.geometry | geojson2gml(crs=crs, gml_format='GML3', gml_longsrs='YES') }}
</cities:geometry>
</cities:City>
</gml:featureMember>
{% endfor %}
Flake8 flake8 is a command-line utility for enforcing style consistency across Python projects.
There are still quite some errors in Stetl that need to be fixed:
./stetl/util.py:137:5: C901 'Util.elem_to_dict' is too complex (37)
./stetl/util.py:197:41: E721 do not compare types, use 'isinstance()'
./stetl/util.py:336:9: E722 do not use bare except'
./stetl/util.py:349:1: E722 do not use bare except'
./stetl/util.py:350:5: F401 'StringIO.StringIO' imported but unused
./stetl/util.py:354:1: C901 'TryExcept 354' is too complex (11)
./stetl/util.py:414:32: W601 .has_key() is deprecated, use 'in'
./stetl/filters/templatingfilter.py:166:5: C901 'Jinja2TemplatingFilter.create_template' is too complex (11)
./stetl/filters/xmlassembler.py:73:29: F841 local variable 'e' is assigned to but never used
./stetl/filters/xmlelementreader.py:78:5: C901 'XmlElementReader.process_xml' is too complex (11)
./stetl/filters/xmlelementreader.py:79:15: E714 test for object identity should be 'is not'
./stetl/filters/zipfileextractor.py:40:9: F841 local variable 'event' is assigned to but never used
./stetl/filters/zipfileextractor.py:46:9: F401 'os' imported but unused
./stetl/inputs/deegreeinput.py:49:5: C901 'DeegreeBlobstoreInput.read' is too complex (15)
./stetl/inputs/fileinput.py:153:29: F841 local variable 'e' is assigned to but never used
./stetl/inputs/fileinput.py:195:5: C901 'XmlElementStreamerFileInput.read' is too complex (12)
./stetl/inputs/fileinput.py:354:161: E501 line too long (175 > 160 characters)
./stetl/inputs/fileinput.py:382:29: F841 local variable 'e' is assigned to but never used
./stetl/inputs/fileinput.py:436:33: E251 unexpected spaces around keyword / parameter equals
./stetl/inputs/fileinput.py:437:5: E128 continuation line under-indented for visual indent
./stetl/inputs/fileinput.py:438:91: E203 whitespace before ','
./stetl/inputs/fileinput.py:562:1: W293 blank line contains whitespace
./stetl/inputs/httpinput.py:106:1: W293 blank line contains whitespace
./stetl/inputs/httpinput.py:276:1: W293 blank line contains whitespace
./stetl/outputs/dboutput.py:147:9: E122 continuation line missing indentation or outdented
./stetl/outputs/deegreeoutput.py:64:5: C901 'DeegreeBlobstoreOutput.write' is too complex (11)
./stetl/outputs/deegreeoutput.py:80:35: W601 .has_key() is deprecated, use 'in'
./stetl/outputs/deegreeoutput.py:83:39: W601 .has_key() is deprecated, use 'in'
./stetl/outputs/deegreeoutput.py:91:13: F841 local variable 'ogrGeomWKT' is assigned to but never used
./stetl/outputs/deegreeoutput.py:94:161: E501 line too long (174 > 160 characters)
./stetl/outputs/deegreeoutput.py:158:9: F841 local variable 'gml_doc' is assigned to but never used
./stetl/outputs/deegreeoutput.py:176:9: F841 local variable 'result' is assigned to but never used
./stetl/outputs/ogroutput.py:120:5: C901 'OgrOutput.init' is too complex (16)
./stetl/utils/apachelog.py:156:42: E701 multiple statements on one line (colon)
./stetl/utils/apachelog.py:186:5: C901 'parser.parse' is too complex (20)
./stetl/utils/apachelog.py:204:25: E722 do not use bare except'
./stetl/utils/apachelog.py:209:25: E722 do not use bare except'
./stetl/utils/apachelog.py:225:25: E722 do not use bare except'
./stetl/utils/apachelog.py:331:1: C901 'If 331' is too complex (11)
./stetl/utils/apachelog.py:335:5: E303 too many blank lines (2)
./stetl/utils/apachelog.py:434:84: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:435:84: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:443:79: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:449:78: E502 the backslash is redundant between brackets
./stetl/utils/apachelog.py:470:5: E303 too many blank lines (2)
Once fixed, the flake8
command can be added to the .travis
file.
Currently when a Stetl ETL Chain is invoked it prints out the number of "rounds" (number of times a Stetl Chain
is invoked) and the total processing (ETL) time.
Often more detailed performance metrics are required in order to track down performance bottlenecks.
This issue proposes a very simple and minimal stats to be printed per Stetl Component
object:
Component
is invoked invokes
total/invokes
)As Stetl always has control of Component
invokation, component.py seems to be the best option to collect timing stats.
A simple print line will do like:
2018-06-21 13:41:13,488 component INFO RefineFilter invokes=144 time(total, min, max, avg) = 0.150 0.001 0.072 0.001
Provide an Input similar to PostgresDbInput: fetch data as record_array's from SQLite DB via configured query. Specialized cases may query something like last N inserted records.
May refactor common functionality in PostgresDbInput and SqliteDbInput into a common base class SqlDbInput.
There is a need for structured/record-based data of Apache logfiles. All kinds of analysis and statistics can be performed, for example when log-records are stored in a (spatial) database. Think of statistics for tiling services: which areas are requested the most? Then these areas could be pre-tiled in more resolutions. But also user-statistics like IP-adresses, HTTP-referers. Performance degradation could be monitored etc.
For these kinds of ETL an ApacheLogFileInput Stetl Component should be developed. As Apache logfiles can have multiple formats driven by an expression like:
'%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
This should be taken into account. If possible existing Open Source GPL implementations should be used like: https://apachelog.googlecode.com.
Docker is currently one of the easiest ways to deploy any service or program. This issue should Dockerize Stetl: provide Docker support for deploying Stetl.
What needs to be done:
Successful experience with a Dockerized Stetl was gained in the SmartEmission project:
https://github.com/Geonovum/smartemission/tree/master/docker/stetl and usage
https://github.com/Geonovum/smartemission/tree/master/etl
The SE project could in time consume the official Stetl Docker image coming out of the Stetl project.
Like the OGR Input a generic output component that uses GDAL/OGR Python SWIG wrappers to Open/Create and write to any OGR datasource. Quite some parameters required. Puzzling what the input format should be. ogr_feature?
Provide Stetl with good unit test coverage.
Also see nlextract/NLExtract#193
In addition to Splitting implemented via issue #35, there is a need for Combining/Merging at least Inputs
. A use-case is within the Smart Emission Project: here we need to collect (harvest) data from multiple remote HTTP REST APIs, see smartemission/smartemission#61.
This could be implemented by allowing an Input
to collect from multiple HTTP endpoints, but this would require specific implementations for each Input
type.
Basic idea is to use the notation also used for Splitting via issue #35, for example to merge two inputs input1
and input2
into single filter
and output
, the following Chain
would be defined:
(input1)(input2) | filter | output
This would be the most common use-case. Additional cases could be applied with sub-Chaining, for example:
(input1 | filter1) (input2 | filter2) | filter | output
Dependent on the ease of implementation, the latter cases may be included or else be deferred to a separate issue.
LineStreamerFileInput
will stream Packets, line by line from a text file. Used in Geonovum Sensors platform to read Records spread over multiple lines in the Smart Emission project. See also
Geonovum/sospilot#21
For example a single record could be like this:
07/24/2015 07:26:12,P.UnitSerialnumber,1
07/24/2015 07:26:12,S.Longitude,5914103
07/24/2015 07:26:12,S.Latitude,53949942
07/24/2015 07:26:12,S.SatInfo,90889
07/24/2015 07:26:12,S.O3,161
07/24/2015 07:26:12,S.BottomSwitches,0
07/24/2015 07:26:12,S.RGBColor,16772501
07/24/2015 07:26:12,S.LightsensorBlue,91
07/24/2015 07:26:12,S.LightsensorGreen,144
07/24/2015 07:26:12,S.LightsensorRed,155
07/24/2015 07:26:12,S.AcceleroZ,755
07/24/2015 07:26:12,S.AcceleroY,510
07/24/2015 07:26:12,S.AcceleroX,512
07/24/2015 07:26:12,S.NO2,91
07/24/2015 07:26:12,S.CO,32392
07/24/2015 07:26:12,S.Altimeter,118
07/24/2015 07:26:12,S.Barometer,101096
07/24/2015 07:26:12,S.LightsensorBottom,26
07/24/2015 07:26:12,S.LightsensorTop,224
07/24/2015 07:26:12,S.Humidity,48526
07/24/2015 07:26:12,S.TemperatureAmbient,299425
07/24/2015 07:26:12,S.TemperatureUnit,305400
07/24/2015 07:26:12,S.SecondOfDay,34016
07/24/2015 07:26:12,S.RtcDate,1012101
07/24/2015 07:26:12,S.RtcTime,596536
07/24/2015 07:26:12,P.SessionUptime,60811
07/24/2015 07:26:12,P.BaseTimer,9
07/24/2015 07:26:12,P.ErrorStatus,0
07/24/2015 07:26:12,P.Powerstate,79
07/24/2015 07:26:12,P.UnitSerialnumber,1 # new record etc
etc
This issue is to get Travis working for Stetl. An initial travis.yml has been constructed but needs expansion:
nose2
(how) is result failure notified?)flake8
And possibly more. This issue is to identify Travis config work. The (nose2) tests themselves are in other issues like #50 and #52.
Also: GDAL2
support could not be realized easily: UbuntuGIS PPA
seems to be blocked by Travis....
There are cases where we would like to split a Stetl ETL Chain: for example to publish converted data to both a database and a CSV file or some or multiple web API(s). Most ETL frameworks provide a Splitter (and a Combiner or Merger, which is also handy but trickier and less required).
Implementation considerations
This could be built into to internal Stetl base classes. This would require a change in the .ini
-file notation, examples:
Simplest case is Output
splitting using the () notation as we may want to split into sub-Chains:
input | filter | (output1) (output2)
Splitting into sub-Chains at Filter
-level:
input1 |filter1 | (filter2a | output1a) (filter2b | output1b)
to split the output of filter1
into sub-Chains filter2a | output1a
and filter2b | output1b
.
Combining could use similar notation:
(input1)(input2) | filter | output
Or even splitting + combining:
(input1)(input2) | filter | (filter2a | output1a) (filter2b | output1b)
Another option is a specialized Filter
or Output
. In the latter case like a CompositeOutput
which is parameterized with a list of Output
s that it calls upon. Disadvantage is that the Chain configuration is hidden in the Composite
Component's config.
In first instance, simple Chain splitting will be provided with this issue. A Combining implementation will be done in a separate issue.
Currently only one -a
argument can be passed to set either a list of options or a single options (.args
) file. Allowing multiple -a
arguments allows for more simpler overriding of for example default options. For example:
stetl -c my.cfg -a default.args -a my.args
or
stetl -c my.cfg -a default.args -a db_host=host -a db_user=me -a db_password=xyz
This allows to keep all default
args and in my.args
or explicit settings just have a few options like passwords.
The order of the -a args will determine overriding order. Args passed via the Environment like stetl_db_password
will still prevail over any commandline args.
Connections between Stetl components need to have compatible inputs/outputs FORMATs. Often we would like to be able to convert in order to connect and reuse. The easiest seems to be a generic FormatConverterFilter Component that can be placed inbeween incompatible i/o's. The FCF can call upon specific converters, and be extensible for custom user-defined FORMATs as in issue #12.
In particular httpoutput uses httplib
which relies on urllib
and only supports HTTP/1.0 which causes problems in some setups.
Better is to use the requests package, also has a simpler programming model.
Fiona https://github.com/Toblerity/Fiona is a simple but powerful library to access (read/write) OGR sources. Fiona will fit very nicely into Stetl at least in three aspects:
On July 13, 2016 the GH repo https://github.com/justb4/stetl will be moved to https://github.com/geopython/stetl the GeoPython organization on GH. This issue is a checklist of TODOs for this transfer. Many thanks to Tom Kralidis for support. Items in Italic are done.
git remote set-url origin https://github.com/geopython/stetl
in local projectStetl is having its configuration delivered in two different ways currently. A config file is delivered through the command line, but also many options are passed at the command line, through the -a parameter.
For example with TOP10extract (NLExtract):
python $STETL_HOME/stetl/main.py -c etl-top10nl.cfg -a "$pg_options temp_dir=temp max_features=$max_features gml_files=$gml_files $multi $spatial_extent"
The disadvantage of this approach is that the options are generated with shell scripting. This has a drawback on the portability of Stetl to other platforms, like Windows. (As far as I can tell, there are no other issues, although I've currently run Stetl only through MingW MSYS.)
Of course there is still a use case to have Stetl accept the -a command line option, but in many cases a the options which are passed through -a do not change.
A native OgrInput that will use the Python SWIG wrappers from GDAL/OGR to Open (and Close) any OGR data source. Produced FORMAT can be ogr_feature and/or possibly ogr_layer or ogr_feature_array. Converters should be able to convert to a Fiona-like GeoJSON Python data structure.
When importing data from a ZIP file, for example the dutch dataset Bestuurlijke Grenzen, which is a ZIP file containing GML files, the current approach is to use a ZipFileInput in combination with a ZipFileExtractor. The latter extracts the ZIP file to a temporary directory. However, GDAL/OGR also has support for "virtual file systems". One of them is a filter named vsizip. When the string "/vsizip/" is prepended to the input path, OGR can directly read data from a ZIP file without unzipping. When a dataset is large (for example the dutch BGT) and you have not that much disk space left, then this way you don'tneed to unzip the individual files (even albeit the unzipping can be done one by one).
For this reason I've added a new filter named VsiZipFilter, and an abstract base class called VsiFilter, so other virtual file system filters can eventually be added in the future (for example vsicurl). These filters can even be chained, which is also true for their Stetl counterparts.
I will submit a PR when the Python 3 migration is done.
One note though: I've manually disabled the creation of GFS files when importing GML files through a VSI filter. The current approach, which is generating a GFS file next to the GML file (when it is unzipped), should be redesigned. There was an issue that the provided GFS template is ignored. This should be solved. Perhaps by passing GML_GFS_TEMPLATE in a different way (either as -lco, -config, -oo), or make a copy with a newer timestamp as the GML file (which is why the current approach works) and pass the name by GML_GFS_TEMPLATE, instead of generating it in the same location as the (unzipped) GML file.
The current implementation of outputs.dboutput.PostgresInsertOutput uses DELETE
(by key) followed by INSERT
to optionally replace existing records (with same key). This is not really a true replacement, as e.g. a new gid
may be created and auto-incremented. Also in cases the sequence of gid's
may be containing "holes". Better is to use UPDATE
(or even UPSERT in PG10). In Smart Emission ETL we were successful with te following addition of UPDATE
, lazily creating a template UPDATE first, similar to INSERT
:
def create_update_query(self, record):
# We assume that all records do the same UPDATE key/values
# https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/6527838#6527838
# e.g. UPDATE table SET field='C', field2='Z' WHERE id=3;
query = "UPDATE %s SET (%s) = (%s) WHERE %s = %s" % (
self.cfg.get('table'), ",".join(['%s ' % k for k in record]), ",".join(["%s", ] * len(record.keys())), self.key, "%s")
log.info('update query is %s', query)
return query
def insert(self, record):
res = 0
if self.replace and self.key and self.key in record:
# Replace option: try UPDATE if existing
# https://stackoverflow.com/questions/1109061/insert-on-duplicate-update-in-postgresql/6527838#6527838
values = record.values()
values.append(record[self.key])
res = self.db.execute(self.update_query, values)
# del_query = "DELETE FROM %s WHERE %s = '%s'" % (self.cfg.get('table'), self.key, record[self.key])
# res = self.db.execute(del_query)
if res < 1:
# Do insert with values from the record dict
# only if we did not do an UPDATE on existing record.
self.db.execute(self.query, record.values())
self.db.commit(close=False)
Now only specific formats are supported. It should be possibly to programmatically extend formats. e.g. FORMAT.add()
Since GDAL/OGR version 2.0 is out for a while, is it a good idea to switch to this as the minimum version of Stetl? I haven't looked into it yet, but will soon. I'm just starting this discussion, because the handling of GFS files while reading GML seems to be somewhat buggy in version 1.11. See pull request #31. I'm also not very happy with how layer creation options are working right now (see issue #30), but I've no idea whether improvements have been made regarding this in version 2.0.
One of the minimal criteria to switch to the new version is that it is easily available on the most important platforms, i.e. Linux, Mac and Windows.
See http://www.osgeo.org/node/1591 for more information.
Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.
See also nlextract/NLExtract#194
Fiona https://github.com/Toblerity/Fiona is a really nice framework to interact with OGR inputs and outputs. It also takes a Pythonistic approach in creating lightweight datastructures for Features (Python built-in types).
Integrating Fiona into Stetl should not be too hard:
Within Filters we may apply any other programming, like with Shapely.
Several use cases arose where we need to filter out, i.e. "sieve" or "passthrough" data (Packets
) based on content of their data. For example particular Records
only need to be passthrough (or discarded) based on the value(s) of an attribute.
A particular case is within the Smart Emission ETL where within a Refiner we need to write all transformed timeseries data records to PostgreSQL, but only a subset (for gases CO,CO2,NO2,O3) to InfluxDB while still using all generic Stetl Components. One way to achieve this is to split the RefineFilter results and prefix the InfluxDBOutput with a RecordSieve
Filter
that let only records pass with a component
attribute value matching these gases.
One can also think of geospatial sieving scenario's where one filters out a particular geospatial area, or as far as to have WFS/SLD Filter-like expressions.
When setting depth_search=False to a file input, files which are in a subdirectory and match the glob pattern are still being found. The actual issue is in Util.make_file_list.
I have a fix ready. I will do a PR when the Python 3 migration is done.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.