Tool for extracting tabular data from files on the web.
Supports fetching data through HTTP or FTP.
The program processes the data from the file provided in the configuration, filters it by a given column, extracts only the required columns provided in the configuration and outputs a csv file with the extracted data to the output folder.
You can download the latest release here.
- Provide a configuration in the config.json file. Refer to the configuration section for information on how to configure the program.
- Run the executable.
- Use the arrows to choose the action.
- Result can be found in the ./output folder.
A config.json file is required to run the extractor. Multiple configurations may be included in the config file in the following form:
[
{
"title": "Example configuration",
"url": "http://www.example.com"
}
]
Full list of available configuration options below.
This is the minimal information that has to be provided in order for the program to work. Depending on the source of the data, youmay have to provide authentication and/or encoding details.
The title of the configuration.
Example:
"title": "Sample Title"
The url (http) or host (ftp) of the configuration.
For http: use whole path together with file name and extension.
Example:
"url": "http://www.example.com/example.csv"
You can provide a placeholder for the current date in the url between curly braces. This will be replaced by today's date in the format you specify in the dateFormat
option.
Example:
"url": "http://www.example.com/file_from_today_{date}.csv"
For ftp: use only hostname, without file name and extension. Filename provided in seperate option.
Example:
"host": "192.168.0.1"
When using the FTP protocol, the name of the file to be extracted is also required.
Example:
"filename": "example.csv"
These setitngs are optional, but authentication and/or encoding details may have to be provided for certain sources.
Format of the date to be embedded in url (HTTP only). dateFormat
has to be provided using Python's strftime
directives. Refer to this list of available directives.
Example:
"dateFormat": "%Y%m%d"
In this example, the date 25. March 2019 will be formatted as: 20190328.
Basic authentication details including username
and password
.
Example:
"authentication": {
"username": "exampleUser",
"password": "examplePassword"
}
The text encoding of the file to be extracted. Defaults to utf8
.
Example:
"encoding": "iso-8859-1"
The characted used as separator in the file. Use \t
for tabs. Defaults to ;
.
Example:
"separator": "\t"
The columns and values by which to filter in the form of an object with the name of the column as the key and a list of strings to filter by as value.
Example:
"filters": {
"column1": [
"value1",
"value2"
],
"column2": [
"value1",
"value2"
]
}
A list of columns to be included in result. All columns will be included if left blank.
Example:
"columns": ["column1", "column3"]
A list of aliases for the included columns in their order. If no aliases provided, the columns will keep the names from the original file.
Example:
"aliases": ["renamedColumn1", "renamedColumn2"]