Git Product home page Git Product logo

web_extractor's Introduction

Web Data Extractor

Tool for extracting tabular data from files on the web.

Supports fetching data through HTTP or FTP.

The program processes the data from the file provided in the configuration, filters it by a given column, extracts only the required columns provided in the configuration and outputs a csv file with the extracted data to the output folder.

Download

You can download the latest release here.

Running the program

  1. Provide a configuration in the config.json file. Refer to the configuration section for information on how to configure the program.
  2. Run the executable.
  3. Use the arrows to choose the action.
  4. Result can be found in the ./output folder.

Configuration

A config.json file is required to run the extractor. Multiple configurations may be included in the config file in the following form:

[
    {
        "title": "Example configuration",
        "url": "http://www.example.com"
    }
]

Full list of available configuration options below.


Required Settings

This is the minimal information that has to be provided in order for the program to work. Depending on the source of the data, youmay have to provide authentication and/or encoding details.

title

The title of the configuration.

Example:

"title": "Sample Title"

url or host

The url (http) or host (ftp) of the configuration.

For http: use whole path together with file name and extension.

Example:

"url": "http://www.example.com/example.csv"

You can provide a placeholder for the current date in the url between curly braces. This will be replaced by today's date in the format you specify in the dateFormat option.

Example:

"url": "http://www.example.com/file_from_today_{date}.csv"

For ftp: use only hostname, without file name and extension. Filename provided in seperate option.

Example:

"host": "192.168.0.1"

filename (FTP)

When using the FTP protocol, the name of the file to be extracted is also required.

Example:

"filename": "example.csv"

Optional Settings

These setitngs are optional, but authentication and/or encoding details may have to be provided for certain sources.

dateFormat

Format of the date to be embedded in url (HTTP only). dateFormat has to be provided using Python's strftime directives. Refer to this list of available directives.

Example:

"dateFormat": "%Y%m%d"

In this example, the date 25. March 2019 will be formatted as: 20190328.


authentication

Basic authentication details including username and password.

Example:

"authentication": {
    "username": "exampleUser",
    "password": "examplePassword"
}

encoding

The text encoding of the file to be extracted. Defaults to utf8.

Example:

"encoding": "iso-8859-1"

separator

The characted used as separator in the file. Use \t for tabs. Defaults to ;.

Example:

"separator": "\t"

filters

The columns and values by which to filter in the form of an object with the name of the column as the key and a list of strings to filter by as value.

Example:

"filters": {
    "column1": [
        "value1",
        "value2"
    ],
    "column2": [
        "value1",
        "value2"
    ]
}

columns

A list of columns to be included in result. All columns will be included if left blank.

Example:

"columns": ["column1", "column3"]

aliases

A list of aliases for the included columns in their order. If no aliases provided, the columns will keep the names from the original file.

Example:

"aliases": ["renamedColumn1", "renamedColumn2"]

License

MIT

web_extractor's People

Contributors

mareksl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.