Git Product home page Git Product logo

tap-spreadsheets-anywhere's Introduction

tap-spreadsheets-anywhere

This is a Singer tap that reads data from spreadsheet files (CSVs, Excel, JSONs, custom-delimited) accessible from any smart_open supported transport and produces JSON-formatted data following the Singer spec. This tap is developed for compatibility with Meltano.

How to use it

tap-spreadsheets-anywhere works together with any other Singer Target to move data from any smart_open supported transport to any target destination. smart_open supports a wide range of transport options out of the box, including:

  • S3
  • local directories (file://)
    • NOTE: that absolute paths look like this "file:///root/child/target" with three forward slashes
    • NOTE: on windows to point to a seperate drive letter an absolute path will not work, if you'd like to point to a folder on your D drive use "file://d:/data/subfolder"
  • HTTP, HTTPS (read-only)
  • SSH, SCP and SFTP
  • WebHDFS
  • GCS
  • Azure Blob Storage

Multiple individual files with the same schema can be configured & ingested into the same "Table" for processing.

Compression

smart_open allows reading and writing gzip and bzip2 files. They are transparently handled over HTTP, S3, and other protocols, too, based on the extension of the file being opened.

Configuration

The Meltano configuration for this tap must contain the key 'tables' which holds an array of json objects describing each set of targeted source files.

config:
  extractors:
  - name: tap-spreadsheets-anywhere
    namespace: tap_spreadsheets_anywhere
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    executable: tap-spreadsheets-anywhere
    capabilities:
    - catalog
    - discover
    - state
    config:
      tables: []

To run this tap directly from the CLI, a config.json file must be supplied which holds the 'tables' array. A sample config file is available here sample_config.json and a description of the required/optional fields declared within it follow. The configuration is also captured in tables_config_util.py as a voluptuous-based configuration for validation purposes.

{
    "tables": [
        {
            "path": "s3://my-s3-bucket",
            "name": "target_table_name",
            "pattern": "subfolder/common_prefix.*",
            "start_date": "2017-05-01T00:00:00Z",
            "key_properties": [],
            "format": "csv",
            "delimiter": "|",
            "quotechar": '"',
            "universal_newlines": false,
            "skip_initial": 0,
            "sample_rate": 10,
            "max_sampling_read": 2000,
            "max_sampled_files": 3,
            "prefer_number_vs_integer": true,
            "prefer_schema_as_string": true,
            "selected": true,

            // for any field in the table, you can hardcode the json schema datatype to override
            // the schema infered through discovery mode. 
            // *Note Meltano users* - the scheam override support delivered in Meltano v1.41.1 is more robust
            //  and should be preferred to this tap-specific override functionality.  
            "schema_overrides": {
                "id": {
                    "type": ["null", "integer"],
                },
                // if you want the tap to enforce that a field is not nullable, you can do it like so:
                "first_name": {
                    "type": "string",
                }
            }
        },
        {
            "path": "sftp://username:password@host//path/file",
            "name": "another_table_name",
            "pattern": "subdir/.*User.*",
            "start_date": "2017-05-01T00:00:00Z",
            "key_properties": ["id"],
            "format": "excel", 
            // you must specify the worksheet name to pull from in your xls(x) file.
            "worksheet_name": "Names"
        }
    ]
}

Each object in the 'tables' array describes one or more CSV or Excel spreadsheet files that adhere to the same schema and are meant to be tapped as the source for a Singer-based data flow.

  • path: A string describing the transport and bucket/root directory holding the targeted source files.
  • name: A string describing the "table" (aka Singer stream) into which the source data should be loaded.
  • search_prefix: (optional) This is an optional prefix to apply after the bucket that will be used to filter files in the listing request from the targeted system. This prefix potentially reduces the number of files returned from the listing request.
  • pattern: This is an escaped regular expression that the tap will use to filter the listing result set returned from the listing request. This pattern potentially reduces the number of listed files that are considered as sources for the declared table. It's a bit strange, since this is an escaped string inside of an escaped string, any backslashes in the RegEx will need to be double-escaped.
  • start_date: This is the datetime that the tap will use to filter files, based on the modified timestamp of the file.
  • key_properties: These are the "primary keys" of the CSV files, to be used by the target for deduplication and primary key definitions downstream in the destination.
  • format: Must be either 'csv', 'json', 'jsonl' (JSON Lines), 'excel', or 'detect'. Note that csv can be further customized with delimiter and quotechar variables below.
  • invalid_format_action: (optional) By default, the tap will raise an exception if a source file can not be read . Set this key to "ignore" to skip such source files and continue the run.
  • field_names: (optional) An array holding the names of the columns in the targeted files. If not supplied, the first row of each file must hold the desired values.
  • encoding: (optional) The file encoding to use when reading text files (i.e., "utf-8" (default), "latin1", "windows-1252")
  • universal_newlines: (optional) Should the source file parsers honor universal newlines). Setting this to false will instruct the parser to only consider '\n' as a valid newline identifier.
  • skip_initial: (optional) How many lines should be skipped. The default is 0.
  • sample_rate: (optional) The sampling rate to apply when reading a source file for sampling in discovery mode. A sampling rate of 1 will sample every line. A sampling rate of 10 (the default) will sample every 10th line.
  • max_sampling_read: (optional) How many lines of the source file should be sampled when in discovery mode attempting to infer a schema. The default is 1000 samples.
  • max_sampled_files: (optional) The maximum number of files in the targeted set that will be sampled. The default is 5.
  • max_records_per_run: (optional) The maximum number of records that should be written to this stream in a single sync run. The default is unlimited.
  • prefer_number_vs_integer: (optional) If the discovery mode sampling process sees only integer values for a field, should number be used anyway so that floats are not considered errors? The default is false but true can help in situations where floats only appear rarely in sources and may not be detected through discovery sampling.
  • prefer_schema_as_string: (optional) Bool value either as true or false (default). Should the schema be all read as string by default.
  • selected: (optional) Should this table be synced. Defaults to true. Setting to false will skip this table on a sync run.
  • worksheet_name: (optional) the worksheet name to pull from in the targeted xls file(s). Only required when format is excel
  • delimiter: (optional) the delimiter to use when format is 'csv'. Defaults to a comma ',' but you can set delimiter to 'detect' to leverage the csv "Sniffer" for auto-detecting delimiter.
  • quotechar: (optional) the character used to surround values that may contain delimiters - defaults to a double quote '"'
  • json_path: (optional) the JSON key under which the list of objets to use is located. Defaults to None, corresponding to an array at the top level of the JSON tree.

Automatic Config Generation

This is an experimental feature used to crawl a path and generate a config block for every file encountered. An intended use-case is where source files are organized in subdirectories by intended target table. This mode will generate a config block for each subdirectory and for each file format within it. The following example config file will crawl the s3 bucket my-example-bucket and produce config blocks for each folder under it where source files are detected.

{
    "tables": [
        {
            "crawl_config": true,
            "path": "s3://my-example-bucket",
            "pattern": ".*"
        }
    ]
}

Typically this mode will be used when there are many streams to be configured and processed. Therefore, generating the catalog independently is generally helpful.

meltano invoke --dump=catalog tap-spreadsheets-anywhere > my-catalog.json
meltano elt --catalog=my-catalog.json --job_id=my-job-state tap-spreadsheets-anywhere any-loader

JSON support

JSON files are expected to parse as a root-level array of objects where each object is a set of flat key-value pairs.

[
    { "name": "row one", "key": 42},
    { "name": "row two", "key": 43}
]

JSONL (JSON Lines) support

JSONL files are expected to parse as one object per line, where each row in a file is a set of key-value pairs.

{ "name": "row one", "key": 42}
{ "name": "row two", "key": 43}

Authentication and Credentials

This tap authenticates with target systems as described in the smart_open documentation here.

State

This tap is designed to continually poll a configured directory for any unprocessed files that match a table configuration and to process any that are found. On the first syncing run, the declared start_date will be used to filter the set of files that match the search_prefix and pattern expressions. The last modified date of the most recently synced file will then be written to state and used in place of start_date on the next syncing run.

While state is maintained, only new files will be processed from subsequent runs.

Install and Run outside of Meltano

First, make sure Python 3 is installed on your system. Then, execute create_virtualenv.sh to create a local venv and install the necessary dependencies. If you are executing this tap outside of Meltano then you will need to supply the config.json file yourself. A sample configuration is available here sample_config.json You can invoke this tap directly with:

python -m tap_spreadsheets_anywhere --config config.json

History:

  • this project borrowed heavily from tap-s3-csv. That project was modified to use smart_open for support beyond S3 and then migrated to the cookie cutter based templates for taps.
  • Support for --discover was added so that target files could be sampled independent from sync runs
  • CSV parsing was made more robust and support for configurable typing & sampling added
  • The github commit log holds history from that point forward

Copyright ยฉ 2020 Eric Simmerman

tap-spreadsheets-anywhere's People

Contributors

amotl avatar cooptang avatar craigastill avatar cwegener avatar edgarrmondragon avatar ets avatar frafra avatar gthesheep avatar juleshuisman avatar laurents avatar menzenski avatar rach-green avatar radbrt avatar rifqish avatar tyshkan avatar visch avatar williamqliu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tap-spreadsheets-anywhere's Issues

Ability to set granularity of replication key

I am using this tap to extract s3 inventory reports from an s3 bucket and trying to fine tune the replication method. I have set the replication in the meltano.yml as follows:

metadata:
      inventory_reporting:
        replication-method: INCREMENTAL
        replication-key: last_modified_date

However it seems that because the s3 inventory reports are written daily and have their own timestamp at the file level, the tap is defaulting to using the overall file modified date vs. the column level modified date. My desired outcome is to only load records from the latest report that have a last_modified_date > the last ELT run, but right now its loading all records/rows from the latest report if the report modified date is > the last ELT run. Has anyone run into this? Am I missing a setting to override the file level and use the more granular row level modified date for

re.compile(pattern).search matches substring too and hence picking files other than the mention subfolder

if matcher.search(key) and (modified_since is None or modified_since < last_modified):
.

>>> import re
>>> pattern = 'popularity_index/.*.csv'
>>> file = 'new_popularity_index/test.csv'
>>> m= re.compile(pattern)
>>> m.search(file)
<re.Match object; span=(4, 29), match='popularity_index/test.csv'>

match will be a better option to go

>>> file = 'popularity_index/test.csv'
>>> m.match(file)
<re.Match object; span=(0, 25), match='popularity_index/test.csv'>
>>> file_2='new_popularity_index/test.csv'
>>> m.match(file_2)
>>> 

Tap not adding metadata to the records

I got an error using this tap to load data (have faced the issue using both PostgreSQL and Snowflake loaders). Basically, the table is created in those databases assuming the metadata fields (eg: _smart_source_bucket) can not be Null but actually they are not passed as part of the record to the database, so the database fails cause it assume those columns can't be null.

@DouweM Took a look and discovered the following:

it looks like

transformed_record = transformer.transform(record, schema, metadata)
inserts that metadata into record , just like the metadata schema is inserted into the record schema here:
'properties': merge_dicts(data_schema, metadata_schema)

But, that's not actually what transformer.transform does with its metadata argument:
https://github.com/singer-io/singer-python/blob/2f618c6194f154be7ba37f288d456f6abd4d78ec/singer/transform.py#L148

which is only used to find out if a specific column should be selected or not:
https://github.com/singer-io/singer-python/blob/2f618c6194f154be7ba37f288d456f6abd4d78ec/singer/transform.py#L128

So the tap is currently claiming that each row will have non-null smart fields, but they never actually end up in the RECORD messages

This bug was introduced by a59fdfd

To fix, we should merge metadata into record again, and the actual stream-specific metadata that transform.transform wants to determine which fields to include should be coming from the catalog here:

def sync(config, state, catalog):
and passed to write_file here:
records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)

Silent failure during sampling of an Excel spreadsheet with blank rows before the data

Issue:

When using this tap to read an Excel spreadsheet with a blank row between non-data rows and the table, it fails to parse the file during the sampling discovery stage. This results in: INFO Processing 0 selected streams from Catalog.

Mentioned in the Slack thread.

Reproduction:

  • Spreadsheet with the following format:

    Title
    Description
    <blank_row>
    <table_header>
    <table_data>
    ...
    Totals
    <blank_row>
    Notes
    
  • meltano.yml snippet (NOTE: hand altered config & output to remove sensitive info):

    ...
    plugins:
      extractors:
      - name: tap-spreadsheets-anywhere
        variant: ets
        pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
        config:
          tables:
          - path: file:///path/to/spreadsheets/
            name: my_table
            pattern: ".*"
            key_properties: []
            format: excel
            worksheet_name: <sheet_name>
            start_date: '1970-01-01T00:00:00+00:00'
            field_names:
            - "Date"
            - "Description"
            - ...
            skip_initial: 3
  • Install meltano + plugins into a python venv.

  • meltano invoke tap-spreadsheets-anywhere --dev

    2023-05-15T13:47:56.821676Z [info     ] Environment 'dev' is active
    INFO Generating catalog through sampling.
    INFO Walking /path/to/spreadsheets/.
    INFO Found 1 files.
    INFO Found broken_spreadsheet.xlsx and /path/to/spreadsheets/broken_spreadsheet.xlsx exists True
    INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
    INFO Sampling broken_spreadsheet.xlsx (1000 records, every 5th record).
    ERROR Unable to write Catalog entry for 'my_table' - it will be skipped due to error tuple index out of range
    INFO Processing 0 selected streams from Catalog

EXPECTATION: skip_initial and/or field_names would allow the the file to be sampled/read from the line where the table data exists in the Excel file.

Debugging:

  • excel_handler.generator_wrapper is throwing an IndexError during the sampling phase when on a blank row.
  • It doesn't look like field_names is used during the sampling phase (Just double checked and it's only used in the csv_handler.py).
  • The original skip_initial changes (#37) supported skipping over rows with data in them.

Stream contains all rows in `.xlsx` sheet instead of only data rows.

I'm playing around with: tap-spreadsheets-anywhere locally with an externally generated .xlsx file.

  • Using skip_initial to get to skip some human readable text and start on the table header.
  • Using field_names to list each of the expected header titles.
  • File contains ~100 lines of table data.
  • There is a Totals row at the bottom of the table data.
  • A few blank lines, then a human readable notes footer and then blank lines until the end of the sheet at row 1000.
  • meltano invoke tap-spreadsheets-anywhere has the line: ... INFO Wrote 995 records for stream "<table_name>" .....

Expected to only see my ~100 rows of data in the stream.

Doing a: meltano run tap-spreadsheets-anywhere target-postgres results in: 995 rows written to the table instead ~100.

Mentioned on Meltano Slack.


Also noticed that field_names has not been implemented into the excel_handler.py file. From the description, is the expectation that if it was added, it would stop at blank rows? Or just be an alternative to skip_initial=<integer_to_table_field_names_row>?

CI is failing on missing PDM lock file

The CI has been failing for a while on the lack of a PDM lock file committed to the repo.

Suggest:

Walking a non existant local file directory doesn't fail

visch@DESKTOP-9BDPA9T:~/git/meltano-projects/spreadsheets$ dir
README.md  analyze  config.json  extract  load  meltano.yml  notebook  orchestrate  output  plugins  requirements.txt  transform
visch@DESKTOP-9BDPA9T:~/git/meltano-projects/spreadsheets$ cat meltano.yml
version: 1
default_environment: dev
project_id: 676da67f-9ef4-44b9-8993-fcf9513de2af
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    config:
      tables:
      - path: file://extractttt/
        name: d11_test
        pattern: "/*.csv"
        start_date: '2023-01-01T00:00:00Z'
        key_properties: [Unique Record ID]
        format: csv
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl
visch@DESKTOP-9BDPA9T:~/git/meltano-projects/spreadsheets$ meltano inv^C
visch@DESKTOP-9BDPA9T:~/git/meltano-projects/spreadsheets$ meltano invoke tap-spreadsheets-anywhere
2023-03-29T19:19:47.436312Z [info     ] Environment 'dev' is active
INFO Using supplied catalog /home/visch/git/meltano-projects/spreadsheets/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json.
INFO Processing 1 selected streams from Catalog
INFO Syncing stream:d11_test
{"type": "SCHEMA", "stream": "d11_test", "schema": {"properties": {"_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": ["Unique Record ID"]}
INFO Walking extractttt/.
INFO Found 0 files.
INFO Checking 0 resolved objects for any that match regular expression "/*.csv" and were modified since 2023-01-01 00:00:00+00:00
INFO Processing 0 resolved objects that met our criteria. Enable debug verbosity logging for more details.
INFO Wrote 0 records for stream "d11_test".
visch@DESKTOP-9BDPA9T:~/git/meltano-projects/spreadsheets$

This should fail in my opinion instead of running and not telling you something is wrong. I'm trying to pull data from a drive on windows and it's really hard to tell why I can't see the files in the directory.

Azure sync process logs quite noisy

Is there any way to reduce the http request/responses being printed to the logs? This was for a single table with two rows.

2023-05-18T12:32:39.201092Z [info     ] Environment 'dev' is active
2023-05-18T12:32:42.127043Z [warning  ] No state was found, complete import.
2023-05-18T12:32:42.528555Z [info     ] INFO Using supplied catalog /home/andycarter/vscode_projects/empiric_elt/empiric_meltano/.meltano/run/tap-spreadsheets-anywhere/tap.properties.json. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.528749Z [info     ] INFO Processing 1 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.528866Z [info     ] INFO Syncing stream:test       cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.528960Z [info     ] INFO Request URL: 'https://devstorefmzyvn6gotqas.blob.core.windows.net/blobcontainer?restype=REDACTED&comp=REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529053Z [info     ] Request method: 'GET'          cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529154Z [info     ] Request headers:               cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529258Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529364Z [info     ]     'Accept': 'application/xml' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529473Z [info     ]     'User-Agent': 'azsdk-python-storage-blob/12.16.0 Python/3.10.6 (Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35)' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529576Z [info     ]     'x-ms-date': 'REDACTED'    cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529690Z [info     ]     'x-ms-client-request-id': '15924b0a-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.529840Z [info     ]     'Authorization': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.530106Z [info     ] No body was attached to the request cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.714693Z [info     ] time=2023-05-18 13:32:42 name=target_postgres level=INFO message=Schema 'raw__tap_spreadsheets_anywhere' does not exist. Creating... CREATE SCHEMA IF NOT EXISTS raw__tap_spreadsheets_anywhere cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
2023-05-18T12:32:42.717406Z [info     ] INFO Response status: 200      cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.717761Z [info     ] Response headers:              cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.718086Z [info     ]     'Transfer-Encoding': 'chunked' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.718241Z [info     ]     'Content-Type': 'application/xml' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.718404Z [info     ]     'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.719016Z [info     ]     'x-ms-request-id': 'fc5c41ae-701e-0014-2d84-89af86000000' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.719263Z [info     ]     'x-ms-client-request-id': '15924b0a-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.719416Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.719569Z [info     ]     'Date': 'Thu, 18 May 2023 12:32:38 GMT' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.719833Z [info     ] INFO Checking 2 resolved objects for any that match regular expression "in/emp.txt" and were modified since 2017-05-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.720014Z [info     ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.720178Z [info     ] INFO Syncing file "in/emp.txt". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.721358Z [info     ] INFO Request URL: 'https://devstorefmzyvn6gotqas.blob.core.windows.net/blobcontainer/in/emp.txt' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722021Z [info     ] Request method: 'HEAD'         cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722222Z [info     ] Request headers:               cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722348Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722473Z [info     ]     'Accept': 'application/xml' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722585Z [info     ]     'User-Agent': 'azsdk-python-storage-blob/12.16.0 Python/3.10.6 (Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35)' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722705Z [info     ]     'x-ms-date': 'REDACTED'    cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.722902Z [info     ]     'x-ms-client-request-id': '15b1cf70-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.723121Z [info     ]     'Authorization': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.723812Z [info     ] No body was attached to the request cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.776114Z [info     ] time=2023-05-18 13:32:42 name=target_postgres level=INFO message=Table '"test"' does not exist. Creating... CREATE TABLE IF NOT EXISTS raw__tap_spreadsheets_anywhere."test" ("_smart_source_bucket" character varying, "_smart_source_file" character varying, "_smart_source_lineno" numeric, "personidlastnamefirstnameaddresscity" character varying) cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
2023-05-18T12:32:42.884202Z [info     ] INFO Response status: 200      cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.884708Z [info     ] Response headers:              cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.885120Z [info     ]     'Content-Length': '86'     cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.885423Z [info     ]     'Content-Type': 'text/plain' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.885700Z [info     ]     'Content-MD5': 'REDACTED'  cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.885974Z [info     ]     'Last-Modified': 'Fri, 12 May 2023 14:36:14 GMT' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.886300Z [info     ]     'Accept-Ranges': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.886755Z [info     ]     'ETag': '"0x8DB52F63D10EF6A"' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.887060Z [info     ]     'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.887367Z [info     ]     'x-ms-request-id': '5d901676-301e-003a-0b84-89fd91000000' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.887883Z [info     ]     'x-ms-client-request-id': '15b1cf70-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.888275Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.888610Z [info     ]     'x-ms-creation-time': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.889107Z [info     ]     'x-ms-lease-status': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.890025Z [info     ]     'x-ms-lease-state': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.890645Z [info     ]     'x-ms-blob-type': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.891199Z [info     ]     'x-ms-server-encrypted': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.891704Z [info     ]     'x-ms-access-tier': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.892250Z [info     ]     'x-ms-access-tier-inferred': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.892839Z [info     ]     'Date': 'Thu, 18 May 2023 12:32:39 GMT' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.893863Z [info     ] INFO Request URL: 'https://devstorefmzyvn6gotqas.blob.core.windows.net/blobcontainer/in/emp.txt' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.894584Z [info     ] Request method: 'GET'          cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.895417Z [info     ] Request headers:               cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.898002Z [info     ]     'x-ms-range': 'REDACTED'   cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.899292Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.899755Z [info     ]     'Accept': 'application/xml' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.900309Z [info     ]     'User-Agent': 'azsdk-python-storage-blob/12.16.0 Python/3.10.6 (Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35)' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.901188Z [info     ]     'x-ms-date': 'REDACTED'    cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.903005Z [info     ]     'x-ms-client-request-id': '15cae50a-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.903593Z [info     ]     'Authorization': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.904158Z [info     ] No body was attached to the request cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.917702Z [info     ] INFO Response status: 206      cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.918325Z [info     ] Response headers:              cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.925455Z [info     ]     'Content-Length': '86'     cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.926205Z [info     ]     'Content-Type': 'text/plain' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.926697Z [info     ]     'Content-Range': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.927164Z [info     ]     'Last-Modified': 'Fri, 12 May 2023 14:36:14 GMT' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.927599Z [info     ]     'Accept-Ranges': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.928037Z [info     ]     'ETag': '"0x8DB52F63D10EF6A"' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.928536Z [info     ]     'Server': 'Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.929027Z [info     ]     'x-ms-request-id': '5d901692-301e-003a-2384-89fd91000000' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.929394Z [info     ]     'x-ms-client-request-id': '15cae50a-f578-11ed-a56a-00155d6836b4' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.929803Z [info     ]     'x-ms-version': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.930234Z [info     ]     'x-ms-creation-time': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.930665Z [info     ]     'x-ms-blob-content-md5': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.933171Z [info     ]     'x-ms-lease-status': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.934547Z [info     ]     'x-ms-lease-state': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.935142Z [info     ]     'x-ms-blob-type': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.935642Z [info     ]     'x-ms-server-encrypted': 'REDACTED' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.936053Z [info     ]     'Date': 'Thu, 18 May 2023 12:32:39 GMT' cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:42.936905Z [info     ] INFO Wrote 2 records for stream "test". cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-05-18T12:32:43.049936Z [info     ] time=2023-05-18 13:32:43 name=target_postgres level=INFO message=Loading 2 rows into 'raw__tap_spreadsheets_anywhere."test"' cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
2023-05-18T12:32:43.080390Z [info     ] time=2023-05-18 13:32:43 name=target_postgres level=INFO message=Loading into raw__tap_spreadsheets_anywhere."test": {"inserts": 2, "updates": 0, "size_bytes": 126} cmd_type=elb consumer=True name=target-postgres producer=False stdio=stderr string_id=target-postgres
2023-05-18T12:32:43.139285Z [info     ] Incremental state has been updated at 2023-05-18 12:32:43.139165.
2023-05-18T12:32:43.165229Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True

`zipfile.BadZipFile: File is not a zip file` when loading an `.xlsx` file

Not sure if this is a misconfiguration on my side, but I get the error: zipfile.BadZipFile: File is not a zip file both from reading an .xlsx from local/S3, or a pytest testcase saving and reloading an openpyxl created workbook.

Debugging/Workaround

My current workaround is to modify: https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/excel_handler.py#L64-L65 from:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle, read_only=True)

to:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle.name, read_only=True)

Reproduction:

  • Created an example testcase:
    from openpyxl import Workbook
    from tap_spreadsheets_anywhere.format_handler import get_row_iterator
    
    
    def get_worksheet():
        """Create a basic workbook that can be manipulated for tests.
        See: https://openpyxl.readthedocs.io/en/stable/usage.html.
        """
        wb = Workbook()
        ws = wb.active
        tree_data = [
            ["Type", "Leaf Color", "Height"],
            ["Maple", "Red", 549],
            ["Oak", "Green", 783],
            ["Pine", "Green", 1204]
        ]
        exp_tree_data = [
            {'type': 'Maple', 'leaf_color': 'Red', 'height': 549},
            {'type': 'Oak', 'leaf_color': 'Green', 'height': 783},
            {'type': 'Pine', 'leaf_color': 'Green', 'height': 1204},
        ]
        [ws.append(row) for row in tree_data]
        return ws, wb, tree_data, exp_tree_data
    
    
    class TestFormatHandlerExcelSkipInitial:
        """pytests to validate Skip Initial for Excel files works as expected."""
        def test_parse_data(self, tmpdir):
            xlsx = tmpdir / "fake_test.xlsx"
            uri = f"file://{xlsx}"
            worksheet, workbook, _, exp = get_worksheet()
            workbook.save(xlsx)
    
            iterator = get_row_iterator({"format": "excel"}, uri)
            assert next(iterator) == exp[0]
  • Current output:
    cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
    ============================================= test session starts ==============================================
    platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
    cachedir: .pytest_cache
    rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
    collected 30 items / 29 deselected / 1 selected                                                                
    
    tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data FAILED [100%]
    
    =================================================== FAILURES ===================================================
    ______________________________ TestFormatHandlerExcelSkipInitial.test_parse_data _______________________________
    
    self = <tap_spreadsheets_anywhere.test.test_format.TestFormatHandlerExcelSkipInitial object at 0x1055e9e50>
    tmpdir = local('/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0')
    
        def test_parse_data(self, tmpdir):
            xlsx = tmpdir / "fake_test.xlsx"
            uri = f"file://{xlsx}"
            worksheet, workbook, _, exp = get_worksheet()
            workbook.save(xlsx)
    
    >       iterator = get_row_iterator({"format": "excel"}, uri)
    
    tap_spreadsheets_anywhere/test/test_format.py:218: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tap_spreadsheets_anywhere/format_handler.py:164: in get_row_iterator
        iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
    tap_spreadsheets_anywhere/excel_handler.py:66: in get_row_iterator
        workbook = openpyxl.load_workbook(file_handle, read_only=True)
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:344: in load_workbook
        reader = ExcelReader(filename, read_only, keep_vba,
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:123: in __init__
        self.archive = _validate_archive(fn)
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:95: in _validate_archive
        archive = ZipFile(filename, 'r')
    /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1301: in __init__
        self._RealGetContents()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <zipfile.ZipFile [closed]>
    
        def _RealGetContents(self):
            """Read in the table of contents for the ZIP file."""
            fp = self.fp
            try:
                endrec = _EndRecData(fp)
            except OSError:
                raise BadZipFile("File is not a zip file")
            if not endrec:
    >           raise BadZipFile("File is not a zip file")
    E           zipfile.BadZipFile: File is not a zip file
    
    /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1368: BadZipFile
    --------------------------------------------- Captured stdout call ---------------------------------------------
    <_io.TextIOWrapper name='/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0/fake_test.xlsx' mode='r' encoding='utf-8'>
    =========================================== short test summary info ============================================
    FAILED tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data - zi...
    ======================================= 1 failed, 29 deselected in 1.22s =======================================

Workaround:

Adding the above mentioned workaround results in success, but not tested this in anger.

cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
============================= test session starts ==============================
platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
collected 30 items / 29 deselected / 1 selected                                

tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data PASSED [100%]

====================== 1 passed, 29 deselected in 23.37s =======================

Create a way to extract spreadsheets with no header row

Implement feature that can handle spreadsheets where the header is not provided in the file. We want to be able to define the expected schema and then have a flag to indicate that there is no header. In testing, You can define a schema, but the first row of a spreadsheet is always assumed to be the header so the records being extracted utilize the values in the first row as a key vs. using the defined schema in the meltano configuration.

Azure support

I'm jumping into this tap on short notice, but I needed to add Azure Blob support and made a bare-bones implementation here: https://github.com/radbrt/tap-spreadsheets-anywhere/tree/azure

Before making a PR, I'd like to discuss some of the issues/choices involved here.

First off, I'm reading files based on the smart_open example for reading Azure, where we depend on a storage account connection string with a particular formatting. For better or worse, Azure has a myriad of ways to connect to blobs, and this implementation does not support access methods such as Managed Identity (granted, azure hardly supports that either). The particular formatting, although found in the Azure Portal, may also be a little intuitive for users, compared to other formats. This formatting issue can probably be fixed through documentation though.

The way the credential is stored is also grabbed from the smart_open example - I simply assume an env variable is present (https://github.com/radbrt/tap-spreadsheets-anywhere/blob/53d8429657be1cca363a7c2a2a23eb0f7d4a97bd/tap_spreadsheets_anywhere/file_utils.py#L253). In Meltano, this works well by adding it to the .env file. But if anyone wants to read from mulitple azure storage accounts, this will be hard to configure. There are a few possible ways around this, with different tradeoffs.

Reading files from azure requires an additional parameter smart_open's open function, and I if-elsed that one very crudely. Any ideas for cleaner implementations are welcome: https://github.com/radbrt/tap-spreadsheets-anywhere/blob/53d8429657be1cca363a7c2a2a23eb0f7d4a97bd/tap_spreadsheets_anywhere/format_handler.py#L22

Lastly, I'm simply importing the azure.storage.blob at the top, while I see some libraries are imported in try-blocks only if the specified protocol requires it.

Extend "json_path" config option with JSONPath parser for deep nested data

Followup to #16

It would be quite helpful to parse more complex JSON-files, especially if it's http-responses from some API.

With actual JSONPath parser it could be done quickly and clean with a simple expression like response.data[*] applied to

{
  "response": {
    "data": [
      { "name": "row one", "key": 42 },
      { "name": "row two", "key": 43 }
    ]
  }
}

would return

[
  { "name": "row one", "key": 42 },
  { "name": "row two", "key": 43 }
]

which is the form of the data expected as input for the json format.

The same library is used by Meltano Singer SDK.

Bug when reading `.xlsx` files. Excel files not properly tapped and no output with `ERROR Unable to write Catalog entry for 'filexlsx' - it will be skipped due to error File is not a zip file`

I'm running into unexpected behaviour when trying to tap into an excel file.

It is being detected, however it's not actually read and nothing ends up in the output dir, seemingly due to an File is not a zip file error.

I'm running python 10.0 on an M1 OSX 11.6.4 Big Sur with the following packages:

โฏ pip freeze
aiodocker==0.21.0
aiohttp==3.8.6
aiosignal==1.3.1
alembic==1.12.1
async-timeout==4.0.3
atomicwrites==1.4.1
attrs==23.1.0
certifi==2023.7.22
charset-normalizer==3.3.2
check-jsonschema==0.22.0
click==8.1.7
click-default-group==1.2.4
click-didyoumean==0.3.0
croniter==1.4.1
distlib==0.3.7
et-xmlfile==1.1.0
fasteners==0.19
filelock==3.13.1
flatten-dict==0.4.2
frozenlist==1.4.0
idna==3.4
importlib-resources==6.1.0
jdcal==1.4.1
Jinja2==3.1.2
jsonschema==4.19.2
jsonschema-specifications==2023.7.1
Mako==1.2.4
markdown-it-py==3.0.0
MarkupSafe==2.1.3
mdurl==0.1.2
meltano==3.1.0
multidict==6.0.4
openpyxl==3.1.2
packaging==23.2
platformdirs==3.11.0
prompt-toolkit==3.0.36
psutil==5.9.6
Pygments==2.16.1
PyJWT==2.8.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-slugify==8.0.1
python-ulid==1.1.0
PyYAML==6.0.1
questionary==2.0.1
referencing==0.30.2
requests==2.31.0
rich==13.6.0
rpds-py==0.12.0
ruamel.yaml==0.17.21
ruamel.yaml.clib==0.2.8
six==1.16.0
smart-open==6.4.0
snowplow-tracker==1.0.1
SQLAlchemy==2.0.23
structlog==23.2.0
tabulate==0.9.0
termcolor==2.3.0
text-unidecode==1.3
typing_extensions==4.8.0
tzlocal==5.2
urllib3==2.0.7
virtualenv==20.24.6
wcwidth==0.2.9
yarl==1.9.2
yaspin==2.5.0

meltano.yml

version: 1
default_environment: dev
project_id: c37dc81d-e555-4e23-9f9f-0b6acbdfff86
environments:
- name: dev
- name: staging
- name: prod
plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    capabilities:
    - discover
    config:
      tables: [
        {
          "path": "file:///Users/alexis.vialaret/vscode_projects/EDA_Accelerator/data",
          "name": "billionaires_excelxlsx",
          "pattern": ".xlsx",
          "key_properties": [],
          "format": "excel",
          "worksheet_name": "Feuille 1",
          "start_date": "1970-01-01T00:00:00+00:00"
        }
      ]
  loaders:
  - name: target-jsonl
    variant: andyh1203
    pip_url: target-jsonl

Here is the file I'm testing with. It is a valid excel file:
billionaires_excel.xlsx

Shell output:

โฏ meltano run tap-spreadsheets-anywhere target-jsonl


2023-11-07T12:18:26.121508Z [info     ] Environment 'dev' is active
2023-11-07T12:18:26.214962Z [warning  ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:26.925482Z [warning  ] A catalog file was found, but it will be ignored as the extractor does not advertise the `catalog` or `properties` capability
2023-11-07T12:18:26.925679Z [warning  ] A state file was found, but it will be ignored as the extractor does not advertise the `state` capability
2023-11-07T12:18:27.382167Z [info     ] INFO Generating catalog through sampling. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382454Z [info     ] INFO Walking /Users/alexis.vialaret/vscode_projects/EDA_Accelerator/data. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382596Z [info     ] INFO Found 2 files.            cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382683Z [info     ] INFO Checking 2 resolved objects for any that match regular expression ".xlsx" and were modified since 1970-01-01 00:00:00+00:00 cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.382901Z [info     ] INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details. cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383281Z [info     ] INFO Sampling billionaires_excel.xlsx (1000 records, every 5th record). cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383774Z [info     ] ERROR Unable to write Catalog entry for 'billionaires_excelxlsx' - it will be skipped due to error File is not a zip file cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.383932Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=elb consumer=False name=tap-spreadsheets-anywhere producer=True stdio=stderr string_id=tap-spreadsheets-anywhere
2023-11-07T12:18:27.446698Z [info     ] Block run completed.           block_type=ExtractLoadBlocks err=None set_number=0 success=True```

I've investigated a bit, and it seems like the issue might be coming from the way the excel file is passed to openpyxl in `excel_handler.py`:

```python3
[...]
def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle, read_only=True)
[...]

file_handle is an _io.TextIOWrapper which openpyxl.load_workbook does not seem to accept.

Here is a minimum reproducible example:

import smart_open
from openpyxl import load_workbook

data = smart_open.open(
    'file:///path/to/file.xlsx', 
    'rb', 
    newline=None, 
    errors='surrogateescape', 
    encoding='utf-8'
)
print(data)

# This fails
workbook = load_workbook(data, read_only=True)

# This works
# workbook = load_workbook(data.buffer, read_only=True)

# This also works, but openpyxl will re-open the file to read it
# workbook = load_workbook(data.name, read_only=True)

zipfile.BadZipFile: File is not a zip file

It looks like a fix would be to pass data.buffer rather than just data. That works in my minimal example and solves the problem of the file not being tapped, but I'm lacking context knowledge to be sure this is a good idea.

What do you think?

Parquet, Avro, and ORC support

Hi,

It looks like tap-spreadsheets-anywhere supports file formats like csv, json, and Excel. I was wondering if there were any plans to support more database like files e.g. Parquet, Avro, and ORC. This would be very useful in having the ability to ingest data from AWS S3 if a provider has uploaded data using one of these alternative formats.

I would be interested in your thoughts on this.

Thanks
Steve

Add option to set encoding

Add option to set encoding for text formats (e.g., css, json), it would be helpful when source file is not encoded with utf8

*csv not working as RegEx in pattern (but .csv$ does work)

When using *.csv as pattern one (at least I) would expect all files ending with .csv to be added to the catalog. That's not the case. Using .csv$ does work. Maybe adding some documentation on the right use of RegEx will help, as will adding the mandatory fields (like for example pattern).

Azure: Use DefaultAzureCredential over storage key access to blob container

Would you consider use of DefaultAzureCredential for authentication to Azure Blob Storage? Maybe falling back to the current 'storage account string' approach. Would love to be able to turn off storage account key access for addtl security, and just used managed identity/logged in user when local etc.

from azure.identity import DefaultAzureCredential()
account_url = "https://<mystore>.blob.core.windows.net"
default_credential = DefaultAzureCredential()

# Create the BlobServiceClient object
blob_service_client = BlobServiceClient(account_url, credential=default_credential)

https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/identity/azure-identity/README.md#defaultazurecredential

SyntaxError -- parens not closed

Just tried testing my configuration and found a syntax error:

% meltano config tap-spreadsheets-anywhere test
2022-12-19T22:03:05.353237Z [info     ] Environment 'dev' is active
Need help fixing this problem? Visit http://melta.no/ for troubleshooting steps, or to
join our friendly Slack community.

Plugin configuration is invalid
Catalog discovery failed: command ['/Users/erik/Dropbox/home/git/meltano-projects/s3-inventory/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere', '--config', '/Users/erik/Dropbox/home/git/meltano-projects/s3-inventory/.meltano/run/tap-spreadsheets-anywhere/tap.29fe46e1-3e31-443c-bcc4-8baf31f11e37.config.json', '--discover'] returned 1 with stderr:
 Traceback (most recent call last):
  File "/Users/erik/Dropbox/home/git/meltano-projects/s3-inventory/.meltano/extractors/tap-spreadsheets-anywhere/venv/bin/tap-spreadsheets-anywhere", line 5, in <module>
    from tap_spreadsheets_anywhere import main
  File "/Users/erik/Dropbox/home/git/meltano-projects/s3-inventory/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.10/site-packages/tap_spreadsheets_anywhere/__init__.py", line 12, in <module>
    import tap_spreadsheets_anywhere.conversion as conversion
  File "/Users/erik/Dropbox/home/git/meltano-projects/s3-inventory/.meltano/extractors/tap-spreadsheets-anywhere/venv/lib/python3.10/site-packages/tap_spreadsheets_anywhere/conversion.py", line 10
    t_schema = pickle.loads(pickle.dumps((schema))
                           ^
SyntaxError: '(' was never closed

https://github.com/ets/tap-spreadsheets-anywhere/blob/master/tap_spreadsheets_anywhere/conversion.py#L10

HTTP/S?

Hi there. The README says this tap should work with HTTPS, but looking at the code it looks like it doesn't?

SFTP error

Hello!

I'm trying to use the tap within meltano to get a csv file from an sftp server.
I set up one locally with docker compose.

Config is what I understood from your examples:

      - path: sftp://testuser:testpwd@localhost//data
        name: orgs
        pattern: "organizations-100.csv"
        start_date: '2010-01-01T00:00:00Z'
        key_properties: []
        format: csv
        delimiter: ","

However, when I run meltano elt ... output is something like this

2023-07-10T08:28:43.707800Z [info     ] INFO Processing 0 selected streams from Catalog cmd_type=extractor name=tap-spreadsheets-anywhere run_id=a74db321-0bd1-443d-acf9-795731dc0543 state_id=2023-07-10T082839--tap-spreadsheets-anywhere--target-duckdb stdio=stderr

For further testing, I ran meltano invoke -d, and the result is:

meltano invoke tap-spreadsheets-anywhere -d
2023-07-10T08:37:04.440135Z [info     ] Environment 'dev' is active
ERROR Unable to write Catalog entry for 'orgs' - it will be skipped due to error 500 Unknown command.
ERROR Unable to write Catalog entry for 'orgs' - it will be skipped due to error module 'smart_open.ssh' has no attribute '_connect'
{
  "streams": []
}

NB, by using the sftp command line I have no issues on connecting and downloading the file.

What am I doing wrong?

Also, syntax for getting from ftp instead of sftp is the same, except for the name of the protocol?

Thank you!

Use private key with SFTP

I was wondering if you had considered supporting private key authentication with SFTP? It looks like smart_open changes its authentication based on the presence of key_filename in connect_kwargs.

Add new output type `object`

In case if we have a .json file with the following structure:

[
    {
        "key1": "value1",
        "key2": {
            "key3": "value2",
            "key4": "value3"
        }
    },
    {
        "key1": "value4",
        "key2": {
            "key3": "value5",
            "key4": "value6"
        }
    }
]

the tap outputs object as serialized dict string which is not directly parseable as JSON by PostgreSQL (in my case): {'key3':'value2','key4':'value3'}.

The tap outputs only string, integer, number and date-time types, so no option to use object.

Slack discussion: https://meltano.slack.com/archives/C01TCRBBJD7/p1680119382351929

The solution is to add new output type object. The only problem I see is the discovery mode. If object type would be detected by default, it would cause troubles for everyone who parses/uses dict-formatted strings downstream.

So there is two options:

  1. Auto-detecting objects by default
  2. Explicitly set object type by using schema or schema_overrides options in config

Column names with uppercase characters break key_properties

I'm not 100% sure whose bug it is, but I think it belongs here:
If I load a CSV with a column name containing uppercase characters (eg. MyKeyCol) and want to use it as the key column, I run into a problem if I am also using target-postgres as the loader (the datamill-co variant). Somewhere in the process, the column name is lowercased, but the key property isn't, which results in the KeyError below.
If I change key_properties to be one of the columns with a lowercase name, it works fine.
I looked at the extras in meltano, but couldn't figure out a solution. It almost feels like the key_properties should allow a value like MyKeyCol as mykeycol.
I'm guessing the field_names config value could be used to force the column names, but it doesn't look like I can skip the first row, so I'm back to unhappiness again :) (this could actually be an extra setting, like skip_rows (integer, default 0) to force the loader to ignore the first N rows of the file).

Here are the logs I'm getting with an example that fails:

tap-spreadsheets-anywhere | INFO Using supplied catalog .meltano/run/elt/load_csv_base_data/3bce9632-9df0-4752-b19d-1705bab4d11d/tap.properties.json.
tap-spreadsheets-anywhere | INFO Syncing stream:us_sec_country_codes
tap-spreadsheets-anywhere | INFO Walking /static_data/.
tap-spreadsheets-anywhere | INFO Found 13 files.
tap-spreadsheets-anywhere | INFO Checking 9 resolved objects for any that match regular expression "myfile.*" and were modified since 2011-01-01 00:00:00+00:00
tap-spreadsheets-anywhere | INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
tap-spreadsheets-anywhere | INFO Syncing file "myfile.csv".
tap-spreadsheets-anywhere | INFO Wrote 309 records for stream "mystream".
target-postgres           | DEBUG MillisLoggingConnection: 0 millis spent executing: b'BEGIN;'
target-postgres           | DEBUG MillisLoggingConnection: 0 millis spent executing: b"\n            SELECT c.relname, obj_description(c.oid, 'pg_class')\n            FROM pg_namespace AS n\n                INNER JOIN pg_class AS c ON n.oid = c.relnamespace\n            WHERE n.nspname = 'meltano';\n        "
target-postgres           | INFO Mapping: mytable to ['mytable']
target-postgres           | INFO Mapping: tp_mytable_mycol__sdc_sequence_idx to None
target-postgres           | DEBUG MillisLoggingConnection: 7 millis spent executing: b"\n            SELECT column_name, data_type, is_nullable FROM information_schema.columns\n            WHERE table_schema = 'meltano' and table_name = 'mytable';\n        "
target-postgres           | DEBUG MillisLoggingConnection: 1 millis spent executing: b"\n            SELECT EXISTS (\n                SELECT 1 FROM pg_tables\n                WHERE schemaname = 'meltano' AND\n                      tablename = 'mytable');\n        "
target-postgres           | INFO Stream mytable (mytable) with max_version None targetting None
target-postgres           | INFO Root table name mytable
target-postgres           | INFO Writing batch with 309 records for `mytable` with `key_properties`: `['MyKeyCol']`
target-postgres           | INFO METRIC: {"type": "counter", "metric": "record_count", "value": 0, "tags": {"count_type": "batch_rows_persisted", "path": ["mytable"], "database": "mydb", "schema": "meltano"}}
target-postgres           | INFO METRIC: {"type": "timer", "metric": "job_duration", "value": 0.0010831356048583984, "tags": {"job_type": "batch", "path": ["mytable"], "database": "mydb", "schema": "meltano", "status": "failed"}}
target-postgres           | ERROR Exception writing records
target-postgres           | Traceback (most recent call last):
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/postgres.py", line 295, in write_batch
target-postgres           |     written_batches_details = self.write_batch_helper(cur,
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/sql_base.py", line 808, in write_batch_helper
target-postgres           |     for table_batch in denest.to_table_batches(schema, key_properties, records):
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/denest.py", line 20, in to_table_batches
target-postgres           |     table_schemas = _get_streamed_table_schemas(schema,
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/denest.py", line 47, in _get_streamed_table_schemas
target-postgres           |     key_prop_schemas[key] = schema['properties'][key]
target-postgres           | KeyError: 'MyKeyCol'
target-postgres           | CRITICAL ('Exception writing records', KeyError('MyKeyCol'))
target-postgres           | Traceback (most recent call last):
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/postgres.py", line 295, in write_batch
target-postgres           |     written_batches_details = self.write_batch_helper(cur,
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/sql_base.py", line 808, in write_batch_helper
target-postgres           |     for table_batch in denest.to_table_batches(schema, key_properties, records):
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/denest.py", line 20, in to_table_batches
target-postgres           |     table_schemas = _get_streamed_table_schemas(schema,
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/denest.py", line 47, in _get_streamed_table_schemas
target-postgres           |     key_prop_schemas[key] = schema['properties'][key]
target-postgres           | KeyError: 'MyKeyCol'
target-postgres           |
target-postgres           | During handling of the above exception, another exception occurred:
target-postgres           |
target-postgres           | Traceback (most recent call last):
target-postgres           |   File ".meltano/loaders/target-postgres/venv/bin/target-postgres", line 8, in <module>
target-postgres           |     sys.exit(cli())
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/__init__.py", line 45, in cli
target-postgres           |     main(args.config)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/__init__.py", line 39, in main
target-postgres           |     target_tools.main(postgres_target)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/target_tools.py", line 28, in main
target-postgres           |     stream_to_target(input_stream, target, config=config)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/target_tools.py", line 77, in stream_to_target
target-postgres           |     raise e
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/target_tools.py", line 70, in stream_to_target
target-postgres           |     state_tracker.flush_streams(force=True)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/stream_tracker.py", line 47, in flush_streams
target-postgres           |     self._write_batch_and_update_watermarks(stream)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/stream_tracker.py", line 67, in _write_batch_and_update_watermarks
target-postgres           |     self.target.write_batch(stream_buffer)
target-postgres           |   File ".meltano/loaders/target-postgres/venv/lib/python3.8/site-packages/target_postgres/postgres.py", line 309, in write_batch
target-postgres           |     raise PostgresError(message, ex)
target-postgres           | target_postgres.exceptions.PostgresError: ('Exception writing records', KeyError('MyKeyCol'))
meltano                   | Loading failed (1): target_postgres.exceptions.PostgresError: ('Exception writing records', KeyError('MyKeyCol'))
meltano                   | ELT could not be completed: Loader failed

Error during discovery doesn't fail job

ERROR Unable to write Catalog entry for 'stream_name' - it will be skipped due to error nothing to repeat at position 0

This job should also fail with a non zero exit code here to alert us that there is an issue. Instead what happens is we generate an empty catalog file and it becomes really hard to debug what they heck is going on as Meltano caches the catalog file.

Can't pull data from s3 bucket using Meltano with the only change between s3 and local being the path

Spend quite a few hours on this. I'm trying to pull down the DBT Jaffleshop csv data directly from the s3 folder. These settings keep pulling nothing:

version: 1
default_environment: dev
project_id: faa34bcc-b471-4943-a928-8104b0330bbd
plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    config:
      tables:
        - format: detect
          name: customers
          path: s3://dbt-tutorial-public/
          pattern: "jaffle_shop_customers*.csv"
          start_date: "2017-05-01T00:00:00Z"
          key_properties: [id]

However when I change it to a local path, it very easily pulls it. Is there something else I'm supposed to do with the s3 settings? I didn't see anything else I needed to change.

version: 1
default_environment: dev
project_id: faa34bcc-b471-4943-a928-8104b0330bbd
plugins:
  extractors:
  - name: tap-spreadsheets-anywhere
    variant: ets
    pip_url: git+https://github.com/ets/tap-spreadsheets-anywhere.git
    config:
      tables:
        - format: detect
          name: customers
          path: file:///home/me/dbt_data
          pattern: "jaffle_shop_customers*.csv"
          start_date: "2017-05-01T00:00:00Z"
          key_properties: [id]

I have even forced the format to CSV to be pull, but no luck. Thanks for your help!

Add "json_path" config option for nested data

I was trying to load a JSON file that looks like this (the actual file is here):

{
  "3166-1": [
    {
      "alpha_2": "AW",
      "alpha_3": "ABW",
      "name": "Aruba",
      "numeric": "533"
    },
    {
      "alpha_2": "AF",
      "alpha_3": "AFG",
      "name": "Afghanistan",
      "numeric": "004",
      "official_name": "Islamic Republic of Afghanistan"
    }
]}

which led to an error because the list of useful objects is nested under the 3166-1 key.
I was able to load the data by adding the following json_array = json_array['3166-1']just after

which tells me it should be possible to add a config option for the JSON handler (similar to delimiter for CSV handler for instance) that tells it where to look for the data. At a minimum, a string defining the key under which to get a list of items (so in my example, it could be like data_path: "3166-1"), but I imagine it should be possible to declare a full path, something like data_path: "section.subheading.one_more_level" so that it can grab items from arbitrarily deep in the tree.

I might give it a try in a PR if it makes sense to add something like this (I think so, I have a couple of use cases for it already ๐Ÿ˜‰ )

TAP_SPREADSHEETS_ANYWHERE_TABLES environment variable is not seen by the tap

I have multiple data sources in S3 that have the same format and the same (>20) tables but delivered from separate S3 buckets. I am running from Meltano.

There are several possible solutions:

  1. create a separate meltano environment for each dataset (separate s3 buckets)
    Cumbersome, bulky and very inflexible. Error prone.

  2. parameterise the s3_bucket and s3_prefix settings inside meltano.yml (for each table) using environment variables
    I have tried this and it just doesn't work. May be related to this which will (I hope) fix the issue meltano/meltano#8268

  3. create a different JSON config file for each dataset and load into TAP_SPREADSHEETS_ANYWHERE_TABLES
    I have attempted to load the JSON tables array into TAP_SPREADSHEETS_ANYWHERE_TABLES but it only works when the string is stored in .env and does not work with an exported variable.

I also tried setting TAP_SPREADSHEETS_ANYWHERE_TABLES to the filename of the tables config but it only expects a JSON string.

Are there any other solutions to dealing with multiple environments, with this tap, that I could explore?

Issue discovering local files

With the following config:

      - path: file:///some/absolute/path/that/ends/with/slash/
        pattern: ".*"

the tap fails to discover local files in the directory mentioned.
This seems to be because

cuts the path one character too far if a trailing slash is included in the path in the config. Replacing that line with:

relpath = os.path.split(abspath)[1]

seems to do the trick.

[Documentation] How to tap from s3ninja

I use s3ninja to serve my files. I emulate s3 this way. s3ninja has a host/port username/password.

It is not obvious how to use it.

you mention smart_open.

However, smart_open supports this format (no surprises here)

s3://my_key:my_secret@my_server:my_port@my_bucket/my_key

but tap-spreadsheets-anywhere takes "my_key:my_secret@my_server:my_port@my_bucket" as a bucket.

I think the documentation is non existent here.

Add capability of turning off discovery mode

Currently there is no provision to turn off discovery mode. Several use cases would require reading the data sources as plain string and subsequently cleaning the data in a transformation step.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.