googlecloudplatform / market-data-transcoder Goto Github PK

View Code? Open in Web Editor NEW

34.0 26.0 9.0 280 KB

ffmpeg for market data

License: Apache License 2.0

Python 99.42% Dockerfile 0.58%

automation binary binaryencoding devops exchanges fix fixprotocol itch marketdata sbe

market-data-transcoder's Introduction

`Google Cloud Datacast Solution`

Ingest high-performance exchange feeds into Google Cloud

This is not an official Google product or service

Introduction

The Datacast transcoder is a schema-driven, message-oriented utility to simplify the lossless ingestion of common high-performance electronic trading data formats to Google Cloud.

Electronic trading venues have specialized data representation and distribution needs. In particular, efficient message representation is a high priority due to the massive volume of transactions a venue processes. Cloud-native APIs often use JSON for message payloads, but the extra bytes required to represent messages using high-context encodings have cost implications in metered computing environments.

Unlike JSON, YAML, or even CSV, binary-encoded data is low-context and not self-describing -- the instructions for interpreting binary messages must be explicitly provided by producers separately and in advance, and followed by interpreters.

The architecture of the transcoder relies on several principal abstractions, detailed below:

Schema

A schema (also known as a data dictionary) is similar to an API specification, but instead of describing API endpoint contracts, it describes the representative format of binary messages that flow between systems. The closest comparison might be drawn with table definitions supported by SQL Data Definition Language, but these schemas are used for data in-motion as well as data at-rest.

The transcoder's current input schema support is for Simple Binary Encoding (SBE) XML as well as QuickFIX-styled FIX protocol schema representations (also in XML).

Target schema and data elements are rendered based on the specified output_type. With no output type specified, the transcoder defaults to displaying the YAML representation of transcoded messages to the console, and does not perform persistent schema transformations. For Avro and JSON, the transcoded schema and data files are encapsulated in POSIX files locally. Direct trancoding to BigQuery and Pub/Sub targets are supported, with the transcoded schemas being applied prior to message ingestion or publishing. Terraform configurations for BigQuery and Pub/Sub resources can also be derived from a specified input schema. The Terraform options only render the configurations locally and do not execute Terraform apply. The --create_schemas_only option transcodes schemas in isolation for other output types.

The names of the output resources will individually correspond to the names of the message types defined in the input schema. For example, the transcoder will create and use a Pub/Sub topic named "NewOrderSingle" for publishing FIX NewOrderSingle messages found in source data. Similarly, if an output type of bigquery is selected, the transcoder will create a NewOrderSingle table in the dataset specified by --destination_dataset_id. By default, Avro and JSON encoded output will be saved to a file named <message type> with the respective extensions in a directory specified using the --output_path parameter.

Message

A message represents a discrete interaction between two systems sharing a schema. Each message will conform to a single message type as defined in the schema. Specific message types can be included or excluded for processing by passing a comma-delimited string of message type names to the --message_type_exclusions and --message_type_inclusions parameters.

Encoding

Encodings describe how the contents of a message payload are represented to systems. Many familiar encodings, such as JSON, YAML or CSV, are self-describing and do not strictly require that applications use a separate schema definition. However, binary encodings such as SBE, Avro and Protocol Buffers require that applications employ the associated schema in order to properly interpret messages.

The transcoder's supported inbound encodings are SBE binary and ASCII-encoded (tag=value) FIX. Outbound encodings for Pub/Sub message payloads can be Avro binary or Avro JSON. Local files can be generated in either Avro or JSON.

The transcoder supports base64 decoding of messages using the --base64 and --base64_urlsafe options.

Transport

A message transport describes the mechanism for transferring messages between systems. This can be data-in-motion, such as an ethernet network, or data-at-rest, such as a file living on a POSIX filesytem or an object residing within cloud storage. Raw message bytes must be unframed from a particular transport, such as length-delimited files or packet capture files.

The transcoder's currently supported inbound message source transports are PCAP files, length-delimited binary files, and newline-delimited ASCII files. Multicast UDP and Pub/Sub inbound transports are on the roadmap.

Outbound transport options are locally stored Avro and JSON POSIX files, and Pub/Sub topics or BigQuery tables. If no output_type is specified, the transcoded messages are output to the console encoded in YAML and not persisted automatically. Additionally, Google Cloud resource definitions for specified schemas can be encapsulated in Terraform configurations.

Message factory

A message factory takes a message payload read from the input source, determines the associated message type from the schema to apply, and performs any adjustments to the message data prior to transcoding. For example, a message producer may use non-standard SBE headers or metadata that you would like to remove or transform. For standard FIX tag/value input sources, the included fix message factory may be used.

CLI usage

usage: txcode  [-h] [--factory {cme,itch,memx,fix}]
               [--schema_file SCHEMA_FILE] [--source_file SOURCE_FILE]
               [--source_file_encoding SOURCE_FILE_ENCODING]
               --source_file_format_type
               {pcap,length_delimited,line_delimited,cme_binary_packet}
               [--base64 | --base64_urlsafe]
               [--fix_header_tags FIX_HEADER_TAGS]
               [--fix_separator FIX_SEPARATOR]
               [--message_handlers MESSAGE_HANDLERS]
               [--message_skip_bytes MESSAGE_SKIP_BYTES]
               [--prefix_length PREFIX_LENGTH]
               [--message_type_exclusions MESSAGE_TYPE_EXCLUSIONS | --message_type_inclusions MESSAGE_TYPE_INCLUSIONS]
               [--sampling_count SAMPLING_COUNT] [--skip_bytes SKIP_BYTES]
               [--skip_lines SKIP_LINES] [--source_file_endian {big,little}]
               [--output_path OUTPUT_PATH]
               [--output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}]
               [--error_output_path ERROR_OUTPUT_PATH]
               [--lazy_create_resources] [--frame_only] [--stats_only]
               [--create_schemas_only]
               [--destination_project_id DESTINATION_PROJECT_ID]
               [--destination_dataset_id DESTINATION_DATASET_ID]
               [--output_encoding {binary,json}]
               [--create_schema_enforcing_topics | --no-create_schema_enforcing_topics]
               [--continue_on_error]
               [--log {notset,debug,info,warning,error,critical}] [-q] [-v]

Datacast Transcoder process input arguments

options:
  -h, --help            show this help message and exit
  --continue_on_error   Indicates if an exception file should be created, and
                        records continued to be processed upon message level
                        exceptions
  --log {notset,debug,info,warning,error,critical}
                        The default logging level
  -q, --quiet           Suppress message output to console
  -v, --version         show program's version number and exit

Input source arguments:
  --factory {cme,itch,memx,fix}
                        Message factory for decoding
  --schema_file SCHEMA_FILE
                        Path to the schema file
  --source_file SOURCE_FILE
                        Path to the source file
  --source_file_encoding SOURCE_FILE_ENCODING
                        The source file character encoding
  --source_file_format_type {pcap,length_delimited,line_delimited,cme_binary_packet}
                        The source file format
  --base64              Indicates if each individual message extracted from
                        the source is base 64 encoded
  --base64_urlsafe      Indicates if each individual message extracted from
                        the source is base 64 url safe encoded
  --fix_header_tags FIX_HEADER_TAGS
                        Comma delimited list of fix header tags
  --fix_separator FIX_SEPARATOR
                        The unicode int representing the fix message separator
  --message_handlers MESSAGE_HANDLERS
                        Comma delimited list of message handlers in priority
                        order
  --message_skip_bytes MESSAGE_SKIP_BYTES
                        Number of bytes to skip before processing individual
                        messages within a repeated length delimited file
                        message source
  --prefix_length PREFIX_LENGTH
                        How many bytes to use for the length prefix of length-
                        delimited binary sources
  --message_type_exclusions MESSAGE_TYPE_EXCLUSIONS
                        Comma-delimited list of message types to exclude when
                        processing
  --message_type_inclusions MESSAGE_TYPE_INCLUSIONS
                        Comma-delimited list of message types to include when
                        processing
  --sampling_count SAMPLING_COUNT
                        Halt processing after reaching this number of
                        messages. Applied after all Handlers are executed per
                        message
  --skip_bytes SKIP_BYTES
                        Number of bytes to skip before processing the file.
                        Useful for skipping file-level headers
  --skip_lines SKIP_LINES
                        Number of lines to skip before processing the file
  --source_file_endian {big,little}
                        Source file endianness

Output arguments:
  --output_path OUTPUT_PATH
                        Output file path. Defaults to avroOut
  --output_type {diag,avro,fastavro,bigquery,pubsub,bigquery_terraform,pubsub_terraform,jsonl,length_delimited}
                        Output format type
  --error_output_path ERROR_OUTPUT_PATH
                        Error output file path if --continue_on_error flag
                        enabled. Defaults to errorOut
  --lazy_create_resources
                        Flag indicating that output resources for message
                        types should be only created as messages of each type
                        are encountered in the source data. Default behavior
                        is to create resources for each message type before
                        messages are processed. Particularly useful when
                        working with FIX but only processing a limited set of
                        message types in the source data
  --frame_only          Flag indicating that transcoder should only frame
                        messages to an output source
  --stats_only          Flag indicating that transcoder should only report on
                        message type counts without parsing messages further
  --create_schemas_only
                        Flag indicating that transcoder should only create
                        output resource schemas and not output message data

Google Cloud arguments:
  --destination_project_id DESTINATION_PROJECT_ID
                        The Google Cloud project ID for the destination
                        resource

BigQuery arguments:
  --destination_dataset_id DESTINATION_DATASET_ID
                        The BigQuery dataset for the destination. If it does
                        not exist, it will be created

Pub/Sub arguments:
  --output_encoding {binary,json}
                        The encoding of the output
  --create_schema_enforcing_topics, --no-create_schema_enforcing_topics
                        Indicates if Pub/Sub schemas should be created and
                        used to validate messages sent to a topic

Message handlers

txcode supports the execution of message handler classes that can be used to statefully mutate in-flight streams and messages. For example, TimestampPullForwardHandler will look for a seconds-styled ITCH message (that informs the stream of the prevailing epochs second to apply to subsequent messages), and append the latest value from that to all subsequent messages (between instances of the seconds message appearing. This helps individual messages be persisted with absolute timestamps that require less context to interpret (i.e. outbound messages contain more than just "nanoseconds past midnight" for a timestamp.

Another handler is SequencerHandler, which appends a sequence number to all outbound messages. This is useful when processing bulk messages in length-delimited storage formats where the IP packet headers containing the original sequence numbers have been stripped.

FilterHandler lets you filter output based upon a specific property of a message. A common use for this is to filter messages pertaining only to a particular security identifier or symbol.

Here is a combination of transcoding invocations that can be used to shard a message universe by trading symbol. First, the mnemonic trading symbol identifier (stock) must be used to find it's associated integer security identifier (stock_locate) from the stock_directory message. stock_locate is the identifier included in every relevant message (as opposed to stock, which is absent from certain message types):


txcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --message_type_inclusions stock_directory --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock=SPY --sampling_count 1

authenticity: P
etp_flag: Y
etp_leverage_factor: null
financial_status_indicator: ' '
inverse_indicator: null
ipo_flag: ' '
issue_classification: Q
issue_subtype: E
luld_reference_price_tier: '1'
market_category: P
round_lot_size: 100
round_lots_only: N
short_sale_threshold_indicator: N
stock: SPY
stock_locate: 7451
timestamp: 11354508113636
tracking_number: 0

INFO:root:Sampled messages: 1
INFO:root:Message type inclusions: ['stock_directory']
INFO:root:Source message count: 7466
INFO:root:Processed message count: 7451
INFO:root:Transcoded message count: 1
INFO:root:Processed schema count: 1
INFO:root:Summary of message counts: {'stock_directory': 7451}
INFO:root:Summary of error message counts: {}
INFO:root:Message rate: 53260.474108 per second
INFO:root:Total runtime in seconds: 0.140179
INFO:root:Total runtime in minutes: 0.002336

Taking the value of the field stock_locate from the above message allows us to filter all messages for that field/value combination. In addition, we can append a sequence number to all transcoded messages that are output. The below combination returns the original stock_directory message we used to look up the stock_locate code, as well as the next two messages in the stream that have the same value for stock_locate:


txcode --source_file 12302019.NASDAQ_ITCH50 --schema_file totalview-itch-50.xml --source_file_format_type length_delimited --factory itch --message_handlers FilterHandler:stock_locate=7451,SequencerHandler --sampling_count 3 

authenticity: P
etp_flag: Y
etp_leverage_factor: null
financial_status_indicator: ' '
inverse_indicator: null
ipo_flag: ' '
issue_classification: Q
issue_subtype: E
luld_reference_price_tier: '1'
market_category: P
round_lot_size: 100
round_lots_only: N
sequence_number: 1
short_sale_threshold_indicator: N
stock: SPY
stock_locate: 7451
timestamp: 11354508113636
tracking_number: 0

reason: ''
reserved: ' '
sequence_number: 2
stock: SPY
stock_locate: 7451
timestamp: 11355134575401
tracking_number: 0
trading_state: T

reg_sho_action: '0'
sequence_number: 3
stock: SPY
stock_locate: 7451
timestamp: 11355134599149
tracking_number: 0

INFO:root:Sampled messages: 3
INFO:root:Source message count: 23781
INFO:root:Processed message count: 23781
INFO:root:Transcoded message count: 3
INFO:root:Processed schema count: 21
INFO:root:Summary of message counts: {'system_event': 1, 'stock_directory': 8906, 'stock_trading_action': 7437, 'reg_sho_restriction': 7437, 'market_participant_position': 0, 'mwcb_decline_level': 0, 'ipo_quoting_period_update': 0, 'luld_auction_collar': 0, 'operational_halt': 0, 'add_order_no_attribution': 0, 'add_order_attribution': 0, 'order_executed': 0, 'order_executed_price': 0, 'order_cancelled': 0, 'order_deleted': 0, 'order_replaced': 0, 'trade': 0, 'cross_trade': 0, 'broken_trade': 0, 'net_order_imbalance': 0, 'retail_price_improvement_indicator': 0}
INFO:root:Summary of error message counts: {}
INFO:root:Message rate: 80950.257512 per second
INFO:root:Total runtime in seconds: 0.293773
INFO:root:Total runtime in minutes: 0.004896

The syntax for handler specifications is:

<Handler1>:<Handler1Parameter>=<Handler1Parameter>,<Handler2>

Message handlers are deployed in transcoder/message/handler/.

Installation

If you are a user looking to use the CLI or library without making changes, you can install the Market Data Transcoder from PyPI using pip:

pip install market-data-transcoder

After the pip installation, you can validate that the transcoder is available by the following command:

txcode --help

Developers

If you are looking to extend the functionality of the Market Data Transcoder:

cd market-data-transcoder
pip install -r requirements.txt

After installing the required dependencies, you can run the transcoder with the following:

export PYTHONPATH=`pwd`
python ./transcoder/main.py --help

market-data-transcoder's People

Contributors

Stargazers

Watchers

Forkers

njoguu dca-devop1 dca-gdm jamesduncannz qpc-github salsferrazza thecatfix ibizq

market-data-transcoder's Issues

Message handler that writes out MDRL directory structures for observed messages

Would need to specify message field names for values such as:

Symbol
SecurityExchange (MIC code)
TransactTime (for manufacturing period structures)

Add JSON field support for MessageHandler*Fields

The MessageHandler DatacastField subclasses need JSON Schema field support for newly manufactured fields, to bring up to the existing Avro and BQ levels of output schema support.

Python integration/unit tests

Employ sane defaults around JSON and console logging

Ideas around revising console logging of JSON messages.

Currently, if neither --quiet nor output_type is specified, the transcoder will default to writing Avro files to --output_path or avroOut in the local directory. In parallel, JSON representations of the transcoded messages are written to standard output.

The behavior should be adjusted to:

If no output_type is specified to the CLI, default to streaming JSON to standard out. This means that writing Avro files no longer is the default and must be explicitly chosen, like bigquery and pubsub are now.
The design should now treat stdout as the equivalent of an output path
The default encoding if stdout is chosen should be JSON vs. any binary format (enforce this?)
If the minimum parameters to run the transcoder now are passed (schema file, source file, factory), the result would be JSON streamed to console. This would allow one to use the transcoder even without a disk to write to

Support pcap-ng format

Current pcap source input format type only supports classic PCAP and not PCAP-NG. This may entail a move to scapy from dpkt within the PcapFileMessageSource class.

Configure Pylint GitHub action

`--message_skip_chars`

Similarly to how --message_skip_bytes is used for each message found in a LengthDelimitedFileMessageSource to ignore things like UDP header information, --message_skip_chars would trim the first n characters from messages arriving via LineDelimitedFileMessageSource.

Print interim statistics to STDERR on demand

One could run the transcoder with debug options, but this slows down processing since the output is a function of a file. In some instances, one might desire seeing where a txcode process is with it's current batch non-destructively. One way to do this might be trapping kill -s INFO signal in main thread and then dump a snapshot of the progress up to that point. The dd command has a similar feature:

Sending an `INFO' signal to a running `dd' process makes it print
I/O statistics to standard error and then resume copying.  In the
example below, `dd' is run in the background to copy 10 million blocks.
The `kill' command makes it output intermediate I/O statistics, and
when `dd' completes, it outputs the final statistics.

     $ dd if=/dev/zero of=/dev/null count=10MB & pid=$!
     $ kill -s INFO $pid; wait $pid
     3385223+0 records in
     3385223+0 records out
     1733234176 bytes (1.7 GB) copied, 6.42173 seconds, 270 MB/s
     10000000+0 records in
     10000000+0 records out
     5120000000 bytes (5.1 GB) copied, 18.913 seconds, 271 MB/s

Support Multicast UDP as a message source

MulticastUdpNetworkSource should be configurable with the multicast group to join.

When `source_file` is absent or `-`, read bytes from STDIN

This would enable the transcoder to be run in streaming mode for POSIX inputs, for example:

http -pb 'https://www.cmegroup.com/ftp/SBEFix/Production/secdef.dat.gz' \ 
| gunzip - \
| txcode --schema_file ../datacast/transcoder/test/FIX50SP2.CME.xml --factory fix --source_file_format_type line_delimited

ApplID: '310'
Asset: ES
CFICode: FFIXSX
Currency: USD
DisplayFactor: 0.01
HighLimitPrice: 456775.0
LastUpdateTime: '20221120170356000585'
LowLimitPrice: 400425.0
MarketSegmentID: '64'
MatchAlgorithm: F
MatchEventIndicator: '00000000'
MaturityMonthYear: '202409'
MaxPriceVariation: 600.0
MaxTradeVol: 3000.0
MinPriceIncrement: 25.0
MinPriceIncrementAmount: 12.5
MinTradeVol: 1.0
MsgType: SECURITYDEFINITION
NoEvents:
- EventTime: '20220617133000000000'
  EventType: ACTIVATION
- EventTime: '20240920133000000000'
  EventType: LAST_ELIGIBLE_TRADE_DATE
NoInstrAttrib:
- InstrAttribType: TRADE_TYPE_ELIGIBILITY_DETAILS_FOR_SECURITY
  InstrAttribValue: '00000000000001000010000000001111'
NoLotTypeRules: '0'
NoMDFeedTypes:
- MDFeedType: GBX
  MarketDepth: 10
SecurityExchange: XCME
SecurityGroup: ES
SecurityID: '118'
SecurityIDSource: EXCHANGE_SYMBOL
SecurityType: FUTURE
SecurityUpdateAction: ADD
SettlPriceType: Final Actual Settlement at Clearing Tick
Symbol: ESU4
TradingReferenceDate: '20221123'
TradingReferencePrice: 428600.0
UnderlyingProduct: 5
UnitOfMeasure: Index Points
UnitOfMeasureQty: 50.0
UserDefinedInstrument: N

PyPI main.py relocation breaks the documented way of running CLI help

Pylint clean up

Propagate `required` fields to JSON schema from source schema

JsonOutputManager would need to keep track of required fields as defined in the source / dest schema and transcode appropriately. For JSON Schema, each object supports an array property named required with the values being the defined required properties in the current object definition.

Sequence-retained parallel message transcoding

https://www.codeproject.com/Articles/85336/Exploiting-Data-Parallelism-in-Ordered-Data-Stream

Additional CLI options for length-prefixed message output and endianness

LengthDelimitedOutputManager takes --prefix_length to determine the size in bytes of the length prefix to output, but endianness is currently hard-coded. Additionally, the prefix length being read by LengthDelimitedFileMessageSource should be separately optioned from the prefix length written out. Hence --source_prefix_length should be an option for LengthDelimitedFileMessageSource and --output_prefix_length should be come an option instrumenting LengthDelimitedOutputManager.

Add option to continue processing FIX messages when a tag in the data is not found in the schema

When processing FIX messages, if the transcoder cannot find a tag present in the data, it fails with a KeyError. Instead, the transcoder should possibly issue a warning, but instead of halting processing, it should use the numeric tag number as they key instead and continue processing.

Support protocol buffers for output encoding

Would require similar mapping that's been done for output Avro encodings. This should enable us to support all available output transports and upgrade to the BigQuery Storage Write API for enhanced performance.

Determine how to handle SBE fields that are defined as anonymous byte arrays

Some binary feed specifications have field lengths that do not correspond to the existing SBE primitive types - for example, a 6 byte integer with a field definition resembling:

<type name="timestamp_nanos_midnight" primitiveType="uint8" length="6"/>

To handle these, one can use a uint8 with a length of 6 as above, but the codec doesn't know how to interpret this properly as an integer, and I've not seen a way to instruct it to do so. It seems as if these are interpreted as UTF-8 by the codec.

For field definitions of this kind`, there should be a way to override or send a hint to the codec on how the field should be interpreted.

Add support for proto3

https://developers.google.com/protocol-buffers/docs/proto3
https://cloud.google.com/pubsub/docs/schemas#protocol-buffer

Automatically handle b64url decoding

The current --base64 decoding capability doesn't automatically handle b64url decoding of payloads. Some systems, such as Pub/Sub, return message payloads as base64url decoded strings (which are safe for use in URLs and filenames). The transcoder should detect from the payload contents and use the proper codec.

Currently, in order to properly decode a Pub/Sub message payload delivered by gcloud CLI, the b64url encoding needs to be explicitly replaced with b64 classic equivalents, for example:

gcloud pubsub subscriptions pull ${SUBSCRIPTION_NAME}  --format=json | json -a .message.data | tr '_' '/' | tr '-' '+'

Without the tr on the message.data string returned from Pub/Sub, the transcoder will fail to base64 decode the original message bytes.

Add `--fix_separator` option to specify a fix separator

Default to chr(1) - SOH char

Add labels to transcoder created gcp resources

Add label to resources created by transcoder: goog-packaged-solution = datacast

Add a `--frame-only` option that suppresses transcoding and simply re-transports message bytes

This would necessarily have to bypass schema-awareness perhaps with the exception of message typing and sequencing.

Add `--create_schemas_only` option

--create_schemas_only option would ignore any input source and just create output schemas

Refrain from transcoding SBE field names in snake_case

I believe there is an option in the SBE decoding library that snake cases all of the decoded field names, this should be suppressed and default to verbatim transcoding of the field name as specified in the schema.

Add support for groups nested within SBE groups

Add FilterHandler (or similar) and then apply sequencing capability

          @t8dogg I agree with that but would suggest maybe tracking that under another issue. I think we''d do a FilterHandler or similar then apply the sequencing capability. Will open under separate issue.

Originally posted by @salsferrazza in #114 (comment)

Author a SequenceMessageHandler to append sequence numbers onto messages arriving from an inbound stream

Add option to interleave messages to an output destination

In lieu of sharding output destinations by message type, support interleaved output of all message types to a single destination file, topic, etc.

This has some implications to schema generation and most obviously would not work for BigQuery output types. It would also require that Pub/Sub topic schemas or Avro outputs be handled differently in this mode, perhaps using the Union option in Avro.

Document transcoder design with ADRs

Capture abstractions and design decisions:

schema/encoding/transport abstractions in the object hierarchy
message handler framework

Add a handler that appends a static date field to all outbound messages

Some feeds only provide timestamps as duration past midnight, assuming prior context of what day the processed messages are from. Since the messages themselves don't have this context, it would be useful to be able to append a statically specified date, convert that to Unix milliseconds, and append it as a manufactured int field to outbound messages. This would reduce friction for downstream SQL analytics on messages ingested from these feeds.

e.g --message_handlers AppendDateHandler:date=20291102

would turn that date into it's UNIX seconds equivalent, and append that as column to all outbound messages.

Schemas with repeating values at different levels in the message graph are incompatible with Avro

When running with the following set of options:

wget -q -O - ftp://ftp.cmegroup.com/SBEFix/Production/secdef.dat.gz | gunzip -  | txcode \ 
 --schema_file  ~/src/datacast/transcoder/test/FIX50SP2.CME.xml  \
 --factory fix  \
 --source_file_format_type line_delimited  \
 --message_type_inclusions=SecurityDefinition \
 --fix_header_tags 8,9,35,1128,49,56,34,52,10 \
 --destination_project_id $(gcloud config get-value project)  \ 
 --output_type pubsub \
 --output_encoding json \
  --lazy_create_resources

The following error is thrown:

google.api_core.exceptions.InvalidArgument: 400 AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema. [detail: "[ORIGINAL ERROR] generic::invalid_argument: AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema. [google.rpc.error_details_ext] { message: \"AVRO schema definition is not valid: sbeMessage.NoLotTypeRules exists twice in schema.\" }"

This is likely because of the complex schema defined in FIX50SP2.CME.xml, where the same field name may be present at different levels of the entity hierarchy. A graph such as Object.Property1 and Object.Property2.Property1 appears to be incompatible with Avro, but is commonly encountered within legitimate FIX schema definitions.

It's notable that BigQuery output types do not exhibit this behavior, but the fastavro output type to local POSIX files does as well:

fastavro._schema_common.SchemaParseException: redefined named type: sbeMessage.NoLotTypeRules

Create a `google_cloud` directory under `transcoder/output` and move classes

Similar to how avro in transcoder/output has it's own code namespace, we should similarly add google_cloud as a higher-level output category for output specific code.

Configurable sequence number field name in `SequencerHander`

The manufactured field name for sequence numbers in SequencerHandler is hard-coded to sequence_number. For FIX presentation this is not optimal. User should be able to pass this value through either via CLI (meta?)option or otherwise through the environment.

This is related to issue #76 and should probably be the illustrative use case for that implementation

Configure Publish Python Package GitHub action

Add terraform resource generator snippet functionality/option

Feature would output either the specified output schema with the potential ability to wrap it in a Terraform block, IE: google_bigquery_table

Handle SIGTERM gracefully and output metrics reached up to that point before exiting the process

Length threshold for PCAP processing is fixed to an arbitrary value

Decide whether to:

Remove the notion of any threshold whatsoever and handle all byte arrays without filtering on size.
Use a heuristic like "larger than the number of bytes in the message header definition for the particular encoding" (e.g. sum of the bytes in the fields comprising the messageHeader component for SBE). This seems like the best case long term.
Allow for length_threshold to flow through from the user surface

Filing this as a bug since 50 bytes is likely too high of a fixed default.

Add YAML as a console output diagnostic notation option

an option named --human or similar would be used to visually check transcoding from sources by writing YAML to the console.

no_md_entries:
- md_entry_px:
    exponent: -9
    mantissa: 2191000000000
  md_entry_size: 4
  md_entry_type: Offer
  md_price_level: 1
  md_update_action: New
  number_of_orders: 1
  rpt_seq: 2328011
  security_id: 123890
  tradeable_size: 0
no_order_id_entries:
- md_display_qty: 4
  md_order_priority: 77606680730
  order_id: 8040595458543
  order_update_action: New
  reference_id: 1
transact_time: 1663632017199755749

Add `--stats-only` flag to report on message type counts in a file without processing.

Tranacoder would only crack messages open to determine their message type to add to the rolling count, then output the statistics observed:

INFO:root:Source record count: 27798842
INFO:root:Summary of message counts: {'InstrumentDirectory': 42, 'RegSHORestriction': 27798800}

Publish updated version to PyPi

Add capability to shard message output based on values found in messages

Currently, the transcoder emits messages to output sources based on the message type specified in the schema. This would add a capability to further filter messages based on input dimensions observed, sharding them to separate outputs. For example, if you specify a list of --shard_keys fully qualified with the parent message type, it would manufacture a file/table/topic for each distinct value found in that dimension. So

--shard_keys SnapshotFullRefreshTcpLongQty.SecurityID

would produce and populate a separate persistent resource for each unique value of that dimension found in the input data.

Add support for runtime parameter values when using Message Handlers

Currently, MessageHandler specific configurations must be hard-coded within the handler module itself. There should be a syntax for specifying user parameters at runtime through the command line syntax. Something like:

message_handlers=TimestampPullForwardHandler:timestamp_field=transact_time

Or potentially a <MessageHandlerSubclassName>.json can be automatically checked and the dictionary passed into the MessageHandler constructor? This might give the most flexibility without rendering the CLI syntax too unweildy.

Add option to `--replace` cloud resources if they already exist

With the --replace option, the transcoder will not try to use existing cloud resources that correspond to message types in the schema (schemas, topics, BigQuery tables), rather it will delete and replace them with new ones derived from the schema represented in --schema_file. This would be a destructive action, so would have to be explicit and used with care.

Support dynamic handler registration through CLI options

Currently, custom handlers need to be authored and put into the source tree under transcoder/message/handler, in addition to modifying the __init__.py in the directory to register the new handler. It would be convenient to be able to store handler modules in arbitrary places addressed by a CLI option in order to dynamically load handler componentry

Configure CodeQL Analysis GitHub action

Add full URL support for the `--source_file` and `--schema_file` options

The --source_file and --schema_file options currently imply file:// if using fully qualified URL syntax. Other protocols, such as http://, ftp:// and ws:// should be possible to resolve in this way, although ws:// would only be applicable as a protocol under --source_file and not --schema_file and would need to be implemented in a manner similar to the recent stdin support addition.

Example:

txcode --schema_file https://raw.githubusercontent.com/SunGard-Labs/fix2json/master/dict/FIX50SP2.CME.xml --source_file ftp://ftp.cmegroup.com/SBEFix/Production/TradingSessionList.dat --factory fix --source_file_format_type line_delimited