CATs: Content-Addressable Transformers (Pre-Alpha)
CATs is a unified distributed processing framework and web application back-end orchestrated with Kubernetes to be deployed on a peer-to-peer mesh network client acting as a cluster head node. CATs enable the creation of Web3 data products as a decentralized service with data process verification using existing Web2 centralized cloud service technologies (SaaS, PaaS, IaaS) in AWS, GCP, Azure, etc., without the need of a smart-contract language. CATs nurture collaboration across domains between cross-functional / multi-disciplinary teams and organizations on products by Content-Addressing the means of processing (input, transformation / process, output, infrastructure [as Code (IaC)]) and using Content-Addresses as the means of data transport between services.
- Execution:
- CATs execute Distributed Processes which are distributed as tasks for Concurrent and/or Parallelized execution on Web2 infrastructure
- Data Verification:
- Content-Addresses can be used to verify data processing (input, transformation / process, output, infrastructure)
- Enables data process re-execution via retrieval of said means using IPFS CIDs as Content-Addresses
- Content-Addresses can be used to verify data processing (input, transformation / process, output, infrastructure)
- Data (& Process) Lineage & Provenance:
- Certifies the accuracy of data processing on data products and pipelines by enabling maintenance & reporting of data and process lineage & provenance as chains of evidence
- Collaboration:
- Cross-functional teams & organization for collaboration across domains on verifiable data processes via a UI that accepts Data Provenance record entry as Input that is also CAT Output
Content-Addressing Data Processing with IPFS:
- IPFS CIDs (Content Identifiers) are used as content addresses that provide the means of verifying data transformation accuracy.
- IPFS client is used to identify and retrieve inputs, transformations, outputs, and infrastructure (as code [IaC]) for verifying transformation accuracy given CIDs
CATs (Data) Pipeline inputs (I/O Data & Transformations) produce a sequence of Bill of Content Addressed Materials (catBOM) that enable Data Provenance and cross-organization participation on (big) data processing using Distributed (Data) Processing frameworks
- Fundamental:
- Content-Addressable Storage - a way to store information such that it can be retrieved based on its content rather than its location
- Data Verification - a process for which data is checked for accuracy and inconsistencies before processed
- Data (& Process) Lineage & Provenance
- Data Lineage - reporting of data lifecyle from source to destination
- Data Provenance - a means of proving data lineage using historical records that provide the means of pipeline re-execution and data validation
- Distributed Computing - typically the concurrent and/or parallel execution of job tasks distributed to networked computers processing data
- Bill of Materials (BOM) - an extensive list of raw materials, components, and instructions required to construct, manufacture, or repair a product or service
- catBOM - a collection of CIDs & URIs metadata for establishing provenance that enable (re-)execution of CAT
processes
- catBOM values are modifiable I/O for CATs
- CIDs are used to retrieve CAT Input off IPFS and transfer them between CATs & on separate on CATclusters
- URIs identify CATclusters’ Distributed File System (FS) used as Distributed DataFrame transformation cache of a Content Addressed Dataset (CAD)
- Current & Input BOM CIDs & BOM I/O URIs
- Illustration of catBOM as I/O surface of CAT
- catBOM Contents:
- Invoice / Content-Addressed Dataset (CAD) - a data format Content-Addressed Data generated by IPFS
CIDing events as a collection IPFS CIDs of Dataset Partitions and partition URIs. Partitions are generated by
CAT DataFrame Partition Shuffling across Worker Nodes of CATclusters
- Content-Addressed Input (CAI) URI & CIDs - an input dataset for a CAT that has been content-addressed as an Invoice/CAD
- Content-Addressed Output (CAO) URI & CIDs - an output dataset of a CAT that has been content-addressed as an Invoice/CAD
- Invoice URI (contains CADs as CAT I/O)
- Transformer URI & CID (CAT Object Configuration & CAT input)
- Transformer URIs (of DataFrame Transformation cache)
- Invoice / Content-Addressed Dataset (CAD) - a data format Content-Addressed Data generated by IPFS
CIDing events as a collection IPFS CIDs of Dataset Partitions and partition URIs. Partitions are generated by
CAT DataFrame Partition Shuffling across Worker Nodes of CATclusters
- catBOM Contents:
- catBOM values are modifiable I/O for CATs
- BOMchain - Linked List of catBOMs used to create & execute CATpipes (Data Pipelines of CATs)
-
Replace s3 with IPFS Cluster and Filebase for Content-Addressable Storage in order to use cluster worker IPFS client
- Alternative: for IPFS server bug (
ipfs init --profile server
) - loop IPFS initialization until provided a public IP Address
- Alternative: for IPFS server bug (
-
Implement CATnode MVP to remove the need for users to install dependencies:
A. Options:
- CATsVM Disk Image (Ubuntu)
- CATsContainer
B. Add dependencies to Terraform one CATnode exists
-
Unit Test: BOM CID equivalence
-
Distributed debugger for Plant(s) [SaaS(s)]
-
Unit & Integration Tests
-
Produce new SaaS Plants with CAT Factory
-
CI/CD
-
Provenance catBOM
-
Options to Content-Address Everything
- CATs software is intended to be deployed on a peer-to-peer (p2p) mesh network client that enable products implemented with the entire Cloud Service Model to be Decentralized Cloud Services by Content-Addressing the entire service.
- IPFS Compute will be used to as a WebAssembly (WASM) module task server, broadcaster, & executor leveraging IPFS Lite on cluster nodes for distributed processing