Git Product home page Git Product logo

cats's Introduction

CATs: Content-Addressable Transformers (Pre-Alpha)

Description:

CATs is a unified distributed processing framework and web application back-end orchestrated with Kubernetes to be deployed on a peer-to-peer mesh network client acting as a cluster head node. CATs enable the creation of Web3 data products as a decentralized service with data process verification using existing Web2 centralized cloud service technologies (SaaS, PaaS, IaaS) in AWS, GCP, Azure, etc., without the need of a smart-contract language. CATs nurture collaboration across domains between cross-functional / multi-disciplinary teams and organizations on products by Content-Addressing the means of processing (input, transformation / process, output, infrastructure [as Code (IaC)]) and using Content-Addresses as the means of data transport between services.

Illustrated CAT:

alt_text

Why CATs are useful:

  • Execution:
    • CATs execute Distributed Processes which are distributed as tasks for Concurrent and/or Parallelized execution on Web2 infrastructure
  • Data Verification:
    • Content-Addresses can be used to verify data processing (input, transformation / process, output, infrastructure)
      • Enables data process re-execution via retrieval of said means using IPFS CIDs as Content-Addresses
  • Data (& Process) Lineage & Provenance:
    • Certifies the accuracy of data processing on data products and pipelines by enabling maintenance & reporting of data and process lineage & provenance as chains of evidence
  • Collaboration:
    • Cross-functional teams & organization for collaboration across domains on verifiable data processes via a UI that accepts Data Provenance record entry as Input that is also CAT Output

Content-Addressing Data Processing with IPFS:

  • IPFS CIDs (Content Identifiers) are used as content addresses that provide the means of verifying data transformation accuracy.
  • IPFS client is used to identify and retrieve inputs, transformations, outputs, and infrastructure (as code [IaC]) for verifying transformation accuracy given CIDs

CATkernel Architectural Quantum:

alt_text

CAT Concepts:

CATs (Data) Pipeline inputs (I/O Data & Transformations) produce a sequence of Bill of Content Addressed Materials (catBOM) that enable Data Provenance and cross-organization participation on (big) data processing using Distributed (Data) Processing frameworks

  • Fundamental:
    • Content-Addressable Storage - a way to store information such that it can be retrieved based on its content rather than its location
    • Data Verification - a process for which data is checked for accuracy and inconsistencies before processed
    • Data (& Process) Lineage & Provenance
      • Data Lineage - reporting of data lifecyle from source to destination
      • Data Provenance - a means of proving data lineage using historical records that provide the means of pipeline re-execution and data validation
    • Distributed Computing - typically the concurrent and/or parallel execution of job tasks distributed to networked computers processing data
    • Bill of Materials (BOM) - an extensive list of raw materials, components, and instructions required to construct, manufacture, or repair a product or service

CATs Data Provenance as CAT I/O: catBOM as Provenance Record

  • catBOM - a collection of CIDs & URIs metadata for establishing provenance that enable (re-)execution of CAT processes
    • catBOM values are modifiable I/O for CATs
      • CIDs are used to retrieve CAT Input off IPFS and transfer them between CATs & on separate on CATclusters
      • URIs identify CATclusters’ Distributed File System (FS) used as Distributed DataFrame transformation cache of a Content Addressed Dataset (CAD)
      • Current & Input BOM CIDs & BOM I/O URIs
    • Illustration of catBOM as I/O surface of CAT alt_text
      • catBOM Contents:
        • Invoice / Content-Addressed Dataset (CAD) - a data format Content-Addressed Data generated by IPFS CIDing events as a collection IPFS CIDs of Dataset Partitions and partition URIs. Partitions are generated by CAT DataFrame Partition Shuffling across Worker Nodes of CATclusters
          • Content-Addressed Input (CAI) URI & CIDs - an input dataset for a CAT that has been content-addressed as an Invoice/CAD
          • Content-Addressed Output (CAO) URI & CIDs - an output dataset of a CAT that has been content-addressed as an Invoice/CAD
          • Invoice URI (contains CADs as CAT I/O)
        • Transformer URI & CID (CAT Object Configuration & CAT input)
          • Transformer URIs (of DataFrame Transformation cache)
  • BOMchain - Linked List of catBOMs used to create & execute CATpipes (Data Pipelines of CATs)
    • BOMchain is modifiable I/O for CATpipe
      • Can be used for data pipeline verification
    • Illustration of BOMchain as I/O surface of CATpipe alt_text

Next Steps:

  1. Replace s3 with IPFS Cluster and Filebase for Content-Addressable Storage in order to use cluster worker IPFS client

    1. Alternative: for IPFS server bug (ipfs init --profile server) - loop IPFS initialization until provided a public IP Address
  2. Implement CATnode MVP to remove the need for users to install dependencies:

    A. Options:

    • CATsVM Disk Image (Ubuntu)
    • CATsContainer

    B. Add dependencies to Terraform one CATnode exists

  3. Unit Test: BOM CID equivalence

  4. Distributed debugger for Plant(s) [SaaS(s)]

  5. Unit & Integration Tests

  6. Produce new SaaS Plants with CAT Factory

  7. CI/CD

  8. Provenance catBOM

  9. Options to Content-Address Everything

Long-Term Vision:

  • CATs software is intended to be deployed on a peer-to-peer (p2p) mesh network client that enable products implemented with the entire Cloud Service Model to be Decentralized Cloud Services by Content-Addressing the entire service.
  • IPFS Compute will be used to as a WebAssembly (WASM) module task server, broadcaster, & executor leveraging IPFS Lite on cluster nodes for distributed processing

Image Citations:

cats's People

Contributors

jejodesty avatar mzargham avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.