Massive Parallel Graph Builder

Status

This repository is in a frozen state. It is not being maintained or kept in sync with the tools and libraries it builds on. Even though work on this repository has been shelved, anyone interested in updating or maintaining this project should express their interest on one Filecoin community conversation mediums: https://github.com/filecoin-project/community#join-the-community.

When Filecoin stores CAR (Content Addressed Archive) files of UnixFS data there are a number of retreival features that can be used. When you pass a file to lotus to create a storage deal, that file is processed into a unixfs graph and stored in CAR file for you.

For very large datasets it could take months or even years to process the data for storage using this method. This project allows you to process a very large amount of data using a very high degree of concurrency for storage in Filecoin using the "offline dealflow."

This project uses Lambda to create massive IPLD graphs and then break them down into many .car files for storage in Filecoin.

This project also uses Dynamo tables (currently hard coded) to store intermediary information about the graph as it is built.

Setup

Create block bucket
- Allow public access by unchecking "Block all public access"
make sure source data bucket allows public data access
Create Dynamo Table named dumbo-v2-{source-bucket-name}
- partition key must be "url"
- configure bucket to have its capacity "on-demand"
run arc deploy
- get lambda function names

Staging

The graphs we're building are so large that we need to do them in stages.

Pull Source Bucket
- Chunk source file, slice if necessary.
- Store every file part in the block store
- Store the hash of every block in Dynamo
Create CAR Files
- Look at the dynamo table and iterate over every entry that hasn't been put in a CAR file.
- Create the unixfs block data for the file.
- Create a root block.
- Write the root block and the unixfs data into the CAR file.
- Pull every referenced Block into the CAR file.
- Write the CAR file into a bucket for only CAR files and update every dynamo entry with an URL to the car file containing its data.
Calculate CommP
- Look through a bucket of CAR files and see if our CommP Dynamo Table has an entry.
- Calculate CommP for every CAR file and insert an entry into the table.

Note that this is all designed so that you can create CAR files by other means and still use Stage 3 to calculate and store CommP for them.

CAR Schema

The CAR files generated here different substantially from those created by the default workflows in lotus. These CAR files are designed to contain more than one file.

A typical CAR file created by lotus is just a single unixfs file (using default import settings) the root block of the CAR file is the root block of the unixfs DAG, which ends up being visible to users as the "Payload CID."

A CAR file generated by Dumbo Drop contains as many as a few thousand files. The root block of the CAR file is a dag-cbor block containing a list of links to every unixfs file. The unixfs files are also not generated using the default importer settings and instead references raw blocks directly.

These CAR files are also not guaranteed to match a deterministic selector on the root block (although they occastionally may) which is what the default dealflow in lotus relies on for arriving at a common CommP. That's why the final stage is to create CommP for every CAR file.

Pipeline

S3 Bucket With Existing Files
Run pull-bucket-v2
- Gets list of files in S3 bucket from step 1
- Chunks each file into an IPLD Block which is stored in s3
- Saves information about each file and generated IPLD Block CIDs in dynamo
Run create-parts-v2
- Reads list of files and IPLD Block CIDs from dynamo created in step 2
- Generates Car files from the IPLD blocks and writes them to S3

Environment Variables

DUMBO_COMMP_TABLE - dynamo table name for commp?
DUMBO_CREATE_PART_LAMBDA - lambda function name to create parts from files
DUMBO_PARSE_FILE_LAMBDA - labmda function name to parse file

Getting Started

This project includes support Visual Studio Code Remote Containers which enables everything needed for development and running in a docker container. You can use this by installing:

Visual Studio Code
- Remote Containers Extension
Docker Desktop (for Mac or Windows) or Docker CE (for Linux)

Once these are installed, open the folder of this project in Visual Studio Code and left click the green box in the bottom right and choose "Reopen in container". The first time you do this, a docker container will be built with all pre-requisites - this may take a few minutes. Once the building is done, you can open a shell in the container via: Terminal->New Terminal or (Control+Shift+`)

Configuration needed

Setup your aws cli

aws configure

Enable ASW SDK support

export AWS_SDK_LOAD_CONFIG=1

Deploy your lambdas

arc deploy

Take the lambda urls from the s3 console for GetParseFileV2

export DUMBO_PARSE_FILE_LAMBDA=InitStaging-GetParseFileV2-110HY43XW9LJ

Export S3 bucket to use for IPLD Block Store

export DUMBO_BLOCK_STORE=chafey-dump-drop-test-block2

Import your data:

./cli.js pull-bucket-v2 chafey-dumbo-drop-test2 --concurrency 1 --checkHead

IAM Policy for lambda functions

AWS requires granting your lambda functions access to s3 and dynamodb. You can grant full access using the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ExampleStmt",
            "Action": [
                "s3:*",
                "dynamodb:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        }
    ]
}

Design notes

S3 currently has an upper limit on requested/second for a given prefix (or directory). To avoid these limits, we store each block and car file in a folder with the CID for that block/car file. This strategy ensures we don't hit the upper limits as CIDs are essentially random numbers since they mainly contain the output of the block hash function.
We have an upper limit of 912MB or 2000 files on a CAR file, whichever is hit first.
- The 2000 file limit is to ensure we have enough space in one block to store all CIDs for the CAR file. In the case of many small files, we would exceed the max block size of 1GB just storing the CIDs for the blocks.
- The 912MB limit ensures that we are under 1GB in size when including CAR file overhead, unixfs overhead and some padding needed for commp generation
DynamoDB is designed for relatively small documents (~10k). For large files, we get much larger than this (400k)

isabella232 / dumbo-drop Goto Github PK

dumbo-drop's Introduction

Massive Parallel Graph Builder

Status

Setup

Staging

CAR Schema

Pipeline

Environment Variables

Getting Started

Configuration needed

IAM Policy for lambda functions

Design notes

dumbo-drop's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent