tl;dr: After discussing some of this with @diasdavid and @vmx, they suggested we post an issue here for further conversation. I'm hoping to get some feedback/guidance on our approach in the comments! Thanks in advance to anyone who can help us out 😃
Nori's IPLD Use Case
High level
We are creating a commodity token called a "Carbon Removal Credit" (CRC) on the
ethereum blockchain which represents a certain amount of carbon having been
removed. The data collected during the removal process serves as proof of
removal and so each CRC will be linked to the data that proves it's
legit. Storing all that data in the Ethereum blockchain is cost prohibitive, so
IPFS, IPLD, and eventually FileCoin, become interesting mechanisms for
decentralizing access to that public data.
Example Data
The data we wish to distribute can be represented as a graph of immutable
structured nodes, and binary file data (images and such). As a simple example,
let's say we have various node types included project
, plot
, crc
, data
,
validation
. Here are examples of what some of those might look like.
project node
This is a node representing a project to remove carbon dioxide from the
atmosphere. We'll call it project-1
.
{
"node": {
"name": "Paul's Farm",
"location": {
"city":"Seattle",
"state": "WA",
},
"createdOn": "2018-01-14",
"url": "http://paulsfarm.com"
},
"edges": {
"owner": ["user-1"]
}
}
land plot node
This is a node representing a plot of land belonging to the project owner. We'll
call it plot-1
,
{
"node": {
"name": "East Plot",
"area": [
{"lat": 45.2342, "long": -122.4342"},
{"lat": 45.3342, "long": -122.5342"},
{"lat": 45.3342, "long": -122.5342"},
{"lat": 45.2342, "long": -122.4342"},
],
},
"edges": {
"project": ["project-1"],
"createdBy": ["user-1"],
"historicalData": ["data-1", "data-2", "data-3", "data-4"]
}
}
CRC Data node
This is a node with all the info, or links to the info, needed to validate a
certain amount of carbon dioxide having been removed.
{
"node": {
"generationDate": "2018-02-14",
"carbonRemoved": 1.23,
"removalMethod": "soil",
"measurements": {
"complicated": "science stuff",
}
},
"edges": {
"validationRecords": ["validation-1", "validation-2"],
"project": ["project-1"],
"createdBy": ["user-1"],
"plots": ["plot-1", "plot-2"]
}
}
How the data will be used
-
Linked from non-fungible ethereum token
We'll link each newly minted CRC token (just a struct in ethereum) to one of
the CRC data nodes, for example: crc-1
. A CID is exactly what we need for this.
-
Fetched from client-side javascript code
Browser based ethereum clients (metamask/web3) will need to look up the
crc-1
data by the identifier stored with the ethereum token, so it can
display that data on a webpage. It will probably need to fetch many nodes
linked to from the crc-1
, such as project-1
, plot-1
, and user-1
at
the same time and in an efficient way.
-
Queried with a wide range of query parameters from client-side javascript
Some example queries include:
-
The 100 most recently created crc
nodes linked to project
nodes with
state="WA"
.
-
The sum of all the carbonRemoved
fields of all crc
nodes linked to
project
nodes with state="WA"
where generationDate
is in the year
2017
Scaling
Near term: 10k-99k nodes
Long term: billions of nodes
How we are doing this now
Centralized cloud database solutions solve all of the above problems except for
the "decentralized" part. So we're just using Google's Cloud DataStore for
everything, but trying to do it in a way that is "compatible" with decentralized
technologies like IPFS:
- We write individual nodes to Cloud DataStore (basically mongoDB).
- We serialize the same nodes using
ipld-dag-cbor
and use the CID as the
primary key when writing to Cloud DataStore.
- We also store the
dag-cbor
serialized data in a special attribute on the
Cloud DataStore object.
- We store binary data (images and such) into Google's Cloud Storage service
(same as s3) also keyed off the CID of that data.
- We make the data stored in Cloud DataStore available to client-side
javascript via a GraphQL API.
This allows us to scale to at least hundreds of millions of nodes (assuming
Cloud DataStore actually works), perform very fast lookups and queries,
easily backup the data into cold storage, and toss it into data warehouses for
more complex analysis.
But of course people will have to rely on us continuing to pay the bills and
operate this centralized service in order to maintain access to this data. And
they will have to trust us to deliver accurate query responses. So we want to
distribute the data and the query indexes to as many people as possible as a way
to secure the data "forever". We could distribute the data by providing links to
gigantic petabyte database backup files which people can theoretically download
and do something with, but IPFS, IPLD, and FileCoin sound like better options :)
Pathway to decentralization
Ultimately we want people to be able to use the platform we are building without
having to go through us. While that is probably quite a ways off, here are the
steps we could take in that direction:
- Make the IPLD encoded data in our database available through a public API
(graphql/rest/whatever) hosted on our servers.
- Still centralized, but the data is public. CIDs ensure that the
data we are serving is accurate, so we can't cheat.
- Mirror/pin the IPLD encoded data in our database to a cluster of IPFS nodes that
we host.
- Still centralized, but now it is "easy" for others to pin the data on their
own ipfs nodes if they want, and make it available to everybody else
through standard ipfs APIs. Now there is an alternative to our public
graphql API.
- If our IPFS cluster goes down, we can repopulate it from the database.
- Would cost us a lot of extra time and money to manage an IPFS cluster on
our own. Doable, but not fun until someone comes along with IPFS
cluster-as-a-service. Then it would just cost us money, but not time.
- Maybe we can write our own IPFS resolver that points to google cloud datastore directly?
- Switch our own client-side javascript over to IPFS (instead of our public
API) for fetching IPLD nodes.
- If public IPFS latency stays the same as it is today, our code would have
to fall back to our own public graphql API when IPFS doesn't respond fast
enough.
- Release an open source javascript library for interacting with our data over
IPFS.
- Theoretically decentralized, assuming other people choose to run their own
IPFS based mirrors.
- Now other folks can theoretically build things that use data from our
platform without relying on our graphql API being available.
- Release an open source server application that facilitates the mirroring of
subsets of CRC data.
- Presumably the people who own CRCs or supply CRCs would be incentivized to
mirror at least the data associated with CRCs that they own or supplied.
- Maybe this is enough to be considered fully decentralized, except you still
don't have decentralized guarantees of data permanence.
- Use transaction fees or some other decentralized funding mechanism to pay for
FileCoin to store this data "permanently".
- Probably about as close as you can get to guaranteed decentrailized
permanence and availability. Sounds like there are still a lot of hard hard
problems to solve before this could be a reality.
These steps don't really include anything about decentralized queryability since
I don't really have a good idea of how that would work in practice. But I sure
would like it to happen!
Also, each one of these steps could trigger very long discussions about the various
ways to implement each step. We would certainly like solutions that are low
cost, low latency, high performance, etc. etc. and will look to the community
for guidance on achieving that.