dat-ecosystem-archive / deps Goto Github PK

Dat Enhancement Proposals. Contains all specs for the Dat protocol, including drafts. [ DEPRECATED - see https://github.com/hypercore-protocol/hypercore-proposals for similar functionality. More info on active projects and modules at https://dat-ecosystem.org/ ]

Home Page: https://dat-ecosystem.github.io/DEPs

deps's People

Contributors

Stargazers

Watchers

Forkers

bnewbold pfrazee joehand bcomnes martinheidegger jaredricesr rangermauve yoshuawuyts da2x ryanramage decentral1se zxcat staltz dat-ecosystem summercms

deps's Issues

Question: Where can I find the document that defines the SLEEP file format

Referenced from the original Dat whitepaper as "SLEEP - the Dat Protocol on Disk Format".

Discussion: Versions should use hashes

Currently in hyperdrive, hypercore, beaker browser (and probably at a few other tools) versions are specified as length of the append-log (a number). However, that is not a safe specification of a version.

Problem case: a researcher wants to specify exactly which version of a DAT is used, and specifies it like dat://ab...ef+234. The researcher notices that the data-set doesn't fit the output, reverts to version 1 and creates a new DAT with exactly 234 versions to fit the output. With this the researcher just managed to specify false claims.

How to make sure this never happens? Each version of a hypercore creates a hash.
Which makes one version of a hyperdrive a combinations of various hypercore versions.

Specifying a dat version like this though:

dat://<channel:64-hex-chars>+<metadata:64-hex-chars>+<content:64-hex-chars>

... for a single-writer-dat. Which would become even more of a hassle with a
multi-writer-dat (1 key for the channel + 2 hashes per writer). Note: I know that it could be okay to have only the first 8 characters as version identification, but that would probably not be good enough for a researcher.

Thinking about this for a little, I got following solution which might be a good idea for a new DEP:

(Single-writer for the sake of simplicity)

We could add another version hypercore to a hyperdrive, that keeps an index of the versions and hashes:

{
  string hash = 0; // Hash of the version (calculated by hashing all hashes in here)
  repeated string tags = 1; // Names to find this version by
  int32 metadataLength = 2; // Length of the metadata-core
  string metadataHash = 3; // Hash for the version on the metadata-core
  int32 contentLength = 4; // Length of the content-core
  string contentHash = 5; // Hash for the version of the content-core
}

This way a version checkout could download all versions of the version hypercore, create a lookup-table and select the version based on that lookup-table.

My questions now are:

Is this a reasonable approach? Do you know a better way to get that done?
How could a multi-writer version look like?
Should this be turned into a DEP?

DEP-0005: Register the .well-known/dat name

DEP-0005 introduces the .well-known name dat.

This name should be registered with the Internet Assigned Numbers Authority (IANA) for the Well-Known URIs Registry.

The registration process and an registration template is outlined in RFC 5785.

Proposal: Don't use Google DNS

Hello!

I have hard times using Beaker Browser to surf the decentralized "next generation web" and be forced to use Google's DNS. It's an antagonism for me.
I would love that you completely drop them, and offer people the possibility to choose their preferred DNS service.

Thanks!

Thoughtfulness towards Proof-of-Capacity support in cryptocurrency usage

Could Dat be used in cryptocurrencies like how IPFS is used in filecoin?
https://filecoin.io/filecoin.pdf has Proof-of-Replication and Proof-of-SpaceTime (PoRep and PoSt) with Provable Data Possession.

Proposal: Dat mounts / symlinks (revisited)

This idea isn't new, but I've recently realized there's a potential optimization that might make this worth prioritizing.

Let's call this proposal a "mount." It's conceptually simple, like a symlink but for dats. It could apply to both hyperdb and to hyperdrive. It is a pointer which maps a prefix/folder to a prefix/folder of another hyperdb or hyperdrive.

var a = hyperdrive(...)
var b = hyperdrive(...)
a.mount('/foo', b, {path: '/bar'}, (err) => {
  // now: a /foo -> b /bar
  a.readdir('/foo', (err, fooListing) => {
    b.readdir('/bar', (err, barListing) => {
      // fooListing === barListing
    })
  })
})

When a hyperdb/drive is replicated, the client would request any mounted dbs/drives using the same connection & swarm as the parent db/drive. This means that a mount does not have to incur additional swarming overhead. (This is the optimization I was referring to.)

Mounting is generally useful to applications. It has the following uses:

Mapping upstream dependencies. For instance, a /vendor directory could be populated with mounts to libraries.
Collaboration. This does not enable users to modify the same data (that's multi-writer) but it does make it possible to mount "user-owned" directories which are easy to discover and leverage. For instance, a /users directory could be populated with mounts to a site's active users; /users/bob could point to bob user.
Data-clustering. Because mounted dats can be shared over the parent's swarm, the number of total active swarms can be reduced.

Discussion: compatibility/gateway layer with IPFS

Hi, I am a user of IPFS who would like to know more about Beaker/Dat
Example: https://github.com/hydrusnetwork/hydrus (anime collection platform with IPFS)
Example 2: https://github.com/Siderus/Orion and https://github.com/Siderus/Orion/issues/133
Here are some questions:

Does Dat require Beaker to run, and if not, is there a standalone client like Orion/qBitTorrent?
How easy would it be for a Python library like Hydrus to use Dat?
Does Dat have context-defined chunking and de-duplication like IPFS?
Does Dat use content-based addressing?

Create a DEP process

We need an DEP process which covers:

How to structure DEPs
A process for accepting DEP drafts into the repository
A process for modifying DEPs after introduction
A process for accepting or rejecting DEPs, and for establishing a DEP's status

Reference:

cc @bnewbold @mafintosh @taravancil @joehand @daniellecrobinson

Proposal & Discussion: Hypercore "major version" pointer

One of the current issues with Hypercore is that a fork in the history is a fatal corruption of the data. This means that peoples' datasets can be destroyed by a botched "key move" between computers.

Another issue is that, because history cannot be rewritten, it's not currently possible to upgrade a datastructure on Hypercore (such as Hyperdb or Hyperdrive). If a breaking change has to be made to the data structure, then the old hypercore has to be replaced with an entirely new hypercore.

To counter-act this issue, @mafintosh and I have been talking about a meta "pointer structure" which provides a level of indirection between the "public URL" and the "internal identifier" of the hypercore. This would make it possible to replace a Dat dataset's internal data structures without changing the publicly-facing URL/key.

Such a data structure might look something like this:

message HypercorePointer {
  required bytes key = 1;
  required uint16 seq = 2;
}

The key would provide the ID of a hypercore, while seq would be a monotonically-increasing value. To update the pointer, the owner of the public-facing URL would publish a new signed HypercorePointer with a seq equal to the previous pointer's seq plus one.

During the exchange for a hypercore, peers will share their latest HypercorePointer and resolve to sync the pointer with the highest seq number. (They could continue to sync previous feeds.) The hypercore pointed to would then be synced within the existing swarm & connection.

Implications for apps & consuming clients

The HypercorePointer makes it possible to change the internal dataset without changing the URL.

When this occurs, the hypercore's data would essentially be reset, and all history could be altered. This is not a trivial event; from the perspective of any consuming application, the hypercore's previous state has been completely invalidated.

If the pointer is updated to fix a fork-corruption, it's likely that the application doing the fix would then try to recreate the last state on the new log. However, a pointer-update will have to be viewed by applications as a total reset, since the destination state can change

To manage this, we would most likely need to surface the HypercorePointer to the APIs and UIs in some way. @mafintosh explored the idea of calling the seq of the pointer a "major version" while the seq of an individual log is the "revision" or perhaps "minor version." This would mean that hypercore-based data structures are addressed by a major/minor version, such as 5.3.

The semantics of a major-version change, under this scheme, would be "this is basically a whole new dat, so clear any current indexes on it and reindex from scratch."

Thoughts and discussion open!

Half-Baked: Peer-to-Peer tunneling (and/or RPC)

I think there are two important low-level features we might need to deliver to enable a broader set of applications and use cases: app-independent identity ("personas"), and robust, secure, authenticated peer-to-peer data channels. Both are Hard Problems. A lot of work has been put in to the former, and I think
there are feasible solutions using some combination of public key crypto, web of trust, trust-on-first-use, device key signing, etc. Maybe some decentralized Keybase-like thing. For the later, XMPP, Matrix, and Tor hidden services are good starts. Here, i'm going to propose two new half-baked patterns, as alternatives to the above, that might fit in to the hyper-world.

The first idea isn't very solid (but sets up the second), and I think it's been raised previously: to store a public identity key in a multi-writer hyperdb, where each of the user's end devices has a separate writer key. Other keys (or at least fingerprints), a profile image, and "impressum" type contact/profile details could be stored here; most importantly it would be a way to revoke/update a primary key-pair (by updating a public key entry in the db). Multi-writer doesn't really pull off the device key revocation problem well though, so this is a hand-wavey solution. Let's call the key in the hyperdb the "public persona key".

The second idea is to have a mechanism for discovering and initiating a connection to one of a persona's devices. The way this would work is that the "sender" (peer initiating the connection) would need prior knowledge of the "recipient" peer's public persona keys. They would look up the currently active key
from the identity hyperdb (note: distinct from the "original"/"source" key in the hyperdb itself, which would be longer-term stable), then generate a discovery key using a protocol name ("hyperpeer" or something, as opposed to "hypercore") and, optionally, a "code word" passed out-of-band from the "recipient". The code word would be used to control initiation access, similar to how unlisted phone numbers or private
email addresses work today (in particular, the "[email protected]" pattern): an individual may or may not want to let just anybody initiate connections, so they might only listen on a set of (rotating?) code words, that might be printed on business cards or whatever. Others might want to allow public connections; "recipients" would control which discovery keys to actually broadcast and listen on. Annonymous/ephemeral connections can be made by generating a key-pair with no associated "identity" hyperdb. Any trusted recipient device (holding the public/private key pair) can receive connections (making the chance of being online higher); "sending" peers would need to retry until they succeed. Only "receivers" would broadcast to be discoverable; the code word mechanism above couuld make DDoS sybil attacks harder (aka, censoring access to a persona; not sure how to prevent this for wide-open personas). Once a peer has been discovered, authentication would be established using the key-pair on the "recipient" end. Likely the sender could supply their own persona key during handshake for mutual authentication. The connection would be encrypted similarly to hypercore, though the message format may be entirely different beyond that. Both generic TCP-like streams (multiplexed a la websockets?) or protobuf-like RPC (request/response) could be sent down the pipe; either generalizes to the other.

Regular hypercore could be tunneled over such a channel. In particular, for RPC or other connections that want to be robust to network disconnects, the peers could immediately establish two private hypercores (one for each direction of transfer) and send messages by appending them. In the other direction, we could try to define a datagram-like transport over UDP (eg, for voice/video applications).

scaling to indefinite and changing groups of files

Hey all,

I've been in conversations with Internet Archive to integrate Dat into their dweb portal. One of their requirements is to serve data from Dat over the peer-to-peer network without having to scan all of the files ahead of time (upwards of 50 petabytes of archives...). Opening this on behalf of @mitra42 who is project lead on the Archive's dweb portal.

The way they have approached this with other protocols such as IPFS, Gun, and Webtorrent is to have a peer that pretends that it contains everything, and then once an archive is requested, to go fetch it from Internet Archive storage and serve it on the network.

For example, they want to be able to handle requests for the string /foo/bar/baz.mp3 over the network. Quoting @mitra42:

Looking at that doc [How dat works by vtduncan], I'm struggling a bit with the route from a path.
dat://12334/foo/bar/baz.mp3 to the data.

It seems, if I read it correctly, that this requires all the files to
have previously been put into the metadata "file", so doesn't scale to
indefinite and changing groups of files, i.e. there doesn't appear to be
point (unlike in GUN for example) where the string /foo/bar/baz.mp3 is
sent from one peer to the other, for that peer to then figure out what
to send back.

Wondering if this is a use case that this WG think is interested to support directly in the protocol (e.g., file-aware transfers vs. only block transfers)?

Thanks,

Discussion: options for hypercore feed-level metadata

Motivation: have a way to annotate the "type" of feed contents. For example, determine if you're looking at a hyperdb key/value feed, a hyperdrive, or some other thing. A requirement is that code/libraries be able to replicate the feed and discover the content type (and schema version) without necessarily understanding the schema itself. A related motivation is to discover related ("content") feeds in a protocol-agnostic manner, but this isn't a requirement.

Question: should this blob be strictly immutable? Being able to change some metadata might be nice (eg, paired feeds), but keeping it immutable is simple for, eg, hosting platforms and archives.

Option 1: protobuf message as special first entry in feed. This is basically what hyperdrive does currently to point from metadata to content feed, only we would want to use (extensible) protobuf instead of bare bytes. Could potentially select a small fixed number of fields for this protobuf schema (eg, "repeated relatedFeeds bytes", "optional protobufSchema String", "optional contentType String"; strings could be mimetype-like), which application could extend upon.

Option 2: add a metadata/header blob out-of-band to hypercore feeds. @mafintosh mentioned a scheme where an immutable blob is transmitted during feed handshakes, and the hash of that blob is used as a key for internal hypercore hashing. Would be stored as a new stub file in SLEEP directories (like feed key is currently).

There are probably more options if we get creative!

Clarify when and how signatures are generated for hypercore DEP

The hypercore DEP (0002) contains some useful example code for how nodes are hashed.

I think it's important to do the same thing for signing these nodes because it was a little hard to follow the internals in hypercore and hypercore-crypto to find this line.

Proposal: Scan directory before pull

As far as I understand it, when a user pulls a dat, if some of the files exist already, they are simply overwritten. It should be possible to hash the data in the directory and compare it to the dat before pulling to avoid duplicate downloads.

Dat DNS using TXT records

https://github.com/beakerbrowser/beaker/wiki/Dat-DNS-TXT-records-with-optional-DNS-over-HTTPS

This won't deprecate the .well-known dat solution, but it takes precedent. Note also the use of DNS-over-HTTPS to protect against MITM attacks. Does anybody have any disagreements with this?

These will probably need to be formalized into a DEP at some point!

Bug: Incorrect entry size for bitfield in DEP-0009

DEP-0009 states that the entry size for the bitfield file should be 3328 (0x0d00) but the hypercore files I've found all use 3584 (0x0e00) for the entry size in these files.

Discussion: Follow redirects in DNS?

Currently the proposal says:

If the server responds with anything other than a 200 OK status, return a failed lookup.

A lookup for dat://www.biserkov.com will hit https://www.biserkov.com/.well-known/dat and everything will work.

But a lookup for dat://biserkov.com will hit https://biserkov.com/.well-known/dat which returns HTTP/1.0 301 Moved Permanently so the lookup will fail.

I believe this situation is quite common, with the current wisdom against apex domains and stuff.

Proposal: Extension Message to notify about hypercore keys

Recently, there's been work around hypercore-protocol-proxy which is making use of hypercore-protocol to replicate feeds from a gateway and for multicasting data.

It works great for feeds that gateways already know about, but the protocol is limited in that you can't start replicating keys that both peers don't already know about.

One use-case that I'm really interested in is public gateways that allow peers to connect to them and replicate any key with the gateway automatically connecting to the discovery-swarm to do replication.

I propose a hypercore extension message that will notify the other party about a "related" key.

This will be used by clients to notify the gateway about a key before attempting to send the "feed" message.

I think we'll need to bikeshed a bunch about the actual details, but I think it would look something like:

The extension name would be "related-feed"
The message contents would be a protobuf that looks like:

enum RelatedFeedAction {
  REQUEST = 1; // Ask for the key to be loaded on the remote
  READY = 2; // Answer to REQUEST, means that it's ready for the "feed" message
  REFUSE = 2; // Answer to REQUEST, means that the gateway won't replicate that key for some reason
}

message RelatedFeed {
  required RelatedFeedAction action = 1;
  optional bytes key = 2;
}

The client would use connect to the gateway using websockets, but it should work with any duplex stream
The client will have a high level API for relateFeed(key, cb) by their key.
The steps for getting a feed are:
1. Send a RelatedFeed REQUEST action for the key to the gateway
2. Wait for a RelatedFeed event back from the gateway for the same key
3. If the action is REFUSE call the CB with an error
4. If the action is READY call the CB without an error
5. The client should then send a "feed" event using whatever means. (probably by replicating the hypercore? Not sure how this part should work, actually)

I think that having a standard DEP for this will make it easier to have people deploy gateways that can be reused between applications.

Zero-knowledge remote attestation (for sensitive institutional/archival dumps)

This may be most interesting to institutional/organizational/repository/archival folks, though it might also be of interest to anybody operating or relying on a hosted "pinning" service (like hashbase.io).

Yesterday in a conversation I was talking about how a network monitor might want to "trust by verify" the archival status of a Dat/hypercore peer by fetching random chunks of a feed to ensure that the peer actually has them. As some context/motivation for this use case, imagine the USA EPA is hosting a large dataset in Dat. Their peer would advertise that they Have all the chunks of data; a skeptical peer (eg, a repository accreditation agency, or journalists, etc) might want to verify that they actually do still have all the data on disk. Another use case would be an individual user paying for really cheap cut-rate backups of all their cat photos; a nervous user might trust the host but want to verify that they do indeed still have all the data and it hasn't become corrupted, particular in the moments before deleting their own local backups (eg, to make space on their laptop disk). A more malicious example would be a well-funded peer trying to censor a public dat archive by creating thousands of fake/dummy peers claiming to have the full archive, but then stalling before actually returning result, resulting in most clients connecting to these "sibyl" nodes and timing out over and over (and thus failing to sync the content). As a final corner use case, consider a hospital storing sensitive private data, and synchronizing it to other hospitals for backup (or when a patient moves); a third-party might want to verify that the data is all there (and hasn't suffered, eg, bulk disk corruption, which a hospital might not even notice if they aren't continuously re-hashing their contents), but not want to actually transfer any of the data (because that would require HIPAA compliance in the USA); there are similar circumstances (financial data, personal private data, etc), where an observer might want to verify

These are not hypothetical concerns: accredited data repositories need to have a workflow in place for this sort of third-party verification, the LOCKSS network has an entire protocol for secure peer verification, and the IPFS FileCoin protocol depends on verification.

In all these cases, a naive mechanism to "check" the remote status would be to download random metadata and content chunks, rehash, and verify the signatures. By having the user chose chunks at random, it isn't possible for the remote to fake the results; they would effectively need to retain the full content.

The person I was talking to reminded that it's possible to do even better: the entire chunks don't need to be transferred over the network, if some clever crypto is used. In a simple case, if the "monitoring" peer also has a complete copy of the content, it can create a random number, use it as a "salt" in hashing a random local chunk (or even the entire content feed!), and send the random number to the peer being tested to do the same operation, then compares just the resulting hash. In this case no actual content has to travel over the wire. I think there might be an even more clever way to do remote checks where the "monitoring" peer doesn't have a complete copy of the content locally (only metadata), but I would need to research this.

This might be a useful extension to the hypercore protocol some day; would probably be two new message types (one to send a verification request, one for the response). I (Bryan) don't have any plans to work on this in the near future, but wanted to put this down in words.

Dat Has No Client Identifier // Include a `client_id` Field in Handshake

As a developer, user and service operator, I'd like to be able to block leeching, misbehaving and outdated clients to ensure the best service to my users. Currently, I don't think this is possible, as Dat has no defined a mechanism for client identification - all clients look the same.

The equivalent in the BitTorrent universe is BEP0020: peer_id Conventions. This is routinely used by both trackers and peers to block malicious torrent clients like BitTyrant/BitVomit/XunLei Thunder, as well as insecure clients like older versions of Azureus and Transmission. The HTTP equivalent is obviously the User-Agent string.

I see that there is already a "User data" field, although the documentation is unclear about what this field should actually be used for - it just says "any purpose", wheres as BT's peer_id field is more specific. (If it is to be used for client identification, then this ticket is simply a documentation issue, not a feature proposal).

So, I propose that the dat handshake should include a 20 byte client_id field that could include client version information, ex., BEAK-1-0-1 or DATCLI-2-8-0, etc.

Alternately, we could formally define a convention such that the ID handhsake field is prefixed with client version information, ala BEP20, but I think that's a hack that we should avoid if at all possible.

Cheers,
Rich

Discussion: Let's settle on a common abstract-dat interface

abstract-dat

A common interface for hypercore based data-structures

There's a lot of hypercores based data structures. When working on higher-level tools, oftenly it does not really matter whether you're dealing with a hyperdrive, a hypercore, a hypertrie, a hyperdb, a multifeed instance or even a kappa-core or a cabal.

Most of these need, in their constructor:

A random-access-storage instance
A key, or a key pair, when wanting to sync an existing archive
An opts object

They also all have a sufficiently similar public API that ususally, apart from the structure-specific methods, include:

key: The public key (Buffer)
discoveryKey: The discovery key (Buffer)
ready (cb): Async constructor
replicate (opts): Create a hypercore-protocol stream for replication (or use one passed as opts.stream)

Structures that are composed of more than one hypercore oftenly also expose a feeds(cb) method that invokes cb with a list of hypercores. There should maybe be a second argument to the callback function that contains type-sepcific metadata for the feeds (e.g. content vs metadata feeds in hypercore).

Some of the more recent data structures can accept a hypercore factory/constructor, either as argument or option. If that is passed, a storage instance is not needed anymore.

There's also a lot of common options, mostly derived from hypercore: sparse, live, valueEncoding

If we turn this abstract-dat interface into a standard (maybe like in the random-access-storage ecosytem), higher level tools can easily work with different data structures. Examples for higher level tools are libraries/managers of multiple dats, debug tools like dat-doctor, and hopefully soon something like a dat-sdk.

Additionally, higher-level tools like cabal could easily also adhere to such an interface, and thus be managed with the same tools as hyperdrives etc.

It's very little that's not already common. One thing is the question of hypercore factory vs. storage instance for structures composed out of hypercores. I'd propose to stay with the storage instance as default, but always also support a hypercore opt that has a hypercore factory (but then the storage arg would be null?). This is pretty much the only difference in signature that I could find (multifeed has a hypercore constructor as first arg, while all others have a storage instance (or path).

I'm not completely sure what the best process for such a standardization is, it would likely involve two parts:

settle on a common interface: Would need maybe a little more research, and then a DEP with the documentation, I guess.
settle on naming: I quite like abstract-dat as a label for hypercore based data structure but please give other suggestions if you have some
adopt it across the ecosystem: Might need new major versions for some tools

Anyway, I'm creating this issue first to gather some feedback before writing up a DEP :-)

Discussion: Full streaming in Hyperdrive.

Using Hyperdrive: when creating a write-stream in DAT, the logic will create a new version of a file for every chunk added. This allows to implement a distributed read-stream but following questions occur:

How much data is going to be written to the stream? (Some streams - such as uploading a big zip file - can predict how much data will be part of the stream)
Is it worth to keep the versions between the start of the stream and the end of the stream? (does metadata need to be sent for every block?)
When is the write stream finished?
What happens when a write stream is corrupted?

Currently the reading of streams is implemented by looking at the stat, and as soon as all the data for the stream arrived the stream is finished hyperdrive/index.js#L510

While the writing of streams appends a lot of data to the content-log hyperdrive/index.js#L578,
it adds only one statement to the tree after finish hyperdrive/index.js#L598.

This means (to my understanding) that currently hyperdrive only starts a read-stream on a peer after a write-stream on the creators machine has entirely finished.

My straightforward idea to fix this would be that upon creation of a write-stream, hyperdrive could add a put message immediately to it: stating the streams final size if know or 0; adding a "open"-flag to the Stats. Upon finish there would be another tree.put with the final stats that don't contain the "open"-flag.

Consider IANA registration of URI Scheme for dat?

See
https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml
https://tools.ietf.org/html/rfc7595

This was suggested in order to add dat to the safelist of HTML's registerProtocolHandler(). See whatwg/html#5482 (comment)

Proposal: Only connect to writers

Following up from some twitter discussion a couple weeks back. I propose changing the default discovery algorithm in three ways in order to improve default privacy:

1. be able to verify known trusted hosts
1. default to download only from trusted hosts (similar to https privacy)
1. default to download only (opt in to seed)

In other words, turn off p2p mode by default, except for a set of 'trusted hosts'. To simplify things, I propose we define the 'trusted hosts' as any writer. This is a simple default that can be overridden by settings (e.g. to specify a set of trusted non-writer hosts).

The way I envision discovery working in this new scheme is something like:

Discover initial peers based on first key (same behavior as now)
Additionally, subscribe to discovery channel for every key in writers
Instead of connecting to any IP:PORT peers that are discovered, only connect to IP:PORT peers signed by writer keyholders

This changes the privacy expectation to match HTTPS: Users need only trust the 'owner' of the content they are requesting to keep their server logs secure. The key difference being instead of one DNS record being considered the owner, the entire set of Dat writers (and their corresponding IP:PORT pairs) are to be considered trustworthy.

Again, this is just a proposed default for any Dat client. An option to run in 'unrestricted p2p mode' is easily added.

This would probably be a breaking change, since there could be existing dat schemes out there that rely on non-writers re-seeding content.

DEP wise, there would need to be a mechanism added to sign DHT payloads.

DEP-0005: Move DNS TXT record to a dedicated subdomain

The current DEP-0005 proposal is unusable with CNAME records. Imagine the following zone file:

www.example.com. IN CNAME 3600 cdn-service-or-something.example.net

It’s impossible to add additional TXT records to the www subdomain in the above setup. You can’t add any other records to a name that has been “redirected”/outsorced with CNAMEs. IPFS has solved this by using a dedicated subdomain for DNS-based discovery of IPFS hashes. E.g.

_dnslink.www.example.com. IN TXT 3600 "dnslink=/ipfs/{hash}"

I propose a small change to DEP-0005 to address this and align the mechanism with RFC 6763: DNS-Based Service Discovery. The change would be to deprecate the current draft proposal of adding TXT records to the named zone, and instead add it to a DNS-SD subdomain (which is compatible with CNAME deployments):

_dweb._udp.www.example.com. IN TXT 3600 "datkey={key}"

—and I do mean that the current method should be deprecated to avoid a future were multiple DNS lookups are required to discover DAT keys. IPFS support both which leads to unnecessary DNS lookups. References to it should be removed from all documentation and support dropped in Beaker Browser after a year or so.

Why _dweb._udp.? it’s conceivable that other protocols would want to use DNS to auto-discover distributed web tools. _dweb is short and generic enough to allow for other uses, and thus increasing the likelihood that the record will be cached somewhere nearer to the end-user in the DNS hierarchy. E.g. IPFS could use the same subdomain with an ipfskey record. Using that argument, Dat should consider using _dnslink. like IPFS does. I’m against that because it’s not compliant with the well-established and well supported RFC 6763. Also, the name dnslink is redundant and meaningless.

The second subdomain ,_udp, is a common name for DNS-SD defined in RFC 6763 that allows for all service discovery methods to be delegated to another delegated to a secondary DNS service. The name _udp should have been _srv (“service”) but it’s _udp is for legacy reasons. See RFC 6763 section 7. for the details.

Technically, SRV (service discovery) records should be used instead of TXT records.

_dweb._udp.www.example.com. IN SRV 3600 "datkey={key}"

However, I don’t have any data on how widely they’re supported in managed DNS solution. I believe they should be supported everywhere except the most outdated and insecure cPanel instances and legacy systems. Using SRV instead of TXT has a potential performance benefit (see RFC 6963 section 12.2) for any discovery mechanism that requires further DNS requests to use a discovered service (such as IPFS+IPNS+DNSLink). It would potentially be beneficial to other dweb solution to stick with SRV type records even though Dat doesn’t use it itself at this time.

Include heartbeat rate in the hypercore-protocol handshake

It turns out that peers need to agree on the heartbeat rate, otherwise they'll incorrectly detect a dead connection. For instance, if peer A expects a rate of 1 every 1s and peer B expects a rate of 1 every 5s, then peer A will believe the connection is dead because too many heartbeat "ticks" will pass with nothing from peer B.

This means the peers need to agree on heartbeat rates in order to maintain good connections. We have two options:

Option 1, hardcode the heartbeat rate in the spec
Option 2, include a heartbeat negotiation in the handshake

@mafintosh suggested option 2. In that case, we need to decide what rate to use if the peers suggest different values. Perhaps an average, or a lower limit, or an upper limit. Also some limit range should be hardcoded, ie 5s < x < 120s.

If we want to move forward with that, we'll need to update the handshake schema

DEP-0007 - limited size header

When reading DEP-0007 I noticed that the dataStructureType can be any string.
(of varint data-size: big) However: most file signatures are fixed in size, which makes me wonder if it wouldn't be a good idea to amend a limit of "how big the size can be" in order to prevent the downloading of a lot of data before checking if the data even fits the structure.

Discussion: Hiding data stored on registries

Related to #14

Some data should be private and not exposed with any third parties, but people will still need it backed up somewhere (pinned) so that they can be sure they have access to it between devices.

Dat should provide a mechanism for this out of the box.

Some ideas to consider:

Easiest approach would be to encrypt the contents of the archive at the application layer
- Metadata could still be exposed about what files exist and how frequently it gets updated
- Encrypt the file paths, too?
Should encryption be at the hyperlog level or the archive level?
- How would hosts advertise and share the data without knowing the contents if it's at the hyperlog level?
How the flow of sharing this data will work
- Should dat URLs allow for a "key" field to have two levels of URL to share?
- dat://encryptionstrategy:encryptionkeyhere@daturlhere/mysecretdata.html

Wanted to have some feedback on others' ideas before working on anything.

For a start I'm going to play around with a wrapper for the DatArchive API for encrypting contents and files. Will look into using WebCrypto for the actual functionality.

Allow for other key=value pairs in a dat TXT record

Currently, the implementation of the dat TXT record has a very greedy regex that only allows for a datkey=value

I am working on an app that uses dat, but also needs to store extra key values in the same TXT record. The proposal is to relax the regex slights, so that a datkey could be mixed with other key values in the same TXT record.

In my case the TXT record would look like:

datkey=3390cfcc601a97174f1221aa7277e0e5b49e5f3e973e4edecac0c66791c882c4;l=53.52294522577663,-113.30244317650795

But I am not proposing what the delimiters will be in any way. And I would not recommend any delimiters. Leave that part flexible.

So I have put together a pull request that relaxes the regex slightly, and would allow for other use cases of the same TXT record. No delimiters are specified, and all current TXT records would function fine.

dat-ecosystem-archive/dat-dns#19 (comment)

Discussion: dat publish & registry API DEP

simple registry HTTP API DEP - informative (@pfrazee)

We discussed briefly the API around dat publish in the last WG meeting. @pfrazee will be drafting a DEP around this, but wanted to open an issue for any related discussion. The first DEP will aim to be simple. In the meeting we also brought up needs around:

registry vs pinning
it might be nice for such an API to have a way to accept "nominations" without authentication (eg, how archive.org has "save page now", or EDGI accepts nominations-for-archiving). most wouldn't implement this.
a simple api for telling a rest server to archive any dat or hypercore is def useful by itself
write to dat.json to prove owership for given registry