(Mostly) Data Sync Research Log

oskarth · May 9, 2019, 10:14am

While this is true for current setup, I think assuming we’ll run on TCP forever is too strong. UDP was specifically chosen in e.g. Dispersy due to NAT traversal, and similar considerations are probably true for Internet less usage and running in the browser. The libp2p folks might have some discussion about this in their forum, but I haven’t looked into it.

cammellos · May 9, 2019, 10:30am

No assumption on running on TCP forever was made

my point is that as I understand, this layer will not be directly connected to peers (i.e you won’t have to do any NAT traversal at this layer), and will piggy back on the transport below for these kind of things (which will have to provide a list of peers and a way to communicate with them).

So unless you are planning to have this running directly on top of UDP/TCP, I don’t think we should be taking MTU much into account, although as mentioned, we should keep the payload as small as possible of course, but that goes without saying.

oskarth · May 13, 2019, 5:06am

May 13

First, started over paper with a more problem-solution oriented and less huge comparison. First section draft done, related works: p2p data sync take two - related work section · status-im/bigbrother-specs@219918e · GitHub. Tomorrow, continue incorporate existing MVDS etc requirements into problem and solution sections. Aim is to get a solid draft this week (~4-5000 words).

Also looked into Swarm feeds a bunch:

Adaptive frequency: https://github.com/ethersphere/go-ethereum/pull/881 and https://github.com/ethereum/go-ethereum/pull/17559

Parallel feed lookups: Parallel feed lookups by jpeletier · Pull Request #1332 · ethersphere/swarm · GitHub

What’s the purpose of the adaptive-frequency lookup algorithm FluzCapacitor in Swarm for Feeds?
To find the latest update in a feed, i.e. the latest chunk.

What problem does the adaptive frequency MRU(Feed) in Swarm solve?
Previous implementation required users to agree upon frequency and start time beforehand to publish updates.

What does the adaptive frequency MRU algorithm in Swarm allow users to easily guess?
Where the next update ought to be.

What is an Epoch in Swarm Feeds?
A section of time.

What pieces of data does an epoch consist of in Swarm?
A start time and a level.

What does a level correspond to in Swarm feed epochs?
A duration of time, 2^n.

What is epoch id in Swarm Feeds?
Number corresponding to binary operation of start time and level.

What’s the harvesting algorithm for Swarm Feeds?
Describes how to find latest update of a resource by walking back in time.

How does the harvesting algorithm for Swarm Feeds roughly work?
By dividing the grid of time (start time x duration) into smaller and smaller sections for each lookup step.

For Swarm feeds update, if you have a period covered at level n with some duration (start_time to start_time + n), what is always true for epochs at n-1 levels?
That they are covered by higher levels, each epoch provides ‘shade’ for layers below.

What does the parallel feed lookup function do in Swarm?
Explores keyspace more efficiently by launching queries in parallel.

The parallel feedback lookup algorithm (LongEarth) solved timeout issue, since before each lookup to remote node might wait 100ms, but this could result in timeout, which results in false lookup failures. Using a longer timeout, say 1s, would solve this. But then each lookup might consist of 5 lookups, which would lead to ~5s lookup (and even 20s for first lookup).

What does the parallel feed lookup function (LongEarth) in Swarm basically do?
Binary search, divides epoch grid into lookback area and lookahead area.

Once this is merged upstream: Parallel feed lookups by jpeletier · Pull Request #1332 · ethersphere/swarm · GitHub then false timeouts should be less of an issue for remote node testing.

Louis proposal, some questions: Status.app reply

Want to revisit video Viktor made on Swarm vision, and also (new?) Swarm paper. See if there’s time to expand on Staples PoC a bit.

oskarth · May 14, 2019, 4:36am

May 14

Brief update.

Draft now up to 4000 words: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md

Almost all of system model and enhancements. A bit left on DSP. Next is adding wire protocol and protobuf as well. Also add SSB. And isolate Feeds list and getting historical actual data. Then example clients, desired simulations ~, then some references and have a solid first draft. Is latency/UX a first class design requirement perhaps? Also multidevice sync at some point.

Question on message id compatibility: How is chunk ids in Swarm calculated? How do they differ from our message ids, and can they be consistently mapped? What about IPFS ids? Would be useful if the same piece of content had the same address, even with ~multicodec. If it isn’t possible, how would we do the mapping?

lash · May 14, 2019, 9:02am

By chunk id (swarm) do you mean the 32 byte address that the chunk has on the network, used for routing etc?

oskarth · May 14, 2019, 9:48am

That’d be useful! I don’t know if it is possible, but having the same or similar address for content stored/calculated locally, in Swarm and IPFS would be great. Perhaps using some multicodec stream. I didn’t look into this yet so maybe it is already possible.

Also fyi @yenda I think you asked about this. I ended up hacking this for Staples PoC by reuploading content to get hash, hackathon style

lash · May 14, 2019, 9:57am

I doubt the hashes will ever be the same because of the BMT hasher scheme for swarm designed to enable “word-size proofs” of 32 bytes. References would have to be wrapped by some other object, but of course that would be possible, I presume. The multicodec would only solve how to literally reference the location under the different schemes, no?

oskarth · May 16, 2019, 5:05am

Indeed, the problem is how to go from one like datasync/<hash1> to swarm/<hash2> when they refer to the same content. You can imagine references forming a DAG inside a datasync payload, but that doesn’t tell you how to find it in e.g. Swarm. You could overload it and say ‘here are some names for this’, but seems ugly and it’d not allow for nodes to later on replicate on Swarm. It might be doable by considering the wrapping, but haven’t through it through consequences properly yet.

oskarth · May 16, 2019, 5:06am

May 16

Up to 5700 words for draft. https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md

Added brief SSB summary. Expanded on stateless sync enhancement and bloom filter subset considerations. Added subsection on content addressed storage and update pointers. Add current specification types and protobuf. Minor references. Added Status specific section on caveats for minimal version (mode of operation, role of mailservers, parent IDs, current implementation).

(Ongoing) add more details to spec and wire protocol
Example client section
Content addressed storage: We can use multihash/codec to refer to the same piece of content, but how and where do we communicate these so it is uniform? I.e. Swarm, IPFS, Data sync (and Whisper?). I.e. go from data sync ID to swarm chunk id
Proof-evaluation-simulation section desired simulations and variables (churn, latency etc)
Future work section (move enhancements into this?)
Crystalize system model with components and specific names (CAS, node etc)
Understand to what extent latency/UX is a requirement
Briefly mention multidevice sync
Ensure written in a generalized messaging way, i.e. not just human-to-human chat
Outline more generalized messaging use cases (group chat, transactions, state-channels, data feeds, ~m2m, more?)
Add references for systems compared
Ensure challenged networks captured adequately, i.e. DTN mesh networks too
Write conclusion
Turn into LaTeX
Review - read through whole and make sure structure makes sense; tighten up
Get feedback from a few people
Read over similar types of papers and see what is missing in terms of rigor/analysis
Get feedback/corrections from larger community (ethresearch, specific projects mentioned, etc)
Find good place to submit preprint (permissionless)

lash · May 16, 2019, 7:11am

here are some names for this

I’m not sure I understand. Example?

Where exactly can I read about your message ids? And are those ids what you here call eg datasync/ ?

Maybe a specific sync DAG can commit to one specific storage layer, then store the disambiguations there?

oskarth · May 16, 2019, 9:36am

I mean this message id: bigbrother-specs/data_sync/p2p-data-sync-mobile.md at master · status-im/bigbrother-specs · GitHub or protocols/BSP.md · master · briar / briar-spec · GitLab in BSP.

By datasync/<hash1> and `swarm/2 I mean whatever it would look like using multiformats, a la

Maybe a specific sync DAG can commit to one specific storage layer, then store the disambiguations there?

Possibly, it’s not clear to me exactly what this would look like though

lash · May 16, 2019, 3:20pm

Couldn’t they just be concatenated multiaddrs in ascending order?

oskarth · May 20, 2019, 7:25am

Theoretically they could, but that seems pretty messy in my opinion. For one, it tightly couples two different concerns. It also doesn’t allow for a decision to replicate remotely after a message has been created, which seems like a problem. Right now I’m leaning more towards some form of wrapping, where these concerns are kept separate. I.e. References to chunks, and then inside chunk you have messages with their own message IDs. This seems like it’d make compatibility easier (raw data sync messages inside Whisper envelopes / Swarm chunks / IPFS thingies). Perhaps there are some analogies with filesystems and how compatibility is kept between FAT/NTFS/Ext3/ZFS etc?

oskarth · May 20, 2019, 7:26am

Brief update:

Paper https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-mobile.md up to 6800 words. Briefly on system components. Mailserver upgrade path and disambiguation. Brief section on Swarm/chunk IDs. Sketch data sync clients examples. Sketch simulations required.

yenda · May 20, 2019, 4:16pm

@oskarth I am not sure about the part on Swarm feeds, feeds have topic so you can have more than one feed, and there is access control so while you can have public feeds, you can also have private ones.

Also the feed itself can’t store much content, it is more intended to store a manifest than the actual data, and you can implement status feeds as some sort of linked list so that others only have to fetch as far as they need to. And finally as you mention PSS can be use for live updates so you only use feeds as an alternative to mailserver for instance, as long as you are connected you can get updates directly with PSS

oskarth · May 21, 2019, 2:07pm

I am not sure about the part on Swarm feeds, feeds have topic so you can have more than one feed, and there is access control so while you can have public feeds, you can also have private ones.

Which part? I agree with all the things you said here. Maybe something is badly written and can be explained better, please let me know where.

yenda · May 21, 2019, 2:09pm

In your document here bigbrother-specs/data_sync/p2p-data-sync-mobile.md at master · status-im/bigbrother-specs · GitHub