(Mostly) Data Sync Research Log

Inspired by Pedro’s daily updates for the road to Nix, I thought I’d start a daily research log for data sync. It is becoming increasingly obvious that we need to make this a priority and work on getting it into the app as soon as possible. This will only become more apparent in a few weeks time when we have Chaos Unicorn Day.

In order to get this done right we need to do our homework. This means looking at multiple existing solutions and comparing them, so we can make an informed decision on what trade-offs we are making and why.

Right now, we have a rough PoC spec based on BSP, simulation and Whisper-based end to end PoC (video). One mistake I made is settling too quickly on Bramble Synchronization without understanding the underlying principles or design decision behind it in more detail. It might be the case we’ll end up with something similar at the end of this, but the rationale should be stronger.

Previous posts

Why does this matter?

If Status is to succeed in its mission to have widespread adoption of Ethereum, having reliable, decentralized messaging is a requirement. Currently, we are relying on a centralized setup to provide reliable messaging.

The challenges in providing reliable messaging for decentralized systems are plentiful: you don’t have a single source of truth, and because nodes are autonomous they can join and leave the network at any time, which impacts availability of individual nodes.

A common approach for dealing with these problems in a distributed setting is to do data replication, i.e. store the same data in multiple places. A natural way to do this is in a pure p2p setting is to let the participants in a group chat store each other’s data (messages). Additionally, since we are building for mobile, it is desirable to have other nodes take part in this replication process. This relates to the work being done on network incentivization.

How does this relate to the other protocol work?

Data sync is part of it, and the most immediate need from our point of view. Collaboration and research for transport privacy and conversational security layer will continue, but this takes priority now.

Additionally, having a clear understanding of how the data sync layer will look like is important for the other layers as well, as it informs design decisions and constraints.

What’s next?

1) Research and hacking.
a) Continue studying relevant literature and existing solutions
b) Document findings and create juxtaposition of different approaches
c) Informed by above, resume PoC hacking and precise specification

Once there’s confidence in PoC and spec, make a plan for getting it into the app

2) Finding great people.
a) Keep track of researchers and people with relevant expertise
b) Triage based on their relevance and likely availability
c) As budget and time frame allows, reach out for collaboration

Tracked in a private spreadsheet to respect people’s privacy (PM me if you need access)

Posting first entry in a separate post.

3 Likes

March 15, 2019

Papers

Survey of Data Replication in P2P Systems

Main focus.

Citation: Martins, Vidal, Esther Pacitti, and Patrick Valduriez. Survey of data replication in P2P systems . Diss. INRIA, 2005.

Summary: This paper goes through various dimensions of data replication in a p2p context. It breaks the problem apart into different dimensions. It is focused on large scale collaborative documents. As a literature review, it starts by breaking down basic concepts such as multi-master, asynchronous, optimistic replication, etc.

Evaluation: Partially read and focusing on basic concepts. Might revisit later. In terms of basic dimensions, to place other data replication systems in, we can create the following table:

Attribute Space
Read-only? Read-Only or Read & Write
Updates: where? Single-master or Multi-master
Replication distribution Full or partial replication
Updates: when? Eager (sync) or lazy (async)
Optimistic replication? Optimistic or non-optimistic (asynchronous updates only)

Examples of read-only p2p data replication is Bittorrent, where we are replicating a file consisting of chunks, but this file doesn’t update. For write-support the main choice is whether we want to single-master or multi-master.

Question: Since the minimal unit of replication for messaging is a message object, these are immutable and never change. A device would then write a message to their own log, sync with other nodes, and this message would never change, it’d be immutable. Does this mean a single-master is more relevant to us, or am I looking at this problem wrong? Also see Dangers of Replication a Solution below.

The paper then goes further into P2P network overlays etc, which seems worth looking into more. They also reference a bunch of common technology and how they fit in, such as DNS, CVS, etc. However, it seems like the end of the paper is about their own protocol, which I’ve never heard of before. A lot of the focus seems to be on the collaborative aspect, where you deal with updates to the same piece of (complex) object, which seems less relevant to us.

Relevant keywords: data replication, reconciliation, eventual consistency, P2P Networks, optimistic replication.

Selection of resources encountered

Bramble vs Matrix

I started a comparison between Bramble and Matrix replication protocols, but I realized I didn’t have a good enough of understanding of the underlying components/dimensions. Thus the shift to going over survey literature. Super rough notes can be found here: Data sync protocol comparison - CodiMD, once I understand the commonly agreed upon dimensions, I’ll re-do this table, and probably include a few other protocols.

People

I used to have a spreadsheet of people, and I have spread out notes of this. Going to collect this and add to the document. To get started with, I added a few people with their relevant area and group affiliation. Didn’t look into them into more detail.

Next

Keep looking into data replication protocol dimensions, e.g. survey literature. Replication book, some more in immutable logs.

3 Likes

March 16, 2019

Catching up with some previous notes and also looking into consistency models. The paper on high risk users and disconnect with developers is relevant for understanding who we actually building for.

Papers

Halpin, Harry, Ksenia Ermoshina, and Francesca Musiani. “Co-ordinating Developers and High-Risk Users of Privacy-Enhanced Secure Messaging Protocols.” International Conference on Research in Security Standardisation. Springer, Cham, 2018..

Slightly different tract but I came across this paper again and thought I’d write up my thoughts around it, as it might be useful to other people at Status. Mostly read bits and pieces, didn’t deconstruct or read in detail.

Summary: Survey of developers, low risk and high risk users and what they want in a secure messenger. Explores two hypotheses (1) Developer-User disconnect (see needs and priorities differently) (2) High-Risk user problem (high risk users have different needs). Based in interviews and field work. Looks into dimensions such as: main threat; security (key verification, ephemeral messaging); privacy (metadata collection, pseudonymity); group support; decentralization; standardization; open licensing.

Evaluation: Largest scale user story of real world high risk users, and developers of these messaging apps. Definitely worthwhile to skim and think more about as it relates to what we are making, even if their conclusions are not always correct (“they would’ve said they wanted a faster horse”) Example takeaways: High risk users don’t care about decentralization, standardization, and open licesing. They do very much care about metadata collection, pseudonymity, ephemeral messaging and group support. Their main threat is client.

Example to illustrate this: their threat model might be the local government spying on them, where a perfectly fine solution is to use Google Drive for photos, and the motto “Use Tor, Use Signal” doesn’t make sense for a user in . They just want to be protected them the local government. There are more stories of this nature.

The biggest difference for high-risk users as I see it is that they generally have a very specific threat model, and they will use whatever tool to solve this problem. The stakes for them are real, so they can’t afford the luxury to wait for something perfect, nor do they care much about “developer tribalism” as they are primarly acting in a different arena.

Further reading: See below for decentralization and privacy survey.

Troncoso, Carmela, et al. “Systematizing decentralization and privacy: Lessons from 15 years of research and deployments.” Proceedings on Privacy Enhancing Technologies 2017.4 (2017): 404-426..

Also by same author as above and relevant, so I might as well include it here.

This looks at the problem space of decentralization wrt privacy, and compares various types of systems, Infrastructure, Topoloy, Authority, and other attributes. Only skimmed so far, see below for excerpt.

Excerpt:

Charron-Bost, Bernadette, Fernando Pedone, and André Schiper. “Replication.” LNCS 5959 (2010): 19-40.

Large book. Based on 30y of research and summarizing what we know, which is a lot. Worried about it being a bit too academic and not applied enough, especially for things like p2p.

Chapter 1, Consistency Models for Replicated Data by Alan D. Fekete and Krithi Ramamritham, available online.

Summary: Consistency models overview and how to think about. A consistency model is the set of properties replication algorithm provides, and the guarantees this gives the user. This is important to compare different systems, and especially important for application developers, because it helps them understand the behavior they see when interacting with a replicated storage system. Provides overview of sequential data type, strong consistency, weak consistency and transactions.

A sequential data type is a formalization of a set of operations and how they impact some state. Simplest model is a read-write single-bit system that supports read() => ok; write(0) => 0; write(1) => 1 operations. Through this we can then look at a sequence of such operations and make sure things make sense at each step. This leads to the notion of linearizability and serializability.

Strong consistency: all accesses are seen by all nodes in the same order (sequentially). For replication, this means you are likely to have problems with e.g. latency and avaiability. In practice this usually require a single sequence node to guarantee that nodes receive messages in same order.

There are a bunch of different consistency models. The ones that are most likely relevant to us are eventual consistency and casual consistency. Weak and strong consistency appears to be more of a continuum, e.g. see Wikipedia and Jepsen below.

Eventual consistency: Update then propagate, puts a premium on availability and disconnected operations. It’s a liveness property not a safety property. In absence of updates things settle down to the same state. They also include a more formal definition.

Session properties: to be more convenient, we have constraints for what type of operations we do. These are things like Read Your Writes and Monotonic Reads.

Casual consistency: roughly a subset of eventual consistency, which gives us some casual ordering.

There’s also more on transactions and things like snapshot isolation and one-copy serializability that I didn’t look into.

Evaluation: Good to place some concepts in context and put name on things. A bit too academic though. It leads to some other good resources and further reading as hinted below. And it points to the importance of evaluation consistency models, as it provides a form of interface and UX to whoever is using the replication layer.

Interesting to note the focus on shared memory / update in place. It isn’t clear to me to what extent this is relevant for the messages themselves. Looking at the original Linearizability paper (hinted below), they say that it’s useful not just for shared-memory models, but also for things like message queues. Some questions/notes on shared memory:

  • Immutability for chunks of messages? (strong bias to say yes)
  • Participants in a group chat? (Matrix has a separate state thingy they sync IIRC)
  • Message queues relevant for propagation, and if so how to think about mutability?
  • Canonical message history? (update pointer to merkle root)
  • Similar to above, link to chat i.e. foochate.stateofus.eth (updating shared state/pointer)
  • Blockchain as consensus engines / data availability layer, when relevant/leverage?

Other notes/questions:

  • For eventual consistency we need simulation to get reasonable bounds for latency.
  • TODO: Update above comparsion diagram with Consistency model
  • Are there any other consistency models people use apart from casual consistency in p2p replication?
  • Session properties that are relevant and how it impacts how the layer is used?
  • Better understanding of safety and liveness would be useful
  • Paper mentioned group communication protocol, ch 3.2, 5, seen this a few times and seems like a relevant part of the literature.

I also don’t feel like I fully grok linearizability and serializability. Linearizability is about the correctness of a single operation for a single object behaving as if you have a single-site unreplicated system, writes appear to be instantaneous. There is some difference depending on if researchers come from distributed systems background or database research.

Further links:

Keywords: consistency models, strong consistency, linearizability, serializaiblity, replicated data, session properties, eventual consistency, casual consistency, liveness, safety

People

Updated the doc with a ~5 more possibly relevant people. Still no triage. Mix of known individuals and some name of people come across.

General heuristic: should have relevance in more than one field, i.e. replication and p2p, not just database researchers (e.g.).

Next

Some paths forward.

  • Immutability: Didn’t look into this more, want to go through “The Log: What every software engineering should know about real-time data’s unifying abstraction” and “Immutability changes everything”

  • P2P consistency models, casual consistency session guarantees, also document Matrix/Bramble diff/similarity

  • Sort out single-master/multi-master relevance, not clear what best lit or mental model for this is. Related to immutability and questions on shared memory above.

  • Group communication protocol literature? Useful for N participant sync, especially if there’s p2p overlap

  • P2P Overlay relevance and discovery of relevant nodes, some further reading in March 15 paper

Misc thoughts/notes

Data access is done through transactions, which is a series of read and writes followed by a commit.

Minimal object we want to object are chunks of data, aka messages or events.

Two ways of looking at Matrix: Server 2 Server API only or the federated reality with Client-Server API. Different properties.

1 Like

This is an awesome initiative @oskarth good stuff, wdyt about setting up Zotero and a group so we can create a library of papers/research articles, categorize and comment on them etc?

Also curious if you intend to increase the scope of your comparative research with other projects like Scuttlebutt?

That’s a good idea, I’ll look look into it. Collaborative research ftw.

Scope: Yes, initially limited but that is the intention. E.g. grow both in dimensions (column) and what is being compared (rows).

March 18, 2019

Abbreviated day due to meetings.

Papers

The Log: What every software engineer should know about real-time data’sunifying abstraction, Jay Kreps, 2013

This isn’t controversial, and I read it a few years ago when working on event sourced architectures with distributed message queues, but it’s useful to re-read.

Summary: LinkedIn many problems reduced to writing a log. Aka write-ahead log, commit log, transaction log. This article outlines what it is, why it’s a fundamental data structure, where it is used and how you can extend it and scale it. It also links to various related distributed systems literature.

Misc notes:

What: Append-only, totally ordered sequence of records ordered by time. Discrete log time. Record what happened and when. Immediate authoritative source; then use projections on top of it. Machine-readable, often internal in DBs etc.

PostgresqSQL etc log shipping protocol to transmit part of log to replica database which acts as a slave (NB indicates master-slave).

Problems it solves: (a) Ordering changes (b) Distributing data.

State Machine Replication Principle: If two identical deterministic proccesses begin in the same state and get he same inputs in the same order, they will product the same output and end in the same state.

A replica can then simply be described by maximum log entry they processed.

Two models: State machine replication model primary backup model.

What’s the difference? In state machine replication model we write write transformations/events to the log and then to each peer. Vs primary-backup: go via primary node, then log results.

Example: Arithmetic service starts at 0, then +1 *3. Logging transformations (+1 *3) or results (1, 3).

A log can be seen as a series of decision of the next value to append. Relevant for distributed, consistent, shared log. E.g. see RAFT, Paxos, Viewstamped Replication (?), [and blockchains].

Theory before practice in distributed system research, which might explain why there’s a bunch of talk about single-value registers as opposed logs for abstractions.

Duality: log of changes and a table. Log - list of credits and debits; table account balances. Log more fundamental because you can re-derive stuff.

Changelog to to get near-real-time replicas. Similar to version control: pull down patches and apply to snapshot.

It then talks about why it is useful for things like data integration, real-time data processing, and distributed system design. Skimming this part.

With logs we have a logical clock, so subscribers can measure to this (e.g. block height). Allows for reasoning about each subscribers state.

Data production is asynchronous from data consumption. Subscribers can subscribe at different rates.

Hints at atomic broadcast aka total order broadcast, as a log for messaging system, where multiple processes receveives event in same order.

Isolate consumer from soure of data for many different services. Lead to Kafka (and later, AWS Kinesis).

[ETL stuff, not interesting.]

Scalability hard. Tricks: partion the log; better throughput with batched r/w; avoiding too many data copies.
[-- aka implied async optimistic; partial replication]

Each partition totally ordered log, but no global ordering between partitions. Without coordination between shards.

Log easy to optimize for read and write patterns.

Kafka binary format: same in-memory, on disk and over network. Can do zero-copy data transfers.

Together means can r/w data at rate supported by data/network.

Also some Real time streaming process, skipping.

Data Flow graphs: feeds that result from many feeds. E.g. DAG of logs and jobs impacting other logs.

Briefly on log compaction.

Summary in paper - architecture: handle data consistency (eventual and immediate) for concurrent updates to nodes. Data replication. Handling rebalancing.

Example for system with two logical parts: log and serving layer. Client writes to log (or proxied), then log writes to serving node, which is what client reads from.

Has some links to related stuff like replication algortihmics, Paxos/Raft, Replication book.

Event sourcing as being same as state machine replication.

Evaluation: Tool for distributed system design. Useful overview and has some good pointers. A lot of it seems to be implicitly consensus-based though (e.g. distributed cluster vs p2p and local).

How kafka achieves fault tolerance: partioned data; brokers and followers; at any time have a leader then N leaders. Election process.

  • Question: When is it desirable to have a single log that we have consensus about? Vs multiple individual logs.

Similar to Ethereum in that you have state and STF, STF(s0, e0) => s1. Events array [e0, e1,…]

(NB: Receive events in same order - probably false! This should be more DAG, where it might look different.)

A lot of this implicitly has one single distributed log and then replicas.

  • Question: When is it desirable to have a single log that we have consensus about? Vs multiple individual logs.
  • Think more about exactly how a DAG generalizes a log, and what the trade-offs are (e.g. wrt ordering properties and and lack of coordination/consensus), as well to what extent they are the same thing.

Misc resources

  • Thorough Introduction to Apache Kafka™ | HackerNoon just to see how a distributed kafka cluster works (N participants, one broker at a time that can persist and die, election-based).

  • Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial http://www.cs.cornell.edu/fbs/publications/smsurvey.pdf - linked from “The Log” article.

  • Primary Backup approach http://www.cs.cornell.edu/fbs/publications/DSbook.c8.pdf - linked from “The Log” article.

  • Vive La Diff ́erence:Paxos vs. Viewstamped Replication vs. Zab https://arxiv.org/pdf/1309.5671.pdf - If you are “shopping for replication algorthm” read this, according to “The Log” article. Not clear to me how relevant this is wrt consensus, but perhaps worth a skim.

  • Atomic broadcast - Wikipedia I don’t understand how this works, but total order broadcasting seems useful to understand somewhat.

  • http://www.pmg.csail.mit.edu/papers/vr-to-bft.pdf From Viewstamped Replication to ByzantineFault Tolerance, by Liksov. From Replication book. Seems like there are two branches of replication algorithms, where BFT is a continuation that deals with arbitrary failure. Both came from MIT apparently, and both deal with ‘state machine replication’. Still unclear to me to what extent we need something like BFT/consensus for replication, but worth digging into basics more and understanding trade-offs. Also see: basis of blockchains.

Immutability Changes Everything

Summary: Need immutability to coordinate at a distance, we can afford it with storage requirements. Change in tech and how we do computing. How updates are layered on top of LSF (Log Structured File Systems), COW (Copy on Write), LSM (Log Structured Merge trees). Also implications for e.g. replication [what interested in]. Then talking about hardware and finally trade-offs.

Append-only computing. Accountants don’t use erasers or they go to jail.

Single-master computing - order changes, either through centralized master or some consensus like protocol.

Append-only: record facts then derive
Immutable files - Replication of files w/o update anomalies.

Versions and history:

  • linear version history - strongly consistent.
  • alternatively directed acyclic graph - many parents/children, eventually consistent.

Versions as immutable objects, a book is published.

GFS and HDFS provide immutable files, chunks replicated across data nodes and identified by GUID. [Similar to Swarm?].

“master-less” file store.

Consistent hash ring: Rebalancing under failure and additional capacity. “Cannot get stale versions of data”

Trade-offs: Denormalization, flexible but fat. Write amplication vs read perspiration.

Clean replication: data immutable and unique id means fewer challenges. No stale version, so less picky about where replications are, also fewer bugs.

Evaluation: General overview of immutability across the stack. Either nothing really new or not super relevant. Useful to think about linear history as strongly consistent vs DAG for eventual consistent though.

Question: Is it possible to have a linear version of history in p2p? Probably yes (well, consensus and using a chain of blocks…), but is it desirable? Why / why not.

  • Look more into consistent hashing, it is useful for data partioninging things (need to remap fewer keys). DHT is one example of this.

  • Question: When is it desirable to have a single log that we have consensus about? Vs multiple individual logs.

  • Think more about exactly how a DAG generalizes a log, and what the trade-offs are (e.g. wrt ordering properties and and lack of coordination/consensus), as well to what extent they are the same thing.

Meta notes

Too wordy, better to read through then jot down summary/notes.

Next

  • See above “nexts”, minus immutability
  • Think more about and formulate more precise statements on linear versions vs DAG-based (consistency model, coordination, consensus requirement, semi-trusted/transparent ordering impact (also see group communication protocols?))
  • Also look into setting up Zotero
  • Setup call for Web3 of Messaging, where data sync is communicated as phase 0 from our POV
1 Like

March 19, 2019

Feeling a bit under the weather today, but I spent some time with the excellent Jepsen’s Consistency Model overview Consistency Models, as well as trying to understand the subtleties of various consistency models and session guarantees.

Papers read

Jepsen - Consistency Models

Summary:


Starts with a good overview of the fundamental concepts that consistency models and distributed systems builds on. Operations, processes, invocation/completion times, imaginary real-time, concurrency, history, consistency models.

Notes

Diagram of various consistency models and how they relate. Several mini articles the various concepts. Zooming in on the most relevant for an exposition.

Linearizability: Operations occur in (imaginary) real-time clock order. Can’t be totally or sticky available; Useful if real-time ordering is a requirement.

Sequential consistency: Operations appear in some total order that reflects order of events. Individual nodes may lag behind reality. Doesn’t provide availability.

Casual consistency: Captures notions that casually-related events should appear in the same order. Casually-independent events don’t. E.g. A: lunch? B: yes C: no, here no one will see yes or not before lunch?, but yes and no might be in different order.

For convergent casual consistency, eventually yes and no would be ordered (e.g. based on hash order).

There are also other forms of casual consistency, such as Real-Time Casual consistency. The strongest consistency model in available system. If you want total availablity you need to give up read-your-writes property.

Casual models come from happen-before semantics by Lamport (potential causality).

Session properties for casual consistency: writes follow reads; monotonic reads and writes.

Mentions formal treatment that decomposes casual consistency into: CasualVisibility, CasualArbitrarion and RVal (see below).

Session properties - Write follows reads: (Wikipedia): Wihin process p, r(x) then w(x), write on x then takes place on same or more recent x. You can’t change the past.

Session properties - Monotonic writes: If a process does w1 then w2, then all processes observe w1 before w2.

Session properties - Monotonic reads: If a process p reads r1 then r2, r2 can’t reflect a state written before r1. You can’t read backwards.

Session properties - Read Your Writes: This one is interesting, because it is not totally available, only sticky available. If you write to a process, you should be able to read that value after. This is also why Casual Consistency isn’t totally available.

Sticky availability: If there’s a network partition, all nodes can progress, as long as client is connected to the same ‘server’.

There’s also PRAM (pipeline random access memory) which is weaker. As well as a bunch of Serializability models, that deals with read committed/uncommitted and monotonic atomic view, snapshot isolation, cursory stability, and repeatable reads.

Evaluation:

Really useful to understand design space of consistency models, what guarantees they provide and why you might want one or the other.

Didn’t look into Serializability/Transaction side of tree too much (read committed/uncommited, montonic atomic view).

  • Claim: Conversations are generally casual, i.e. you either start conversation or respond

  • Claim: Network availability is vital in permission-less p2p and to be censorship-resistant

  • If we value availability over total order, then casual consistency is the strongest consistency model we can provide

  • This implies UI needs to accommodate the rendering of casually-independent events

  • Question: Are there use cases when a total ordering is a requirement, and we can sacrifice availability (or network partitioning)?
    – E.g.: authentication; canonical chat history; canonical name resolution

  • Question - why is casual consistency not totally available but only sticky available? I think I understand if your model is client-server, and you write and then try to read. But not 100% intuitive, especially how it differs vs other session models.

  • Question: What consistency model does Bitcoin/Ethereum provide for? See below for skimmed resources.

Documenting session properties seems useful, and impacts how to consume protocol.

Comment on Monotic Writes: In Bramble model, you don’t deliver writes until you have message dependencies. This ensures this session property is provided for.

  • Question: But how does that work if you do have w2 message but for some reason can’t find w1 due to that data not being available? That seems like the same as not having gotten w1 or w2 in first place, so perhaps it reduces to what we mean by availablity and eventual consistency in the first place.

  • I’m a bit worried people won’t fundamentally grok casual consistency, so how do we render this in a reasonable way? What would two parents in UI look like?

  • Does convergent casual consistency makes sense for us? How would it be rendered?
    Does a notion of finalization makes sense? (a la Eth2)

  • Is there something similar to Bitcoin’s Omega factor w.r.t probability of message persisting? Swarm and data availability problem.

  • It turns out casual consistency is family of algorithms. What do other versions of casual consistency provide and why might you want to use them / not?

(I wonder if using Jepsen would be interesting at some point GitHub - jepsen-io/jepsen: A framework for distributed systems verification, with fault injection)

Further reading:

Also see below on some skimmed.

Question: What consistency model does Bitcoin/Ethereum provide for? (skimmed)

  • Bitcoin Guarantees Strong, not Eventual, Consistency -
    Emin Gün Sirer, 2016
    provides a nice way of thinking about it. Brief summary: Bitcoin (and Ethereum) provides strong consistency through serializability. You need to look at Bitcoin as a read protocol and write protocol, considerings its outputs, not the internal state. It guarantees serializability, with a probability that is exponentially decreasing with latency. Can’t say it is inconsistent just because you can have re-orgs, that’s like saying it is works like MongoDB as a DB. Instead: for the read protocol you can discard Omega blocks, and you can analyze what this is to get the desired probability of anomaly. Using a low Omega (0 confirmations) just means you are not leveraging the consistency that Bitcoin provides. This parameterization property is indeed very nice, since it depends on use case (0-conf for coffee vs 6-conf for bigger transactions, etc). [It should be noted that there appears to be disagreement about how to classify this in academia; most important is being able to rigorously argue for what properties it has, incl probability of failure etc].

  • Additionally: Eventual Consistency & Bitcoin

  • Probably more, found some formal one but lost it

Regarding Read-Your-Writes (RYW) (skimmed)

The way to make sense of it seems to be that it’s about client-view consistency, not data per se. E.g. client+server as node slightly different (since they fail together).

From Stickiness and Client-Server Session Guarantees:

Via a classic partitioning argument, we can see that RYW is not achievable under the stringent CAP availability model. We can partition a client C away from all but one server S and require C to perform a write. If our implementation is available, S should eventually acknowledge the write as successful. If C reads from S, it’ll achieve RYW. But, what if we partition C away from S and allow it to only communicate with server T? If we require C to perform a read, T will have to respond, and C will not read its prior write. This demonstrates that it’s not possible to guarantee RYW for arbitrary read/write operations in an available manner.

Also see:

Misc

Unrelated to reading:

  • Nadim Kobeissi, who wrote Cryptocat and recently formally verified Signal and TLS, teaches the following course: https://computersecurity.paris/. It includes things like: threat modeling, excerpts on secure messaging, crypto, network security, handshake protocols etc. Seems like an up-to-date and exciting curriculum with relevance to Secure Messaging in general. Lots of pointers too. Practical course.

Admin

People: Added minor.

Zotero:

Next

  • Understand design space of casual consistency better
  • Play around with Zotero workflow some more
  • See previous TODOs and questions
1 Like

March 20, 2019

Fever, but the research continues, albeit abbreviated. Let’s look briefly at casual consistency and fork join consistency models.

Papers read

Consistency in Non-Transactional Distributed Storage Systems

Summary:

Consistency used to be a more vague notion, e.g. what does “strong consistency” really mean? This survey paper looks at more than 50 different notions of consistency in distributed systems, especially with a focus on storage systems. It then decomposes them into specific formal properties that hold, and creates a graph that shows hows different consistency models are related. It relates both to database literature and distributed systems literature.
Uses graph theory and first-order logic to describe and classify consistency semantics.

Notes:

Mostly interested in getting a rough idea of fork-join based models and casual consistency family. See picture posted yesterday.

Preliminaries:

Classify consistency semantics into ten families.

Describe an operation by process on shared object formally as a tuple of:

  • proc (process id),
  • type (operation type),
  • obj (id of object),
  • ival (input value),
  • oval (output value),
  • stime (invocation time),
  • rtime (return time).

A history is then a set of operations. They then define more formal notions using various logic and equivalance classes etc. Skipping details of a a lot of this for now. With this you get relations on elements in history such as:

  • rb (returns before) - natural order based on real-time; a.rtime < b.stime
  • ss (same session) - invoked by same process; a.proc = b.proc
  • so (session order) - partial order of rb INTERSECT ss
  • ob (same object) - a.b = b.obj
  • concur - pair of realtime concurrent ops; ob \ rb

Abstract execution: multi graph A=(H,vis, ab). Vis and ab captures message delivery order and e.g. conflict resolution. Looking at relations between operations in a histoy and justification, e.g. vis (visibility) that accounts for propagation of writes, ar (arbitration) total order that deals with conflict resolution.

A lot of this went over my head and would require more time to get into in detail. (Update: went back to clarify some things, actually seems largely like straightforward set logic with some additional unfamiliar notation for things like transitive closure +some unknowns)

Update: visibility is an acyclic relation that acconuts for write propagation. If a is visibile to b (a vis-> b), then effects of a are visible to a.

Also defines things like:

  • hb (happens before) - transitive closure (which one? a bunch) of so UNION vis
  • rval (return value consistency)

Meat, chapter 3: non-transactional consistency semantics:

[Not clear why non-transactionl, seems like it is because of single-operation?]

Casual modes:

happens-before semantics (Lamport 1978) was originally for messaging passing systems but then seems to have turned out to be equivalent to consistency model for shared-memory.

  • Casuality = Casual Visibility + Casual Arbitration + Rval
  • Casual+ aka Convergent Casual = Casuality and strong convergence (independently agree on confict resolution)
  • Real-time casual = Casuality and RealTime (casually concurrent writes that don’t overlap in real time must be applied according to real-time order)

Fork based models:

Trust limitations for storage and computations led to dealing with byzantine faults. What consistency guarantees can you get in an untrusted environment?

Fork-linearizable models guarantee that if the visible histories of two processes differ for a single operation, they may never observe each other’s writes.

Other forms of fork models, fork-sequential, fork-join casual consistency, - this seems eerily similar to things like things like CBC Casper. That’s largely part of (separate?) consensus literature, which I don’t want to go into right now.

Related reading:

Other casual consistency models:
Could look into this but seems like too much of a rabbit hole, probably better use of time to go to p2p direction.

“Casual consistency as an optimal trade-off between user-perceived correctness and coordination overhead, especially in mobile or geo-replicated applications”:

  • Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen. 2011. Don’tsettle for eventual: scalable causal consistency for wide-area storage with COPS. InACMSymposium on Operating Systems Principles (SOSP). 401–416.
  • Peter Bailis, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2013. Bolt-on causal consistency.InACM SIGMOD International Conference on Management of Data (SIGMOD), 2013. 761–772
  • Marek Zawirski, Nuno Preguic ̧a, S ́ergio Duarte, Annette Bieniusa, Valter Balegas, and MarcShapiro. 2015. Write Fast, Read in the Past: Causal Consistency for Client-side Applications.ACM/IFIP/USENIX Middleware Conference, 2015. (Inria again! P2P overlap?)

Real time consistency:

  • Mahajan, Prince, Lorenzo Alvisi, and Mike Dahlin. “Consistency, availability, and convergence.” University of Texas at Austin Tech Report 11 (2011): 158.

Linked to other papers to understand subtle implications for eventual consistency:

  • Bermbach, David, and Stefan Tai. “Eventual consistency: How soon is eventual? An evaluation of Amazon S3’s consistency behavior.” Proceedings of the 6th Workshop on Middleware for Service Oriented Computing. ACM, 2011.

  • Bernstein, Philip A., and Sudipto Das. “Rethinking eventual consistency.” Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013.

  • Bailis, Peter, and Ali Ghodsi. “Eventual consistency today: Limitations, extensions, and beyond.” Queue 11.3 (2013): 20.

  • Bailis, Peter, et al. “Quantifying eventual consistency with PBS.” The VLDB Journal 23.2 (2014): 279-302.
    Unclear if worthwhile, but especially p2p would be interesting.

  • Something on hashes in p2p and how it relates to byzantine cases?

  • [https://hal.inria.fr/hal-01350652/document](Balegas, Valter, et al. “Geo-replication: Fast if possible, consistent if necessary.” IEEE Data Engineering Bulletin 39.1 (2016): 12.) - hybrid RedBlue convergence that attempts that both strong and casual consistency.

  • Rodrigo Rodrigues (author) seems to look into p2p, routing, and consistency: https://scholar.google.com.tw/citations?user=mxdX7FYAAAAJ&hl=en&oi=sra

Misc stuff / inbox

  • data layer availability problem; L2 protocol using eth2 as a data layer; zk roll ups - complete noob in these so need to wrap my head around it, and whether it’d be useful at all

  • IPFS (draft) https://arxiv.org/pdf/1407.3561.pdf vs Swarm as well

  • Look at Zotero file save with closed group

  • Formulate more precise research questions

Might take tomorrow off depending on how I feel.

21 March, 2019

Papers read

Re-read Survey of data replication in p2p systems

Most consistency models literature has been in the context of traditional distributed database systems. Now it’s time to explore more of the literature for p2p, with the goal of finding out how nodes join, link up, etc. In the context of data replication.

Summary

P2p scaling examples:

  • Computation Seti/Genom@Home
  • Communication (ICQ/Jabber)
  • P2p Multicast systems, security
  • Data (Gnutella, Kazaa etc)

P2P overlay networks types:

Unstructured: ad hoc, data locality unrelated to topology. Each peer knows its neighbors but not what resources they have.

Search method: The most basic and expensive is flooding. You can do perf optimizations though (see further reading):

  1. Multiple random walks
  2. Forward queries based on previous history
  3. Select subset based on preformance in recent queries
  4. Dynamically adapt network so that nodes are short distance from high capacity nodes
  5. Use routing indicies to get closer “in the direction” of content

They then talk about things like query expressiveness (key lookup/sql-like query).

Examples: Gnutella, Kazaa, FreeHaven.

Strucured networks: Emerged to solve scalability problem. Done by controlling overlay topology and data placement (data or pointers to data). DHT (Distributed Hash Table) main example.

Each peer has some neighbors with a routing table. Usually: lookup to find peer with desired data, then communicate with peer directly. Can find key in O(log N).

Downside: autonomy limited, since node is responsible for some piece of data. Also: usually for exact key searches, but possible to generalize somewhat.

Examples: Chord, CAn, Tapestry, Pastry, Freenet (semi-structured), etc.

Super-peer: Unlike unstructured and structured, super-peer servers aren’t pure. Each peer provides equal functionalities. Super-peer are hybrid client-server, and if just using one it is same as client-serve. These super-peers provides services such as: indexing, querying, access control, meta data etc.

Advantages: efficiency and quality of service. Takes advantage of different resource consumption. Also access control [NB Bad boy!]. Restricts autonomy (can’t login to any node) and worse fault tolernace. Possibly dynamic replacement [mailserver round rubbin hurr-hurr].

Comparison: They compare on a few dimensions:

  • Autonomy (join and leave without restriction; control what store and who else can store)
  • Query expressiveness (e.g. not just key lookup but keyword ranking and sql like language)
  • Effiency (BW/power/compute/storage), higher throughput
  • Quality of service QoS (user-perceived efficency; completeness of results, consistence, availability, response time)
  • Fault tolerance (provide efficiency and QoS despite peer failing - since peers are dynamic we need to rely on data replication)
  • Security (can’t rely on trusted servers in open p2p networks; dealing with access control)

They then compare the three approaches. Essentially super-peer has moderate autonomy and low fault-tolerance, but high everything else. Unstructured has low efficency, QoS and security, but high in other aspects. Structured low autonomy, query expressiveness and security, high in everything else.

It then goes into details of various replication solutions in p2p systems. Won’t go into too much detail right now but:

Napster: Relies on central service to mediate, super-peer. Replication occurs passively. Super peers as single points of failure. Only stores static data.

Gnutella: To obtain shared file node/requestor must perform three tasks: join network, search for desired file and download it. Join with bootstrap nodes, then search with flooding. Passive replication. Active replication strategies proposed (see further reading Q. Lv). Only deals with static files.

Chord routing system on top of DHT. Uses consistent hashing.

CAN, Tapestry Freenet etc but didn’t look into here.

Evaluation / Comments

  • Activate replication seems similar to semi-trusted nodes

  • What about hybrid approaches between these p2p overlays to get best of both worlds? Also which aspects put more of a premium on?

  • [Mental model check: Assuming message size 1KB, we have 100 users who send average of 100 messages a day. Probably all on the high end, perhaps by factor of 2 or so, but that’d be 10 MB/day in messages. Then we’ve been on for say a year, so messages should take up something like 3GB. Now let’s ask and wait for #status-core. No answer yet]

  • QUESTION: Are super-peers desirable for some contexts? What are the main invariants we want to avoid? Something like: each round is a fresh, i.e. arbitrary which we use and can select K. No privileged read. Compare with e.g. Matrix federation and our mailservers.

  • We do want to take advantage of different resource consumptions, but it’s about it not being a special relationship. Also see miners etc. Has to be a lot more research on this.

  • Also leads to me to all these validator designs people have in Ethereum, a lot of them seems way overkill and like a bad alternative to basic operations. It needs to be integrated or very simple, probably.

  • [Question: Super-peer hybrid, but if everyone has OPTION to be that player, is this a problem? Vs centralized decision]

  • Unstructured optimization: Is this roughly what was initially intended for Whisper?

  • Also routing indicies seems similar to PSS

  • Why is Gnutella so well studied?

  • How does PSS deal with nodes storing whatever vs IPFS and DHT (autonomy but also censorship resistance)

  • Whisper as (ephemeral) a DHT: key is topic and then expiring values that are replicated everywhere

They seem more interested in collaborative semantically rich data like SQL.

Due to dynamic behavior, join and leave whenever, rely on massive data duplication

Back of the envelope calculation for message size:

  • Assumption: 100 messages a day, 1KB message size, 1M people
  • => 10^210^61KB= 100 GB data per day, 100 KB per day per person or 10 MB per 3m
  • => How many logs can you realistically keep on mobile?
  • Assumption: 10 chats with average 10 participants (or 30 chats with N=3)
  • => 10 MB per day or 1GB per 3 months
  • => Need to be a bit careful / do some pruning/deduplication

Further reading:

Question: How can we do efficient search and replication in p2p networks?

Structured

Briefly on dimensions

Collecting for evaluation. Challenge: right table (problem-fit, relevance) and minimal set of consistent dimensions and comparables.

Outside of replication attributes above, consistency model, we also have P2P overlay topology type, and probably ~join|find|pull method.

Other tables from Martins, et al. 2006:

  • Single-master vs multi-master
  • Full replication vs partial replication (distinguishing feature, load balancing, availability, storage space, communication costs)
  • Synchronous propagation vs asynchronous propagation (distinguishing feature, synonymous, consistency criterion, local reads, distributed concurrency control, scalability, environment)
  • Comparing optimistic replication solutions (System, Obj, Operation, Relation, Propagation, Conf, Reconcilation, Consistency)
  • Comparison of P2P networks (autonomy, query expressiveness, efficiency, QoS, Fault-tolerance, Security)

And final table in Martins:

Replication Systems in P2P Systems:

  • Source: Martins, et al. 2006, Table 8
  • P2P System (Napster, Gnutella, Chord, Freenet, etc…)
  • P2P Network (Super-peer/Unstructured/Structured)
  • Data Type (File/Any/Tuple)
  • Autonomy (Low/Moderate/High)
  • Replication Type (-/Static-data/Single-master/Mulit-master)
  • Conflict Detection (-/Concurrency/None/Semantic)
  • Consistency (-/Probabilistic/No guarantees/Eventual)
  • Network Assumptions (strong/weak)

Zotero

  • Made group public closed to get file sharing working, also got auto import working
  • Imported all mentioned papers in this thread, a few others and tagged a bunch
  • Here’s what it looks like now:

Link to library: Zotero | Your personal research assistant

1 Like

25 March, 2019

Plan of action: go through specific applications and describe them along relevant dimensions. Timeboxed. At the end, do brief comparison. E.g. something like: Briar, Matrix, Scuttlebutt, Tox, Swarm.

Desired artifacts:

  • Rough descriptions of similar solutions
  • Rough comparison of above
  • Better formulated research questions

Briar - BSP

Briar synchronizes data in a group context among a subset of devices called peers.

  • Since you only connect to devices you did key exchange with, it is a friend-to-friend network

  • Each message is immutable, and (depending on the client) include message dependencies. This means we build up a DAG of history, and it thus provides casual consistency

  • It’s not an unstructured network nor a super-peer network. Since you have direct connections with each other you might classify it as a direct p2p network, or possibly a structured p2p network, since A talking to B means A knows B has its own log of data. It is not anything like a DHT though, so perhaps this notion makes less sense for f2f networks.

  • BSP requires a transport security protocol to communicate between devices.

  • Peer discovery: to sync up initially requires meeting up in person. To know how to reach a device Briar keeps track of transport properties. For short-distance it uses BT/LAN discovery beacons, and for long distance it uses Tor hidden services.

  • Look into Tor hidden services interface more, both in general and in Briar. E.g. see onion addresses.

  • Assumptions: A Briar node runs on a mobile device, but it runs in background. This means currently mainly Android is supported, and there are some battery considerations. Though it is less than in an unstructured network.

  • Briar and problems with battery consumption. Additionally, there’s some resources on Tor specific battery work that I can’t find right now.

  • Briar and problems with running on iOS due to background restrictions

  • In Briar asynchronous, optimistic replication is practiced. This means you append to your own log, and then try to sync up with other nodes in some group context. Since each message is immutable and hashed, conflicts are likely to be rare.

  • Conflict-resolution: Unlikely to be a problem due to hashes, but in the case it is it is presumed validation will take care of this in a straightforward manner.

  • Briar’s have weak network assumptions, and it can sync over pretty much any medium. Including USB sticks, BT, and Tor. It only requires the ability to securely transport something from device A to B.

  • Replication data: For individual objects, it is single-master and ‘static data’ since only a device writing a message can write it, and it doesn’t change after that. No other node can change a message. This is the minimum object being replicated and is essentially a simple file. However, if we look at the whole thing being replicated in a group context, it is really a DAG. This DAG can be updated by all the members, and since it is a DAG it is a form of CRDT. This means it might be more apt to classify it as a dynamic data; multi-master setup.

  • Exactly how resolution works in case of group membership conflicts needs to be studied more carefully. The semantics are fairly coarse, but well specified through a finite state machine. E.g. see [private group sharing client[(Private Group Sharing Client · Wiki · briar / briar · GitLab).

  • Since all devices participating in a group sync context are interested in messages of that group, it is a form of passive replication. No other helper nodes are active by default in Briar, e.g. there’s no active replication involved. If there were, these nodes would have to choose between seeing the encrypted messages or see the shape of the DAG, since the DAG is encoded inside a message through message dependencies. This is different from, say, a linear log where the contents could be encrypted but the sequence number would be transparent.

  • There are some plans to have a “relayer” node that actively replicates for you, to get better latency guarantees for offline inboxing.

  • Full vs partial replication: within a group, a set of peer for a single node is a subset of the devices within that group. Messages are thus partially replicated by default, but it is likely it’ll settle on full replication as an individual node is interested in all the messages of a group context. But the contract is a partial copy of the graph, so it is a form of partial replication.

  • Pull/push call: there are different synchronization modes, i.e. interactive and batch mode. These are coordination-less and up to the client. I.e. either you can request specific messages or offer messages to someone else. This can either be in on go (send all messages) or more efficient by offering message ids, then waiting for a specific request. There doesn’t appear to be any specific “request all most recent messages” API call.

  • To ensure casual consistency, a message is only delivered once all its message dependencies have been delivered. This ensures Monotonic Writes (MW).

  • Additionally: it is possible to delete/unshare messages (while retaining some information to fill in the partial graph).

  • TBD: How exactly it ensures WFR, MR, RYW, MW (above) session properties.

Matrix

Matrix can be looked at in a few different ways. They provide a suite of protocols, and by default an end user uses their client server API and then each server uses the server to server API for data syncronization.

  • This is a form of federated or super peer architecture, since a Homeserver has special capabilities. This means individual endpoints don’t need high availability.

  • It is also possible to do pure P2P by hosting your own homeserver. We won’t look into this here, as it isn’t currently widely practiced and has its own challenges.

  • Federation: Homeservers can send multiple types of messages, not all which needs to be sync. This is a form of framing. They distinguish between Persisted Data units (PDU), ephemeral (EDU) and queries. PDUs record history of message and state of room. EDUs don’t need to be replied to, necessarily. Queries are simple req/resp to get snapshot of state.

  • Like Briar, Matrix uses DAGs in PDUs. Their equivalent of a group context is a room.

  • Additionally, Matrix separates prev_event from auth_events. Auth events are special events that contain previous room state etc. In Briar, this separation is a bit more general and the semantics are encoded in a state machine for each specific data sync client.

  • PDU validation: checks if events are valid, signatures, hash and various auth rules match.

  • Soft failure / ban evasion prevention: in order to prevent user from evading bans by attaching to an older part of DAG. These events may be valid, but a federation homeserver checks if such an event passes the current state auth checks. If it does, homeserver doesn’t propagate it. A similar construct does not appear in Briar, since there’s no such federation contruct and there’s no global notion of the current state.

  • EDUs are used for things such as presence notifications.

  • There’s a room state that contains e.g. room name, and they have techniques for room state resolution, e.g. a form of conflict resolution. This may look different depending on which room version is used (client id for data sync client in Briar).

  • Once a homeserver joins a room it gets new events by other homeservers in that room. Users might request previous history before homeserver was part of this room. Homeservers can get previous history from each other with the /backfill API or /get_missing_events.

  • The above implies homeservers practice partial replication, i.e. a node doesn’t need to have all state to be functioning.

  • Since Matrix is a form of federated/super-peer architecture, we can also look at it as a client server architecture. A client authenticates with a server in some fashion.

  • Client discovery of homeservers happens out-of-band, for example by a known URL.

  • TBD: How does failure work if a user’s homeserver fails? Can you seamlessly be connected to several homeservers?

  • TBD: Homeserver discovery? It appears that you need to know a hostname for homeservers to connect. It is not clear to me if this can happen in-band, e.g. through some form of propagation.

  • Conversation history is ‘linearized’ eventually (casual) consistent event graph into an event stream for the end user.

  • There are two types of room events: state events and message events. State events describe metadata and overwrite each other. Message events are transient one-off events. Applications can also add their own type of events.

  • To me it seems like this is mostly putting data sync client of Briar into the data sync layer itself. E.g. message/state events are different event types with different semantics. The main difference is the soft ban / intermediation that servers can do to check current auth state. It isn’t clear what this would look like if each server is a homesever, since this enforcement is up to each node. This makes moderation and group membership an interesting topic, and the state can possibly look different for different people. But assuming there’s a well-encoded rule set and clients are well-behaving, this should be fine.

  • To sync, a client simple calls GET /sync. This returns most recent message events and state for each room, it is thus a pull API. This also gives you a pointer to know how to only get most recent states, as well as how to go further back in the event stream. I.e. a prev_batch and next_batch field.

  • To get new events, Matrix uses HTTP long polling. There’s thus a persistant connection with a homeserver, that is assumed to have high uptime.

  • TBD: How is the partial graph compacted into an event stream?

  • For clients to send events, it sends a PUT request to its homeserver. This means event is acknowledged once homeserver, and homeserver assumes responsibility to deliver it. This is true both for state events and message events.

  • TBD: If a homeserver goes down, is it possible to do message resends to a different homeserver?

  • TBD: Why exactly do they make sure a big difference between? Is it simply because of the soft failure / ban evasion case?

  • They also support redaction of events, which strips off a lot of common keys.

  • For room creation in client-server API they use POST requests.

  • TBD: In case of redactions/deletes, how does the hashing change not change? Doesn’t it break the immutability invariant? Same question for Briar.

  • Homeservers can query each other’s public rooms through a public room directory, which is a form of discovery mechanism.

  • Client server API end to end encryption is optional. Key exchange happens out of band, and its public key is uploaded to a homeserver.

  • Multi device: A user has a user id which can be queried through homeserver, and this contains list of all device identity keys.

  • For E2EE it is only the content of the message that is encrypted from what I can tell, not event types or who it is sent to. This means the DAG is transparent to the homeserver, even though the home server isn’t an end-user endpoint.

  • TBD / out of scope: They use the Olm crypto ratched for E2EE. It isn’t clear to me how it differs from Double Ratchet.

  • TBD / out of scope: For group chat they use Megolm variant, which is more scalable for group chat. It isn’t clear to me exactly how it is more scalable in terms of e.g. algorithmic complexity.

  • It is multi-master since many people can write to a room. Depending on if it is a message event or room state change different conflict resolution might be required.

  • From a client’s point of view to a homeserver it is synchronous (PUT request acts as a confirmation). Between home servers it is optimistic asynchronous.

  • Like Briar, it provides casual consistency through a DAG.

  • TBD: Conflict resolution algorthim for room state exact semantics.

  • TBD: Treating each user as running their own homeserver, how it change things. Probably requires looking into more detailed Github issues / planning docs / code, as right now the details are a bit hazy.

  • Note: Clients can send messages to a room without knowing who is participating in it, this is a form of pubsub.

  • Partial replication, since a homeserver doesn’t need all events to start being used.

  • Server discovery: Ad hoc, e.g. see https://matrix.to/#/#matrix-spec:matrix.org

  • TBD look into relevnat existing WIP proposals here: Spec Change Proposals | Matrix Specification

  • Example: Websockets as alternative transport: https://github.com/matrix-org/matrix-doc/blob/master/drafts/websockets.rst

  • Transactions: homeservers sync PDUs/EDUcs in transaction of some limited size (~50/100). These are synced in a HTTP PUT, which gives ACK semantics (200 OK). Additionally, errors can be returned with the 200 such as not allowed to send messages to room, wrong version, etc.

  • Room Version 3 is the latest spec for rooms. In Briar this would be a specific data sync client. It specifies things like room state resolution (actually in room v2 spec). It uses things like revers topological power ordering.

  • List of super peers aka home servers: Unofficial selection of public Matrix servers

Secure Scuttlebutt

SSB.

Decentralized secure gossip platform.

  • Discovery: Locally it broadcasts UDP beacons for discovery.

  • Have a concept of pubs, ssb peer available publicly and you can be invited to it. This acts as a form of super-peer (?).

  • Each user has a feed which is a list of all messages posted by an identity. This is an append-only log. Each message has a pointer to previous one. This points to them using a form of casual consistency.

  • Each message has a link to previous message with a hash, author which is a public key of where it should appear, and sequence number for position in feed. Also timestamp, hash, content. Field are in a specific order.

  • Message ID is a hash of messages with their signature.

  • A SSB client maintains a set of feeds they are interested in.

  • When peers connect, they ask each other if the feeds they care about have any new messages.

  • There’s of blobs for attachments etc. They can be linked from meesages. Peers signal with ‘want’ and ‘have’ (request and offer in Briar).

  • A feed can follow another feed, this means they are interested in messages from that feed. To follow, you post a special message on your own feed. This message is signed with author, and content includes identity of contact want to follow.

  • Each feed announces which feed they are following publicly. This means clients can arrange feeds into a graph of who follows who.

  • By doing this, we have several layers of a social graph: a user’s own feed, the feeds it explicitly follows (visible in UI), 2 hops away client fetches and stores messages (but doesn’t display), and 3 hops clients see these feeds mentions but doesn’t store. Layer 0: write, layer 1: read, layer 2: store/fetch; layer 3: aware.

  • Clients can choose how they want to display these messages.

  • Pub is a publicly accessible SSB node, it serves social and technical purpose: socially as a gathering point for new users. Technically it has a stable IP and allows incoming TCP connections. Joining a pub means you follow it and it follows you back.

  • A lot of emphasis is on content discovery, which happens automatically when you join a pub since you are two hops away from other users of that pub. After following a pub a user discovered new feeds and don’t need to follow pub for feed visibility. Though pub helps with replication and accepting incoming TCP.

  • Any user can follow a pub. But to get pub to follow you back, you need invite code. Invite codes can work in different way (e.g. single use, etc).

  • Similar to Briar, SSB is offline by default, and deals with short-distance discovery well through lan beacons. Similar to Bittorrent local peer discovery BEP 14 or SSDP based on UDP.

  • It is a form of private p2p, but group based, as you can see more than one hop.

  • TBD: What happens if a pub doesn’t follow you back?

  • TBD: Can this be a setting? I.e. access control up to each node how visible they want to be.

  • Private messages are encrypted and posted to own feed like normal messages but e2ee.

  • Note: I like their protocol guide that link to specific parts of various implementations.

  • TBD: Privacy preservation seems unclear. Threat model? Options?

  • TBD: How does replicaiton and gossiping work in more detail?

  • Peer connections: once SSB peer discovered IP/port/pubkey of a peer that can connect via TCP it asks for updates and exchanges messages.

  • Handshake and connection based to create encrypted channel. JSON RPC protocol. E.g. createHistoryStream which asks a peer for list of messages for specific feed.

  • Scuttlebot behaves just like a Kappa Architecture DB. In the background, it syncs with known peers. Peers do not have to be trusted, and can share logs and files on behalf of other peers, as each log is an unforgeable append-only message feed. This means Scuttlebots comprise a global gossip-protocol mesh without any host dependencies.

  • A key difference with other p2p architectures is that it doesn’t use singletons such as a DHT. Instead it operates on a human user network which is a form of group-based private p2p. Also see [[Redirecting...] Design Challenge: Sybil Attacks - Articles - SSBC](SSB on sybil attack).

  • A pub is just a normal helper client, and anyone can run it. Thus it doesn’t act so much as a super peer, but it helps with active replication for other peers, if you have a static IP.

  • All data is eventually consistent, or casually consistant within one log (total order), globally there’s a partial order, similar to Briar and Matrix.

  • TBD. “There’s a proposal to used signed pings to measure the “freshness” of a feed, but this could only be used in small groups of interested peers.” Not clear where this proposal lives.

  • TBD. Documentation for SSB replication API - [Redirecting...] Documentation - SSBC dead link. Replicate API - Scuttlebot not a lot. Can ask in their chat. Also see Easily replicating between two scuttlebutts? · Issue #148 · ssbc/ssb-db · GitHub

  • Privacy-preservation: End-to-end Encryption - Scuttlebot - SSBC “Private-box” proposal (?). For hidden recipents.

  • It appears to be partially replicated since you can choose how much of the partial graph you want to sync.

  • Unlike Briar and Matrix, there’s no strict notion of a room with specific semantics. Instead everything is your log and other logs you sync, with fluid discovery/gossiping of interests.

  • It is optimistic async, since you sync your own log locally.

  • It is single-master since you only write to your own growing log, and sync other people’s logs separately.

  • It is either a form of direct or structured network since data locality is in a specific peer, or possibly two hops away.

  • TBD: Exactly how policy for sharing is decided. I.e. why do my messsages end up at pub, and how does this get to another pub and then to B recipient? Gossip based but more details would be useful on this type of active replication.

  • The log/feed structure is not encrypted, instead only content is. The main idea here is that it’s a private network and you trust your friends (and your friend’s friends?).

Misc:

Next

  • Continue with other real-world applications, such as Tox, Swarm, IPFS, (Whisper, Status), (Bittorrent, Git, Tribler).

  • Create table of comparison of these.
    Collecting descriptions per stack first instead, since (a) there’s no usually very little agreement on what terms to use (b) these tools are generally too new to have been studied in any academic surveys.

Also look into more:

  • Local peer discovery how generalize and work for us, what’s needed (RFP). This relate to transport properties component, and would be amazing for conferences.

  • Bittorrent hidden services.

  • Spend more dedicated time looking into protocol upgradability

  • /specs repo outline and rough TODOs to get this work started

2 Likes

Hi Oskar!
This looks super interesting I forwarded this to Twitter.com/etzrodtmartin , who is pursuing collaborative research efforts at Akasha.

29 March, 2019

Combines work from 27 and 28th too.

Plan of action: resume work and look into Tox, Swarm, Tribler etc. Sketch comparison.

Should timebox this and focus on data sync more specificially.

Tox / Tok tok (spec)

Tox is a p2p instant messenger and protocol with a focus on security and privacy. They have many clients and they support audio and video capabilities.

  • There is a Tok Tok project as well that seems to be essentially the same (?). Current focus is on things like a formal spec and executable model.

  • Goals: authentication, e2ee, forward secrecy, privacy, resiliance

  • Privacy: attempts to avoid leaking public key to 3rd parties, and avoid determining IP address of a given user

  • Resiliance:
    – no central servers; bootstrap nodes permission-less and no specific privilege
    – connecting to peers behind NAT/FW - UDP hole-punching, UPnP, NAT-PMP, untrusted nodes as relays, DNS tunnels
    – to basic DoS attacks with short timeouts ~

  • Non goal: Anonymity. Instead be compatible with solutions that provide it

  • By default they provide direct connections between peers, since this is needed for real-time multimedia comms

  • Crypto Numbers used: based in NaCl/libsodium: public key / secret key / combined key / nonce

(- Crypto box Differentiates between Plain Text and Cipher text)

  • Transport Protocol below Tox protocol. Supports UDP and TCP. Binary representation a single bit, 0/1.

  • Socket address: Host address+port number + transport protocol

  • Node info: transport protocol, socket address and a public key

  • Protocol packet different message types, ping req/resp, nodes, crypto handshake, dht req, onion req, annonuce, bootstrap info, etc.

  • DHT based so it is an structured, public p2p network, self organizing swarm of all Tox nodes in network.

  • DHT keypair acts as node address - renewed every time Tox closes/restarts (!)

  • DHT Public key of friend is found through onion model

  • XOR distance metric; client list max size k nodes closest to base key. List can be full, viable nodes can replace if distance is lower

  • I.e. doesn’t use Kademlia least-recently-seen eviction policy

  • DHT node state: keypair; close list (node info of neighbor nodes); search list (public keys looking for ~)

  • DHT search list [don’t quite follow, but more interested in data replication right now - skipping a lot from here]

  • Check for responsiveness every 60s with Node Requests

  • Sections on various forms of NAT traversal, hole punching; LAN discovery

  • contact is public key+nospam bytes+checksum [how does nospam work? => friend_request doc]

  • Messenger on top, different forms of packets: online/offline/nickname, message, file_sendreq etc

  • TCP server and client (so get acks that way); if packets dropped TCP client should save them and prioritize re-sending them; out-of-band (OOB) TCP relay

  • Connect to TCP relay to send onion packets

  • … if TCP connections is a limited number, how does routing work? TCP server acts as relay for non-direct; can be run on any tox node but more commonly on actual server

  • TCP server handshake to get secure comm channel, client has to send some encrypted data to confirm the connection to avoid handshake+timeout attack

  • When TCP client connects it tells TCP server who it wants to connect to; TCP server only allows if both have indicated they allow; does not give any feedback by default if packets arrived or not

  • …I find it difficult to read the Tok Tok spec. Partly because it has a different architecture and partly because it isn’t very well structured, it mixes interface and implementation details a bunch. It is also 100 pages long… lets ctrl-f some keywords

  • ctrl-f: no mention of replication, retransmission, log, consistency [hmm]. Some of “sync” and “offline”.

  • Sync: state and group syncing, if receiving higher version number then search DHT ~.

  • Friend connection takes care of establishing the connection to the friend and gives the upper messenger layer a simple interface to receive and send messages, add and remove friends and know if a friend is connected (online) or not connected (offline).

  • Does this mean only online messaging works?

  • “Mesenger” (with all various message types) is on top of friend_connection, which sits above DHT/onion/net_crypto.

  • Create connection over relayer, use onion to search.

  • Can establish direct TCP or UDP connection but it isn’t clear how req/res works, or message history, etc.

  • There’s a separate Tox Multi device proposal doc In it it seems like data sync is tentative. Oh well. Let’s move on for now.

Thoughts

  • Interesting to note how Tox and Briar both punt on anonymity to Tor. This seems like it’d be useful to allow other projects to integrate this mixnet, e.g. allowing Tox and Briar to hook into it.

  • DHT keypair is renewing each restart

  • Does the DHT Client List eviction policy mean you can force inclusion by generating public key closer? Seems cheap, what does that enable? Indeed, someone mentions this concern down the line:

TODO: this is different from kademlia’s least-recently-seen eviction policy; why the existing solution was chosen, how does it affect security, performance and resistance to poisoning? original paper claims that preference of old live nodes results in better persistence and resistance to basic DDoS attacks;

  • If the DHT keypair is rotating, how does identity work?

  • Find it difficult to read and it seems rather ad hoc, perhaps I just need to spend more time with it to decipher

Swarm [readthedocs] (https://buildmedia.readthedocs.org/media/pdf/swarm-guide/latest/swarm-guide.pdf) and architecture

  • Distributed storage platform and content distribution network. Base layer for web3 stack.

  • Provide infrastructure for: messaging, data streaming, peer to peer accounting, mutable resourceupdates, storage insurance, proof of custody scan and repair, payment channels and database service

  • Swarm is a collection of nodes on devp2p network

  • Swarm allows for upload and disappear; It is a persistent data structure

  • “Uploading” and “downloading”, uploading is adding to local node and syncing. Downloading querying peers for relevant data then reassemble locally

  • Optimistic asynchronous replication, thus

  • swarm up foo.md => hash of swarm manifest; manifest is a json file; manifest ensures you download file with right mime type, multiple entries, etc.

  • Can also upload to a swarm gateway without running a node

  • Can modify swarm manifest [how does manifest hash change?]

  • Dealing with mutability of resources through ENS (blockchain) and Feeds in Swarm.

  • With ENS can get something like bzz://foobar.eth/quux.txt resolution

  • Swarm hashes are content addressed so changes to data changes hash; feeds provide single, persistent identifier for sequential data

  • Why use feeds instead of ENS? Update costs gas; can’t change data fast enough; resolution requires synced blockcchain

  • feed = user + topic; can give illusion of subtopics by e.g. taking hash of photo and adding “-comment” string.

  • What is the consistency model of Feeds? It updates, but I assume there can be stale reads. How does it work in case of concurrent updates?

  • Access control for content: password, by pk for one participant or with act (multi grantee).

  • ACT strategy - where can I find more info on this?

  • PSS: Destination can be specific node or neighborhood; specify destination in overlay address space independent of payload

  • Relay using devp2p and forwarding kademlia, semi-permanent TCP connections between relaying nodes; within neighborhood it uses message gossip

  • PSS is best-effort and does not guarantee ordering or delivery

  • Due to forwarding kademlia, pss offers sender anonymity (same packet though? ~)

  • Optional DH exchange supporting

Architecture docs:

  • Chunks (basic unit of storage, <4K), references (cryptographic hash, and possibly decryption key), and manifests (data structure for collections, can be basic kv-store)

  • A Swarm Hash aka bzzhash is a Merkle tree, ~drop in replacement for e.g. SHA3. Can verify integrity of partial content without having to transmit all of it, thus allow random access.

  • Since ENS is a smart contract its consistency models follows that of the underlying chain; Feeds is off-chain, so what consistency model does it have?

  • Swarm node address = sha3(eth acc pubkey)

  • Swarm implements a “distributed preimage archive” where nodes closest to address of chunk stores the data

  • Reachability from node (uploader/requester) to storer node is guaranteed by structured network topology that Kademlia provides, with guarantees of max number of forwarding hops

  • Auto scaling: nodes cache content that is passed on at retrieval, this leads to maximum resource utilisation and a garbage collection process removes least-accessed.

  • Storage insurance for unpopular content will be provided in future versions to ensure important content isn’t deleted

  • It thus offers partial replication, and caching+GC

  • Proximity metric: XOR of two adddresses, binary, log and diff of max distance to get integer where higher represents closer

  • Uses devp2p as transport layer allows for semi-stable peer connections ove TCP, authenticated/encrypted/sync data streams

  • Define kademlia connectivity more formally: ~ connected to nodes with all different proximity orders (intuitively: can call someone in family, in neighborhood, city, country, continent)

  • If points to connected subgraphs has kademlia connectiviy then there’s kademlia topology, and thus a path between two nodes exists, max hops, only need local information

  • Bootstrap nodes, then each node tries to reach saturation to fill bin with different proximity; once saturated can increase depth (more peers at each proximity level?)

  • DHT for k-v store, for content address storage usually means nodes are seeders and can serve that content (IPFS?); not just information/fingerprint but the content itself. This is what Swarm calls “Distributed Preimage Archive” (DPA). [This is useful for coercion resistance]

  • DPA splits into chunks and synced no nodes whose bin they fall into

  • Once request reaches storer node then it relays back to requestor, assumes forwarding nodes remembers requestors [what if relay nodes go offline?]

  • Success of retrievals dependings on routes and availability of nodes syncing chunks nearby

  • Redundantly retrievable to degree n if it can deal with n-1 failures; set of nearest neighbors hold replicas, this is called area of responsibility

  • Later on in POC4 things like erasure codes for improved reliability might be implemented

  • Briefly on caching and puring storage, synchronization by caching (along retrieval path)

  • This means it is a form of active replication

  • When two nodes connect they offer locally available chunk ~ per dbin [some subtlelty wrt proximity bin and heuristics]

  • Downstream node completed history syncing if: acknowleged chunks from upstream peer from beginning up until time session started; completed session syncing if synced all since session started

  • I.e. essentially heuristic for why a downstream node should have certain chunks based on their proximity and bins etc (though need to look into details more)

  • To reduce BW and redunant syncs from multiple sources, source sends offeredHashes and wantedHashes handshake (similar to Briar)

  • Data units: message, chunk, file(mime type, integrity), collection

  • Storage consists of two layers: local store (memory and persistant) and netstore (DPA)

Thoughts

  • Upload and disappear very powerful, as is coercion resistance (vs IPFS pinning IIRC)

  • …don’t follow manifest updates changing or hashed? Assume it is hashed but didn’t see explicitly

  • Feeds update consistency live where who can update?

  • What does upload mean exactly when transaction finished? Sync locally then with most immediate peers but more in detials

  • How does replication factor work? [briefly mentioned above]

  • Who can update a feed?

  • Look into PSS requirements more

  • …Forwarding kademlia and anonymity, not clear about how rigorous this is, no mix format, timing attacks etc

  • More details on ACT?

  • Best interface for using DPA as extension of data sync?


  • Later: IPFS, Tribler

Misc resources

So to call most E2E chat systems TOFU is far too generous. It’s more like TADA — Trust After Device Additions.

Other updates

  • First Web3 Messaging call
  • Welcoming Dean, Big Brother spec discussion
  • Talk with Adam about documenting protocol
  • Draft of comparison (see below)
  • Played around with swarm a bit



DRAFT: Different approaches to data sync source

WARNING: This is an early draft, and likely contains errors.

This document compares various forms of data sync protocols and applications, along various dimensions, with the goal of making it easier for you to choose which one is for you.

Let’s start with some definitions.

Table of contents

  • TODO: Introduction
  • Definitions
    • Further work
  • Methodology
    • Compared dimensions
      • Further work
    • Compared technologies
      • Further work
  • TODO: Consider adding description here
  • Comparison
    • Further work
  • Summary
  • References

Introduction

TODO

Definitions

Node: Some process that is able to store data, do processing and communicate with other nodes.

Peer: The other nodes that a peer is connected to.

Peer-to-peer (P2P): Protocols where resources are divided among multiple peers, without the need of central coordination.

Device: A node capable of storing some data locally.

User: A (human) end-user that may have multiple devices.

Data replication: Storing the same piece of data in multiple locations, in order to improve availability and performance.

Data sync: Achieving consistency among a set of nodes storing data

Mobile-friendly: Multiple factors that together make a solution suitable for mobile. These are things such as dealing with mostly-offline scenarios, network churn, limited resources (such as bandwidth, battery, storage or compute).

Replication object: Also known as the minimal unit of replication, the thing that we are replicating.

Friend-to-friend network (F2F): A private P2P network where there are only connections between mutually trusted peers.

Content addressable storage (CAS): Storing information such that it can be retrieved by its content, not its location. Commonly performed by the use of cryptographic hash functions.

Consistency model: …
Mostly-offline:
Network churn:
Light node:

Private P2P: …
Structured P2P network: …
Unstructed P2P network: …
Super-peer P2P network: …
Group-based P2P network: …

Cryptographic hash function: …

Further work

  • Is minimal unit of replication necessarily the right abstraction? E.g. messages but you care about conversation, chunks vs files, etc.

Methodology

We look at generally established dimensions in the literature [TODO] [TODO], and evaluate protocols and applications based on these. Specifically the focus is on p2p applications that perform some form of data synchronization, with a bias towards secure messaging applications.

All notes are tentative and are based on the provided documentation and specification. Code is generally not looked into, nor has any empirical simulations been performed. These results have yet to be reviewed so take them with a grain of salt.

Compared dimensions

These dimensions are largely taken from the survey paper [TODO].

  • Minimal unit of replication
  • Read-only or read and write
  • Single-master vs multi-master
  • Synchronous vs asynchronous
  • Asynchronous optimistic or not
  • Eager or lazy updates
  • Full vs partical replication
  • Consistency model
  • Active vs full replication
  • P2P type/topology

Notes on single-master vs multi-master

For single-master, there’s only a single node that writes to a specific piece of data. The other peers purely replicate it and don’t change it.

For many of the studied systems, content addressable storage is used. This means the replication object is usually immutable, and it is only written to once. As a side effect, many systems are naturally single-master from this point of view.

This is in comparison with the survey paper, where they are more interested in update-in-place programming through replicating SQL DBs and other “rich” data structures.

However, if we look at what is semantically interesting for the user, this is usually not an individual message or chunk of a file. Instead it is usually a conversation or a file. Seen from that point of view, we often employ some form of linear or DAG-based version history. In this case, many participants might update the relevant sync scope. Thus, the system is better seen as a multi-master one. To capture this notion I’ve modified single-master dimension to be w.r.t. replication object or user scope point of view. I suspect the latter is more informative for most cases.

TODO: Tighten up above paragraph with better definitions.

Further breakdown

TODO: Add definitions of below
TODO: Add more relevant dimension
TODO: Possibly decompose into further subproblems
TODO: Possibly re-structure as who-what-when-where-how-withwhat

  • linear/dag version history?
  • framing - support for nosync messages?
  • privacy-preservation (~wrt p2p layer?)

Compared technologies

This includes both applications and specification. The level of rigor and nomenclature varies between projects.

  • Briar and its associated Bramble protocol stack [TODO]

  • Matrix and its specification [TODO]

  • Secure Scuttlebutt [TODO]

(- Tox and its Tok Tok specification [TODO])

  • Swarm [TODO]

Further research

  • Apples to apples: Does it really make sense to compare Matrix with something like Swarm directly? This also depends on how decomplected protocols are and to what extent they can be studied in isolation.

  • More technologies: Bittorrent, Git, Tribler, IPFS. Whisper, Status.

  • Similarities and differences: Since a lot of protocols make the same choices, it might make sense to break similarities apart and focus on differences. Similar to what is done in Whisper vs PSS [TODO].

Comparison

Dimension Bramble Matrix SSB Swarm
Replication object Message (DAG) Message* (DAG/*) Messages (?) (Log) Chunks (File/Manifest)
Single-master? (object) Single-master Depends* Single-master
Single-master? (scope) Multi-master Multi-master Single-master
Asynchronous? Yes Partial Yes Yes
Asynchronous optimistic? Yes Yes Yes Yes
Full vs partial replication? Partial Partial Partial? Partial
Consistency model Casual consistency Casual Eventual / Casual
Active replication? No Yes Yes Yes
P2P Topology F2F Network Super-peer Group-based Structured
  • Read-only or read and write
  • Single-master vs multi-master
  • Synchronous/eager vs asynchronous/lazy
  • Asynchronous optimistic or not
  • Full vs partical replication
  • Consistency model
  • Active vs full replication
  • P2P type/topology

Further work

  • This dimension is a bit hairy: | Read-only? | Read&Write | Read&Write | Read&Write | Read&Write | - it depends on what we see as the minimal object of replication, and I’m not satifised with the current separation

  • Maybe it makes more sense to have two replication objects, e.g. chunks and DAG or log, as opposed to changing notion of single-master etc

  • Matrix depends: should capture auth state semantics

  • How deal with other things that are kind of replicated but differently? E.g. metadata, etc. Auth state, ephemeral, manifests, feeds, etc.

Summary

TODO

References

TODO

2 Likes

April 1, 2019

Continued with data sync comparison: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-comparison.md

Diff: Update p2p data sync comparison doc · status-im/status-research@7e8d886 · GitHub

Protocol weekly notes Protocol weekly notes - CodiMD

1 Like

2019-04-02

Continued with data sync p2p comparison: https://github.com/status-im/bigbrother-specs/blob/master/data_sync/p2p-data-sync-comparison.md

Not happy with the dimensions break down. Tweaking a bit:

Compared dimensions

*Request for comments: What’s a better way to decompose these properties? Ideally something like 2-5 dimensions that matter most. Right now it reads a bit like a laundry list, or like an ad hoc classification.

These dimensions are largely taken from the survey paper by Martins as well, with some small tweaks. To make it easier to survey, we divide up the dimensions into rough sections.

1. Why and what are we syncing?

  • Problem domain. Why are we syncing stuff in the first place?
  • Minimal unit of replication . The minimal entity that we can replicate. The actual unit of replication is the data structure that we are interested in, usually a collection of entities. Related: version history abstraction (linear vs DAG).
  • Read-only or read and write. Is the (actual unit of replication) data static or not?

2. Who is participating?

  • Active vs passive replication. Are participants replicating data that they are not interested in themselves?

3. When and where are we syncing?

Replication control mechanisms.

  • Single-master vs multi-master. Both entity and collection.
  • Synchronous (eager) vs asynchronous (lazy).
  • If asynchronous, optimistic or not.
  • Replica placement. Full or partial replication.

4. How are they syncing?

  • P2P Topology. Public or private P2P? Unstructured/structured/super-peer or friend-to-friend/group-based?
  • Peer Discovery. How do peers discover and connect with each other?
  • Transport requirements. Any specific requirements on the underlying transport layer?

5. What guarantees does it provide?

  • Consistency-model. What are the guarantees provided?
  • Privacy-preservation. Assuming underlying layers provide privacy, are these guarantees upheld?

6. Engineering and application, how does it work in practice?

  • Actively used. Is an instance of the protocol used in the real-world?
  • Mobile friendliness. Any specific considerations for mobile?
  • Well-defined spec. Is there a well-defined and documented spec to refer to?
  • Protocol upgradability. Are there any mechanisms for upgrading the protocol?
  • Framing considerations. Does it support messages that don’t need replication?

More diff

As well as filling in properties and comparisons of various properties.

Diff: Update p2p data sync comparison · status-im/bigbrother-specs@19ecae4 · GitHub

Also

Edited some phases based on Status protocol stack - CodiMD and setup static site BigBrother Specs | bigbrother-specs

Further reading

1 Like

Been a while since last post. Let’s post a brief update, starting with the two goals that were posted in OP a month ago.

Looking back

  1. Research and hacking.
    a) Continue studying relevant literature and existing solutions
    b) Document findings and create juxtaposition of different approaches
    c) Informed by above, resume PoC hacking and precise specification

Once there’s confidence in PoC and spec, make a plan for getting it into the app

  1. Finding great people.
    a) Keep track of researchers and people with relevant expertise
    b) Triage based on their relevance and likely availability
    c) As budget and time frame allows, reach out for collaboration

1 (a) and (b) have been done enough for (c ) to be resumed. It allowed us to get a deeper understanding of things such as consistency models, how people look at replication in the literature, and survey various approaches to this problem.

While a and b could continue for longer (much longer), we have enough to get a minimal viable version going (more on this later). As long as it supports upgradability we can improve things as we go along. This is important as it allows us to stay on the critical path.

2 Dean joined recently and has been hacking away, both on the general stack (GitHub - status-im/bigbrother-specs: Research and specification for Big Brother protocol) and minimal viable data sync (https://discuss.status.im/t/minimal-viable-data-sync-research-log/1185?u=oskarth).

We’ve also started having more regular calls (GitHub - status-im/messaging-pm: Messaging for Web3 PM) with other people under the Web3 Messaging banner. This includes closer collaboration with mixnet researchers such as Nym/Panoramix, and also recently the Swarm team.

Briefly on survey/research paper and utility thereof

For the continued research and juxtaposition, while slightly deprioritized last few weeks, it is also something important that we should keep working on. There are about ~10 bigger ish issues that need more thinking, as well as some feedback and misc TODOs that should be addressed. All of thee are in-line in here bigbrother-specs/data_sync/p2p-data-sync-comparison.md at master · status-im/bigbrother-specs · GitHub

The proposed next milestone of the p2p data sync survey paper is to create a solid draft survey/research paper. This should be something of sufficient quality that we can put it out to the larger community, including academia. The idea being that this allows us to be evaluated seriously by security/ethereum/dist-sys community, and this can have other positive side effects such as:

  • snowball effect in terms of research, a la Gnutella research (lots of papers on this)
  • more eyeballs and evaluation of our protocol stack
  • more qualified people taking us seriously, recommending our work
  • more qualified people interested in wanting to participate with us

The latter is already partly happening through e.g. Web3 Messaging calls, and by being public with research we have multiple groups reach out to us, being featured in Week in Ethereum, etc. We can reach wider than the Ethereum community though.

More on this in another update.

Misc PoCs

Some PoCs that have been written recently that relate to data sync:

Briefly on specs of Status protocol stack

Slightly out of scope of data sync per se, but there are some dependencies and interface points. We’ve made a bunch of progress on documenting on our current protocol stack, largely thanks to Adam.

This also includes discussions around bandwidth saving changes (while keeping compatibility), and the dependencies this has on things like moving to device-based comms, multidevices, and so on.

You can see the umbrella issue of things that have to be done here: Status protocol stack specifications, umbrella issue · Issue #10 · status-im/specs · GitHub

Minimal viable data sync spec and requirements

The goal with this work is to get a minimal viable data sync spec and PoC out as soon as possible so the app can use it. This can later be enhanced and upgraded. You can read Dean’s log here Status.app

The current description is here: Minimal Viable Data Sync - CodiMD

As part of this we have a set of requirements that we have been going back and forth on, in an attempt to lock down the things we know now. More on these specific requirements in the next section. This is still an open discussion, and to read more about this please consult above hackmd or ping me or Dean.

Towards minimal requirements

Here are the proposed requirements as of today.

  1. MUST sync data reliably between devices.
    By reliably we mean having the ability to deal with messages being out of order, dropped, duplicated, or delayed.

  2. MUST NOT rely on any centralized services for reliability.
    By centralized services we mean any single point of failure that isn’t one of the endpoint devices.

  3. MUST allow for mobile-friendly usage.
    By mobile-friendly we mean devices that are resource restricted, mostly-offline and often changing network.

  4. MAY use helper services in order to be more mobile-friendly.
    Examples of helper services are decentralized file storage solutions such as IPFS and Swarm. These help with availability and latency of data for mostly-offline devices.

  5. MUST have the ability to provide casual consistency.
    By casual consistency we mean the commonly accepted definition in distributed systems literature. This means messages that are casually related can achieve a partial ordering.

  6. MUST support ephemeral messages that don’t need replication.
    That is, allow for messages that don’t need to be reliabily transmitted but still needs to be transmitted between devices.

  7. MUST allow for privacy-preserving messages and extreme data loss.
    By privacy-preserving we mean things such as exploding messages. By extreme data loss we mean the ability for two trusted devices to recover from a, deliberate or accidental, removal of data.

  8. MUST be agnostic to whatever transport it is running on.
    It should not rely on specific semantics of the transport it is running on, nor be tightly coupled with it. This means a transport can be swapped out without loss of reliability between devices.

This brings down the total list of requirements from 11 conditions to 8, and 5 SHOULDS to 0. This is still a work in progress and feedback or questions on these requirements are welcome.

Misc notes

Just some points that might be worth mentioning.

It was brought to our attention that Teller Network would be interested in using data sync for state channels. This is one of several alternatives being evaluated, and its requirements would roughly be reliability (why mailservers don’t cut it for now) and scalability (why Ethereum PoW chain might not be the best choice). Other alternatives being evaluated are POA/POS chain, etc. One idea here is to choose a short-term solution, black box it, and work on something more desirable over long term, such as generalized plasma solution, etc.

From a data sync point of view, this is interesting as a specific use case that fits the requirements. A concern here is in terms of timeline.

Another point worth emphasizing is making sure we see data sync as generalized messaging, including machine to machine communication, and don’t get blindsighted by our focus on supporting the app. There are probably more areas where we aren’t taking this general view sufficiently seriously (e.g. assuming human users everywhere).

Next steps

Main strands of work.

  1. Refine minimal requirements above, associated spec with wire protocol and protobuf, and get a working PoC that we can (later) use through console-client, and possibly other means. Aggressive deadline for inital version of this has been set at end of April, but this might be revised.

  2. Work on getting a a solid draft for all the associated specs in spec repo. This is ongoing work and with Adam going on vacation this might take a tad longer than expected. Old ETA for this was mid May but may be revised.

Associated with spec effort is also critical path for getting Whisper partition topic bandwidth consumption with compatiblity into app. This is a must for a public launch, and interfaces with existing spec, the app, multi device pairing, and data sync.

  1. As bandwidth allows, continue on data sync p2p comparison survey/research paper. No ETA set. Might collab with Jacek on this later in person.

As well as continue with general collaboration with Messaging for Web3 teams. Potentially something more with Swarm team as there appears to be some opportunities there in the medium term.


I probably forgot some things, but that’s the (long) gist of it in terms of updates.

Until next time!

2 Likes

May 6

Compilation of notes from previous week. Today most focus was on Bloom filters. Specifically as they pertain to dealing with data loss and privacy preservation (see minimal requirements above), as well as to what extent they are a necessary optimization for bw/latency, and how they relate to sync with untrusted nodes. See research questions at end for more details.

Briefly on survey paper

Realized one thing that was bugging me with the existing survey work, is that it largely compares these different sync protocols in a vacuum. This means things that shouldn’t be captured there are, and there’s not explicit focus on constraints of the problem, nor proposing a solution.

While the desired output here is code into the app (this remains the critical path), we also want to keep doing this paper in parallel, being more problem and solution oriented as opposed to detached survey. To be continued.

Previous

Some notes and misc resources I didn’t get around to post before. More of a dump/reference.

Also revisited SoK and conversational security a while ago:

SoK Small note: they say self-destructuring messages is impossible, which is true, but they are possibly missing local threat model as seen from real world issues.

Three small pieces: (Notes captures elsewhere)


because pictures are pretty

<Notes from Git internals and object model, tags,references, remotes etc missing>

Bloom filters and their applications to data sync.

Main paper: Tarkoma, Sasu, Christian Esteve Rothenberg, and Eemil Lagerspetz. “Theory and practice of bloom filters for distributed systems.” IEEE Communications Surveys & Tutorials 14.1 (2012): 131-155.

In this paper they go over basics of Bloom filters, and surveys multiple variations of Bloom filters with different characteristics.

They also go through various applications, from general search, caching, network routing to p2p data sync.

Bloom filters, merkle trees and blockchains

Achieving Blockchain Scalability with Sparse Merkle Trees and Bloom Filters. The main idea in this blog post is to combine Sparse Merkle Trees (SMT) and Bloom filters to increase Bitcoin scalability. SMTs are a more sparse version of Binary Merkle Trees, and leverage default hash values for empty leafs. The SMT is used to prove membership of a transaction, whereas the bloom filter is used to prove non membership of a transaction. That is, to prove that an unspent transaction hasn’t been spent in a previous block. This is useful as you can have a compact representation, instead of going over all block headers and prove a transaction hasn’t been spent.

Similar ideas have been proposed for e.g. Plasma in Ethereum. See Plasma Cash with Sparse Merkle Trees, Bloom filters, and Probabilistic Transfers - Plasma - Ethereum Research I believe there’s a difference here if you use something UXTO like vs account based. There are some concerns on the performance of SMTs, but this is largely irrelevant at the interface level. See post by Vitalik for some ideas to optimize this. Optimizing sparse Merkle trees - Data Structure - Ethereum Research

Bloom filters and data sync

See papers above, things to read more.

  • Efficient PDA sync. Using Bloom filters for heterogenerous networks and synchronization protocols.

Only skim. Most sync uses “slow sync”, which is band for BW and latency. Their idea: synchronize hosts using one message in each direction with length |A-B| + |B-A|. Their approache CPISync supposedly more efficient than even standard Bloom filter approaches, using some kind of characteristic polynomial (didn’t dive into math details).

  • InformedContentDelivery AcrossAdaptiveOverlayNetworks. Optimizing throughput through better set reconciliation and collobrative downloads (p2p?). Also erasure coding.

  • Efficient Peer-to-Peer Keyword Searching. Quick search with low bandwidth using Bloom filters and DHTs. Inverted index over DHT. Some mathy calculations and experimental results, including for cpu,latency and with forms of caching.

  • TRIBLER: a social-basedpeer-to-peer system. Exploiting similar interest and friendship/community to increase usabiliy/perf in p2p. Set of extensions to BT. Talks about 5 challenges: of achieving decentralization, availability, system integrity/trust, providing proper incentives and finally network transparency (NAT traversal, etc). Taste-buddy-based content discovery. Buddycast. Bloom filters are used for their Megacaches. Using empirical data and carwler to see Friendster average friend (~243) and f2f connections (9147), within order of magnitude same for others. This means ~260 bytes to discover friend-of-friend, and enable thousands of peers to exchange a lot of data quicky. Buddycast: taste buddy exploration (random peer) vs exploitation (taste buddies). Ratio thereof.

  • How they solve the availability issue and how this relates to replication between untrusted nodes, Swarm etc.

  • Need to look into this in more detail

  • If you randomly sync with nodes, wouldn’t this impact sync scope? For a topic, say.

  • https://pdfs.semanticscholar.org/94e6/26b7ff25b07d23fd345aad254abbdd0a574b.pdf. Data bundle synchronization. Bloom filters, NAT traversal, data bundle selection algorithms. Good for high churn. Node from candidate list (probabilistic to get good behavior, 1%/45%/50% iirc, tracker/friend/f2f/random? ish). You select a range of bundles to sync, but which are these nodes? Only a range. They mention 100k bundles, so that’s all their content?

Flow Model: Node selection → Bundle Selection → Bloomfilter creation → NAT traversal → Waiting for reply. Need to read and think about this more - my mental model of replication space is different now from last time.

Churn resiliance - 1000 nodes 30m to 30s experiment.

Also, this paper is better than their wiki. (Dispersy Read the doc system overview).

Misc

(Delft, Niels Zeilemaker etc? Where are these people now? Reach out?)

Research questions

  1. Can Bloom Filters be used to do more efficient data sync?
    Unquestionably yes.

  2. Are Bloom Filters necessary for a minimal version of data sync, or can they can be added after the fact?
    Leaning towards adding after the fact as enhancement being a desirable possibility.

TODO: Still here - the main unknown is in terms of how to deal with sync scope and sync with untrusted nodes.

Can we generalize the node selection and bundle selection step? This might make it more elegant (make it work->right->fast).

  1. Can Bloom filters be used/relied on for privacy?
    See paper (http://netgroup.uniroma2.it/wp-content/uploads/2015/04/BBL-PSD-cameraready.pdf) for analysis. Also how does Tribler deal with this? Vs private overlay.

  2. Can Bloom filters ability to prove non-memberhip be used to deal with exploding messages and data loss?
    To be explored.

  3. How big is the perf difference for reasonable parameters and Bloom filter?
    I.e. bandwidth saving, to calculate.

TODO: Still here - need to set some assumptions and do some math, or compile existing empirical data. Can probably KISS to get order of magnitude intuition.

Next

Look into above questions more in detail, especially as blocking critical path. Study Tribler in more detail.

iirc it doesn’t, by sharing a bloom filter you allow a node to discern information about you, it’s similar to the sync with untrusted nodes issue

May 8

Looked into Dispersy bundle synchronization and Tribler into more detail. Efficient sync using subset bloom filter selection, as well as some good ideas on dealing with challenging networks of various types.

Briefly on large group chats etc

Just to capture briefly ideas from physical notebook. Terse version.

  1. Leverage multicast when underlying protocol allows it, i.e. use | topic | message | for sending offers/acks. This assumes you send to many IDs and these are all listening to that topic.

  2. Random selection for large group chats. I.e. pick K out of N to offer/ack to, this allows not having quadratic scaling for naive syncs.

  3. Log collator idea. f(topic, logs) => topic, one log. Problem: Censorship resistance (proof of data availability~)? No denial of service.

Tribler

The main idea is to exploit the social connections between people who share similar interests, and use this to increase usability and performance in p2p networks.

Content discovery and recommendations is based on taste buddies, i.e. users with similar interests.

Talks about decentralization/availability/integrity/incentives/network transparency as being fundamental challenges in p2p networks.

Architecturally, it is a set of extensions to Bittorrent. It has a module for social networking, and ‘megacaches’ to locally query metadata. This meetadata are things like friends list (information on social network); peer cache (info on super peers); preference cache (preference list of other peers). All of these are under 10MB in size.

The problem domain is Bitorrent so static files. In UI there’s downloads/friends/files liked and taste buddies.

They have a buddycast algorithm (epidemic based on social graph) and a ‘peer simularity evaluator’.

Bootstraping use random super peer to get more nodes, then use those other ones.

Session boundary problem - using stable peer ids, i.e. keypairs.

Bloom filters for storing and testing set membership for these megacaches. Filtering peers from messages that are known to destinations.

Size of these filters functions of expected connections. Empirical data: Friendster 240 average friends and FB 9100. [FB ~200 median, 340 average]. (This makes sense, since 100 and then 100*100 with some overlap).

To get a bloom filter with 1% false positive rate you need <10 bits per element, so that’s 260 bytes to test and figure out common friends of friends.

Buddycast: Each peer has a number of taste buddies in their megacache with preference list (what content they want) and random peers (i.e. no known preference list). Then every 15s, each peer can either exploit or explore. Exploit = connect to taste buddy, explore = conect random peer. This ratio can be titrated.

A buddycast message also has taste buddies, top 10 preference list and random peers. Buddycast can be disabled for privacy. Peers are listed either by freshness (random peers) or similarity (~overlapping preference lists?).

Briefly talking about BT and rarest-first, tit for tat, and BTSlave using repeaters to increase content replication.

  • Curious, does the megacache concept scale for messages? Assume a message is 1KB, and average user sends 100 messages a day (70 for slack average sent). If you are talking to 100 other people (~200 friends in some constellation of N chats), that’s 10k messages per day or 10 mb per day already. The key thing here is that it’s metadata Tribler is storing, not messages themselves, which (a) allows it to be a cache, cause data integrity not hard requirement (b) constant space, ish (prune preference list at top N).

  • What would happen if we used this on top for most popular/recent content? Still allowing for other means.

Main paper May 8: Dispersy Bundle Synchronization

This is their academic paper equivalent to Dispersy — Dispersy 0.0.1 documentation - not clear exactly what overlap looks like. I find it interesting that in their docs they call it Distributed Permissions System, but in the paper they emhpasis data dissemination in challenged networks. EDIT: Oh, cause this is about the Bundle Synchronization in particular, not Dispersy or how they use it.

Introduction

Requires minimal assumptions about reliability of communication/network copmonents. Decentralized and pure p2p. Specifically for challenged networks.

What are examples of challenged networks?
Mainly: Vechuilar ad hoc networks (VANET), Delay Tolerant Networks (DTNs).
Additionally: P2P Networks and smartphones.

What’s challenging about VANETs?
Connection windows are small.

What’s challenging about DTNs?
Connections are only possible when network isn’t very well-utilized.

According to Dispersy, what’s challenging about pure p2p networks?

  • Churn - node lifetime measured in minutes
  • NAT traversal
  • Malicious nodes: frequent interactions

In past, these were treated in isolated. For Dispersy, these concerns are combined.

What’s the main idea in Dispersy?
Nodes advertise locally available data in one message, and repeat random interactions lead to quick spreading of data.

NOTE: It appears that the goal is to spread all bundles to all nodes, i.e. a form of full replication. It is also active, since all nodes are spreading all data.

Individual bundles can appear out of order, so it’s different from Delivery in Bramble. They suggest not having dependencies on bundles.

Key things in paper:

  • 1:N and M:N data dissemination with minimal complexity/state info
  • Easy repeated random interactions between nodes, wrt churn, failure and NATs
  • Allow for 100k bundles+ and M+ nodes
  • Experimental data and real-world usage

Related work:

What are the three forms of distributing updates Demers proposed?
direct mail, anti-entropy and rumor mongering.

According to Demers, how does direct mail data distribution work?
Update is sent from one node to all other nodes.

According to Demers, how does anti-entropy data distribution work?
Node chooses random node and resolve difference in state.

According to Demers, how does rumor mongering data distribution work?
Node receive new update (hot rumor), then forward update until it goes cold (received from many nodes).

Demers initially thought anti-entropy too expensive; optimiztion to use check-sums etc. (NOTE: Before Bloom filters popular?)

VANET can be seen as P2P due to churn and many nodes w/o central component.

What is the network churn in VANETs due to?
Movement of the vechile.

Lee had urban vechiles with event summary such as license plate and then bloom filter to detect missing summaries. Bloom filters as being very efficient for membership testing, in Lee’s case using 1024 Bytes. (NOTE: for 1% false positive rate that’d be ~1000 cars). They also coupled this with a push to one-hop neighbor. The bloom filter here is thus only used to detect missing summaries. They call this a form of harvesting.

System Overview

‘Bundle synchronization middleware’ for challenged networks. Extends the ant-entropy method using compact hash representation (Bloom filters). Allows two nodes to compare data sets w/o sending complete copy.

TODO: Insert overview of sync and receive

In Dispersy, What does the synchronization five steps look like?

  1. Node selection
  2. Bundle selection (some range)
  3. Bloom Filter creation
  4. Send BloomFilter (NAT traversal)
  5. Wait for reply (Pause and goto start)

Synchronization extends harvesting by Lee. They only have a range of bundles, which keeps false positive rate low w/o big bloom filter.

NOTE/QUESTION: How does that follow? Bloom filter size is dteremined by bits-per-element (m/n), but if the element is the set of possible bundles what does it matter what the specific range is? Does this imply the recipient knows which range it is supposed to receive, in order to limit the universe of possible elements?

They say “in contrast Lee used fixed size w/o selecting bundles”. This seems to imply the recipent knows what the range is.

System Components

Allow for different types of challenged networks, so Dispersy is modular. They have a few different components:

  • Network Construction
  • Node Discovery
  • Robust Node Selection
  • NAT puncturing
  • Synchronization
    (- Synchronization performance)

Network construction

All nodes communicate in one or more overlays.

What’s an overlay in p2p?
An additional network build on top of e.g. Internet.

A public key is used as identifier of an overlay, which has to be known to nodes joining.

NOTE: Is this sloppily written or am I misunderstanding? Is this just NodeID or do they mean some kind of bootstrap node / Network ID? EDIT: I guess if we look at each node as having their own overlay.

Why did Dispersy choose to run on UDP over TCP?
Because it allows for better NAT firewall puncturing.

Each overlap has a candidate list,

What is the candidate list that each Dispersy node has in each overlay?
List of active connections within that overlay.

When we connect to an overlay, the candidate list is empty, thus requiring node discovery.

Node Discovery

What’s the most basic form of node discovery if you have an Internet connection?
Using trackers.

What characteristics does a tracker have for node discovery?
(1) Known to all nodes (2) Must be connectable (not behind NAT-firewall).

What does a tracker do in terms of node discovery?
(1) Maintains list of nodes for several overlays (2) Returns these upon request.

After populating candidate list with initial nodes, for example by asking a tracker, these can be used for new connections.

A node A asks node B for introduction to new node, B sends reply with node C. Then A can puncture NAT.

Description of NAT puncture mechanism, more in later section.

Robust Node Selection

We have to select a node from candidate list to send intro request to, but there might be attacker nodes in the list.

Why is node selection in p2p non-trivial?
You might connect to a malicious node, leading to e.g. eclipse attack.

What’s manipulating the candidate list of nodes in p2p called so that it includes attack nodes?
Eclipse attack.

What can a malicious node do after eclipsing a node?
Control all data received by victim.

How does Dispersy create a more robust node selection algorithm?
By dividing candidate list into three categories, one with two subcategories.

What are the 3-4 selection list categories in Dispersy?

  1. Trusted nodes
  2. Nodes we contacted before (successfully)
  3. Nodes that contacted us before
    3a) Received introduction request
    3b) Nodes introduction

When selecting a node from candidate list, category 1 is 1% of the time, 99% divided by category 2 and 3. Each subcategory gets 24.75% of the time.

NOTE/TODO: This probably generalizes to how Bitcoin and Ethereum deals with Eclipse attacks, in terms of buckets etc. Might be worth comparing these.

For each step in Dispersy node selection, what’s the probability of choosing each category?
1% trusted, ~50% nodes we connected to before, and ~25% for each of other nodes having contacted us before.

How does Dispersy’s node selection algorithm guard against eclipse attacks?
Attacker nodes will (usually!) end up in the third bucket, which has a cap on the selection probability.

In terms of which node is selected in each category, this is elaborated on later. But roughly oldest for NAT timeouts, as well as most similar friend (IIRC).

If an attacker has a lot of resources they can still eclipse node, hence the use of trusted nodes.

NOTE/TODO: Eclipse video worth rewatching with better notes on mitigations. Also: How does 1% impact eclipse possibility to do harm? If you have limited amount of connections you might still not reach 1% case, so seems to depend on what harm individual/set of nodes can do.

Trusted nodes (i.e. trackers): Contacting a trusted node will completely reset candidate list (!). They did experiments with manipulating category 2 nodes.

Why are trusted nodes (trackers) less suspectible to eclipse attacks?
They are contacted a lot and constantly by honest nodes, which makes it harder to pollute the candidate list. (N.B. Sybil attacks).

NOTE/TODO: Read up more on Sybil attack and Sybil resistance.

NOTE: They seem to claim 4 million concurrenct nodes in P2P empirically (cite 18, Gnutella), and implicitly that the risk of Sybil attack for that is low.

NOTE/TODO: How does recency impact the sybil attack resistance? IIRC in Eclipse presentation there’s an argument that old nodes are safer, I guess this depends on bucket design? I.e. if % allocation changes.

NAT puncturing

image

How many nodes connected to the Internet are behind a NAT, according to Halkes et al 2011?
Two thirds, up to 64%.

NOTE/TODO: this number might be even higher nowadays with smartphones? Read up on current state of NAT traversal, and how this changes with different protocols, i.e. QUIC etc.

As a consequence of this, each introduction step is used with distributed UDP NAT puncturing mechanism.

How does Dispersy get around the fact that 64% of nodes are behind a NAT?
By integrating each node discovrey step with a distributed UDP NAT puncturing mechanism.

How does the UDP NAT puncturing mechanism work in Dispersy? Three steps.

  1. When A asks for introduction B, B sends introduction-reply to A AND performs a puncture-request to C (with A’s info)
  2. C then sends puncture message to Node A
  3. C’s message will be blocked by A, but when A selects C they can connect to them

NOTE/TODO: A bit fuzzy on the details of how the ports are matched up, worth revisiting basics. Same re NAT timeouts.

Due to NAT timeouts, nodes are invalidated after some time (~30s-1m). See Halkes (cite 9) for more.

Synchronization

Bundle synchronization through bloom filters, allow nodes to check for missing bundles. Additionally, piggybacking on introduction mechanism to allow just one message to be sent.

Why is it extra important to limit size of bloom filter in Dispersy?
Due to UDP + piggybacking on introduction message, so that the MTU of link isn’t exceeded.

What happens if the MTU of the link is exceeded for UDP?
If UDP consists of multiple fragments, there’s a higher probability it will be lost.

What’s a typical MTU size used on the Internet, according to RFC 3580?
1500 Bytes (2304B for 802.11).

NOTE: Might be more up to date resources for this / gotchas for various links.

NOTE/TODO: How big would e.g. Sphinx packet format be? Same for various types of standard cryptographic payloads, e…g Handshake. If they are written with TCP in mind, perhaps this is a gotcha.

<1000 bundleds can be synchronized with 1% false positive rate for 1024 Bytes; <2000 for 10% false positive rate.

Assuming bloom filter of 1024B, how many messages can we synchronize with 10% false positive rate?
<2000

What’s a flaw in Lee’s MobEyes bloom filter design that uses 1024 Bytes filter?
There’s no limit on bundles being synced, which means the false positive rate shoots up if we add e.g. 2000 vechiles (10% for x2).

How does Dispersy get around the flaw in Lee’s MobEyes bloom filter parameter choices AND MTU limitations?
By fixing false positive rate to 1% as a constant, and limiting number of bits available (MTU) (m), this means Bloomfilter has a fixed capacity, so we select subset of bundles to synchronize (n).

Subset selection

Based on global time property of each bundle, which is an implementeation of Lamport’s logical clock.

What property is subset selection in Dispersy based on?
Global-time property of each bundle, which is using Lamport’s logical clock.

Global time is highest global time received by node + 1 when creating a bundle. NOT unique, but provides partial ordering in overlay. Assume distribution is uniform ish.

Two heuristics for defining which subset to sync.

How does global time property help selecting what to sync in Dispersy?
By specifying a subset range of this (non-unique) global time to sync.

What heuristics are used in Dispersy to select subset to sync in Dispersy and when are they used?
Modulo heuristic when a node is lagging behind, and pivot heuristic when it’s almost up to date.

Modulo heuristic: count number of bundles, divide by Bloom filter capacity. E.g. 10k bundles, bloom filter 1000 capacity, use modulo 10. Then use random offset, e.g. select every 10+n, e.g. 1, 11, 21 etc.

What does modulo heuristic in Dispersy resemble and what does it not require?
Linear download (1, 11, 21 etc) and doesn’t require state.

What’s the downside of modulo heuristic in Dispersy?
If almost caught up, it’d take a modulo number of steps to sync up, e.g. 10th bundle +1/+2 etc offset.

For pivot sync, using exponential distribution between 0 and (locally known global time). Then select bloom filters to left and right, and pick the one~. (NOTE: A bit hazy on exact details here, would need to revisit; exponential means bias towards new).
image

**NOTE/TODO: How would we allow for these types of sync modes? It’d require global-time partial ordering visible for all nodes, AFAIUI

What does a node do when receiving Bloom filter in Dispersy?
Checks if it has missing bundles to send, based on local subset of bundles and testing against Bloom Filter.

What does receiving node in Dispersy need to find out if it has missing bundles to send?

  1. Bloom filter
  2. Subset defined by: a) range of global-time b) modulo and c) offset value.

Synchronization performance

Nice math based on false positive rate, MTU, bundles etc. Depending on synchronization level.

NOTE: Interesting to note re false positive rate, AFAIUI this is inversely related to BW perf.

Evaluation and evaluation

Actual simulation and empirical study of things described above. Including NAT experiments, random node selection etc. Skimming for now.

Churn resiliance based on session time. Propagation speed. Bandwidth consumption. High workloads. Internet deployment (integrate into Tribler).

NOTE: Cell phone perfmrance is much worse than WiFI connections. I.e. carrier grade APDM routers, hard to puncture. Also UDP concerns.

NOTE/TODO: Look into smartphone NAT traversal state of the art.
image

To sync 100k bundles Dispersy connected to ~1000 nodes and consumed 30 Mbytes of BW per run. (NOTE: Assuming bundle is ~1KB that’s 30*1000=30k for 100k or 1/3? Eh, unit is off somewhere).

Sucess rate of 77% for 15 different NAT traversal (N.B. smartphone worse). Can handle excessive churn (session times 30s).

Further reading and conclusion

Awesome paper, much better and clearer than Tribler Dispersy docs. Wish I read it in detail earlier. Lots of beautiful ideas and backed up by real world experience and math. Still need to think about the untrusted stuff, but at least the dependencies and mechanics are more clear.

Some citations from paper:

  • Bloom, Burton H. “Space/time trade-offs in hash coding with allowable errors.” Communications of the ACM 13.7 (1970): 422-426.

  • Demers, Alan, et al. “Epidemic algorithms for replicated database maintenance.” ACM SIGOPS Operating Systems Review 22.1 (1988): 8-32.

  • Halkes, Gertjan, and Johan Pouwelse. “UDP NAT and Firewall Puncturing in the Wild.” International Conference on Research in Networking. Springer, Berlin, Heidelberg, 2011.

  • Huang, Hong-Yu, et al. “Performance evaluation of SUVnet with real-time traffic data.” IEEE Transactions on Vehicular Technology 56.6 (2007): 3381-3396.

  • Laoutaris, Nikolaos, et al. “Delay tolerant bulk data transfers on the internet.” ACM SIGMETRICS Performance Evaluation Review. Vol. 37. No. 1. ACM, 2009.

  • Lee, Uichin, et al. “Dissemination and harvesting of urban data using vehicular sensing platforms.” IEEE transactions on vehicular technology 58.2 (2009): 882-901.

  • Singh, Atul, et al. “Defending against eclipse attacks on overlay networks.” Proceedings of the 11th workshop on ACM SIGOPS European workshop. ACM, 2004.

Ran out of time for timebox, and still have some synthesizing to do, but thought I’d post it here for e.g. Dean et al. TODO: understand in more detail the following points:

  1. What exactly would need to be added to our current setup to enable this (e.g. transparent global time and subset indicators)
  2. How this compares with current/Briar ish approach wrt untrusted node assumptions and latency (sync delay quite big) - how can we get best of both worlds? Also bloom filter privacy.
1 Like

Extending MVDS with stateless synchronization

Background

Existing spec for MVDS: Minimal Viable Data Sync - CodiMD

Data Sync log research including Dispersy Bundle Sync: Status.app

Bundles and messages are used interchangeably.

Rationale

While the currently suggested implementation takes care of the requirements, it has a few things that can probably be improved:

  • performance in large group chats - minimize chatter
  • leveraging untrustrusted nodes better for churny nodes

Churny nodes means mostly-offline, smartphones, etc.

One way of doing this is to extend the current protocol proposal with some ideas from Dispersy Bundle Sync algorithm (DBS hereafter).

Overview

DBS uses stateless sync with bloom filters on subset of bundle rages. Essentially it goes something like this:

  1. Node selection
  2. Bundle selection, subset
  3. Bloom Filter construction
  4. Send Bloom filter (they use NAT traversal, but this can be separate component)
  5. Wait for reply

They assume you want to sync all data with all nodes. Essentially you have a candidate list and then you pick a (possibly untrusted) node from it, using a specific selection algorithm.

For us, we can do this per sync scope, and let nodes be the participants by default. As an extension, other nodes (i.e. any node) can be included, but there are some privacy considerations to work out to make this workable.

Even if we leave node selection alone it would already be an improvement, as you can leverage existing nodes in a group chat, with no difference from current MVDS setup.

The tricky part, and the one that requires some modification, is how to do bundle selection. To make the Bloom filter efficient, we need to be able to select a subset range of bundles/messages to sync. This has some requirements, things that should be doable by any node. That is, it shouldn’t be like a DAG inside the message.

For Bloom filter creation, what it gives you is a way to check that someone is missing messages. I.e. A sends B some Bloom filter of a range of bundles, and B can test against this range. If they have other messages A doesn’t have, it can send them, just like we do in MVDS right now. And B would do vice versa.

Bundle selection

To do bundle selection we need a few things. The main property that helps us cut bundles up is the global time property. This is a Lamport logical timestamp which provides a partial ordering of bundles. When a bundle is created, it takes the highest global time it has seen and +1. It is assumed the range of these timestamps are roughly uniform, and they are not assumed to be unique.

We probably want this to property to be outside the payload, as this allows any node to participate in this sync. I imagine this would be in the header. I imagine this will be per sync scope, but it might also be global. We need to figure out trade-offs and how much this needs to be specified.

In addition to this global time property, we need a few more things. These are more like mechanisms and things for the Bloom Filter Offer payload.

Essentially we want a way to take a specific subset of all bundles available. Then depending on what sync mode we have (old, lots of missing bundles) (new, just give me latest) we use different heuristics. These are the modulo heuristic and pivot heuristic. Without going into too much detail, the main thing we need to communicate here is:

  • global time range
  • some modulo number
  • some offset

In the modulo heuristic case, if we want 10k bundles, we might pick modulo 10, then offset 1…n and in each sync send a bloom filter with metadata: [lamport low, high], 10, 1. Then receiver would use bloom filter and check if sender is missing something in bundles: 1, 11, 21, etc. Then repeat process for each offset until we have synced across the whole bundle range.

NOTE: Need to confirm this is how they structure global time range

For pivotal sync, as far as I can tell, it’s a global time range and then we probabilistically choose it based on exponential curve. This has a bias towards most recent messages.

Changes required

As far as I can tell the only thing we really need here to enable this is to:

  1. Logic: Make node selection step explicit
  2. Logic: Make bundle/message selection step explicit
  3. Logic/Wire: Add Lamport clock to header

As well as later on:
4) Writing modulo and pivotal heuristics
5) Adding Bloom Filter creation, offer and handling logic
- Additional payload with global time, modulo, offset

Sanity checks and comments welcome.


Briefly on MTU budget

To allow for more efficient transports, it’s useful to be mindful of the MTU. From my current reading, 1500 Bytes seems like a safe standard. This is 1.5KB, or essentially 1KB with some wrapper on it. Some questions we should answer:

  1. How much should we care about this at this stage? I lack intuition to know how big of a PITA this can be, other than knowing it can lead to huge packet loss for UDP with NAT puncturing.

  2. Is this MTU budget correct? Are there other limits in UDP / smart phone routing / mesh networks etc we should care about?

  3. Are there any aspects of our design where this MTU might be hit? Example are: Big cryptographic handshakes; Header overhead; Excessive wrapping; Sphinx packet format; Bloom Filter size; Message payload.

  4. Is there some general heuristic for budgeting these things? I.e. how big of a payload budget do we have, given that we want to encrypt it, put some header, then routing info, etc.

How much should we care about this at this stage?

Not sure you need to care about MTU at all for a protocol at this layer, I think it makes sense if you are designing one that you know will run directly on UDP (for example a discovery protocol over udp, but it’s unlikely to be the case here, as it will have probably a few layers underneath), and in our specific case will run most likely over tcp (devp2p).

That’s not to say that we shouldn’t keep the payload small and lean, but we might try to keep it under 1.5K, and then is just being padded by the encryption layer and uses tcp as a transport, where mtu is less of an issue.