Topic negotiation and device to device communication proposal

cammellos · May 7, 2019, 11:42am

Problem

Currently all the traffic from 1-to-1 messages / group chats / device syncing goes into a single whisper topic.
This makes this topic fairly heavy and effectively prevents us from retrieving more than 24 hours of data.

We have been talking about moving to a partitioned topic, but we have faced the issue on how to keep compatibility once we do.

During the discussion, we reasoned that moving to device-to-device will give us a better way to upgrade our protocol without breaking compatibility, as we will be able to know who is listening on the other side essentially.

Proposed solution

Currently we use already device-to-device communication for group chats, the proposed solution means extending this method for 1-to-1 chats and add some form of topic negotiation on top of it.

Information about devices are published periodically by each device and can be looked up by other devices when sending a message, so they know which devices to target.

Given that user A has 3 devices A1, A2, A3 and user B has 3 devices B1, B2, B3, we would like to negotiate a topic T and have all devices being able to listen to that topic.
We also want to minimize the amount of chatting between devices, as we should consider them mostly offline.

Topic seeding

The fastest way to negotiate a topic would be for whoever starts the conversation to send a random topic to all the other devices, wait for the other devices to acknowledge and then start using it to publish.

Although this methods works, we don’t want to have entropy from a single source, as that might potentially allow some vectors of attack (essentially we are giving A the ability to have B listen to any topic of their choice), so we can leverage the fact that we use key pairs to calculate a shared topic, using diffie-hellman for example:

On A side:

DH(Aprk, Bpk) -> Append some constant bytes -> Sha3 -> Takes the first 4 bytes

On B side:

DH(Apk, Bprk) -> Append some constant bytes -> Sha3 -> Takes the first 4 bytes

The key initially generated can be used for symmetric encryption, and the 4 bytes as topic.

In this case there’s no need to add any extra information to the already existing message as all the information is already known.

So in short, on receiving a message from A, B will simply calculate the topic using A(pk) and listen for messages on the new topic.

Converge

The problem is that we can only stop publishing on the shared topic and move to the new topic only when all the devices participating in the conversation have acknowledged the new topic, otherwise some of the devices will be left behind.

To do this, we can ask devices to send acknowledgement messages on the new topic, and we can have devices carry that information with each message (optionally to save bandwidth, we can compute paddings to check etc), so that latecomers or new devices (who might have missed acknowledgements), will be brought up to date.

For example, say we have A1 A2 B1 B2:

A1 sends a message to B1, B2 It will also send a message to A2 if paired

B1 receives the message, and acknowledges on topic T, including {:acknowledgements => ['A1]}`

B2 sees B1 message, acknowledges on topic T, includes {:acknowledgements => ['A1', 'B1']}

And so on.
Once all the devices are on the new topic, we can only publish on the new topic, and we can exploit the fact that we use a symmetric key to send a single message (the content will be encrypted differently for each device though, not to invalidate pfs).
If not everyone is on the same side, we fallback to the shared topic for that user.(i.e. if A2 has not acknowledged T, we will send a message to A on the shared topic, targeting A1 & A2, as it is now), the pairing message to B2 can be sent on the new topic, provided that B2 has acknowledged.

If a new device joins in (say B3), there are two cases:

User B pairs the device and syncs the chat.
In such case B3 will generate the topic for the chat, fetch historical messages and might see some of the messages/acknowledgement, and will be brought up to date.
User B does not pair the device.
Eventually user A will detect the new device, at which point the next message will be sent on the shared topic, with acknowledgement informations included, until B3 acknowledges on topic T, after which only topic T will be used.

There are quite a few variations on this theme ( have new device ask for acknowledgements, different way to calculate the topic etc), but the basic idea is:

We move to device-to-device communication
We have a deterministic secret topic based on the two users key pairs
We wait for all the devices to acknowledge the new channel until we fully move to the new topic
We have devices acknowledge automatically on the new topic to speed up the process, and we possibly carry that information on other messages in order for new devices to catch up.

Because moving to a new topic can take quite a long time (we need all the devices to acknowledge, or to be considered stale, which will happen after we haven’t seen that device for x days), we also might want to move to a partitioned topic.t

The method used for that is essentially similar, by versioning the bundles containing device information https://github.com/status-im/status-go/blob/21f9f09586bb36cbe55f2895aeff820aa4ffd705/services/shhext/chat/encryption.proto#L11 we know which ones have moved to the new topic and we can target accordingly (in this case we don’t need acknowledgements as we already publish this information periodically).

Eventually this should decrease the bandwitdth on the shared topic as more people upgrade, and give us a way to better handle compatibility issues.

It is worth mentioning though that this is a breaking change in a way, as we are fundamentally changing the way we currently send messages, and we can only target devices from version 0.9.32 (although we can avoid breaking compatibility if we don’t detect any device, so it will break if a user is running a mix of >= 0.9.32 and < 0.9.32, but not if it’s running only < 0.9.32).

Let me know what you think, feedback is welcome, also let me know if something needs to be better described (these are not specs though, just a description of a possible solution).

roman · May 7, 2019, 1:30pm

In case if A2 was created, paired with A1, then it communicated with B1 and B2 (so that T is established), and then A2 was removed and a new instance A3 was created on the same device: how do we handle a paired instance which will never acknowledge a new topic?

it probably should be B2 instead of B3

cammellos · May 7, 2019, 1:44pm

In case if A2 was created, paired with A1 , then it communicated with B1 and B2 (so that T is established), and then A2 was removed and a new instance A3 was created on the same device: how do we handle a paired instance which will never acknowledge a new topic?

In this case A2 has never acknowledged correct?

If so, how it will work is that they will not be able to move to the new topic completely.
Messages from A → B will go on a new topic, if A2 is still paired a pairing message will go on the shared topic.
Messages from B → A will go on the shared topic, until one of these conditions is true:

A2 is marked as stale, currently we haven’t pull the trigger on how many days of inactivity we consider a stale device, but we can cap that to a reasonable number (i.e 30 days). If in 30 days there’s no activity from a device, it will be considered stale and we won’t be sending messages to it. Mind that devices periodically advertise themselves, so it would mean that this device was not online for 30 days. If a paired device of A2 has any activity, we consider this as A2 activity as well (say you don’t unpair A2, to B1 this will count as A2 activity as well as A1).
Another device takes place of A2, currently we keep only 3 devices in sync, so if the user adds A4, A1 & A3 & A4 will be kept in sync, but not A2. This might of course never happen.

Not sure there’s much we can do in this case, as we won’t know for sure whether the device is gone or is just offline for a while.

it probably should be B2 instead of B3

Indeed it should have been, I initially had 3 devices for the example, but that was too ambitious so chickened out, but didn’t cover my tracks very carefully

roman · May 8, 2019, 5:21am

@cammellos what is our upgrade path in case if we will want to have more than 4096 topics later?

cammellos · May 8, 2019, 6:27am

So, say that we have pushed out a version which listen to the partition topic of 4096:
H1(Pk) and we want to move to H2(Pk).

A new device (A) will therefore listen on 3 topics:

H(discovery-topic) -> current discovery topic (unversioned)
H1(Pk) -> 4096 topic (version 1)
H2(Pk) -> 8192 topic (version 2)

It’s bundle https://github.com/status-im/status-go/blob/21f9f09586bb36cbe55f2895aeff820aa4ffd705/services/shhext/chat/encryption.proto#L11 will have a version of 2.

B running version 2

B is a new device from a different user. When sending a message to A, it will first check for A’s bundle under it’s personal topic PH(A) as of now:

No bundle found: B will send a message on H(discovery-topic) to any device, it will include it’s own bundle so A will know where to reply (topic negotiation as described above might occur)
A bundle is found: B will send a message on H2(Pk), as it can be deduced by the version of the bundle that A is listening on that topic

B running version 1

Identical to the step above, but the message will be sent on H1(Pk), this will include version 1 (and possibly topic negotiation), so A will know where to reply.

Version

We can be explicit about which topic we listen to on the bundle, as we could literally specify the topic someone is listening to, in that case it’s just a matter of changing that value, and it’s more flexible as it can be changed to anything, at any point (even by the user). Or we can just work by convention and just use version: 2 and the code knows that version 2 listen to this particular topic, up to us, don’t have a strong opinion on the matter.

igor · May 8, 2019, 8:49am

Thanks for the writeup! It looks reasonable to me, at least it is definitely much better than not preserving backward compatibility.

Having app starting from 0.9.32 is good, I don’t think we’ll have many installations before these, especially in the mixed setup where the devices aren’t stale.

I’d wait for moving protocol to status-go side for that, so we can simulate different scenarios easier. Is it possible to use the current console client to make some simulations?

cammellos · May 8, 2019, 8:52am

I’d wait for moving protocol to status-go side for that, so we can simulate different scenarios easier. Is it possible to use the current console client to make some simulations?

Sure, the code would be though already all in status-go as it would add to the existing PFS code, possibly we don’t even have to touch status-react for this (not so sure at the moment, but we might get away without).

igor · May 8, 2019, 8:56am

Yeah, then I’d prepare some test harness for difference scenarios, the happy case and some edge cases so we can simulate them. Otherwise, I’m (as usual) concerned that we might take a hit on reliability there.

These are my thoughts.

Again, let’s wait for @oskarth and some more feedback before proceeding.

oskarth · May 8, 2019, 8:56am

This is awesome @cammellos, thanks for putting this together. In general I like it. A few questions.

User B does not pair the device.
Eventually user A will detect the new device, at which point the next message will be sent on the shared topic, with acknowledgement informations included, until B3 acknowledges on topic T , after which only topic T will be used.

How will user A detect the new device exactly?

Because moving to a new topic can take quite a long time (we need all the devices to acknowledge, or to be considered stale, which will happen after we haven’t seen that device for x days), we also might want to move to a partitioned topic.t

I’m no sure how this interacts with the first proposal. Is this meant as an alternative or as a complement? How would it fare with compatibility?

It is worth mentioning though that this is a breaking change in a way, as we are fundamentally changing the way we currently send messages, and we can only target devices from version 0.9.32 (although we can avoid breaking compatibility if we don’t detect any device, so it will break if a user is running a mix of >= 0.9.32 and < 0.9.32, but not if it’s running only < 0.9.32).

I agree we should only target devices from version 0.9.32. To increase compatibility further, would it make sense to make the move opt-in? I.e. you’d get a popup saying “We want to move you to a move efficient transport that’ll save bandwidth by XX! However, this means you won’t be able to ~sync messages if you have an old unpaired device. Are you OK with this?”. If we phrase it right with UXR help, I think we’d get rid of 90%+ of the BW issues.

cammellos · May 8, 2019, 9:10am

How will user A detect the new device exactly?

Device advertise themselves periodically on their own topic H(pk) (once a day), if you are in conversation with B, you will listen for device updates from any of B’s devices, so eventually there’s a good chance that A will receive this information.

I’m no sure how this interacts with the first proposal. Is this meant as an alternative or as a complement? How would it fare with compatibility?

It’s up to us, they are compatible, I think eventually we want to have multiple discovery topics (possibly falling back on the old discovery topic for <0.9.32 compatibility), and single use topics for a particular 1-to-1 (i.e we meet on discoveryN, but we jump on randomTopicB), so we reduce bandwidth usage, as I feel this only will still make the original discovery topic fairly heavy in terms of bandwidth.
Compatibility wise is identical, everything I mentioned here for this, would apply for a partitioned topic (as you are targeting a specific device, you can make an informed decision on whether it is using a new partitioned topic, or no)

I agree we should only target devices from version 0.9.32. To increase compatibility further, would it make sense to make the move opt-in? I.e. you’d get a popup saying “We want to move you to a move efficient transport that’ll save bandwidth by XX! However, this means you won’t be able to ~sync messages if you have an old unpaired device. Are you OK with this?”. If we phrase it right with UXR help, I think we’d get rid of 90%+ of the BW issues.

Yes, we can do that, once you opt in on one device there’s no way back, but otherwise it would work.

oskarth · May 8, 2019, 9:15am

once you opt in on one device there’s no way back, but otherwise it would work.

I think this is fine, because it’s part of being sovereign - if you own a keypair you also have ownership/responsibility for these types of decisions.

Let’s see what @decanus and @arnetheduck says, but tentatively I’d say we can go ahead with this general approach. I guess next step(s) would go something like:

bring up proposal next Core Dev call to get people a chance to veto it
write up more specifics for specs repo
figure out steps needed to get this working end to end and where we want to do it (console client, etc)

decanus · May 8, 2019, 3:21pm

So I like this approach, the one thing I am not a major fan of but currently cannot come up with a better solution is the way acknowledgements are handled seems like a large overhead solution.

cammellos · May 22, 2019, 7:47am

Update 2019/05/22

After a discussion in the dev meeting, we decided to go ahead with the proposed plan, the steps will be:

Add basic negotiation of topic for devices
Add partitioned topic
Add Acks
Add piggiback of Acks

Steps 3 and 4 will be done separately as it’s where we have more leverage, so getting 1 & 2 done first might be beneficial to refine those 2 steps and potentially come up with different strategies.

We can go live with this for pairing & group chats first as it’s a non-breaking change, which will give us some feedback and help iron out potential issues, and eventually bring to one to one chats.

As part of this effort I will also be attempting to move some code in status go (management of whisper fiter and topics), which should simplify status-react code.

Progress so far is that step 1 is implemented in status-go https://github.com/status-im/status-go/pull/1466 , still WIP, locally I have status-react using the negotiated topic, but currently it’s not fetching messages from the mailserver, so working on that.

ricardo3 · May 23, 2019, 12:07am

cammellos · May 23, 2019, 3:50pm

Update 2019/05/23

Managed to get all the code for loading filters in status-go, and plugged it in status-react, that makes a single call with a list of chats [{:chat-id "id" :one-to-one true :identity}, {:chat-id "status"}], and it is returned a list of whisper filter ids, including the negotiated topic with other devices and partitioned topic for those accounts.

Next step is to address the mailserver issue to make sure all the topics are fetched, remove the obsolete code, and then it would be time to polish up the PR, but there is still quite a lot of work to do.

@ricardo3 thanks for the post, currently we won’t be ratcheting over topics for now, but this would be a stepping stone for something on those lines, if we think it is necessary.

yenda · May 23, 2019, 10:08pm

@cammellosdoes it mean point 2 of Status.app is done? That will make point 1 much easier because filter code was very scary

cammellos · May 24, 2019, 7:02am

@yenda partially:

first we will move chats from status-react realm db to status-go sql db

This won’t be done in the scope of this PR, as it’s probably the biggest change, and will require some thinking

this will allow filter and key management on status-go side

This will be done

with these changes status-go will start the whisper filters and create the symmetric keys for the chats when user logs in

This will be done

this will remove polling loops for whisper filters as it will be replaced by signal based pushes for new messages as implemented in 1.

I haven’t removed the polling loop, we still create a web3 filter, but we make a single request to status-go:

Get chat ids & type on status-react
Call status-go with the chat specs
Status-go creates a bunch of whisper filters and returns a list of filter ids to status-react
Status react creates a bunch of web3 polling filters using those filtler ids (this is not an asychronous step as no request is made to status-go)

So effectively is just an fx ::load-filters [chat-specs] and then in the callback we create the filters and store them in re-frame db.

Changing to signals does not seem to be a big change though, should be fairly easy to implement, but not sure I will do in the scope of the first PR, but I can take a look at it, worst case is just going to be in a follow up PR.

an api will allow the user to make chats queries with pagination

This won’t be done as messages will be still stored in realm for now.

Overall I think changing to signals is a small change once the filters are on status-go, I will definitely play with it, but not sure I will include it in the first PR, as it’s quite a change already and with less moving parts it will be easier to identify bugs etc.

yenda · May 24, 2019, 12:04pm

Yes regarding changing signals I can have a look once this PR is implemented as I’ve already done it for wallet. This is the last remain of web3 at this point

cammellos · June 5, 2019, 1:40pm

Update 2019/06/05

Status-go code has been written and is under review, started working on status-react PR, currently quite a few changes are necessary as we need to take into account that one-to-one and group chats have multiple topics, so gaps and mailserver requests need to reflect that.