Post-Mortem: Breaking Compatibility due to Partitioned Topic
Date
22-03-2019
Authors
igorm
Status
Complete, action items in progress
Summary
We have to break compatibility of 1-1 chats and private group chats with
version 0.9.33 and earlier.
Impact
Users who use 0.9.33 and earlier version won’t be able to communicate with
users of 0.11.0 and higher versions.
Root Causes
Bad architectural decision to use a single Whisper topic for all 1-1 and group
chats of all users of Status. The decision is very root and it is extremely
hard to change it without breaking compatibility.
What is exactly bad about using a single Whisper topic?
Every user download private messages from everyone in the network (because they are mixed in the same topic).
There is no issue in security/privacy, because everything is asym-encrypted.
But it leads to a significant waste of traffic.
(Currently 1 day of data is ~100mb, 10x users will mean 1000mb trafic per day)
Trigger
Users had their data plans wasted during popular events (ETHDenver, ETHBerlin).
Resolution
Bite the bullet and play our “beta” card to break compatiblity.
Detection
Complains about wasting data from the users.
Action Items
Action Item
Type
Owner
Bug
Discuss the approaches to this problem
mitigate
-
n/a DONE
Document the protocol
prevent
adamb
IN PROGRESS
Audit the protocol implementation
prevent
igorm
TODO
Lessons Learned
What went well
First releasing a version that supports partitioned private chat topics, and
then releaing a version that actually sends data there (breaking
compatibility)
What went wrong
We don’t have great documentation on our protocol, and we didn’t audit it
from the tecnical perspective.
Where we got lucky
We are still in beta so traffic didn’t grow too much (1GB per user per day
would be very bad, but 100MB in modern world isn’t too big of a deal).
At least on iOS we have ability to “expire” incompatible builds, so we are
sure that no iOS user is using them.
Timeline
Date
Description
20180907
ETHBerlin: Complains about wasting traffic on limited data plans
Also, probably worth mentioning why we used a single topic in the first place, not sure I wasn’t involved in the choice, but it offers the best darkness (it is also the simplest approach, so not sure what was the drove the decision )
I would also express some concerns that the current idea for partitioned topic is also not well architected and could be better communicated. In particular:
It is not completely language agnostic; I was told it uses Mersenne Twister pseudorandom number generator but it’s hard to find good implementations that are compatible between different languages. Most importantly, I don’t see a reason why a pseudorandom number generator must be used here at all because we need just one number for a given seed (unless I missed something). In my opinion, a better and much simpler approach is to get e.g. X point from the public key and do X mod N where N is a number of topic partitions.
I haven’t found any discussion on discuss.status.im with a proposal first (searched for Status.app). If we had a protocol that is published and the community would maintain another client, they would be unable to prepare a fix before the release.
I played a bit with node.js bignumber.js library and it can produce the same result as Go using X mod N approach. All it requires is some string/byte-array manipulation.
Thanks for the write up Igor, this is a good start. Some comments:
Bad architectural decision to use a single Whisper topic for all 1-1 and group chats of all users of Status. The decision is very root and it is extremely hard to change it without breaking compatibility.
I’m not sure it is accurate to say this is a root cause, there’s probably more digging to be done here.
The main things I’d like to see explored in more detail are:
Process: We have agreed multiple times that we want to maintain for previous versions, as well as why we think this is important. Given that - regardless of the actual decision to break compatibility or not - it seems prudent to ensure buy-in and consensus of the rest of Status before making such a change. How come this wasn’t made a bigger deal and brought up as a decision point for e.g. Core dev calls, were alternatives and consequences were proposed? Right now it seems too much like the flippant attitude we had in the past, which makes me personally uneasy. This is the kind of questions I’d like to see answered in a post mortem. Perhaps this is all wrong and everyone was aware of this and in consensus of breaking compatibility, but that’s not my impression. If so, there ought to be a clear decision record of this.
Specific issue: I’m not convinced we did everything we could to maintain compatibility here, and other alternatives doesn’t seem to have been explored too much. For example, the timeline mentions events that were a long time ago. Thus this wasn’t an emergency, and we had time to do more graceful upgrade. For example, off the top of my head: read/write on both partition and discovery topic, then keep track of peers that respect partition topic and gracefully stop discovery topic (edit: not sure if this would work naively since public key != device, which can be different versions, oh well). This was the discovery topic remnants would be equivalent to someone spamming on it, which would make it part of general DoS etc protection. I’m not convinced this, or similar alternatives, weren’t considered in depth.
The fact that it was even possible to break compatibility is another aspect that hasn’t been discussed too much either, where this type of change should require changes at spec level and strong consensus. This should probably ideally be true regardless of specific cultural process and specific issue.
Future compatibility - it’d be nice to see more detail on how the stated root caused is addressed, i.e. how partitioned topics as a scaling solution will grow, and generally clarify the difference between interface and implementation (i.e. is it an optimization for darkness/BW or core?).
Do you think these links (a) make it clear it is a breaking change (b) had any kind of larger consensus in Status, vs local decision making within a swarm (which is great for 99% of cases)?
We don’t have a protocol spec and no one audited it which is written in the post-mortem, in the actions.
This seems fine from a technical perspective, might be useful to make it clear. The cultural aspect is more important though. I’d also caution against relying too heavily on future work though (Big Protocol Spec with Security Audit), vs just a plaintext “roughly like this” separate file that captures our knowledge of what compatibility means. We already have a few clients (well, ish), so that’d work as a litmus test, at least as context for discussions and decisions.
nope, I agree that it should be worded differently, so my poor job on this point.
Just to make it clear: We aren’t plainly breaking compatibility, people who are on 0.10.0 and 0.10.1 aren’t affected by that, their app will work just fine.
Yeah I 100% agree. The problem is that we don’t have even that plaintext thing. We have something related to what Clojure-based message format means, but we don’t have any documentation for something across the stack. And all we do (as @adamb) pointed out is very JS-specific, maybe because of that.
UPD: I’ll do an iteration of this post mortem based on this discussion, thanks for the feeback
We had discussions about how to keep compatibility, these were the possible solutions proposed:
Wait until users of an incompatible version deplete (this is just delaying breaking compatibility until it has low impact)
Upgrade on partitioned topic only once a positive acknowledgment is received that the other end has a compatible version (this can be carried on in contact requests / messages). This approach currently is limited by the fact that multiple devices might be running different versions (i.e desktop most users still use the released version of the app).
Protocol negotiation does not work well in our case given the asynchronous, mostly offline nature of communications, not to mention the fact that communication are really multi-cast (public key), rather than device-to-device.
As a matter of fact, we haven’t released this version, so we are still in time to backtrack (it’s just a flag in the config), and take a deeper look.
In general, I think we should also note that the first PR received little to no reviews after a week of bothering people, and the discuss post had only two contributors, although that was mentioned in the dev meetings, on status and core-chat stands up.
When we had any discussion, most of the people didn’t go further than saying "don’t do it’ or providing not so relevant examples (i.e synchronous online apps without multi-device support).
None of these are helping us move towards a solution, although they are worth mentioning and more constructive engagement would be appreciated.
If we want to grow as a team, we should also start making people accountable for the lack of engagement in critical moments (i.e. discussion about partition topic/compatibility, which in my opinion were fairly visible), instead of waiting for someone to act and criticize post-mortem (which is not a post-mortem, as we haven’t quite pulled the trigger yet, so come forward!)
In general, I think we should also note that the first PR received little to no reviews after a week of bothering people, and the discuss post had only two contributors, although that was mentioned in the dev meetings, on status and core-chat stands up.
I don’t think a PR with specific implementation is a place to discuss a particular protocol change. I know we have been doing it this way since forever but it’s high time we change this process.
It’s true that it was mentioned a few times on various calls but in this kind of matter, I think, the best and only available place we have is discuss.status.im.
If we want to grow as a team, we should also start making people accountable for the lack of engagement in critical moments (i.e. discussion about partition topic/compatibility, which in my opinion were fairly visible), instead of waiting for someone to act and criticize post-mortem (which is not a post-mortem, as we haven’t quite pulled the trigger yet, so come forward!)
The high traffic due to single topic is a serious issue and will only get worse as time goes by and more users come along. We should fix it as soon as possible because the impact can only grow the longer we wait.
The reason we are still in beta is precisely because we still have major issues like these to hammer out, and which is why we should take advantage of it when we can, braking compatibility is that. We can mitigate by waiting a bit until people upgrade, but eventually we have to get this done. (I guess this kinda raises again one of our main issues in debugging user problems, lack of information on what is happening on our clients. Understandable due to privacy concerns, but maybe we could at least collect app versions being used. But there’s a separate discussion.)
Lack of documentation is definitely an issue, but considering how we are still changing things that is understandable. A braking change is a great point to make new docs.
Though I don’t get your comment about “JS-specific” things, are you referring to the same thing as @adam in;
That does make alternative implementations difficult.
Either we make an example client library be our source of truth and documentation of the protocol, or we make the document that is the source of truth of the current state of how things work. As a dev I’d rather have a heavily commented library that explains the protocol and can be an example that can be re-implemented in other language rather than a doc, but that’s my bias.
Oh, so cool to see this discussion is more active and constructive than any previous discussions around this topic. Kudos guys!
Yep, that was my idea (and looks like many people support that), that instead of doing an obviously a very hard thing to keep compatibility here, we just play “we are in beta” card and softly break compatibility with old enough versions.
The major problem that I see, and that we need to fix is that no one looked at our current protocol holistically (transport layer, presentation layer, abstractions) and no one made any risk analysis about how it will scale in the future. We should do that, so we are as aware as we could be of risks like that.
I do refer to exactly that RNG, yes. Though, it uses a standard algorithm, it is a bit tricky to get the same seed to the algorithm as JS does.
So in JS it is something like that:
var seed = generateNumericHash('0xabcabcabc....'); // this is a hell of JS-specific code
var rng = MersenneTwister(seed); // this is standard
var number = rng.getDouble(); // this is standard
so in, say, Go (or C++), you have to do like that
seed := emulateJSMathToGetNumericHash("0xabcabcabc...") // you don't want to see the implementation, believe me
rng := MersenneTwister{seed: seed}
number := rng.getDouble()
Can we verify that we can actually drop this pseudorandom bullshit without actually reducing security or teenage edginess(also known as “darkness”)? If so, we should definitely do so.
Can we verify that we can actually drop this pseudorandom bullshit without actually reducing security or teenage edginess(also known as “darkness”)? If so, we should definitely do so.
that’s funny
we could drop it, and I think it’s worth thinking about it, as @adam 's method is much better, I wish I had thought about it before, the issue is that it would delay going live (essentially 0.10.1 would not be compatible anymore, 0.11.0 will still use the shared topic, and we’ll break in a release after that).
I don’t have a strong opinion myself, but definitely an option on the table.
It’s not more future proof, as it’s only different in the way the topic is calculated, but is definitely better, simpler and much easier to implement in other languages.
I think it’s worth in my opinion, we definitely want to include the change in 0.11.0 though, so that potentially we could go live in >= 0.12 , but should be a simple change (hopefully, migration of nightlies might be a bit more complex).
In my opinion, a better and much simpler approach is to get e.g. X point from the public key and do X mod N where N is a number of topic partitions.
Beyond obscurity and difficulty to reimplement, I tend to agree the mersenne twister doesn’t add much. I believe we may find it hard to get others to adopt as some sort of community UX friendliness standard, for this reason.
If I was making up a new scheme I’d probably (securely) hash the public key before using it for fingerprinting, then use the bits from that hash directly in the lookup like is usually done. I’d maybe try to make the number of bits per word match some specific part of the fingerprint so they can be translated back and forth, for better verification, for those that desire to do so.
See also Status.app for the difficulty of choosing somebody else’s three-word name: long story short, how boring it is to repro the twister is the biggest obstacle