Fixing Whisper for great profit

oskarth · October 17, 2019, 8:01am

tldr: Whisper currently can’t scale. This post shows why, and how to fix it.

Background

We have very few users for the Status app. Despite this, we have issues with bandwidth usage. One of the most common complaints I hear about Status, and the reason core contributors often don’t use it at events for group coordination, is that it consumes too much bandwidth. People often have a limited data plan, and especially at big events we’ve seen community members have their whole data plan drained just by using Status.

For more precise user reports and some rough numbers, see e.g.:

We have made some improvements in this regard, both in the past and for the v1 release. Most recently by moving to a partitioned topic as opposed to a single discovery topic. There have also been improvements to mailserver performance.

Still, this isn’t enough. At a fundamental level, the confidence that Whisper will scale to any reasonable level is very low, and for good reasons. However, this is more of a rough intuition, and we haven’t done any real studies on this or how to fix it. Right now it’s more like a pebble in our shoe that we keep walking around with, hoping it’ll go away.

There are a few reasons we haven’t made progress on making Whisper more scalable:

Lack of adoption. Few users means the problem haven’t hit us in any serious way, and the “scalability” issues we’ve solved have mostly been relevant for ~100-1k users. The issues we have seen have not been taken seriously enough, because people don’t depend on Status to function.
Church of Darkness. One of our core principles is privacy, and this, coupled with lack of rigorous understanding of the protocols we use and their properties, have lead us to put an irrationally high premium on the metadata protection capabilities that Whisper provides.
Timeline expectations. There are more longer-term plans for replacing Whisper. This is the work that is happening with Vac and together with entities like Block.Science, Swarm and Nym. This means we’ve historically not seen fixing Whisper ourselves as a big priority in the short to medium term.

Going foward

With v1 of the app soon being out of the door (amazing job everyone!), we are going to start pushing for more adoption. For people to use Status, we need reasonable performance, on par with alternative solutions.

On metadata protection and a reality check

Considering the financial constraints, we need to push for traction and make Status a joy to use sooner rather than later. This means we can’t have people burn up their data plan and uninstall the app. Later on, we can enhance it with more rigorous guarantees around things like metadata protection, for example through mixnets such as the one Nym is working on.

As an end user, most people care more about being able to use the thing at all than theoretical (and somewhat unrigorous) metadata protection guarantees. Additionally, the proposed solutions will still enable hardcore users to get stronger receiver-anonymity guarantees if they so wish.

It is also worth pointing out that, unlike apps like Signal, we don’t tie users to their identity by a phone number or email address. This is already huge when it comes to privacy. Other apps like Briar also outsource the metadata protection to running on Tor. Now, this comes with issues regarding spam resistance, but that’s a topic for another time.

In terms of the Anonymity Trilemma, we are likely in a suboptimal spot.

Why try to fix Whisper if we are going to replace it?

Technically, this argument is correct. However, reality disagrees. If we are going to start pushing user acquisitions, we need to retain users. This needs to happen soon, on the order of few weeks, and not several months, coupled with more uncertainity and compatibility issues.

It doesn’t make sense to replace Whisper with a semi-half assed medium term thing if it’ll take months to get in production, and then replace that thing with a generalized, scalable, decentralized, incentivized network.

Theoretical model

Caveats

First, some caveats: this model likely contains bugs, has wrong assumptions, or completely misses certain dimensions. However, it acts as a form of existence proof for unscalability, with clear reasons.

If certain assumptions are wrong, then we can challenge them and reason about them in isolation. It doesn’t mean things will definitely work as the model predicts, and that there aren’t unknown unknowns.

The model also only deals with receiving bandwidth for end nodes, uses mostly static assumptions of averages, and doesn’t deal with spam resistance, privacy guarantees, accounting, intermediate node or network wide failures.

On the model and its goals

The theoretical model for Whisper attempts to encode characteristics of it.

Goals:

Ensure network scales by being user or usage bound, as opposed to bandwidth growing in proportion to network size.
Staying with in a reasonable bandwidth limit for limited data plans.
Do the above without materially impacting existing nodes.

It proceeds through various case with clear assumptions behind them, starting from the most naive assumptions. It prints results for 100 users, 10k users and 1m users.

The colorized report assumes <10mb/day (300mb/month) is good, <30mb/day (1gb/month) is ok, <100mb/day (3gb/month) is bad and above is a complete failure. See bandwidth usage too high for comparative numbers with other apps.

Results

A colorized report can be found here and source code is here.

The colorized report is easier to scan, but for completeness the report is also embedded below.

Whisper theoretical model. Attempts to encode characteristics of it.

Goals:
1. Ensure network scales by being user or usage bound, as opposed to bandwidth growing in proportion to network size.
2. Staying with in a reasonable bandwidth limit for limited data plans.
3. Do the above without materially impacting existing nodes.

Case 1. Only receiving messages meant for you [naive case]

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A4. Only receiving messages meant for you.

For 100 users, receiving bandwidth is 1000.0KB/day
For 10k users, receiving bandwidth is 1000.0KB/day
For  1m users, receiving bandwidth is 1000.0KB/day

------------------------------------------------------------
Case 2. Receiving messages for everyone [naive case]

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A5. Received messages for everyone.

For 100 users, receiving bandwidth is   97.7MB/day
For 10k users, receiving bandwidth is    9.5GB/day
For  1m users, receiving bandwidth is  953.7GB/day

------------------------------------------------------------
Case 3. All private messages go over one discovery topic

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A8. All private messages are received by everyone (same topic) (static).

For 100 users, receiving bandwidth is   49.3MB/day
For 10k users, receiving bandwidth is    4.8GB/day
For  1m users, receiving bandwidth is  476.8GB/day

------------------------------------------------------------
Case 4. All private messages are partitioned into shards [naive case]

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A9. Private messages are partitioned evenly across partition shards (static), n=5000

For 100 users, receiving bandwidth is 1000.0KB/day
For 10k users, receiving bandwidth is    1.5MB/day
For  1m users, receiving bandwidth is   98.1MB/day

------------------------------------------------------------
Case 5. Case 4 + All messages are passed through bloom filter with false positive rate

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A9. Private messages are partitioned evenly across partition shards (static), n=5000
- A10. Bloom filter size (m) (static): 512
- A11. Bloom filter hash functions (k) (static): 3
- A12. Bloom filter elements, i.e. topics, (n) (static): 100
- A13. Bloom filter assuming optimal k choice (sensitive to m, n).
- A14. Bloom filter false positive proportion of full traffic, p=0.1

For 100 users, receiving bandwidth is   10.7MB/day
For 10k users, receiving bandwidth is  978.0MB/day
For  1m users, receiving bandwidth is   95.5GB/day

NOTE: Traffic extremely sensitive to bloom false positives
This completely dominates network traffic at scale.
With p=1% we get 10k users ~100MB/day and 1m users ~10gb/day)
------------------------------------------------------------
Case 6. Case 5 + Benign duplicate receives

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A9. Private messages are partitioned evenly across partition shards (static), n=5000
- A10. Bloom filter size (m) (static): 512
- A11. Bloom filter hash functions (k) (static): 3
- A12. Bloom filter elements, i.e. topics, (n) (static): 100
- A13. Bloom filter assuming optimal k choice (sensitive to m, n).
- A14. Bloom filter false positive proportion of full traffic, p=0.1
- A15. Benign duplicate receives factor (static): 2
- A16. No bad envelopes, bad PoW, expired, etc (static).

For 100 users, receiving bandwidth is   21.5MB/day
For 10k users, receiving bandwidth is    1.9GB/day
For  1m users, receiving bandwidth is  190.9GB/day

------------------------------------------------------------
Case 7. Case 6 + Mailserver case under good conditions with smaller bloom false positive and mostly offline

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A9. Private messages are partitioned evenly across partition shards (static), n=5000
- A10. Bloom filter size (m) (static): 512
- A11. Bloom filter hash functions (k) (static): 3
- A12. Bloom filter elements, i.e. topics, (n) (static): 100
- A13. Bloom filter assuming optimal k choice (sensitive to m, n).
- A14. Bloom filter false positive proportion of full traffic, p=0.1
- A15. Benign duplicate receives factor (static): 2
- A16. No bad envelopes, bad PoW, expired, etc (static).
- A17. User is offline p% of the time (static) p=0.9
- A18. No bad request, duplicate messages for mailservers, and overlap/retires are perfect (static).
- A19. Mailserver requests can change false positive rate to be p=0.01

For 100 users, receiving bandwidth is    3.9MB/day
For 10k users, receiving bandwidth is  284.8MB/day
For  1m users, receiving bandwidth is   27.8GB/day

------------------------------------------------------------
Case 8. Waka mode - no metadata protection with bloom filter and one node connected; still static shard

Next step up is to either only use contact code, or shard more aggressively.
Note that this requires change of other nodes behavior, not just local node.

Assumptions:
- A1. Envelope size (static): 1024kb
- A2. Envelopes / message (static): 10
- A3. Received messages / day (static): 100
- A6. Proportion of private messages (static): 0.5
- A7. Public messages only received by relevant recipients (static).
- A9. Private messages are partitioned evenly across partition shards (static), n=5000

For 100 users, receiving bandwidth is 1000.0KB/day
For 10k users, receiving bandwidth is    1.5MB/day
For  1m users, receiving bandwidth is   98.1MB/day

------------------------------------------------------------

Takeaways

Whisper as it currently works doesn’t scale, and we quickly run into unacceptable bandwidth usage.
There are a few factors of this, but largely it boils down to noisy topics usage and use of bloom filters.
- Duplicate (e.g. see Whisper vs PSS) and bad envelopes are also fundamental factors, but this depends a bit more on specific deployment configurations.
Waku mode (case 8) is a proposed solution that doesn’t require other nodes to change, and extends capabilities for nodes that puts a premium on performance. Essentially it’s a form of Infura for chat.
The next bottleneck after this is the partitioned topics, which either needs to gracefully (and potentially quickly) grow, or an alternative way of consuming those messages needs to be deviced.

Waku mode

Doesn’t impact existing clients, it’s just a separate node and capability. A bit like Infura for chat.
Other nodes can still use Whisper as is, like a full node.
Sacrifices metadata protection and incurs higher connectivity/availability requirements for scalbility

Requirements:

Exposes API to get messages from a set of list of topics (no bloom filter)
Way of being identified as a Waku node (e.g. through version string)
Option to statically encode this node in app, e.g. similar to custom bootnodes/mailserver
Only node that needs to be connected to, possibly as Whisper relay / mailserver hybrid

Provides:

likely provides scalability of up to 10k users and beyond
with some enhancements to partition topic logic, can possibly scale up to 1m users

Caveats:

hasn’t been tested in a large-scale simulation
other network and intermediate node bottlenecks might become apparent (e.g. full bloom filter and cluster capacity; can likely be dealt with in isolation using known techniques, e.g. load balancing)

Next steps

The proposed Waku mode can be implemented as a proof of concept on the order of a few weeks, which works well with current marketing plans and, if successful, could be used in a 1.1 app release.

A spec proposal is in early draft mode with associated issue. This will be enhanced as discussions here progress.

The main steps / requirements at this stage is:

(a) Buy-in from Core to give users the option to use this mode

(b) One or possibly two people to implement Waku mode as a proof of concept mode that can be used end to end

As well as any refinements to the assumptions and model necessary.

Additionally for performance improvements, there’s a more engineering focused effort on optimizing mailservers retries/locality/queries, that is out of scope of this post.

To tie this to more long term research work, we also want to use these nodes to do accounting of resources (i.e. bandwidth). This will inform more incentive modelling work.

Fin.

yenda · October 17, 2019, 10:31am

I support the idea of a Waku node over mailservers as a temporary solution before vacp2p is ready.

Which protocol are they going to use to communicate with clients? The client MUST trust the Waku node in regards to metadata shared, but it should use some sort of e2e communications because transport can’t be trusted.

andrey · October 17, 2019, 10:55am

What about avatars ? What if people will actively be setting their avatars after v1, we still send it as base64 string in whisper message, maybe we could disable avatars for v1 and add them to ens after v1

hester · October 17, 2019, 11:04am

Can you elaborate on what the risks are that you see with this @andrey?

andrey · October 17, 2019, 11:12am

no risks just hight traffic, so my idea instead of propagating image through whisper network, we could store an image in ipfs for example and save hash into ens contract, so users could get hash and load an image

hester · October 17, 2019, 11:21am

Is that something that we could migrate to after people have set an avatar in V1? Or will those avatars continue to float around the Whisper network?

andrey · October 17, 2019, 11:38am

yes we can migrate later, they won’t float, but i still think current implementation more confusing than useful, because only contacts can see your avatar and they store it on their devices when receive message with avatar from you

barry · October 17, 2019, 11:44am

When connected to wifi bandwidth is not a big consideration but message ordering is, receiving messages in random order means waiting for the sync to complete before being able to use Status. I would posit traditional messengers consume a good amount of bandwidth as well, however they load the last messages first which gives the user the impression of instant load time while the out of view messages continue to load in the background.

yenda · October 17, 2019, 11:49am

Additionally the public chats could be queried by name rather than topic, which would reduce the false positives even more. That could also be useful for the request from Hester and Maciej to know about active public chats.

On top of that paid Waku nodes could maintain a list of filtered addresses for a given user to limit spam in public chats at the “source”.

hester · October 17, 2019, 12:33pm

Agree that it makes more sense to tie avatar display to ENS name. Meaning that if ENS is set and avatar is your ‘ENS name avatar’, it is displayed whenever you set ENS name as display name in chat.

For the time being though, I think it does add to the experience to have a way to set a custom avatar, especially for friend connections in 1:1 chats where it’s more likely for both parties to have added eachother. I’d consider it an acceptable tradeoff for now.

cammellos · October 17, 2019, 12:34pm

What about avatars ? What if people will actively be setting their avatars after v1, we still send it as base64 string

One thing we should definitely stop doing is to send the default avatar as base64 encoded, currently even if it’s the default it’s being sent I believe. With the latest change of avatars the identicon have a much smaller size (4/5 times smaller), but still, no reason to send it through the wire. User set photos are still heavy, although less than the maximum packet size.

@barry

Currently we also send messages in reverse chronological order, although we can’t make any guarantee on the actual receiving order, but newer messages will definitely appear first

barry · October 17, 2019, 2:26pm

Just doing another look to see what makes it feel off. Status loads messages on all channels rather than the channel I’m on which makes it feel clunky.

Comparing to Telegram which I just inspected TG api calls, Telegram seems to make an api call to get some basic info about each room such as number of messages, but then separately requests a batch of messages for the room you are in and paginates, requesting more as one scrolls down a room. This seems to create the “fast” experience even though the network transfer was not small, nearly 20MB and growing as I scroll.

So it seems figuring out a way to

Just obtain info on chat with auxiliary data
Paginate chats

Would bring the messages UX to parity with most messengers. Of course doing this in a way that still preserves privacy would be the breakthrough.

I think ZeroDB which is an end-to-end encrypted database that enables clients to operate on (search, sort, query, and share) encrypted data without exposing encryption keys or cleartext data to the database server can enable such functionality. The main issue(s) with ZeroDB is it requires a central operator and is no longer maintained. Even so, the operator can only censor the clients but never read the plaintext.

There might be something fruitful in examining how ZeroDB works and figuring out a way construct those operations as a distributed system eventually layering on an incentive mechanism.

cammellos · October 17, 2019, 2:32pm

Comparing to Telegram which I just inspected TG api calls, Telegram seems to make an api call to get some basic info about each room such as number of messages, but then separately requests a batch of messages for the room you are in and paginates

Yes, we had a brief chat before about this, with the current set up is a bit hard, as we pass a bloom filter, so we don’t really know whether the messages match any chat, but not impossible (you return a count for each topic that matches and let the client do the math). One to one’s is harder, as we can’t tell from the topic whether there are messages for us, but probably less of an issue.

We have some form of pagination, kind of, although it’s manually triggered by the user, instead of firing automatically, the idea was to have it on scroll as you described eventually, but we never got around doing so.

oskarth · October 18, 2019, 4:34am

Thanks everyone for the comments!

Re which protocol to use:
That’s a good point. In light of getting a proof of concept out as soon as possible it probably makes sense to start with minimal changes. For draft spec diff with EIP627, see https://github.com/status-im/specs/pull/54.

Long term, we are moving to libp2p. But this PoC mode won’t be on libp2p, at least not initially. This likely means using what we got (devp2p) and noting attack vector.

Re avatars:
To see how much overhead they bring, it’d be good to know things like:

how big are the envelopes
how many envelopes are generated for one update
how often does this update happen
on which topic

Re public chats queried by name to reduce false positives:
This could be done for UX reasons, but in terms of bandwidth usage it is not a bottleneck. Until you hit 2^(32)-1 public channels/topics the average public channel will have a unique topic.

What will be a bottleneck though, at 5k+ users/keys, is the partitioned topic. Ideas for resolving that bottleneck would be great.

Re TG and fetching:
Absolutely. As you say, this is both correlated with bandwidth usage as well as a slightly separate concern. See Message fetching after being offline takes way too long · Issue #9185 · status-im/status-mobile · GitHub for the non bandwidth part - it’d be great to keep that conversation there and hopefully Core can improve on this for v1.1

cammellos · October 18, 2019, 6:48am

To see how much overhead they bring, it’d be good to know things like:

how big are the envelopes

how many envelopes are generated for one update

how often does this update happen

on which topic

Envelope size is variable, depends on your profile picture, from experience I can tell is likely at least half the maximu size of a whisper envelope (when syncing profile picture, contacts need to be synced one-by-one or you start dropping envelopes).

how many envelopes are generated for one update

You generate one envelope per contact you added, every 48 hours, it’s basically a one-to-one message, so the topic is whatever topic the other end is listening to.
This is basically an identical problem to group chats, and works roughly in the same way, as currently we use pairwise encryption, with all it’s strength (strong encryption), and limitations (bandwidth consumption).

If we want to address this issue, we have (at least) 3 options:

Make this information public. Currently the data contained in a contact request is:

ENS name, which is basically public
Your avatar
Your token for push notifications (can’t be revoked once sent, it’s bad that we share it to basically non-trusted contacts, as that’s supposed to be secret, and we can’t assume contacts will keep it that way), for all your devices (we might be dropping this?)
Your address (currently not used, but I believe it might be used eventually for disclosing your wallet accounts, but we could opt for a different solution)

If we decide to make this information public, then we can just publish a single message (that’s conditional to dropping notifications, or at least find a better way. I’d much rather have a server run by us to do that, then sending my fcm token to everyone, which is way unsafer, and if the server is down, or you don’t trust us, we degrade gracefully by not having pn).

Look at what essentially are other group chat protocols, solving the problem in a generic way, and potentially use it as a testing ground for adoption. The options are somewhat limited in this field (most use variants of pairwise encryption, which some optimization built on top of it), but there are other worth exploring (tree based, mls, simple hash ratchet etc)
Solve it with a specific solution, as it is a group chat problem, but with some difference, namely, there’s only one admin, members can’t leave the group unless “kicked out” by the admin. Given this we can devise a much simpler version, and trade off on strong encryption (no pfs for example).

Regarding the protocol for waku nodes, the only thing I would add is that we should make it synchronous (request-response, ideally sync), if we are going that way it would make many things easier.

Also I think we should not consider this a temporary solution, to me it just sounds like a linus security blanket.
That’s not to say that we won’t change it in the future and that we won’t be putting effort in it, but the argument “We do this, but it’s only temporary” (how many times we heard this? our own fleet? temporary! mailservers? temporary!), it’s just counterproductive and to me sounds a bit hypocritical.

Considering this stuff as second class citizen puts us in the same position as we are now, where we neglected mailservers for such a long time just because they are “temporary” (with no realistic solution on the horizon), until it became problematic, and now we scramble to fix it pre-v1.

I think we should own our decisions (which I agree with), whether we like them or not, and make the best of what we have now, so if we go for waku mode, let it be the best waku mode we can build, as we know that “temporary” in agile terms is meaningless, what you put out today it’s how your product might look like for the rest of its lifetime, whether you like it or not.

yenda · October 25, 2019, 10:56am

Yes it is dropped, I’m deleting notifications code today. No more fcm tokens

oskarth · November 25, 2019, 4:42am

Update: Waku project has now kicked off. Progress tracked in Vac forum. See Waku project and progress - Vac for more.

oskarth · December 10, 2019, 10:10am

oskarth · January 14, 2020, 10:52am