Thoughts on software quality and knowledge-building

A couple of recent related bugs prompted me to write up this post (those being Messaging/transacting take too long to be processed if there is a 'Pending...' tx message · Issue #5293 · status-im/status-mobile · GitHub and one other that manifested itself in the fact that messages sent in rapid succession did not leave the client - they were just stuck with the spinning wheel animation).
Those two bugs happened due to the fact that some months ago we moved from spawning new threads to leveraging a thread pool on Android. That thread pool was limited to 4 threads, and although that is admittedly a very conservative value, it usually shouldn’t be a problem as long as tasks executing in the thread pool don’t hang around for too long (or at least if they yield regularly). I mean, if that low limit was an immediate problem, we should have quickly noticed that problem, right?
Although one can argue that the #5293 bug is very specific and we probably just haven’t come across that scenario before, I find it hard to believe that the conditions for reproducing the message sending issue hadn’t yet been met in the intervening months. I find it more plausible that some other unrelated change caused tasks to execute for longer in the thread pool (sometimes for as long as 2 minutes), and consequently exhaust the number of free threads, and make the app unresponsive to new parallel execution requests.

In the end, this issue was fixed by community contributor @pacamara by raising the maximum number of threads in the pool to 100 times the number of cores on the device. Although this caused the symptom to disappear, I believe that we’re missing out on a potentially valuable lesson/new best practice by just closing the issue and moving on, and not taking the time to investigate where exactly the problems started to manifest themselves by checking previous builds or doing a bisect. There is a potential long-running task now in the codebase, and we risk repeating the same mistake in the future if we haven’t taken the time to study the problem and come up with a best practice. I’ve voiced my concerns a few times on the respective Slack threads but was met with silence (probably because everyone was focused on getting nightly builds back to working condition), so maybe this long-form post is a more adequate way to raise awareness to the issue.

I think that as a team, taking these opportunities and ruthlessly hunting down for root causes and sharing the resulting learnings is what will keep us progressing technically and inevitably becoming experts in the areas we touch (even if they’re not our current core areas of expertise). However, this requires consciously prioritizing these efforts and having an all-hands-on-deck approach to nail the root cause as quickly as possible (knowing that the longer you wait, potentially the harder it’ll be to have the right conditions to reproduce a bug - code moves on, infrastructure moves on, etc.)
One approach which has worked well for me in the past is to have two experienced people in the team pair up to investigate the issue and challenge each other on findings to quickly explore possible root causes.

This situation raised a few questions for me:

  • How easy is it to bisect to find the commit where a behavior was introduced?
    • Using git bisect is likely going to take too long to build each commit to be practical.
    • Installing previous nightlies doesn’t seem to work (even assuming all the nightlies are on the same react schema), as Android refuses to install an older build. (Might be a problem if we need to keep current app data to test the issue from build to build.)
  • Are old nightly builds easily accessible somewhere, sorted by date, in a place like artifactory?
  • As a team, how high a priority are we willing to assign to an investigation? I.e., do we consider it a nice-to-have, or a vital need?

It also demonstrated the need to be able to change the log level of geth.log without needing to rebuild the app from sources, if we are to be able to go back through a list of old builds to find out where a behavior started.

In any case, I just wanted to share my point of view, and perhaps get a healthy discussion started to see if anyone else agrees/disagrees, or sees problems in our current approach.

6 Likes

Did anything happen to this? I’d love it if we were more rigorous in terms of root cause analysis. It’s definitely more effective in the long run. A specific investigation would make a great post on Discuss/Our/Town Hall/Core Dev call, and it’s probably something the larger community would find interesting.

If there are specific individuals interested in this (or other RCAs) that seems like a great use of time to me, especially if it can be written up and shared with others. Perhaps this is something QA can also help with?

More generally, would something else be useful in terms of making this more of a thing we do regularly?

A few days ago I spent some time testing an hypothesis where code in the Runnables that calls Statusgo might be locking up a thread for too long and quickly exhausting the number of threads in the pool. However after timing each Runnable, I didn’t see anything that would confirm that hypothesis. Most Runnables executed in a fraction of a second, and only very few took more than 1 second (e.g. at login), which doesn’t seem enough to explain the behavior we witnessed.
I’d be willing to pair up with someone else to explore hypotheses and methodically test the scenarios.

By the way, I love the idea of writing that down in the discuss thread as we go along, that’s a perfect use case for discuss, as people can chime in with ideas and knowledge as data comes in!