A couple of recent related bugs prompted me to write up this post (those being Messaging/transacting take too long to be processed if there is a 'Pending...' tx message · Issue #5293 · status-im/status-mobile · GitHub and one other that manifested itself in the fact that messages sent in rapid succession did not leave the client - they were just stuck with the spinning wheel animation).
Those two bugs happened due to the fact that some months ago we moved from spawning new threads to leveraging a thread pool on Android. That thread pool was limited to 4 threads, and although that is admittedly a very conservative value, it usually shouldn’t be a problem as long as tasks executing in the thread pool don’t hang around for too long (or at least if they yield regularly). I mean, if that low limit was an immediate problem, we should have quickly noticed that problem, right?
Although one can argue that the #5293 bug is very specific and we probably just haven’t come across that scenario before, I find it hard to believe that the conditions for reproducing the message sending issue hadn’t yet been met in the intervening months. I find it more plausible that some other unrelated change caused tasks to execute for longer in the thread pool (sometimes for as long as 2 minutes), and consequently exhaust the number of free threads, and make the app unresponsive to new parallel execution requests.
In the end, this issue was fixed by community contributor @pacamara
by raising the maximum number of threads in the pool to 100 times the number of cores on the device. Although this caused the symptom to disappear, I believe that we’re missing out on a potentially valuable lesson/new best practice by just closing the issue and moving on, and not taking the time to investigate where exactly the problems started to manifest themselves by checking previous builds or doing a bisect. There is a potential long-running task now in the codebase, and we risk repeating the same mistake in the future if we haven’t taken the time to study the problem and come up with a best practice. I’ve voiced my concerns a few times on the respective Slack threads but was met with silence (probably because everyone was focused on getting nightly builds back to working condition), so maybe this long-form post is a more adequate way to raise awareness to the issue.
I think that as a team, taking these opportunities and ruthlessly hunting down for root causes and sharing the resulting learnings is what will keep us progressing technically and inevitably becoming experts in the areas we touch (even if they’re not our current core areas of expertise). However, this requires consciously prioritizing these efforts and having an all-hands-on-deck approach to nail the root cause as quickly as possible (knowing that the longer you wait, potentially the harder it’ll be to have the right conditions to reproduce a bug - code moves on, infrastructure moves on, etc.)
One approach which has worked well for me in the past is to have two experienced people in the team pair up to investigate the issue and challenge each other on findings to quickly explore possible root causes.
This situation raised a few questions for me:
- How easy is it to bisect to find the commit where a behavior was introduced?
- Using
git bisect
is likely going to take too long to build each commit to be practical. - Installing previous nightlies doesn’t seem to work (even assuming all the nightlies are on the same react schema), as Android refuses to install an older build. (Might be a problem if we need to keep current app data to test the issue from build to build.)
- Using
- Are old nightly builds easily accessible somewhere, sorted by date, in a place like artifactory?
- As a team, how high a priority are we willing to assign to an investigation? I.e., do we consider it a nice-to-have, or a vital need?
It also demonstrated the need to be able to change the log level of geth.log
without needing to rebuild the app from sources, if we are to be able to go back through a list of old builds to find out where a behavior started.
In any case, I just wanted to share my point of view, and perhaps get a healthy discussion started to see if anyone else agrees/disagrees, or sees problems in our current approach.