How to track the non reproducible bug in version 2.1

So, version 2.1 was crashing for quite a few users on startup. Obviously the app wasn’t crashing for me in any of my tests.

First line of defence when crashes happens is to ask the user what they were doing and try to replicate the same action to see if I can recreate the crash. But here they weren’t doing anything it wasn’t starting.

Second is to look for patterns in the crash. The first few users reporting the issue were all using an iPad. There are a few differences between the iPhone and the iPad, mainly that the iPad displays an activity detail at the same time as the activity list. This can result in some difference in logic and makes the startup on iPad marginally more prone to issues. This can lead to hard to reproduce bugs too as it is a function of the type of activity the user would load on startup (the last one downloaded)

Another help is the crash report from apple. In this case, they were very few and reporting crash in location that are not triggered at start up or inside apple libraries. So not helpful to pinpoint the problem.

Some users affected by the issue kindly sent me their activity data so I could try to see if the problem was linked to their specific activities data. It wasn’t, I could start the app with their data fine.

This leaves one other possibility, a thread concurrency issue. This can be really random, and a function of both the speed of the hardware and the exact activities you have. ConnectStats is quite multi-threaded to try to keep the app responsive as many calculations are performed in parallel. So it is possible that some of this parallel activity result in a collision or some instructions run too early (when required data is not ready).

I try to be very careful on collision conditions and to make the code robust to data not being ready. In version 2.1, in order to optimise the startup, I had done a clean up of all the notifications between events in the app to limit them to the minimum needed. So likely the problem was linked to that: issue affecting random users, not a logic issue (users data didn’t let me reproduce it), but the issue was systematic for some users: so this gives another hint. The one issue that can exhibit these symptoms is if you trigger a User Interface event not from the thread dedicated to the UI (main thread).

I did a review of all the events that happen on start up and found one that, in the clean up of 2.1, wasn’t not forcefully directed to that main thread…

So while I couldn’t reproduce, I feel strongly that this must be the issue:

  • linked to some change I made in version 2.1,
  • random: depends of when precisely event get triggered
  • systematic for a given user: for a set of hardware speed and size of data, always happens or not

I submitted a fix for that event issue to apple, let’s cross fingers until version 2.1.1 is approved by apple and hope the issue is indeed resolved. I requested an expedited review, but it’s not always granted.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.