Garmin Servers were out from Thursday July 23rd to Monday July 27th. I won’t go over what happened at Garmin as it’s well covered in the press or other sites, but wanted to share how it impacted me and ConnectStats. While the impact on ConnectStats wasn’t massive, the episode resulted in two small fixes and improvements…
The discovery
First, it happened that on a Thursday when I didn’t go out running. So I didn’t realise until Friday, after the Garmin Connect App didn’t upload my latest morning run. The app was reporting some maintenance, which I didn’t worry about too much. I just opened ConnectStats, to see how the app was handling the maintenance, it reported a Garmin Error and invalid name or password if you tried to use the Garmin Website connection. Not great, but it’s hard sometime to interpret the web page errors. Mental note to check again later see if could interpret the message better, and I went to work.
When I checked again later in the evening to see my morning activity it was still down, which was unusual. Quick check on social media showed general outage of Garmin, so I figured people will be understanding and postponed looking further until the weekend when I’d have more time.
The race
The week-end was mostly spent looking through the web for information on what was happening in Garmin, and figuring out how to upload the files manually to Strava. But that highlighted a new issue… The upload of a fit file to ConnectStats wasn’t working properly, for two reasons:
- The good reason: by design, I had implemented the opening of a fit file in ConnectStats to just be temporary, the app wasn’t saving the activity to its internal database, which was not so useful here.
- The bad reason: opening fit file being a little used feature, it also had a bug that was resulting in the file not always being opened…
So I fixed both issues and rushed a new release on Sunday. It would be a race between Apple and Garmin: would Apple approve the new version before Garmin fixes the outage… Turns out a tough race to judge because on Monday morning when I woke up the app was approved and Garmin was sending activities again to my server. I never look too closely which one technically happened first overnight
The day after
When I saw activities coming again, I looked at my server to see that the queue processing the activities fed by the Garmin API was close to 10000 requests behind and growing. The server was getting activities faster than it could process… Given the fairly low number of users of ConnectStats my server runs on a single computer on the cloud so has limited scaling ability… It works great with the regular trickle of activities from the app users, but here it was sending close to 5 days of activities all at once… You can see in the list the number of distinct active users sending an activity and the number of activities the server processes each day in the last few weeks…. You can see the spike on the 27. What the table does not show is that all these activities were sent in a very short amount of time, while the usual number comes as a trickle during the day…
Monitoring a large queue being processed showed clearly one aspect of the processing was quite the bottleneck. A little bit of thinking and investigation later, I found another place where my database update wasn’t properly optimised. I hadn’t noticed during the normal “trickle” operating mode, but this special situation showed it clearly (for those technically inclined, one of my queries was on a column without an index, while I had tried to make sure it wasn’t the case…). So after a quick fix, the queue started to process much faster again and this also will help going forward should ever the number of user grow a lot!
Few hours later, all was back to normal in the ConnectStats App and Server realm, minus one bug and plus one optimisation 🙂