Never ending service and server saga

Time flies. It has been close to one year since Garmin opened its new API to third party developers and that I embarked on migrating ConnectStats to using it. I had anticipated it would be fairly straight forward, but, boy, it has been quite a ride, and the last few weeks continued to provide a lot of “excitement”. Let me share some of that story…

Quick recap of the service

In the past ConnectStats was connecting to the Garmin Connect website to extract the data for the user activities. While this was the way ConnectStats worked from the beginning (2011!!!), it was simple as I didn’t have to maintain any online servers, but it was not very robust and repeatedly created serious outage as the Garmin Connect website would change its way of getting data without any type of support or documentation. I would typically have to find out via a multitude of bug report and try to reverse engineer what had changed on the web site.

The new service is fully documented and supported, but the flip side is that it works by pushing the fit files of a user to the application web server in the cloud. Which means you need to have a web server in the cloud, which I didn’t…

A web server on the cheap

Being an old timer programmer, I mostly knew how to program in the “traditional” way, when I started looking at what it would take to build the server, I quickly looked at Amazon AWS and other services like Digital Ocean, but it felt quite overwhelming. They felt much more professional and scalable, but my blog host server was just so much simpler and familiar, providing just a php/mysql environment for a reasonable price.

So I started building the server for that. The code of which is open source, as is ConnectStats itself. Turns out a simple environment is nice, but I also had to implement myself a lot of components for the server, as it was impossible to bring external open source tools. Took me quite a while.

First attempt, September 2019 – worked great for one user!

My first implementation back in the fall worked great for my account. Quite excited, I started to enable it for a few tests users. It had a few quirks, but it was the first time and it ended up workup for a dozen users for a few weeks, I felt ready to go next stage…

But as soon as I enabled a 100 or so user, the whole thing collapsed with the server dying trying to process all the files being fed by Garmin.

The queue system to process the incoming feed was just too basic and silly. So I built a new better one. And all became happy for the 100 users.

Second attempt, November 2019 – worked great for 100 users!

By that point, I enabled everyone. It worked great for about one week, the queue was processing the events for all the new users with ease. Beautiful.

That is until I got an email from my web hosting service saying that they were shutting down my system because it was using too much resources! Turned out “Unlimited” basic account, meant unlimited “within reason”. My database was having way too much data in it.

But not to worry, I called them, they were charming, congratulated me for having a web site that seem to trigger so much traffic and data, and offered me to upgrade to the next service level at the next cost level. Which I did.

Took a bit of time for them to migrate my account. And now while I was below the limit and the server was up, it was performing dreadfully slow!

After I called back, they explained to me I may need the next level again. A bit desperate I agreed, the server had been down for a while. So I migrated again to a server tier yet at the next cost level.

But it actually still didn’t work! A bit of research later, I realised simple query in my database were very slow. So I ran the optimiser on the database, and it all became blazing fast. Of course I was now using the “Enhance” Business Hosting plan…

Third Attempt, January 2020 – worked great for 5000 users!

So I saw the users grow, which was quite satisfying. All was going well, but I was noticing my database size was growing at an alarming rate. The database has to save all the fit files of all the users it needs to service, so when ConnectStats request an update it can send back the activity files for the user…

One day, I noticed the site had stopped working all together. I hadn’t received any email notification of an issue, but it appeared the disk was full! After just a few months of usage, that was alarming… Did I have to update to the next tier from the provider? A quick math showed that, at the rate of growth, it wouldn’t last long either with the disk space limitation and would probably cost quite a bit of money very quickly.

So I did some investigation and found out a few ways to improve the situation.

First, I was doing a backup of the database, which is definitely a good idea, but keeping the files on the server in addition to another separate server. So each file saved was taking double its size. Easy to fix, and I moved the back up to keep it only on the other server with cheap disk.

Second, I realised that some users were receiving hundreds of files at once from the Garmin service, which didn’t make sense. Typically a user would have one activity file sent at a time, as it’s quite unusual to run or cycle more than a few activities in one day.

After investigation, it turned out these were a lot of old activities from these users. The garmin service is not supposed to do that, but only send you the latest one. All these were definitely adding up in disk space!

I wrote to Garmin and put a safeguard I would only save the activities that are recent. Garmin replied, they have an issue with their service, which explained the old activities and they will work on fixing it.

So I was now back in business. My server had free disk space again, and the rate of increase was reduced. But I decided I had to start working on a service that was more scalable as the current setup would definitely reach the limit again…

Fourth Attempt, February 2020 – worked great for 10,000 users!

So I knew I had time, but I decided to plan ahead for the future and start learning how to build a scalable service with a provider like Amazon AWS.

It became a fascinating journey. The ways Amazon AWS lets you bring the component you need, build APIs, server less functions that scale is very impressive. There is definitely a learning curve though and while I could get most of the basics working, I still have missing pieces to fully replicate the service I have build in the more traditional php/mysql framework. It isn’t finished but as usual I kept all the code open source

As I was still working with that concept, at about 10,000 users, I ran out of disk space again!!!

This time, I had very little option to free up space. And so it was time to do hot surgery on my server and move some pieces to AWS. The way to split was easy enough: I would keep all the database and APIs on my existing server, but would move all the files to the S3 file system of AWS. I had already got some concept of this working in my experimentation…

Luckily, the way I had designed the app meant that for users the outage was transparent as ConnectStats was silently reverting to getting data from the Garmin website while the ConnectStats server couldn’t respond… So it was a bit stressful, but at least the users weren’t badly impacted and no flurry of bug reports…

So I quickly wrote and tested the change to move to the hybrid system between my Godaddy server and Amazon AWS.

I got It working on a test server with my own account and activities. Which meant I had to do a few extra runs to make sure they were uploaded fine in the hybrid system.

Then to save the disk space – the server was basically full, I had to write a migration script that would take the data from the godaddy server and upload it to Amazon, then remove it from the godaddy server…

I had about 30 gig of data, so took a few days to upload, as I was throttling the upload, fearing it would bring down the server if I pushed too much at once…

Once the upload was completed, I did a few basic tests, the file were there and accessible from go daddy… so I switched the production server to that set up…

All seemed to work, and my server was living on the edge with very little disk space left, so I went ahead and deleted the old data…

Last step – worked great until I fixed my silly bugs…

That’s when I started receiving sporadic bug reports that ConnectStats was crashing… But the app was working for me! So frustrating is the life of an independent developer with limited testing resources šŸ™‚

It didn’t take long though for me to be able to reproduce the crash. It had to do with trying to download an activities that had been saved in the migration from godaddy to Amazon. All the new activities coming were fine, which explain the majority of users didn’t notice… Most users only download new activities as the old one are saved by ConnectStats on the phone.

My migration script had a silly little bug, that meant all the files were uploading with a few missing bytes! This of course corrupted the fit files and getting connectstats to crash…

So I wrote a fix to the app to properly handle corrupted fit files, and did a new migration of my data from the backup to Amazon, this time without the bug… I am also pleased I had setup the backup system… Would have been quite unfortunate otherwise.

So right now, the server for connectstats work in the hybrid mode between godaddy and Amazon. My contract for godaddy has another year or so, so I have time to build the next version of the server to be fully on Amazon AWS…

Hopefully, some of you found this interesting or even entertaining. I suspect a lot of users do not realise the issues and complexities that goes into building a service in the cloud…

But I have to say, I learned a lot and it was quite interesting for me!

4 thoughts on “Never ending service and server saga

  1. As always Brice, we who have been using ConnectStats as our training log for some time greatly appreciate what effort you put in to keep this service rolling. Thanks so much mate!

Leave a Reply to Rick TCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.