While I am still preparing a new release of ConnectStats fixing a few issues, right now, the corona virus is a major distraction for most of us. While the virus spreads there is a lot of information floating around and it can be sometimes be a bit confusing. So I decided to see what could I verify on my own at home with basics analysis and tools. I’ll be using python/jupyter/pandas, which is very standard and easy to install or can be run directly on google collab
This may become a bit technical for some, but hopefully will be interesting and may help some people learn more about python and pandas as more of us have to stay home…
Data Sources
Of course, one can collect everyday all the numbers from the different website even build his own web scraper. But of course, many people are already doing that. And this is all available on the web.
John Hopkins university is publishing a web dashboard that is quite a reference, and they make all their collected data available in GitHub. So I started using that and this post will look at how using that data yourself.
I’ll highlight another site that I found fascinating about tracking the evolution of the genome of the virus, which I recommend people have a look.
Note that all the below is not intended to be scientific. One should always follow official guidelines, no conclusion with the analysis below should override it.
It is only intended to show how you can yourself use the data and check independently what is being reported and to show how one can use python and pandas for that.
The full notebook to replicate all the below can be seen here as html or here as a notebook. You can also open and run it on google collab
Looking at the data
The data contains time series by country and region/states. It is fairly clean but require a few steps to make it useful for time series analysis.
Two main steps are required
- Consolidate the data for countries that are broken down into states. This is easily accomplished with a
groupby
inpandas
- Transpose the data and convert to time the dates.
You end up then with data that looks like this
The conversion and merging of states data for the US is slightly more involved, but remains quite simple
Analysing the data
You can quite quickly realise that the growth is exponential. But one question is how stable is it, how does it compare across countries and how stable over time the rate of growth?
The first step is to plot the log of the series
So visually it all seems quite converging toward the same slope, but it can visually be a bit tricky to judge. So let’s try to fit a linear model or line to the data.
Python and pandas make it quite easy in a few lines. Here we’ll plot for a given length of time (20) and a given state of country (California) the fitted exponential model to the data.
You can then experiment with the different countries or state and period of time to see how stable the coefficients are with time or between areas.
As previously stated, you need to be very careful in interpreting the results, but here is a comparison of New York and California. which seem to indicate a higher rate of growth in New York, which could be linked to new tests results coming rather than actual infection growth of course.
Or Italy over two period of time, which seem to indicate a slower rate of growth on a more recent period.
Please stay safe and healthy. Always follow the official guideline.
I hope some of you found this interesting and maybe as more of us have to stay home can be a good opportunity to learn and experiment with python and pandas.
The full notebook can be seen here as html or downloaded as a notebook
Wonderful. I’ve been retired for years now and it was interesting to see how the languages have developed. In the UK there is talk of the over 70s having to stay and home non voluntarily so maybe I’ll take up your suggestion 😊
Thanks again