Looking at the data

The first step is to get the data from the https://github.com/CSSEGISandData/COVID-19

The directory csse_covid_19_data/csse_covid_19_time_series will contains files that are well suited to time series analysis

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

confirmed_f = 'time_series_19-covid-Confirmed.csv'
deaths_f = 'time_series_19-covid-Deaths.csv'

dfc = pd.read_csv(confirmed_f)
confirmed = dfc.drop(columns=['Lat','Long']).groupby( ['Country/Region']).sum().transpose()
confirmed.index = pd.to_datetime(confirmed.index)

dfd = pd.read_csv(deaths_f)
deaths = dfd.drop(columns=['Lat','Long']).groupby( ['Country/Region']).sum().transpose()
deaths.index = pd.to_datetime(deaths.index)

Looking at the data, it is organized by Country and regions, and dates in columns.

In [18]:
dfc.head()
Out[18]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20
0 NaN Thailand 15.0000 101.0000 2 3 5 7 8 8 ... 47 48 50 50 50 53 59 70 75 82
1 NaN Japan 36.0000 138.0000 2 1 2 2 4 4 ... 360 420 461 502 511 581 639 639 701 773
2 NaN Singapore 1.2833 103.8333 0 1 3 3 4 5 ... 117 130 138 150 150 160 178 178 200 212
3 NaN Nepal 28.1667 84.2500 0 0 0 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
4 NaN Malaysia 2.5000 112.5000 0 0 0 3 4 4 ... 50 83 93 99 117 129 149 149 197 238

5 rows × 57 columns

So to prepare the data for time series analysis, the first things we'll do is remove some columns on longitude and latitude and group by the data so we have one row by country then transpose the data so each row is a date.

In addition, we'll convert the index into a datetime

In [19]:
confirmed = dfc.drop(columns=['Lat','Long']).groupby( ['Country/Region']).sum().transpose()
confirmed.index = pd.to_datetime(confirmed.index)

dfd = pd.read_csv(deaths_f)
deaths = dfd.drop(columns=['Lat','Long']).groupby( ['Country/Region']).sum().transpose()
deaths.index = pd.to_datetime(deaths.index)

It's easy to look at the columns to find the exact spelling of specific countries as below

In [22]:
[x for x in confirmed.columns if 'Korea' in x]
Out[22]:
['Korea, South']

You can then define a list of countries you are interested and focus on these, either looking at the raw numbers or a simple plot of the data

In [23]:
countries = [ 'United Kingdom', 'US', 'Italy', 'France', 'Korea, South' ]
deaths[ countries ].tail()
Out[23]:
Country/Region United Kingdom US Italy France Korea, South
2020-03-10 6 28 631 33 54
2020-03-11 8 36 827 48 60
2020-03-12 8 40 827 48 66
2020-03-13 8 47 1266 79 66
2020-03-14 21 54 1441 91 72
In [25]:
confirmed[ countries ].tail(20).plot()
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x143f92bd0>

You can also try to extract information for states. A quick look shows though that at different time states will be either broken down by counties or summarized by states.

Let's first extract only the numbers for the US into confirmed_us and here look for New York, you'll see that you need to either look for the state of the counties containing NY.

In [32]:
confirmed_us = dfc[ dfc['Country/Region'] == 'US']
confirmed_us.head()
Out[32]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 3/5/20 3/6/20 3/7/20 3/8/20 3/9/20 3/10/20 3/11/20 3/12/20 3/13/20 3/14/20
100 Washington US 47.4009 -121.4905 0 0 0 0 0 0 ... 0 0 0 0 0 267 366 442 568 572
101 New York US 42.1657 -74.9481 0 0 0 0 0 0 ... 0 0 0 0 0 173 220 328 421 525
102 California US 36.1162 -119.6816 0 0 0 0 0 0 ... 0 0 0 0 0 144 177 221 282 340
103 Massachusetts US 42.2302 -71.5301 0 0 0 0 0 0 ... 0 0 0 0 0 92 95 108 123 138
104 Diamond Princess US 35.4437 139.6380 0 0 0 0 0 0 ... 45 45 45 45 45 46 46 46 46 46

5 rows × 57 columns

In [33]:
confirmed_us = dfc[ dfc['Country/Region'] == 'US']
states = [x for x in list(confirmed_us['Province/State']) if 'NY' in x or 'New York' in x]
states
Out[33]:
['New York',
 'Suffolk County, NY',
 'Ulster County, NY',
 'Rockland County, NY',
 'Saratoga County, NY',
 'Nassau County, NY',
 'New York County, NY',
 'Westchester County, NY']

the following logic collapses all the county into the states resulting in one column for the state

In [36]:
nyc1 = confirmed_us[confirmed_us['Province/State'].isin( states )].drop(columns=['Lat', 'Long'])
nyc1[ 'Province/State' ] = 'New York'
nyc = nyc1.groupby( 'Province/State' ).sum().transpose()
nyc.tail()
Out[36]:
Province/State New York
3/10/20 173
3/11/20 220
3/12/20 328
3/13/20 421
3/14/20 525

We can then wrap the logic into a quick function so we can easily apply it to multiple states

In [37]:
def FindState(df, statefull, stateshort):
    states = [x for x in list(df['Province/State']) if stateshort in x or statefull == x]
    return states
    

def ExtractState(df, statefull, stateshort):
    states = FindState(df,statefull,stateshort)
    pre = df[ df[ 'Province/State'].isin(states)].drop(columns=['Lat','Long'])
    pre['Province/State'] = statefull
    return pre.groupby('Province/State').sum().transpose()
In [38]:
FindState(confirmed_us, 'Washington', ', WA')
Out[38]:
['Washington',
 'Kitsap, WA',
 'Kittitas County, WA',
 'Clark County, WA',
 'Jefferson County, WA',
 'Pierce County, WA',
 'Grant County, WA',
 'Snohomish County, WA',
 'King County, WA',
 'Skagit, WA',
 'Thurston, WA',
 'Island, WA',
 'Whatcom, WA']

Let's now build a new data frame where we will put a few states of interest and country for analysis

In [40]:
confirmed_st = ExtractState(confirmed_us, 'Washington', ', WA')
confirmed_st['New York'] = ExtractState(confirmed_us, 'New York', ', NY')
confirmed_st['California'] = ExtractState(confirmed_us, 'California', ', CA')

confirmed_st['United Kingdom'] = confirmed['United Kingdom']
confirmed_st['Italy'] = confirmed['Italy']
confirmed_st['France'] = confirmed['France']
confirmed_st['Korea, South'] = confirmed['Korea, South']
confirmed_st['US'] = confirmed['US']

confirmed_st.tail()
Out[40]:
Province/State Washington New York California United Kingdom Italy France Korea, South US
3/10/20 267 173 144 384 10149 1787 7513 959
3/11/20 366 220 177 459 12462 2284 7755 1281
3/12/20 442 328 221 459 12462 2284 7869 1663
3/13/20 568 421 282 801 17660 3667 7979 2179
3/14/20 572 525 340 1143 21157 4480 8086 2727

Analysing the data

We can now start looking at the data in more details.

Plotting some shows clearly that the series are exponential

In [42]:
confirmed_st.tail(20).plot()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x144139690>

It is easy to apply log to see if it then looks more linear

In [13]:
confirmed_st.apply(np.log).tail(15).plot()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x143c1d050>
In [43]:
confirmed_st.apply(np.log).tail(10)
Out[43]:
Province/State Washington New York California United Kingdom Italy France Korea, South US
3/5/20 4.248495 3.135494 3.931826 4.753590 8.257904 5.940171 8.714075 5.379897
3/6/20 4.356709 3.433987 4.077537 5.099866 8.441607 6.486161 8.793764 5.568345
3/7/20 4.624973 4.330733 4.394449 5.332719 8.679822 6.858565 8.859505 5.996452
3/8/20 4.804021 4.663439 4.553877 5.613128 8.905851 7.029088 8.897546 6.249975
3/9/20 4.804021 4.955827 4.615121 5.774552 9.123911 7.100027 8.919721 6.368187
3/10/20 5.587249 5.153292 4.969813 5.950643 9.225130 7.488294 8.924390 6.865891
3/11/20 5.902633 5.393628 5.176150 6.129050 9.430439 7.733684 8.956093 7.155396
3/12/20 6.091310 5.793014 5.398163 6.129050 9.430439 7.733684 8.970686 7.416378
3/13/20 6.342121 6.042633 5.641907 6.685861 9.779057 8.207129 8.984568 7.686621
3/14/20 6.349139 6.263398 5.828946 7.041412 9.959726 8.407378 8.997889 7.910957

Now we need to see if we can fit a linear model to the log series.

It is important to see though that the data on 3/12/20 seems to be a repeat of the data on 3/11/20. This is likely because no numbers were published on that day, and therefore we should remove that point from the data, or it would impact the fit. This is the reason for the delete(x, -3) line

In [47]:
def fit_model(df, which, n):
    df.index = pd.to_datetime(confirmed_st.index)

    logdf = df.apply(np.log).tail(n)

    x = np.arange(n)
    y = logdf[which].values

    x = np.delete(x, -3)
    y = np.delete(y, -3)

    reg = np.polyfit( x,y,1 )
    fit_fn = np.poly1d(reg)

    plt.plot(x, np.exp(y) )
    plt.plot(x, np.exp(fit_fn(x) ) )
    plt.title('{} {}'.format(which, reg))
    
fit_model(confirmed_st, 'New York', 10)
In [52]:
fit_model(confirmed_st, 'New York', 10)
plt.show()
fit_model(confirmed_st, 'California', 20)
plt.show()
In [56]:
fit_model(confirmed_st, 'France', 15)
plt.show()
fit_model(confirmed_st, 'United Kingdom', 20)
plt.show()
In [ ]: