# Render our plots inline
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)
You can read data from a CSV file using the read_csv
function. By default, it assumes that the fields are comma-separated.
We're going to be looking some cyclist data from Montréal. Here's the original page (in French), but it's already included in this repository. We're using the data from 2012.
This dataset is a list of how many people were on 7 different bike paths in Montreal, each day.
broken_df = pd.read_csv('../data/bikes.csv',encoding = "ISO-8859-1")
# Look at the first 3 rows
broken_df[:3]
Date;Berri 1;Brébeuf (données non disponibles);Côte-Sainte-Catherine;Maisonneuve 1;Maisonneuve 2;du Parc;Pierre-Dupuy;Rachel1;St-Urbain (données non disponibles) | |
---|---|
0 | 01/01/2012;35;;0;38;51;26;10;16; |
1 | 02/01/2012;83;;1;68;153;53;6;43; |
2 | 03/01/2012;135;;2;104;248;89;3;58; |
You'll notice that this is totally broken! read_csv
has a bunch of options that will let us fix that, though. Here we'll
;
'latin1'
(the default is 'utf8'
)fixed_df = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
fixed_df[:3]
Berri 1 | Brébeuf (données non disponibles) | Côte-Sainte-Catherine | Maisonneuve 1 | Maisonneuve 2 | du Parc | Pierre-Dupuy | Rachel1 | St-Urbain (données non disponibles) | |
---|---|---|---|---|---|---|---|---|---|
Date | |||||||||
2012-01-01 | 35 | NaN | 0 | 38 | 51 | 26 | 10 | 16 | NaN |
2012-01-02 | 83 | NaN | 1 | 68 | 153 | 53 | 6 | 43 | NaN |
2012-01-03 | 135 | NaN | 2 | 104 | 248 | 89 | 3 | 58 | NaN |
When you read a CSV, you get a kind of object called a DataFrame
, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.
Here's an example:
fixed_df['Berri 1']
Date 2012-01-01 35 2012-01-02 83 2012-01-03 135 2012-01-04 144 2012-01-05 197 ... 2012-11-01 2405 2012-11-02 1582 2012-11-03 844 2012-11-04 966 2012-11-05 2247 Name: Berri 1, Length: 310, dtype: int64
Just add .plot()
to the end! How could it be easier? =)
We can see that, unsurprisingly, not many people are biking in January, February, and March,
fixed_df['Berri 1'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11da42eb0>
We can also plot all the columns just as easily. We'll make it a little bigger, too. You can see that it's more squished together, but all the bike paths behave basically the same -- if it's a bad day for cyclists, it's a bad day everywhere.
fixed_df.plot(figsize=(15, 10))
<matplotlib.axes._subplots.AxesSubplot at 0x11dadd8e0>
Here's the code we needed to write do draw that graph, all together:
df = pd.read_csv('../data/bikes.csv', sep=';', encoding='latin1', parse_dates=['Date'], dayfirst=True, index_col='Date')
df['Berri 1'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11d6bb9d0>