# The usual preamble
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
# This is necessary to show lots of columns in pandas 0.12.
# Not necessary in pandas 0.13.
pd.set_option('display.width', 5000)
pd.set_option('display.max_columns', 60)
plt.rcParams['figure.figsize'] = (15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from NYC Open Data.
# because of mixed types we specify dtype to prevent any errors
complaints = pd.read_csv('../data/311-service-requests.csv', dtype='unicode')
Depending on your pandas version, you might see an error like "DtypeWarning: Columns (8) have mixed types". This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are integers.
For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.
When you print a large dataframe, it will only show you the first few rows.
If you don't see this, don't panic! The default behavior for large dataframes changed between pandas 0.12 and 0.13. Previous to 0.13 it would show you a summary of the dataframe. This includes all the columns, and how many non-null values there are in each column.
complaints
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | 169 STREET | 90 AVENUE | 91 AVENUE | NaN | NaN | ADDRESS | JAMAICA | NaN | Precinct | Assigned | 10/31/2013 10:08:41 AM | 10/31/2013 02:35:17 AM | 12 QUEENS | QUEENS | 1042027.0 | 197389.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | 58 AVENUE | 58 PLACE | 59 STREET | NaN | NaN | BLOCKFACE | MASPETH | NaN | Precinct | Open | 10/31/2013 10:01:04 AM | NaN | 05 QUEENS | QUEENS | 1009349.0 | 201984.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | BROADWAY | WEST 171 STREET | WEST 172 STREET | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/31/2013 10:00:24 AM | 10/31/2013 02:39:42 AM | 12 MANHATTAN | MANHATTAN | 1001088.0 | 246531.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | WEST 72 STREET | COLUMBUS AVENUE | AMSTERDAM AVENUE | NaN | NaN | BLOCKFACE | NEW YORK | NaN | Precinct | Closed | 10/31/2013 09:56:23 AM | 10/31/2013 02:21:10 AM | 07 MANHATTAN | MANHATTAN | 989730.0 | 222727.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | WEST 124 STREET | LENOX AVENUE | ADAM CLAYTON POWELL JR BOULEVARD | NaN | NaN | BLOCKFACE | NEW YORK | NaN | NaN | Pending | 11/30/2013 01:53:44 AM | 10/31/2013 01:59:54 AM | 10 MANHATTAN | MANHATTAN | 998815.0 | 233545.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
111064 | 26426013 | 10/04/2013 12:01:13 AM | 10/07/2013 04:07:16 PM | DPR | Department of Parks and Recreation | Maintenance or Facility | Structure - Outdoors | Park | 11213 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | BROOKLYN | NaN | NaN | Closed | 10/18/2013 12:01:13 AM | 10/07/2013 04:07:16 PM | 08 BROOKLYN | BROOKLYN | NaN | NaN | Brower Park | BROOKLYN | Brower Park | B012 | NaN | NaN | 7189658900 | Brooklyn, St. Marks, Kingston Avenues, Park Place | BROOKLYN | NY | 11213 | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
111065 | 26428083 | 10/04/2013 12:01:05 AM | 10/04/2013 02:13:50 AM | NYPD | New York City Police Department | Illegal Parking | Posted Parking Sign Violation | Street/Sidewalk | 11434 | NaN | NaN | NaN | NaN | GUY R BREWER BOULEVARD | ROCKAWAY BOULEVARD | INTERSECTION | JAMAICA | NaN | Precinct | Closed | 10/04/2013 08:01:05 AM | 10/04/2013 02:13:50 AM | 13 QUEENS | QUEENS | 1048801.0 | 178419.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.656160 | -73.767353 | (40.656160351546845, -73.76735262738222) |
111066 | 26428987 | 10/04/2013 12:00:45 AM | 10/04/2013 01:25:01 AM | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 10016 | 344 EAST 28 STREET | EAST 28 STREET | MOUNT CARMEL PLACE | 1 AVENUE | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/04/2013 08:00:45 AM | 10/04/2013 01:25:01 AM | 06 MANHATTAN | MANHATTAN | 990637.0 | 208987.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.740295 | -73.976952 | (40.740295354643706, -73.97695165980414) |
111067 | 26426115 | 10/04/2013 12:00:28 AM | 10/04/2013 04:17:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Talking | Club/Bar/Restaurant | 11226 | 1233 FLATBUSH AVENUE | FLATBUSH AVENUE | AVENUE D | NEWKIRK AVENUE | NaN | NaN | ADDRESS | BROOKLYN | NaN | Precinct | Closed | 10/04/2013 08:00:28 AM | 10/04/2013 04:17:32 AM | 14 BROOKLYN | BROOKLYN | 996654.0 | 172515.0 | Unspecified | BROOKLYN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.640182 | -73.955306 | (40.64018174662485, -73.95530566958138) |
111068 | 26428033 | 10/04/2013 12:00:10 AM | 10/04/2013 01:20:52 AM | NYPD | New York City Police Department | Blocked Driveway | Partial Access | Street/Sidewalk | 11236 | 1259 EAST 94 STREET | EAST 94 STREET | AVENUE J | AVENUE K | NaN | NaN | ADDRESS | BROOKLYN | NaN | Precinct | Closed | 10/04/2013 08:00:10 AM | 10/04/2013 01:20:52 AM | 18 BROOKLYN | BROOKLYN | 1011804.0 | 172470.0 | Unspecified | BROOKLYN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.640024 | -73.900717 | (40.640024057399216, -73.90071711703163) |
111069 rows × 52 columns
To select a column, we index with the name of the column, like this:
complaints['Complaint Type']
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent ... 111064 Maintenance or Facility 111065 Illegal Parking 111066 Noise - Street/Sidewalk 111067 Noise - Commercial 111068 Blocked Driveway Name: Complaint Type, Length: 111069, dtype: object
To get the first 5 rows of a dataframe, we can use a slice: df[:5]
.
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.
complaints[:5]
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | 169 STREET | 90 AVENUE | 91 AVENUE | NaN | NaN | ADDRESS | JAMAICA | NaN | Precinct | Assigned | 10/31/2013 10:08:41 AM | 10/31/2013 02:35:17 AM | 12 QUEENS | QUEENS | 1042027.0 | 197389.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | 58 AVENUE | 58 PLACE | 59 STREET | NaN | NaN | BLOCKFACE | MASPETH | NaN | Precinct | Open | 10/31/2013 10:01:04 AM | NaN | 05 QUEENS | QUEENS | 1009349.0 | 201984.0 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | BROADWAY | WEST 171 STREET | WEST 172 STREET | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/31/2013 10:00:24 AM | 10/31/2013 02:39:42 AM | 12 MANHATTAN | MANHATTAN | 1001088.0 | 246531.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | WEST 72 STREET | COLUMBUS AVENUE | AMSTERDAM AVENUE | NaN | NaN | BLOCKFACE | NEW YORK | NaN | Precinct | Closed | 10/31/2013 09:56:23 AM | 10/31/2013 02:21:10 AM | 07 MANHATTAN | MANHATTAN | 989730.0 | 222727.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | WEST 124 STREET | LENOX AVENUE | ADAM CLAYTON POWELL JR BOULEVARD | NaN | NaN | BLOCKFACE | NEW YORK | NaN | NaN | Pending | 11/30/2013 01:53:44 AM | 10/31/2013 01:59:54 AM | 10 MANHATTAN | MANHATTAN | 998815.0 | 233545.0 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, -73.94738703491433) |
We can combine these to get the first 5 rows of a column:
complaints['Complaint Type'][:5]
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
and it doesn't matter which direction we do it in:
complaints[:5]['Complaint Type']
0 Noise - Street/Sidewalk 1 Illegal Parking 2 Noise - Commercial 3 Noise - Vehicle 4 Rodent Name: Complaint Type, dtype: object
What if we just want to know the complaint type and the borough, but not the rest of the information? Pandas makes it really easy to select a subset of the columns: just index with list of columns you want.
complaints[['Complaint Type', 'Borough']]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
... | ... | ... |
111064 | Maintenance or Facility | BROOKLYN |
111065 | Illegal Parking | QUEENS |
111066 | Noise - Street/Sidewalk | MANHATTAN |
111067 | Noise - Commercial | BROOKLYN |
111068 | Blocked Driveway | BROOKLYN |
111069 rows × 2 columns
That showed us a summary, and then we can look at the first 10 rows:
complaints[['Complaint Type', 'Borough']][:10]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
5 | Noise - Commercial | QUEENS |
6 | Blocked Driveway | QUEENS |
7 | Noise - Commercial | QUEENS |
8 | Noise - Commercial | MANHATTAN |
9 | Noise - Commercial | BROOKLYN |
This is a really easy question to answer! There's a .value_counts()
method that we can use:
complaints['Complaint Type'].value_counts()
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 ... Trans Fat 1 X-Ray Machine/Equipment 1 Ferry Permit 1 DWD 1 Snow 1 Name: Complaint Type, Length: 165, dtype: int64
If we just wanted the top 10 most common complaints, we can do this:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
HEATING 14200 GENERAL CONSTRUCTION 7471 Street Light Condition 7117 DOF Literature Request 5797 PLUMBING 5373 PAINT - PLASTER 5149 Blocked Driveway 4590 NONCONST 3998 Street Condition 3473 Illegal Parking 3343 Name: Complaint Type, dtype: int64
But it gets better! We can plot them!
complaint_counts[:10].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x11a854e50>