911 Calls - Exploratory Data Analysis

Skills: Python, Pandas, Seaborn
Github

Introduction

For this project we’ll analyze the 911 call dataset from Kaggle. The data contains the following fields:

  • lat : String variable, Latitude
  • lng: String variable, Longitude
  • desc: String variable, Description of the Emergency Call
  • zip: String variable, Zipcode
  • title: String variable, Title
  • timeStamp: String variable, YYYY-MM-DD HH:MM:SS
  • twp: String variable, Township
  • addr: String variable, Address
  • e: String variable, Dummy variable (always 1)

Let’s start with some data analysis and visualisation imports.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

plt.rcParams['figure.figsize'] = (6, 4)
#Reading the data
df = pd.read_csv('data/911.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat          99492 non-null float64
lng          99492 non-null float64
desc         99492 non-null object
zip          86637 non-null float64
title        99492 non-null object
timeStamp    99492 non-null object
twp          99449 non-null object
addr         98973 non-null object
e            99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB
#Checking the head of the dataframe
df.head()
lat lng desc zip title timeStamp twp addr e
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1

Basic Analysis

Let’s check out the top 5 zipcodes for calls.

df['zip'].value_counts().head(5)
19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

The top townships for the calls were as follows:

df['twp'].value_counts().head(5)
LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

For 90k + entries, how many unique call titles did we have?

df['title'].nunique()
110

Data Wrangling for Feature Creation

We can extract some generalised features from the columns in our dataset for further analysis.

In the title column, there’s a kind of ‘subcategory’ or ‘reason for call’ alloted to each entry (denoted by the text before the colon).

The timestamp column can be further segregated into Year, Month and Day of Week too.

Let’s start with creating a ‘Reason’ feature for each call.

df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])
df.tail()
lat lng desc zip title timeStamp twp addr e Reason
99487 40.132869 -75.333515 MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2... 19401.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:06:00 NORRISTOWN MARKLEY ST & W LOGAN ST 1 Traffic
99488 40.006974 -75.289080 LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ... 19003.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:07:02 LOWER MERION LANCASTER AVE & RITTENHOUSE PL 1 Traffic
99489 40.115429 -75.334679 CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... 19401.0 EMS: FALL VICTIM 2016-08-24 11:12:00 NORRISTOWN CHESTNUT ST & WALNUT ST 1 EMS
99490 40.186431 -75.192555 WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... 19002.0 EMS: NAUSEA/VOMITING 2016-08-24 11:17:01 HORSHAM WELSH RD & WEBSTER LN 1 EMS
99491 40.207055 -75.317952 MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... 19446.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:17:02 UPPER GWYNEDD MORRIS RD & S BROAD ST 1 Traffic

Now, let’s find out the most common reason for 911 calls, according to our dataset.

df['Reason'].value_counts()
EMS        48877
Traffic    35695
Fire       14920
Name: Reason, dtype: int64
sns.countplot(df['Reason'])
<matplotlib.axes._subplots.AxesSubplot at 0x1165ad710>

png

Let’s deal with the time information we have. Checking the datatype of the timestamp column.

type(df['timeStamp'][0])
str

As the timestamps are still string types, it’ll make our life easier if we convert it to a python DateTime object, so we can extract the year, month, and day information more intuitively.

df['timeStamp'] = pd.to_datetime(df['timeStamp'])

For a single DateTime object, we can extract information as follows.

time = df['timeStamp'].iloc[0]

print('Hour:',time.hour)
print('Month:',time.month)
print('Day of Week:',time.dayofweek)
Hour: 17
Month: 12
Day of Week: 3

Now let’s create new features for the above pieces of information.

df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)
df.head(3)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 3
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 3
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 3

The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.

dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day of Week'] = df['Day of Week'].map(dmap)

df.tail(3)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week
99489 40.115429 -75.334679 CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... 19401.0 EMS: FALL VICTIM 2016-08-24 11:12:00 NORRISTOWN CHESTNUT ST & WALNUT ST 1 EMS 11 8 Wed
99490 40.186431 -75.192555 WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... 19002.0 EMS: NAUSEA/VOMITING 2016-08-24 11:17:01 HORSHAM WELSH RD & WEBSTER LN 1 EMS 11 8 Wed
99491 40.207055 -75.317952 MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... 19446.0 Traffic: VEHICLE ACCIDENT - 2016-08-24 11:17:02 UPPER GWYNEDD MORRIS RD & S BROAD ST 1 Traffic 11 8 Wed

Let’s combine the newly created features, to check out the most common call reasons based on the day of the week.

sns.countplot(df['Day of Week'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))
<matplotlib.legend.Legend at 0x116d9cbe0>

png

It makes sense for the number of traffic related 911 calls to be the lowest during the weekends, what’s also iteresting is that Emergency Service related calls are also low during the weekend.

sns.countplot(df['Month'],hue=df['Reason'])

plt.legend(bbox_to_anchor=(1.25,1))
<matplotlib.legend.Legend at 0x117dd1c88>

png

Now, let’s check out the relationship between the number of calls and the month.

byMonth = pd.groupby(df,by='Month').count()
byMonth['e'].plot.line(y='e')
plt.title('Calls per Month')
plt.ylabel('Number of Calls')
<matplotlib.text.Text at 0x1031d65c0>

png

Using seaborn, let’s fit the number of calls to a month and see if there’s any concrete correlation between the two.

byMonth.reset_index(inplace=True)
sns.lmplot(x='Month',y='e',data=byMonth)
plt.ylabel('Number of Calls')
<matplotlib.text.Text at 0x109aa7fd0>

png

So, it does seem that there are fewer emergency calls during the holiday seasons.

Let’s extract the date from the timestamp, and see behavior in a little more detail.

df['Date']=df['timeStamp'].apply(lambda x: x.date())
df.head(2)
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week Date
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station ... 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 Thu 2015-12-10
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 Thu 2015-12-10

Grouping and plotting the data:

pd.groupby(df,'Date').count()['e'].plot.line(y='e')

plt.legend().remove()
plt.tight_layout()

png

We can also check out the same plot for each reason separately.

pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
plt.title('Traffic')
plt.legend().remove()
plt.tight_layout()

png

pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
plt.title('Fire')
plt.legend().remove()
plt.tight_layout()

png

pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
plt.title('EMS')
plt.legend().remove()
plt.tight_layout()

png

Let’s create a heatmap for the counts of calls on each hour, during a given day of the week.

day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')

day_hour
Hour 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
Day of Week
Fri 275 235 191 175 201 194 372 598 742 752 ... 932 980 1039 980 820 696 667 559 514 474
Mon 282 221 201 194 204 267 397 653 819 786 ... 869 913 989 997 885 746 613 497 472 325
Sat 375 301 263 260 224 231 257 391 459 640 ... 789 796 848 757 778 696 628 572 506 467
Sun 383 306 286 268 242 240 300 402 483 620 ... 684 691 663 714 670 655 537 461 415 330
Thu 278 202 233 159 182 203 362 570 777 828 ... 876 969 935 1013 810 698 617 553 424 354
Tue 269 240 186 170 209 239 415 655 889 880 ... 943 938 1026 1019 905 731 647 571 462 274
Wed 250 216 189 209 156 255 410 701 875 808 ... 904 867 990 1037 894 686 668 575 490 335

7 rows × 24 columns

** Now create a HeatMap using this new DataFrame. **

sns.heatmap(day_hour)

plt.tight_layout()

png

We see that most calls take place around the end of office hours on weekdays. We can create a clustermap to pair up similar Hours and Days.

sns.clustermap(day_hour)
<seaborn.matrix.ClusterGrid at 0x11c49f320>

png

And this concludes the exploratory analysis project.