Skip to content
Sajal Sharma

911 Calls - Exploratory Data Analysis

13.10.2016 -
Python, Pandas, Seaborn

Introduction

For this project we'll analyze the 911 call dataset from Kaggle. The data contains the following fields:

  • lat : String variable, Latitude
  • lng: String variable, Longitude
  • desc: String variable, Description of the Emergency Call
  • zip: String variable, Zipcode
  • title: String variable, Title
  • timeStamp: String variable, YYYY-MM-DD HH:MM:SS
  • twp: String variable, Township
  • addr: String variable, Address
  • e: String variable, Dummy variable (always 1)

Let's start with some data analysis and visualisation imports.

1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import seaborn as sns
5%matplotlib inline
6
7sns.set_style('whitegrid')
8plt.rcParams['figure.figsize'] = (6, 4)
9
10# Reading the data
11df = pd.read_csv('data/911.csv')
12df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat 99492 non-null float64
lng 99492 non-null float64
desc 99492 non-null object
zip 86637 non-null float64
title 99492 non-null object
timeStamp 99492 non-null object
twp 99449 non-null object
addr 98973 non-null object
e 99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB
1# Checking the head of the dataframe
2df.head()

Output:

latlngdescziptitletimeStamptwpaddre
040.297876-75.581294REINDEER CT & DEAD END; NEW HANOVER; Station ...19525.0EMS: BACK PAINS/INJURY2015-12-10 17:40:00NEW HANOVERREINDEER CT & DEAD END1
140.258061-75.264680BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...19446.0EMS: DIABETIC EMERGENCY2015-12-10 17:40:00HATFIELD TOWNSHIPBRIAR PATH & WHITEMARSH LN1
240.121182-75.351975HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...19401.0Fire: GAS-ODOR/LEAK2015-12-10 17:40:00NORRISTOWNHAWS AVE1
340.116153-75.343513AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;...19401.0EMS: CARDIAC EMERGENCY2015-12-10 17:40:01NORRISTOWNAIRY ST & SWEDE ST1
440.251492-75.603350CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S...NaNEMS: DIZZINESS2015-12-10 17:40:01LOWER POTTSGROVECHERRYWOOD CT & DEAD END1

Basic Analysis

Let's check out the top 5 zipcodes for calls.

1df['zip'].value_counts().head(5)
19401.0 6979
19464.0 6643
19403.0 4854
19446.0 4748
19406.0 3174
Name: zip, dtype: int64

The top townships for the calls were as follows:

1df['twp'].value_counts().head(5)
LOWER MERION 8443
ABINGTON 5977
NORRISTOWN 5890
UPPER MERION 5227
CHELTENHAM 4575
Name: twp, dtype: int64

For 90k + entries, how many unique call titles did we have?

1df['title'].nunique()
110

Data Wrangling for Feature Creation

We can extract some generalised features from the columns in our dataset for further analysis.

In the title column, there's a kind of 'subcategory' or 'reason for call' alloted to each entry (denoted by the text before the colon).

The timestamp column can be further segregated into Year, Month and Day of Week too.

Let's start with creating a 'Reason' feature for each call.

1df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])
2df.tail()

Output:

latlngdescziptitletimeStamptwpaddreReason
9948740.132869-75.333515MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2...19401.0Traffic: VEHICLE ACCIDENT -2016-08-24 11:06:00NORRISTOWNMARKLEY ST & W LOGAN ST1Traffic
9948840.006974-75.289080LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ...19003.0Traffic: VEHICLE ACCIDENT -2016-08-24 11:07:02LOWER MERIONLANCASTER AVE & RITTENHOUSE PL1Traffic
9948940.115429-75.334679CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ...19401.0EMS: FALL VICTIM2016-08-24 11:12:00NORRISTOWNCHESTNUT ST & WALNUT ST1EMS
9949040.186431-75.192555WELSH RD & WEBSTER LN; HORSHAM; Station 352; ...19002.0EMS: NAUSEA/VOMITING2016-08-24 11:17:01HORSHAMWELSH RD & WEBSTER LN1EMS
9949140.207055-75.317952MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08...19446.0Traffic: VEHICLE ACCIDENT -2016-08-24 11:17:02UPPER GWYNEDDMORRIS RD & S BROAD ST1Traffic

Now, let's find out the most common reason for 911 calls, according to our dataset.

1df['Reason'].value_counts()
EMS 48877
Traffic 35695
Fire 14920
Name: Reason, dtype: int64
1sns.countplot(df['Reason'])

png

Let's deal with the time information we have. Checking the datatype of the timestamp column.

1type(df['timeStamp'][0])
str

As the timestamps are still string types, it'll make our life easier if we convert it to a python DateTime object, so we can extract the year, month, and day information more intuitively.

1df['timeStamp'] = pd.to_datetime(df['timeStamp'])

For a single DateTime object, we can extract information as follows.

1time = df['timeStamp'].iloc[0]
2
3print('Hour:',time.hour)
4print('Month:',time.month)
5print('Day of Week:',time.dayofweek)
Hour: 17
Month: 12
Day of Week: 3

Now let's create new features for the above pieces of information.

1df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
2df['Month'] = df['timeStamp'].apply(lambda x: x.month)
3df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)
4
5df.head(3)

Output:

latlngdescziptitletimeStamptwpaddreReasonHourMonthDay of Week
040.297876-75.581294REINDEER CT & DEAD END; NEW HANOVER; Station ...19525.0EMS: BACK PAINS/INJURY2015-12-10 17:40:00NEW HANOVERREINDEER CT & DEAD END1EMS17123
140.258061-75.264680BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...19446.0EMS: DIABETIC EMERGENCY2015-12-10 17:40:00HATFIELD TOWNSHIPBRIAR PATH & WHITEMARSH LN1EMS17123
240.121182-75.351975HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St...19401.0Fire: GAS-ODOR/LEAK2015-12-10 17:40:00NORRISTOWNHAWS AVE1Fire17123

The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.

1dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
2df['Day of Week'] = df['Day of Week'].map(dmap)
3
4df.tail(3)

Output:

latlngdescziptitletimeStamptwpaddreReasonHourMonthDay of Week
9948940.115429-75.334679CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ...19401.0EMS: FALL VICTIM2016-08-24 11:12:00NORRISTOWNCHESTNUT ST & WALNUT ST1EMS118Wed
9949040.186431-75.192555WELSH RD & WEBSTER LN; HORSHAM; Station 352; ...19002.0EMS: NAUSEA/VOMITING2016-08-24 11:17:01HORSHAMWELSH RD & WEBSTER LN1EMS118Wed
9949140.207055-75.317952MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08...19446.0Traffic: VEHICLE ACCIDENT -2016-08-24 11:17:02UPPER GWYNEDDMORRIS RD & S BROAD ST1Traffic118Wed

Let's combine the newly created features, to check out the most common call reasons based on the day of the week.

1sns.countplot(df['Day of Week'],hue=df['Reason'])
2plt.legend(bbox_to_anchor=(1.25,1))

png

It makes sense for the number of traffic related 911 calls to be the lowest during the weekends, what's also iteresting is that Emergency Service related calls are also low during the weekend.

1sns.countplot(df['Month'],hue=df['Reason'])
2
3plt.legend(bbox_to_anchor=(1.25,1))

png

Now, let's check out the relationship between the number of calls and the month.

1byMonth = pd.groupby(df,by='Month').count()
2byMonth['e'].plot.line(y='e')
3
4plt.title('Calls per Month')
5plt.ylabel('Number of Calls')

png

Using seaborn, let's fit the number of calls to a month and see if there's any concrete correlation between the two.

1byMonth.reset_index(inplace=True)
2
3sns.lmplot(x='Month',y='e',data=byMonth)
4plt.ylabel('Number of Calls')

png

So, it does seem that there are fewer emergency calls during the holiday seasons.

Let's extract the date from the timestamp, and see behavior in a little more detail.

1df['Date']=df['timeStamp'].apply(lambda x: x.date())
2df.head(2)

Output:

latlngdescziptitletimeStamptwpaddreReasonHourMonthDay of WeekDate
040.297876-75.581294REINDEER CT & DEAD END; NEW HANOVER; Station ...19525.0EMS: BACK PAINS/INJURY2015-12-10 17:40:00NEW HANOVERREINDEER CT & DEAD END1EMS1712Thu2015-12-10
140.258061-75.264680BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP...19446.0EMS: DIABETIC EMERGENCY2015-12-10 17:40:00HATFIELD TOWNSHIPBRIAR PATH & WHITEMARSH LN1EMS1712Thu2015-12-10

Grouping and plotting the data:

1pd.groupby(df,'Date').count()['e'].plot.line(y='e')
2
3plt.legend().remove()
4plt.tight_layout()

png

We can also check out the same plot for each reason separately.

1pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')
2plt.title('Traffic')
3plt.legend().remove()
4plt.tight_layout()

png

1pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')
2plt.title('Fire')
3plt.legend().remove()
4plt.tight_layout()

png

1pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')
2plt.title('EMS')
3plt.legend().remove()
4plt.tight_layout()

png

Let's create a heatmap for the counts of calls on each hour, during a given day of the week.

1day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')
2
3day_hour

Output:

Hour0123456789...14151617181920212223
Day of Week
Fri275235191175201194372598742752...9329801039980820696667559514474
Mon282221201194204267397653819786...869913989997885746613497472325
Sat375301263260224231257391459640...789796848757778696628572506467
Sun383306286268242240300402483620...684691663714670655537461415330
Thu278202233159182203362570777828...8769699351013810698617553424354
Tue269240186170209239415655889880...94393810261019905731647571462274
Wed250216189209156255410701875808...9048679901037894686668575490335

7 rows × 24 columns

Now create a HeatMap using this new DataFrame.

1sns.heatmap(day_hour)
2
3plt.tight_layout()

png

We see that most calls take place around the end of office hours on weekdays. We can create a clustermap to pair up similar Hours and Days.

1sns.clustermap(day_hour)

png

And this concludes the exploratory analysis project.

© 2022 Sajal Sharma.
Made with ❤️   +  GatsbyJS