911 Calls - Exploratory Data Analysis
Introduction
For this project we'll analyze the 911 call dataset from Kaggle. The data contains the following fields:
- lat : String variable, Latitude
- lng: String variable, Longitude
- desc: String variable, Description of the Emergency Call
- zip: String variable, Zipcode
- title: String variable, Title
- timeStamp: String variable, YYYY-MM-DD HH:MM:SS
- twp: String variable, Township
- addr: String variable, Address
- e: String variable, Dummy variable (always 1)
Let's start with some data analysis and visualisation imports.
1import numpy as np2import pandas as pd3import matplotlib.pyplot as plt4import seaborn as sns5%matplotlib inline6
7sns.set_style('whitegrid')8plt.rcParams['figure.figsize'] = (6, 4)9
10# Reading the data11df = pd.read_csv('data/911.csv')12df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 99492 entries, 0 to 99491 Data columns (total 9 columns): lat 99492 non-null float64 lng 99492 non-null float64 desc 99492 non-null object zip 86637 non-null float64 title 99492 non-null object timeStamp 99492 non-null object twp 99449 non-null object addr 98973 non-null object e 99492 non-null int64 dtypes: float64(3), int64(1), object(5) memory usage: 6.8+ MB
1# Checking the head of the dataframe2df.head()
Output:
lat | lng | desc | zip | title | timeStamp | twp | addr | e | |
---|---|---|---|---|---|---|---|---|---|
0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 |
1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 |
2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 |
3 | 40.116153 | -75.343513 | AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;... | 19401.0 | EMS: CARDIAC EMERGENCY | 2015-12-10 17:40:01 | NORRISTOWN | AIRY ST & SWEDE ST | 1 |
4 | 40.251492 | -75.603350 | CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S... | NaN | EMS: DIZZINESS | 2015-12-10 17:40:01 | LOWER POTTSGROVE | CHERRYWOOD CT & DEAD END | 1 |
Basic Analysis
Let's check out the top 5 zipcodes for calls.
1df['zip'].value_counts().head(5)
19401.0 697919464.0 664319403.0 485419446.0 474819406.0 3174Name: zip, dtype: int64
The top townships for the calls were as follows:
1df['twp'].value_counts().head(5)
LOWER MERION 8443ABINGTON 5977NORRISTOWN 5890UPPER MERION 5227CHELTENHAM 4575Name: twp, dtype: int64
For 90k + entries, how many unique call titles did we have?
1df['title'].nunique()
110
Data Wrangling for Feature Creation
We can extract some generalised features from the columns in our dataset for further analysis.
In the title column, there's a kind of 'subcategory' or 'reason for call' alloted to each entry (denoted by the text before the colon).
The timestamp column can be further segregated into Year, Month and Day of Week too.
Let's start with creating a 'Reason' feature for each call.
1df['Reason'] = df['title'].apply(lambda x: x.split(':')[0])2df.tail()
Output:
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | |
---|---|---|---|---|---|---|---|---|---|---|
99487 | 40.132869 | -75.333515 | MARKLEY ST & W LOGAN ST; NORRISTOWN; 2016-08-2... | 19401.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:06:00 | NORRISTOWN | MARKLEY ST & W LOGAN ST | 1 | Traffic |
99488 | 40.006974 | -75.289080 | LANCASTER AVE & RITTENHOUSE PL; LOWER MERION; ... | 19003.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:07:02 | LOWER MERION | LANCASTER AVE & RITTENHOUSE PL | 1 | Traffic |
99489 | 40.115429 | -75.334679 | CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... | 19401.0 | EMS: FALL VICTIM | 2016-08-24 11:12:00 | NORRISTOWN | CHESTNUT ST & WALNUT ST | 1 | EMS |
99490 | 40.186431 | -75.192555 | WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... | 19002.0 | EMS: NAUSEA/VOMITING | 2016-08-24 11:17:01 | HORSHAM | WELSH RD & WEBSTER LN | 1 | EMS |
99491 | 40.207055 | -75.317952 | MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... | 19446.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:17:02 | UPPER GWYNEDD | MORRIS RD & S BROAD ST | 1 | Traffic |
Now, let's find out the most common reason for 911 calls, according to our dataset.
1df['Reason'].value_counts()
EMS 48877Traffic 35695Fire 14920Name: Reason, dtype: int64
1sns.countplot(df['Reason'])
Let's deal with the time information we have. Checking the datatype of the timestamp column.
1type(df['timeStamp'][0])
str
As the timestamps are still string types, it'll make our life easier if we convert it to a python DateTime object, so we can extract the year, month, and day information more intuitively.
1df['timeStamp'] = pd.to_datetime(df['timeStamp'])
For a single DateTime object, we can extract information as follows.
1time = df['timeStamp'].iloc[0]2
3print('Hour:',time.hour)4print('Month:',time.month)5print('Day of Week:',time.dayofweek)
Hour: 17Month: 12Day of Week: 3
Now let's create new features for the above pieces of information.
1df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)2df['Month'] = df['timeStamp'].apply(lambda x: x.month)3df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)4
5df.head(3)
Output:
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 | EMS | 17 | 12 | 3 |
1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 | EMS | 17 | 12 | 3 |
2 | 40.121182 | -75.351975 | HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St... | 19401.0 | Fire: GAS-ODOR/LEAK | 2015-12-10 17:40:00 | NORRISTOWN | HAWS AVE | 1 | Fire | 17 | 12 | 3 |
The Day of the Week is an integer and it might not be instantly clear which number refers to which Day. We can map that information to a Mon-Sun string.
1dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}2df['Day of Week'] = df['Day of Week'].map(dmap)3
4df.tail(3)
Output:
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
99489 | 40.115429 | -75.334679 | CHESTNUT ST & WALNUT ST; NORRISTOWN; Station ... | 19401.0 | EMS: FALL VICTIM | 2016-08-24 11:12:00 | NORRISTOWN | CHESTNUT ST & WALNUT ST | 1 | EMS | 11 | 8 | Wed |
99490 | 40.186431 | -75.192555 | WELSH RD & WEBSTER LN; HORSHAM; Station 352; ... | 19002.0 | EMS: NAUSEA/VOMITING | 2016-08-24 11:17:01 | HORSHAM | WELSH RD & WEBSTER LN | 1 | EMS | 11 | 8 | Wed |
99491 | 40.207055 | -75.317952 | MORRIS RD & S BROAD ST; UPPER GWYNEDD; 2016-08... | 19446.0 | Traffic: VEHICLE ACCIDENT - | 2016-08-24 11:17:02 | UPPER GWYNEDD | MORRIS RD & S BROAD ST | 1 | Traffic | 11 | 8 | Wed |
Let's combine the newly created features, to check out the most common call reasons based on the day of the week.
1sns.countplot(df['Day of Week'],hue=df['Reason'])2plt.legend(bbox_to_anchor=(1.25,1))
It makes sense for the number of traffic related 911 calls to be the lowest during the weekends, what's also iteresting is that Emergency Service related calls are also low during the weekend.
1sns.countplot(df['Month'],hue=df['Reason'])2
3plt.legend(bbox_to_anchor=(1.25,1))
Now, let's check out the relationship between the number of calls and the month.
1byMonth = pd.groupby(df,by='Month').count()2byMonth['e'].plot.line(y='e')3
4plt.title('Calls per Month')5plt.ylabel('Number of Calls')
Using seaborn, let's fit the number of calls to a month and see if there's any concrete correlation between the two.
1byMonth.reset_index(inplace=True)2
3sns.lmplot(x='Month',y='e',data=byMonth)4plt.ylabel('Number of Calls')
So, it does seem that there are fewer emergency calls during the holiday seasons.
Let's extract the date from the timestamp, and see behavior in a little more detail.
1df['Date']=df['timeStamp'].apply(lambda x: x.date())2df.head(2)
Output:
lat | lng | desc | zip | title | timeStamp | twp | addr | e | Reason | Hour | Month | Day of Week | Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40.297876 | -75.581294 | REINDEER CT & DEAD END; NEW HANOVER; Station ... | 19525.0 | EMS: BACK PAINS/INJURY | 2015-12-10 17:40:00 | NEW HANOVER | REINDEER CT & DEAD END | 1 | EMS | 17 | 12 | Thu | 2015-12-10 |
1 | 40.258061 | -75.264680 | BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP... | 19446.0 | EMS: DIABETIC EMERGENCY | 2015-12-10 17:40:00 | HATFIELD TOWNSHIP | BRIAR PATH & WHITEMARSH LN | 1 | EMS | 17 | 12 | Thu | 2015-12-10 |
Grouping and plotting the data:
1pd.groupby(df,'Date').count()['e'].plot.line(y='e')2
3plt.legend().remove()4plt.tight_layout()
We can also check out the same plot for each reason separately.
1pd.groupby(df[df['Reason']=='Traffic'],'Date').count().plot.line(y='e')2plt.title('Traffic')3plt.legend().remove()4plt.tight_layout()
1pd.groupby(df[df['Reason']=='Fire'],'Date').count().plot.line(y='e')2plt.title('Fire')3plt.legend().remove()4plt.tight_layout()
1pd.groupby(df[df['Reason']=='EMS'],'Date').count().plot.line(y='e')2plt.title('EMS')3plt.legend().remove()4plt.tight_layout()
Let's create a heatmap for the counts of calls on each hour, during a given day of the week.
1day_hour = df.pivot_table(values='lat',index='Day of Week',columns='Hour',aggfunc='count')2
3day_hour
Output:
Hour | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Day of Week | |||||||||||||||||||||
Fri | 275 | 235 | 191 | 175 | 201 | 194 | 372 | 598 | 742 | 752 | ... | 932 | 980 | 1039 | 980 | 820 | 696 | 667 | 559 | 514 | 474 |
Mon | 282 | 221 | 201 | 194 | 204 | 267 | 397 | 653 | 819 | 786 | ... | 869 | 913 | 989 | 997 | 885 | 746 | 613 | 497 | 472 | 325 |
Sat | 375 | 301 | 263 | 260 | 224 | 231 | 257 | 391 | 459 | 640 | ... | 789 | 796 | 848 | 757 | 778 | 696 | 628 | 572 | 506 | 467 |
Sun | 383 | 306 | 286 | 268 | 242 | 240 | 300 | 402 | 483 | 620 | ... | 684 | 691 | 663 | 714 | 670 | 655 | 537 | 461 | 415 | 330 |
Thu | 278 | 202 | 233 | 159 | 182 | 203 | 362 | 570 | 777 | 828 | ... | 876 | 969 | 935 | 1013 | 810 | 698 | 617 | 553 | 424 | 354 |
Tue | 269 | 240 | 186 | 170 | 209 | 239 | 415 | 655 | 889 | 880 | ... | 943 | 938 | 1026 | 1019 | 905 | 731 | 647 | 571 | 462 | 274 |
Wed | 250 | 216 | 189 | 209 | 156 | 255 | 410 | 701 | 875 | 808 | ... | 904 | 867 | 990 | 1037 | 894 | 686 | 668 | 575 | 490 | 335 |
7 rows × 24 columns
Now create a HeatMap using this new DataFrame.
1sns.heatmap(day_hour)2
3plt.tight_layout()
We see that most calls take place around the end of office hours on weekdays. We can create a clustermap to pair up similar Hours and Days.
1sns.clustermap(day_hour)
And this concludes the exploratory analysis project.