Skip to content
Sajal Sharma

Disaster Message Classifier

11.09.2022 -
Pandas, NLTK, Scikit-learn, XGBoost

1. Introduction

In this project, we'll build a multilabel classifier to classify disaster event related messages into appropriate categories. This can can help tell us about the nature of the event so that the message can be routed to the correct organizations, enabling faster mobilization of resources.

The project will be divided into two separate modules: "Extract, Transformer, Load" and Machine Learning. As you'll see, the initial data for the project is not clean. We'll take this opportunity to showcase some basic ETL skills, and save the clean data to a Sqlite database, which can then be loaded by the ML pipeline. Finally, we will train an XGBoost classifier to predict the labels fo new messages.

This is a walkthrough notebook of the project. The finished scripts, and a web app with a user interface to allow message classification can be found here.

Note: This project is available as part of some Udacity nanodegrees.

1import pandas as pd
2import matplotlib
3from sqlalchemy import create_engine

2. Data

Data is available to us in two different files: disaster_messages.csv and disaster_categories.csv.

Let's load these two datasets and take a look at their first few rows.

1categories = pd.read_csv('../data/disaster_categories.csv')
2messages = pd.read_csv('../data/disaster_messages.csv')
3messages.head()
idmessageoriginalgenre
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...direct
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirect
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...direct
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...direct
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...direct

We're only interested in the message column from the disaster_messages dataset. We'll use that column to train our message classifier, and ignore the other columns.

1categories.head()
idcategories
02related-1;request-0;offer-0;aid_related-0;medi...
17related-1;request-0;offer-0;aid_related-1;medi...
28related-1;request-0;offer-0;aid_related-0;medi...
39related-1;request-1;offer-0;aid_related-1;medi...
412related-1;request-0;offer-0;aid_related-0;medi...
1categories.iloc[0]['categories']
1'related-1;request-0;offer-0;aid_related-0;medical_help-0;medical_products-0;search_and_rescue-0;security-0;military-0;child_alone-0;water-0;food-0;shelter-0;clothing-0;money-0;missing_people-0;refugees-0;death-0;other_aid-0;infrastructure_related-0;transport-0;buildings-0;electricity-0;tools-0;hospitals-0;shops-0;aid_centers-0;other_infrastructure-0;weather_related-0;floods-0;storm-0;fire-0;earthquake-0;cold-0;other_weather-0;direct_report-0'

The disaster_categories dataset contains the category labels for our messages, but in a serialized fashion. We have also taken a closer look at the category for the first row. We'll need to convert the categories to a format better suited for our ML model.

3. Extract, Transform, Load!

Let's start by merging the two datasets so that the messages and categories are present in the same dataframe.

1df = messages.merge(categories, on='id')
2df.head()
idmessageoriginalgenrecategories
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...directrelated-1;request-0;offer-0;aid_related-0;medi...
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirectrelated-1;request-0;offer-0;aid_related-1;medi...
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...directrelated-1;request-0;offer-0;aid_related-0;medi...
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...directrelated-1;request-1;offer-0;aid_related-1;medi...
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...directrelated-1;request-0;offer-0;aid_related-0;medi...

Cleaning and transforming the categories column

Let's start by splitting the categories column into separate columns for each category.

1categories = df['categories'].str.split(';',expand=True)
2categories.head()
0123456789...26272829303132333435
0related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
1related-1request-0offer-0aid_related-1medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-1floods-0storm-1fire-0earthquake-0cold-0other_weather-0direct_report-0
2related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
3related-1request-1offer-0aid_related-1medical_help-0medical_products-1search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
4related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0

5 rows × 36 columns

Extracting the column names from a row of the categories dataframe.

1category_column_names = categories.iloc[0].apply(lambda x: x.split("-")[0])
2# Let's check which categories we have
3print([name for name in category_column_names])
1['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

Using the extracted column names as the header of the categories dataframe:

1categories.columns = category_column_names
2categories.head()
relatedrequestofferaid_relatedmedical_helpmedical_productssearch_and_rescuesecuritymilitarychild_alone...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
0related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
1related-1request-0offer-0aid_related-1medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-1floods-0storm-1fire-0earthquake-0cold-0other_weather-0direct_report-0
2related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
3related-1request-1offer-0aid_related-1medical_help-0medical_products-1search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0
4related-1request-0offer-0aid_related-0medical_help-0medical_products-0search_and_rescue-0security-0military-0child_alone-0...aid_centers-0other_infrastructure-0weather_related-0floods-0storm-0fire-0earthquake-0cold-0other_weather-0direct_report-0

5 rows × 36 columns

Next, we'll fix the values of the above dataset so that they are binary. 1 indicates that a message belonging to a given category.

1for column in categories:
2 # the last character of a value indicates its binary label
3 categories[column] = categories[column].str[-1].astype(int)

The related column contains values other than 1 or 2, so we will convert the 2s to 1s by testricting the maximum value of a category as 1.

1categories['related'].value_counts()
11 20042
20 6140
32 204
4Name: related, dtype: int64
1categories['related'] = categories['related'].clip(0,1)
2categories['related'].value_counts()
11 20246
20 6140
3Name: related, dtype: int64

Let's concatenate the categories dataframe back to our original dataframe.

1# drop the original categories column from `df`
2df.drop('categories',axis=1,inplace=True)
3# concatenate the original dataframe with the new `categories` dataframe
4df = pd.concat([df,categories], axis=1)
5df.head()
idmessageoriginalgenrerelatedrequestofferaid_relatedmedical_helpmedical_products...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
02Weather update - a cold front from Cuba that c...Un front froid se retrouve sur Cuba ce matin. ...direct100000...0000000000
17Is the Hurricane over or is it not overCyclone nan fini osinon li pa finidirect100100...0010100000
28Looking for someone but no namePatnm, di Maryani relem pou li banm nouvel li ...direct100000...0000000000
39UN reports Leogane 80-90 destroyed. Only Hospi...UN reports Leogane 80-90 destroyed. Only Hospi...direct110101...0000000000
412says: west side of Haiti, rest of the country ...facade ouest d Haiti et le reste du pays aujou...direct100000...0000000000

5 rows × 40 columns

Removing duplicates and NaN rows, if any

1# checking for duplicate rows in df
2df.duplicated().sum()
1171
1# drop duplicates
2df.drop_duplicates(keep='first', inplace=True)
3# drop NaN messages
4df.dropna(subset=["message"], axis=0, inplace=True) # drop the row
5# drop the id and original columns as they are not useful for the learning problem
6df.drop(["id", "original"], axis=1, inplace=True)

Saving to a database

As part of this project's simulated scenario, we'll use sqlite and save the dataframe to a db.

1database_filename = "disaster_db"
2engine = create_engine('sqlite:///'+ database_filename)
3df.to_sql('messages', engine, if_exists="replace", index=False)
126215

4. Machine Learning Pipeline

Let's start by loading the data back form our database:

1import re
2
3import nltk
4# uncomment below if NLTK data needs to be downloaded
5# nltk.download(['punkt', 'wordnet'])
6# nltk.download('stopwords')
7from nltk.tokenize import word_tokenize
8from nltk.stem import WordNetLemmatizer
9from nltk.corpus import stopwords
10
11from xgboost import XGBClassifier
12
13from sklearn.model_selection import train_test_split, GridSearchCV
14from sklearn.pipeline import Pipeline
15from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
16from sklearn.metrics import classification_report, accuracy_score
17from sklearn.multioutput import MultiOutputClassifier
18
19engine = create_engine('sqlite:///' + database_filename)
20df = pd.read_sql_query('select * from messages', engine)
21# the training data is a numpy array of all messages
22X = df['message'].values
23# the labels are all the different categories
24Y = df.drop(columns=['message','genre'], axis=1)
25category_names = Y.columns
26len(X)
126215
1X[:5]
1array(['Weather update - a cold front from Cuba that could pass over Haiti',
2 'Is the Hurricane over or is it not over',
3 'Looking for someone but no name',
4 'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.',
5 'says: west side of Haiti, rest of the country today and tonight'],
6 dtype=object)
1Y.head()
relatedrequestofferaid_relatedmedical_helpmedical_productssearch_and_rescuesecuritymilitarychild_alone...aid_centersother_infrastructureweather_relatedfloodsstormfireearthquakecoldother_weatherdirect_report
01000000000...0000000000
11001000000...0010100000
21000000000...0000000000
31101010000...0000000000
41000000000...0000000000

5 rows × 36 columns

Splitting the dataset into train/test sets

1X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
2print(len(X_train), len(X_test))
120972 5243

Preprocessing the messages texts

For the classifier to perform well, we need to preprocess the disaster message texts to standardize and tokenize them. We can define a function to do this, which can be used later when defining our model pipeline:

1punctuation_regex = re.compile(r"[^\w\s]")
2stopwords = stop_words = stopwords.words('english')
3wordnet_lemmatizer = WordNetLemmatizer()
4pos_tags_to_lemmatize = ["n", "v"]
5
6def tokenize(text):
7 """
8 Tokenizes a given text.
9 Args:
10 text: text string
11 Returns:
12 tokens: list of tokens
13 """
14 # lowercase string and remove punctuation
15 text = punctuation_regex.sub(" ", text.lower()).strip()
16 # tokenize text
17 tokens = [token for token in word_tokenize(text)]
18 # lemmatize text based on pos tags
19 for pos_tag in pos_tags_to_lemmatize:
20 tokens = [wordnet_lemmatizer.lemmatize(token, pos=pos_tag) for token in word_tokenize(text)]
21 # remove stopwords
22 tokens = [token for token in tokens if token not in stopwords]
23 return tokens

Building the model pipeline

1def build_model():
2 """Builds classification model """
3
4 pipeline = Pipeline([
5 ('vect', CountVectorizer(tokenizer=tokenize)),
6 ('tfidf', TfidfTransformer()),
7 ('clf', MultiOutputClassifier(XGBClassifier(learning_rate=0.1)))
8 ])
9
10 parameters = {
11 "clf__estimator__max_depth": [8, 16],
12 "clf__estimator__colsample_bytree":[0.5, 0.75]
13 }
14
15 cv = GridSearchCV(pipeline, cv=3, param_grid=parameters, n_jobs=1, scoring="f1_micro")
16 return cv

Fitting the model

1model = build_model()
2model.fit(X_train, y_train)

Evaluating the model

1y_preds = model.predict(X_test)
2print(classification_report(y_preds, y_test.values, target_names=category_names))
1---
2 precision recall f1-score support
3
4 related 0.96 0.82 0.89 4707
5 request 0.55 0.79 0.65 622
6 offer 0.00 0.00 0.00 0
7 aid_related 0.66 0.77 0.71 1881
8 medical_help 0.25 0.59 0.35 186
9 medical_products 0.27 0.64 0.38 114
10 search_and_rescue 0.18 0.72 0.29 40
11 security 0.03 0.33 0.06 9
12 military 0.32 0.57 0.41 95
13 child_alone 0.00 0.00 0.00 0
14 water 0.71 0.82 0.76 294
15 food 0.76 0.80 0.78 579
16 shelter 0.60 0.75 0.67 375
17 clothing 0.48 0.73 0.58 51
18 money 0.22 0.69 0.34 42
19 missing_people 0.13 0.80 0.23 10
20 refugees 0.24 0.66 0.35 62
21 death 0.46 0.79 0.58 136
22 other_aid 0.15 0.66 0.25 158
23 infrastructure_related 0.05 0.40 0.09 43
24 transport 0.25 0.72 0.37 85
25 buildings 0.38 0.72 0.49 137
26 electricity 0.26 0.71 0.38 41
27 tools 0.00 0.00 0.00 0
28 hospitals 0.05 0.75 0.10 4
29 shops 0.00 0.00 0.00 0
30 aid_centers 0.04 0.29 0.06 7
31 other_infrastructure 0.03 0.32 0.05 19
32 weather_related 0.71 0.85 0.77 1206
33 floods 0.58 0.87 0.70 289
34 storm 0.63 0.73 0.68 416
35 fire 0.28 0.75 0.41 20
36 earthquake 0.81 0.88 0.84 446
37 cold 0.38 0.71 0.49 56
38 other_weather 0.12 0.39 0.19 76
39 direct_report 0.45 0.73 0.56 624
40
41 micro avg 0.61 0.79 0.69 12830
42 macro avg 0.33 0.60 0.40 12830
43 weighted avg 0.72 0.79 0.74 12830
44 samples avg 0.53 0.66 0.54 12830

Calculating accuracy scores per category:

1# collect accuracy scores in a dict
2category_name_2_accuracy_score = {}
3for i in range(len(category_names)):
4 category_name_2_accuracy_score[y_test.columns[i]] = accuracy_score(y_test.values[:,i],y_preds[:,i])
5print(pd.Series(category_name_2_accuracy_score))
1related 0.813656
2request 0.898722
3offer 0.995232
4aid_related 0.774747
5medical_help 0.923899
6medical_products 0.955369
7search_and_rescue 0.973107
8security 0.982071
9military 0.970437
10child_alone 1.000000
11water 0.970818
12food 0.949647
13shelter 0.946786
14clothing 0.989701
15money 0.978257
16missing_people 0.989701
17refugees 0.970818
18death 0.970818
19other_aid 0.881556
20infrastructure_related 0.935342
21transport 0.959756
22buildings 0.961472
23electricity 0.982262
24tools 0.994278
25hospitals 0.989319
26shops 0.995804
27aid_centers 0.988747
28other_infrastructure 0.956132
29weather_related 0.885180
30floods 0.957849
31storm 0.944688
32fire 0.991799
33earthquake 0.972344
34cold 0.984360
35other_weather 0.949838
36direct_report 0.861911
37dtype: float64

This is an imbalanced dataset, as most categories for a given message will be 0. The f1-score is a better metric of the model's performance.

Predicting categories for new messages

1def predict(text):
2 """Returns a list of predicted categories for the given text"""
3 preds = model.predict([text])
4 predicted_categories = [category for i, category in enumerate(y_test.columns) if preds[0][i] == 1]
5 return predicted_categories
6
7predict("after the floods in our area we are trapped. we need food and shelter ")
1['related',
2'request',
3'aid_related',
4'food',
5'shelter',
6'weather_related',
7'floods',
8'direct_report']

5. Further Improvements

This notebook sticks to the basics in order to provide a good baseline model for this classification task. There is plenty of room for improvement here, including but not limited to:

  1. Using word embeddings (GloVe, word2vec) or even sentence embeddings (Univsersal Sentgence Encoder) to transform message text, instead of using a CountVectorizer. This should allow the model to generalize well to similar/unseen words and improve the model accuracy too.
  2. Some of the category column values like related and child_alone are highly skewed. We can look at adding more data for these categories.
  3. A Neural Network can be trained for this task. Better yet, a pre-trained state-of-the-art transformer based network can be fine tuned using the data available to us.
  4. Certain classification categories are "noisy" such as "related" or "child alone" (no positive examples). These can either be removed, or more data can be acquired that provides enough missing examples for such cases.

6. Summary

In this project, we built an ETL pipeline to load messy and unsuitable-for-training data, clean and transform it, and save it to a database. We also built a ML pipeline to tokenize message text, and train an XGBoost classifier to classify messages into different categories.

We stuck to the basics for building this classifier, and there's plenty of room for improvement in the future using modern NLP architectures.

© 2022 Sajal Sharma.
Made with ❤️   +  GatsbyJS