Disaster Message Classifier
1. Introduction
In this project, we'll build a multilabel classifier to classify disaster event related messages into appropriate categories. This can can help tell us about the nature of the event so that the message can be routed to the correct organizations, enabling faster mobilization of resources.
The project will be divided into two separate modules: "Extract, Transformer, Load" and Machine Learning. As you'll see, the initial data for the project is not clean. We'll take this opportunity to showcase some basic ETL skills, and save the clean data to a Sqlite database, which can then be loaded by the ML pipeline. Finally, we will train an XGBoost classifier to predict the labels fo new messages.
This is a walkthrough notebook of the project. The finished scripts, and a web app with a user interface to allow message classification can be found here.
Note: This project is available as part of some Udacity nanodegrees.
1import pandas as pd2import matplotlib3from sqlalchemy import create_engine
2. Data
Data is available to us in two different files: disaster_messages.csv
and disaster_categories.csv
.
Let's load these two datasets and take a look at their first few rows.
1categories = pd.read_csv('../data/disaster_categories.csv')2messages = pd.read_csv('../data/disaster_messages.csv')3messages.head()
id | message | original | genre | |
---|---|---|---|---|
0 | 2 | Weather update - a cold front from Cuba that c... | Un front froid se retrouve sur Cuba ce matin. ... | direct |
1 | 7 | Is the Hurricane over or is it not over | Cyclone nan fini osinon li pa fini | direct |
2 | 8 | Looking for someone but no name | Patnm, di Maryani relem pou li banm nouvel li ... | direct |
3 | 9 | UN reports Leogane 80-90 destroyed. Only Hospi... | UN reports Leogane 80-90 destroyed. Only Hospi... | direct |
4 | 12 | says: west side of Haiti, rest of the country ... | facade ouest d Haiti et le reste du pays aujou... | direct |
We're only interested in the message
column from the disaster_messages
dataset. We'll use that column to train our message classifier, and ignore the other columns.
1categories.head()
id | categories | |
---|---|---|
0 | 2 | related-1;request-0;offer-0;aid_related-0;medi... |
1 | 7 | related-1;request-0;offer-0;aid_related-1;medi... |
2 | 8 | related-1;request-0;offer-0;aid_related-0;medi... |
3 | 9 | related-1;request-1;offer-0;aid_related-1;medi... |
4 | 12 | related-1;request-0;offer-0;aid_related-0;medi... |
1categories.iloc[0]['categories']
1'related-1;request-0;offer-0;aid_related-0;medical_help-0;medical_products-0;search_and_rescue-0;security-0;military-0;child_alone-0;water-0;food-0;shelter-0;clothing-0;money-0;missing_people-0;refugees-0;death-0;other_aid-0;infrastructure_related-0;transport-0;buildings-0;electricity-0;tools-0;hospitals-0;shops-0;aid_centers-0;other_infrastructure-0;weather_related-0;floods-0;storm-0;fire-0;earthquake-0;cold-0;other_weather-0;direct_report-0'
The disaster_categories
dataset contains the category labels for our messages, but in a serialized fashion. We have also taken a closer look at the category for the first row. We'll need to convert the categories to a format better suited for our ML model.
3. Extract, Transform, Load!
Let's start by merging the two datasets so that the messages and categories are present in the same dataframe.
1df = messages.merge(categories, on='id')2df.head()
id | message | original | genre | categories | |
---|---|---|---|---|---|
0 | 2 | Weather update - a cold front from Cuba that c... | Un front froid se retrouve sur Cuba ce matin. ... | direct | related-1;request-0;offer-0;aid_related-0;medi... |
1 | 7 | Is the Hurricane over or is it not over | Cyclone nan fini osinon li pa fini | direct | related-1;request-0;offer-0;aid_related-1;medi... |
2 | 8 | Looking for someone but no name | Patnm, di Maryani relem pou li banm nouvel li ... | direct | related-1;request-0;offer-0;aid_related-0;medi... |
3 | 9 | UN reports Leogane 80-90 destroyed. Only Hospi... | UN reports Leogane 80-90 destroyed. Only Hospi... | direct | related-1;request-1;offer-0;aid_related-1;medi... |
4 | 12 | says: west side of Haiti, rest of the country ... | facade ouest d Haiti et le reste du pays aujou... | direct | related-1;request-0;offer-0;aid_related-0;medi... |
Cleaning and transforming the categories column
Let's start by splitting the categories column into separate columns for each category.
1categories = df['categories'].str.split(';',expand=True)2categories.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
1 | related-1 | request-0 | offer-0 | aid_related-1 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-1 | floods-0 | storm-1 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
2 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
3 | related-1 | request-1 | offer-0 | aid_related-1 | medical_help-0 | medical_products-1 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
4 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
5 rows × 36 columns
Extracting the column names from a row of the categories dataframe.
1category_column_names = categories.iloc[0].apply(lambda x: x.split("-")[0])2# Let's check which categories we have3print([name for name in category_column_names])
1['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']
Using the extracted column names as the header of the categories dataframe:
1categories.columns = category_column_names2categories.head()
related | request | offer | aid_related | medical_help | medical_products | search_and_rescue | security | military | child_alone | ... | aid_centers | other_infrastructure | weather_related | floods | storm | fire | earthquake | cold | other_weather | direct_report | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
1 | related-1 | request-0 | offer-0 | aid_related-1 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-1 | floods-0 | storm-1 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
2 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
3 | related-1 | request-1 | offer-0 | aid_related-1 | medical_help-0 | medical_products-1 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
4 | related-1 | request-0 | offer-0 | aid_related-0 | medical_help-0 | medical_products-0 | search_and_rescue-0 | security-0 | military-0 | child_alone-0 | ... | aid_centers-0 | other_infrastructure-0 | weather_related-0 | floods-0 | storm-0 | fire-0 | earthquake-0 | cold-0 | other_weather-0 | direct_report-0 |
5 rows × 36 columns
Next, we'll fix the values of the above dataset so that they are binary. 1 indicates that a message belonging to a given category.
1for column in categories:2 # the last character of a value indicates its binary label3 categories[column] = categories[column].str[-1].astype(int)
The related
column contains values other than 1 or 2, so we will convert the 2s to 1s by testricting the maximum value of a category as 1.
1categories['related'].value_counts()
11 2004220 614032 2044Name: related, dtype: int64
1categories['related'] = categories['related'].clip(0,1)2categories['related'].value_counts()
11 2024620 61403Name: related, dtype: int64
Let's concatenate the categories dataframe back to our original dataframe.
1# drop the original categories column from `df`2df.drop('categories',axis=1,inplace=True)3# concatenate the original dataframe with the new `categories` dataframe4df = pd.concat([df,categories], axis=1)5df.head()
id | message | original | genre | related | request | offer | aid_related | medical_help | medical_products | ... | aid_centers | other_infrastructure | weather_related | floods | storm | fire | earthquake | cold | other_weather | direct_report | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | Weather update - a cold front from Cuba that c... | Un front froid se retrouve sur Cuba ce matin. ... | direct | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 7 | Is the Hurricane over or is it not over | Cyclone nan fini osinon li pa fini | direct | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 8 | Looking for someone but no name | Patnm, di Maryani relem pou li banm nouvel li ... | direct | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 9 | UN reports Leogane 80-90 destroyed. Only Hospi... | UN reports Leogane 80-90 destroyed. Only Hospi... | direct | 1 | 1 | 0 | 1 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 12 | says: west side of Haiti, rest of the country ... | facade ouest d Haiti et le reste du pays aujou... | direct | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 40 columns
Removing duplicates and NaN rows, if any
1# checking for duplicate rows in df2df.duplicated().sum()
1171
1# drop duplicates2df.drop_duplicates(keep='first', inplace=True)3# drop NaN messages4df.dropna(subset=["message"], axis=0, inplace=True) # drop the row5# drop the id and original columns as they are not useful for the learning problem6df.drop(["id", "original"], axis=1, inplace=True)
Saving to a database
As part of this project's simulated scenario, we'll use sqlite and save the dataframe to a db.
1database_filename = "disaster_db"2engine = create_engine('sqlite:///'+ database_filename)3df.to_sql('messages', engine, if_exists="replace", index=False)
126215
4. Machine Learning Pipeline
Let's start by loading the data back form our database:
1import re2
3import nltk4# uncomment below if NLTK data needs to be downloaded5# nltk.download(['punkt', 'wordnet'])6# nltk.download('stopwords')7from nltk.tokenize import word_tokenize8from nltk.stem import WordNetLemmatizer9from nltk.corpus import stopwords10
11from xgboost import XGBClassifier12
13from sklearn.model_selection import train_test_split, GridSearchCV14from sklearn.pipeline import Pipeline15from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer16from sklearn.metrics import classification_report, accuracy_score17from sklearn.multioutput import MultiOutputClassifier18
19engine = create_engine('sqlite:///' + database_filename)20df = pd.read_sql_query('select * from messages', engine)21# the training data is a numpy array of all messages22X = df['message'].values23# the labels are all the different categories24Y = df.drop(columns=['message','genre'], axis=1)25category_names = Y.columns26len(X)
126215
1X[:5]
1array(['Weather update - a cold front from Cuba that could pass over Haiti',2 'Is the Hurricane over or is it not over',3 'Looking for someone but no name',4 'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.',5 'says: west side of Haiti, rest of the country today and tonight'],6 dtype=object)
1Y.head()
related | request | offer | aid_related | medical_help | medical_products | search_and_rescue | security | military | child_alone | ... | aid_centers | other_infrastructure | weather_related | floods | storm | fire | earthquake | cold | other_weather | direct_report | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 36 columns
Splitting the dataset into train/test sets
1X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)2print(len(X_train), len(X_test))
120972 5243
Preprocessing the messages texts
For the classifier to perform well, we need to preprocess the disaster message texts to standardize and tokenize them. We can define a function to do this, which can be used later when defining our model pipeline:
1punctuation_regex = re.compile(r"[^\w\s]")2stopwords = stop_words = stopwords.words('english')3wordnet_lemmatizer = WordNetLemmatizer()4pos_tags_to_lemmatize = ["n", "v"]5
6def tokenize(text):7 """8 Tokenizes a given text.9 Args:10 text: text string11 Returns:12 tokens: list of tokens13 """14 # lowercase string and remove punctuation15 text = punctuation_regex.sub(" ", text.lower()).strip()16 # tokenize text17 tokens = [token for token in word_tokenize(text)]18 # lemmatize text based on pos tags19 for pos_tag in pos_tags_to_lemmatize:20 tokens = [wordnet_lemmatizer.lemmatize(token, pos=pos_tag) for token in word_tokenize(text)]21 # remove stopwords22 tokens = [token for token in tokens if token not in stopwords]23 return tokens
Building the model pipeline
1def build_model():2 """Builds classification model """3
4 pipeline = Pipeline([5 ('vect', CountVectorizer(tokenizer=tokenize)),6 ('tfidf', TfidfTransformer()),7 ('clf', MultiOutputClassifier(XGBClassifier(learning_rate=0.1)))8 ])9
10 parameters = {11 "clf__estimator__max_depth": [8, 16],12 "clf__estimator__colsample_bytree":[0.5, 0.75]13 }14
15 cv = GridSearchCV(pipeline, cv=3, param_grid=parameters, n_jobs=1, scoring="f1_micro")16 return cv
Fitting the model
1model = build_model()2model.fit(X_train, y_train)
Evaluating the model
1y_preds = model.predict(X_test)2print(classification_report(y_preds, y_test.values, target_names=category_names))
1---2 precision recall f1-score support3 4 related 0.96 0.82 0.89 47075 request 0.55 0.79 0.65 6226 offer 0.00 0.00 0.00 07 aid_related 0.66 0.77 0.71 18818 medical_help 0.25 0.59 0.35 1869 medical_products 0.27 0.64 0.38 11410 search_and_rescue 0.18 0.72 0.29 4011 security 0.03 0.33 0.06 912 military 0.32 0.57 0.41 9513 child_alone 0.00 0.00 0.00 014 water 0.71 0.82 0.76 29415 food 0.76 0.80 0.78 57916 shelter 0.60 0.75 0.67 37517 clothing 0.48 0.73 0.58 5118 money 0.22 0.69 0.34 4219 missing_people 0.13 0.80 0.23 1020 refugees 0.24 0.66 0.35 6221 death 0.46 0.79 0.58 13622 other_aid 0.15 0.66 0.25 15823 infrastructure_related 0.05 0.40 0.09 4324 transport 0.25 0.72 0.37 8525 buildings 0.38 0.72 0.49 13726 electricity 0.26 0.71 0.38 4127 tools 0.00 0.00 0.00 028 hospitals 0.05 0.75 0.10 429 shops 0.00 0.00 0.00 030 aid_centers 0.04 0.29 0.06 731 other_infrastructure 0.03 0.32 0.05 1932 weather_related 0.71 0.85 0.77 120633 floods 0.58 0.87 0.70 28934 storm 0.63 0.73 0.68 41635 fire 0.28 0.75 0.41 2036 earthquake 0.81 0.88 0.84 44637 cold 0.38 0.71 0.49 5638 other_weather 0.12 0.39 0.19 7639 direct_report 0.45 0.73 0.56 62440 41 micro avg 0.61 0.79 0.69 1283042 macro avg 0.33 0.60 0.40 1283043 weighted avg 0.72 0.79 0.74 1283044 samples avg 0.53 0.66 0.54 12830
Calculating accuracy scores per category:
1# collect accuracy scores in a dict2category_name_2_accuracy_score = {}3for i in range(len(category_names)):4 category_name_2_accuracy_score[y_test.columns[i]] = accuracy_score(y_test.values[:,i],y_preds[:,i])5print(pd.Series(category_name_2_accuracy_score))
1related 0.8136562request 0.8987223offer 0.9952324aid_related 0.7747475medical_help 0.9238996medical_products 0.9553697search_and_rescue 0.9731078security 0.9820719military 0.97043710child_alone 1.00000011water 0.97081812food 0.94964713shelter 0.94678614clothing 0.98970115money 0.97825716missing_people 0.98970117refugees 0.97081818death 0.97081819other_aid 0.88155620infrastructure_related 0.93534221transport 0.95975622buildings 0.96147223electricity 0.98226224tools 0.99427825hospitals 0.98931926shops 0.99580427aid_centers 0.98874728other_infrastructure 0.95613229weather_related 0.88518030floods 0.95784931storm 0.94468832fire 0.99179933earthquake 0.97234434cold 0.98436035other_weather 0.94983836direct_report 0.86191137dtype: float64
This is an imbalanced dataset, as most categories for a given message will be 0. The f1-score is a better metric of the model's performance.
Predicting categories for new messages
1def predict(text):2 """Returns a list of predicted categories for the given text"""3 preds = model.predict([text])4 predicted_categories = [category for i, category in enumerate(y_test.columns) if preds[0][i] == 1]5 return predicted_categories6
7predict("after the floods in our area we are trapped. we need food and shelter ")
1['related',2'request',3'aid_related',4'food',5'shelter',6'weather_related',7'floods',8'direct_report']
5. Further Improvements
This notebook sticks to the basics in order to provide a good baseline model for this classification task. There is plenty of room for improvement here, including but not limited to:
- Using word embeddings (GloVe, word2vec) or even sentence embeddings (Univsersal Sentgence Encoder) to transform message text, instead of using a CountVectorizer. This should allow the model to generalize well to similar/unseen words and improve the model accuracy too.
- Some of the category column values like
related
andchild_alone
are highly skewed. We can look at adding more data for these categories. - A Neural Network can be trained for this task. Better yet, a pre-trained state-of-the-art transformer based network can be fine tuned using the data available to us.
- Certain classification categories are "noisy" such as "related" or "child alone" (no positive examples). These can either be removed, or more data can be acquired that provides enough missing examples for such cases.
6. Summary
In this project, we built an ETL pipeline to load messy and unsuitable-for-training data, clean and transform it, and save it to a database. We also built a ML pipeline to tokenize message text, and train an XGBoost classifier to classify messages into different categories.
We stuck to the basics for building this classifier, and there's plenty of room for improvement in the future using modern NLP architectures.