How to design a simple prediction model with Airbnb data
This blog aims to give a tutorial on "how to write a prediction model in Python in less than a breakfast time".
Most of the top data scientists and Kagglers build their first effective model quickly and submit. The simplest solution provides a bench mark solution to optimize.
This tutorial is to show you how to predict the courtry of destination a user is going to book on Airbnb.
I will divide the whole job into small tasks, with each giving both the explaination and the code demo. Without further ado, let's rock!
The tasks are :
- Data exploration
- Data cleaning
- Data modelling
Step 1 : Data Exploration
- 1.1 Load libraries
- 1.2 Load data
- 1.3 View data features/columns and summary
- 1.4 Get to know your variables : categorical, numerical, string, ID, target
- 1.5 First visualization of the data
In [8]:
# 1.1 Load libraries
import pandas as pd
from pandas import Series,DataFrame
# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
import xgboost as xgb
In [71]:
# 1.2 Load data
train = pd.read_csv('Kaggle data comp 1/input/train_users_2.csv')
test = pd.read_csv('Kaggle data comp 1/input/test_users.csv')
train['Type'] = 'Train' #Create a flag for Train and Test Data set
test['Type'] = 'Test'
# Combined both Train and Test Data set
fullData = pd.concat([train,test],axis=0)
In [72]:
print train.shape, test.shape
In [73]:
# 1.3 View the column names / summary of the dataset
print '\n---------------------------------------------------------\n'
print 'Information of our data:', fullData.info()
print '\n---------------------------------------------------------\n'
In [74]:
# 1.4 Get to know your variables
# Check the data types of all variables (or columns if you want)
fullData.dtypes
Out[74]:
In [75]:
# 1.5 First Visualization of the data
# Plot the distribution of the labels
fig, (axis1, axis2) = plt.subplots(1,2,figsize=(15,4))
sns.countplot(x='country_destination', data=fullData, palette="husl",
ax = axis1)
sns.countplot(x='country_destination', data=train, palette="husl",
ax = axis2)
Out[75]:
Step 2 : Data Cleaning
- 1.1 Drop out unnecessary variables
- 1.2 Find out all categorical variables and process them one by one (loop)
- 1.2.1 Visualze the distribution of the variable values
- 1.2.2 (optional) Date-type variable have to be split
- 1.2.3 (optional) Fill NaN values with median values
- 1.2.4 (optional) Make sure the data type of newly created variables is justified, e.g., 'Year', is an 'int' instead of 'float' type
- 1.2.5 (optional) Binary process ('one value' versus 'non- one value')
- 1.2.6 Look for new insights due to the use of new variables
- 1.3 Convert any remaining non-numerical variables to numerical
- 1.4 Particular process of some important numerical variables, e.g., age
In [76]:
# 1.1 Drop unecessary columns, these columns won't be useful in
# analysis and prediction
fullData = fullData.drop(['date_account_created',
'timestamp_first_active'], axis = 1)
In [77]:
# 1.2 Find out all categorical variables (choose those numerical
# variables, just change 'o' to 'int64')
columnSeries = fullData.columns.to_series()
colType = columnSeries.groupby(fullData.dtypes == 'O').groups
colType[True]
Out[77]:
In [78]:
def FillNaNRandom(df,columnName):
count_col = len(np.unique(df[columnName].value_counts()))
count_nan = df[columnName].isnull().sum()
range_col = df[columnName].value_counts().index
print "There are %s different values in the list of :\n%s"
% (len(range_col), range_col)
if count_nan > 0:
rand = np.random.randint(0, count_col, size = count_nan)
# Create a random selection of the 7 possible categorical
# values for all 'count_ran_department' 'NaN' records
RandforNaN = range_col[rand]
# Find indexes for the NaN values
NaNIndex = df[columnName] != df[columnName]
df[columnName][NaNIndex] = RandforNaN
print 'There are : %s NaN values in this variable!'
% count_nan
else:
print 'There is no NaN values for this variable!'
return df
In [79]:
# There are 12 original categorical variables in total
# 1. 'affiliate_channel' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='affiliate_channel', data=fullData, palette="husl",
ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'affiliate_channel')
# (3) Binary process
fullData["affiliate_channel"] =
(fullData["affiliate_channel"] == 'direct').astype(int)
In [80]:
# There are 12 categorical variables in total
# 2. 'affiliate_provider' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='affiliate_provider', data=fullData, palette="husl",
ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'affiliate_provider')
# (3) Binary process
fullData["affiliate_provider"] =
(fullData["affiliate_provider"] == 'direct').astype(int)
In [81]:
# There are 12 categorical variables in total
# 3. 'country_destination' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='country_destination', data=fullData, palette="husl",
ax=axis1)
# (2) Binary process
fullData["booked"] =
(fullData['country_destination']!= 'NDF').astype(int)
In [30]:
# There are 12 categorical variables in total
# 4. 'date_first_booking' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='date_first_booking', data=fullData, palette="husl",
ax=axis1)
Out[30]:
In [82]:
# Type : Date
# (2) Split a 'date' variable into 'year' and 'month'
def get_year(date):
# if the value is nan, nan is not equal to another nan
if date == date:
return int(str(date)[:4])
return date
def get_month(date):
if date == date:
return int(float(str(date)[5:7]))
return date
# Create Year and Month columns
fullData['Year'] = fullData['date_first_booking'].apply(get_year)
fullData['Month']= fullData['date_first_booking'].apply(get_month)
# (3) Fill NaN of the new created variables 'Year' and 'Month'
fullData['Year'].fillna(fullData['Year'].median(), inplace=True)
fullData['Month'].fillna(fullData['Month'].median(), inplace=True)
type(fullData['Year'].iloc[2]) # numpy.float64
# (4) Be careful of the data type of 'Year' and 'Month' variables.
# They are float instead of int. Convert 'float' to 'int'
fullData[['Year', 'Month']] = fullData[['Year', 'Month']].astype(int)
In [83]:
# (5) Look for new insights
# NOTICE : in year 2014 and 2015, there wasn't "no-booking"
fig, axis1 = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x="Year",hue="country_destination", data=fullData,
palette="husl", order=[2010,2011,2012,2013,2014,2015],
ax=axis1)
Out[83]:
In [84]:
# There are 12 original categorical variables in total
# 5. 'first_affiliate_tracked' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_affiliate_tracked', data=fullData,
palette="husl", ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_affiliate_tracked')
In [85]:
# There are 12 original categorical variables in total
# 6. 'first_browser' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_browser', data=fullData, palette="husl",
ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_browser')
In [86]:
# There are 12 original categorical variables in total
# 7. 'first_device_type' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_device_type', data=fullData, palette="husl",
ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_device_type')
In [87]:
# There are 12 original categorical variables in total
# 8. 'Gender' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='gender', data=fullData, palette="husl", ax=axis1)
# (2) Assign other categories to either 'MALE' or 'FEMALE' category
i = 0
def get_gender(gender):
global i
if gender != 'FEMALE' and gender != 'MALE':
return 'FEMALE' if (i%2) else 'MALE'
i = i+1
return gender
fullData['gender'] = fullData['gender'].apply(get_gender)
# (3) Map 'MALE'to '0' and 'FEMALE' to '1'
fullData["gender"] = fullData["gender"].map({"FEMALE": 1, "MALE": 0})
In [88]:
# (4) Look for insights
fig, (axis1, axis2) = plt.subplots(2,1,sharex=True,figsize=(15,8))
# frequency of country_destination for every gender
sns.countplot(x="gender",hue="country_destination", data=
fullData[fullData['country_destination'] != 'NDF'],
palette="husl", ax=axis1)
# frequency of booked Vs no-booking users for every gender
sns.countplot(x="gender",hue="booked", data=fullData, palette="husl", ax=axis2)
Out[88]:
In [89]:
# There are 12 original categorical variables in total
# 9. 'id' variable
# (1) Visualization
fullData['id'].iloc[0] # it shows 'gxn3p5htnn'
Out[89]:
In [90]:
# There are 12 original categorical variables in total
# 10. 'language' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='language', data=fullData, palette="husl", ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'language')
# (3) Binary process
fullData["language"] = (fullData["language"] == 'en').astype(int)
In [91]:
# There are 12 original categorical variables in total
# 11-12. 'signup_method' and 'signup_flow' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='signup_method', data=fullData, palette="husl", ax=axis1)
# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'signup_method')
fullData = FillNaNRandom(fullData, 'signup_flow')
# (3) Binary process
fullData["signup_method"] =
(fullData["signup_method"] == "basic").astype(int)
fullData["signup_flow"] = (fullData["signup_flow"] == 3).astype(int)
In [92]:
# 1.3 Process remaining 'object' type variables
# We will create a corresponding unique numerical value
# for each non-numerical value in a column.
ID_col = 'user_id'
target_col = 'country_destination'
from sklearn import preprocessing
for f in fullData.columns:
if f == target_col or f == ID_col or f == 'Type': continue
if fullData[f].dtype == 'object':
lbl = preprocessing.LabelEncoder()
lbl.fit(np.unique(list(fullData[f].values)))
fullData[f] = lbl.transform(list(fullData[f].values))
In [93]:
fullData.dtypes
Out[93]:
In [94]:
# 1.4 Particular process of numerical variable 'age'
# 1.4.1 Find and remove outliers by assigning all age values > 100
# to NaN, these NaN values will be replaced with real ages below
a = fullData.age.values
fullData['age'] = np.where(a >100, np.nan, a)
# get average, std, and number of NaN values in fullData
average_age = fullData.age.mean()
std_age = fullData.age.std()
count_nan_age = fullData.age.isnull().sum()
print count_nan_age
# generate random numbers between (mean - std) & (mean + std)
rand = np.random.randint(average_age - std_age, average_age + std_age,
size = count_nan_age)
# fill NaN values in Age column with random values generated
fullData["age"][np.isnan(fullData.age)] = rand
# convert type to integer
fullData['age'] = fullData.age.astype(int)
In [96]:
# 1.4.2 Check our converting, see if it makes sense
fig, (axis1, axis2) = plt.subplots(1,2,figsize=(15,4))
# frequency for age values(in case there was a booking)
fullData['age'][fullData.country_destination != 'NDF'].hist(ax=axis1)
# cut age values into ranges
fullData['age_range'] = pd.cut(fullData.age, [0, 20, 40, 60, 80, 100])
# frequency of country_destination for every age range
sns.countplot(x="age_range",hue="country_destination", data=
fullData[fullData.country_destination != 'NDF'],
palette="husl", ax=axis2)
# drop age_range
fullData.drop(['age_range'], axis=1, inplace=True)
In [97]:
fullData.dtypes
Out[97]:
Step 3 : Data modelling
- Final drop of unnecessary columns
- Defining traning and testing sets
- Convert 'target_col' to numerical values
- Drop variables that are not useful for prediction
- Select a model and use the model to predict
In [98]:
# Final drop of unnecessary columns
fullData = fullData.drop(['booked','id','Year','Month'],axis=1)
fullData.dtypes
Out[98]:
In [99]:
# Defining traning and testing sets
X_train = fullData.ix[fullData.Type == 'Train','Type':]
Y_train = X_train["country_destination"]
X_train = X_train.drop(["Type", 'country_destination'],axis=1)
print('---------')
X_train.info()
print('---------')
X_test = fullData.ix[fullData.Type == 'Test','Type':]
X_test = X_test.drop(["Type", 'country_destination'],axis=1)
X_test.info()
In [100]:
# Convert 'country_destination' to numerical values
# Get the name list of all countries
range_countries = fullData.country_destination.value_counts().index
# Get the dictionary of the target column
country_num_dic = dict(zip(range_countries,
range(len(range_countries))))
In [101]:
print country_num_dic
In [102]:
# Create training labels
Y_train = train['country_destination'].map(country_num_dic)
In [103]:
# Select the model 'Random Forests'
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Out[103]:
In [104]:
# Use 'Random Forest' model to predict on the training set
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
Out[104]:
In [105]:
# Use 'Random Forest' model to predict on the testing set
random_forest.score(X_test, Y_pred)
Out[105]: