This blog aims to give a tutorial on "how to write a prediction model in Python in less than a breakfast time".

Most of the top data scientists and Kagglers build their first effective model quickly and submit. The simplest solution provides a bench mark solution to optimize.

This tutorial is to show you how to predict the courtry of destination a user is going to book on Airbnb.

I will divide the whole job into small tasks, with each giving both the explaination and the code demo. Without further ado, let's rock!

The tasks are :

Data exploration

Data cleaning

Data modelling

Step 1 : Data Exploration

1.1 Load libraries
1.2 Load data
1.3 View data features/columns and summary
1.4 Get to know your variables : categorical, numerical, string, ID, target
1.5 First visualization of the data

# 1.1 Load libraries
import pandas as pd
from pandas import Series,DataFrame

# numpy, matplotlib, seaborn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation
import xgboost as xgb

# 1.2 Load data
train = pd.read_csv('Kaggle data comp 1/input/train_users_2.csv')
test = pd.read_csv('Kaggle data comp 1/input/test_users.csv')
train['Type'] = 'Train' #Create a flag for Train and Test Data set
test['Type'] = 'Test'
# Combined both Train and Test Data set
fullData = pd.concat([train,test],axis=0)

print train.shape, test.shape

(213451, 17) (62096, 16)

# 1.3 View the column names / summary of the dataset
print '\n---------------------------------------------------------\n'
print 'Information of our data:', fullData.info()
print '\n---------------------------------------------------------\n'


------------------------------------------------------------------

Information of our data:<class 'pandas.core.frame.DataFrame'>
Int64Index: 275547 entries, 0 to 62095
Data columns (total 17 columns):
Type                       275547 non-null object
affiliate_channel          275547 non-null object
affiliate_provider         275547 non-null object
age                        158681 non-null float64
country_destination        213451 non-null object
date_account_created       275547 non-null object
date_first_booking         88908 non-null object
first_affiliate_tracked    269462 non-null object
first_browser              275547 non-null object
first_device_type          275547 non-null object
gender                     275547 non-null object
id                         275547 non-null object
language                   275547 non-null object
signup_app                 275547 non-null object
signup_flow                275547 non-null int64
signup_method              275547 non-null object
timestamp_first_active     275547 non-null int64
dtypes: float64(1), int64(2), object(14)
memory usage: 37.8+ MB
 None

------------------------------------------------------------------

# 1.4 Get to know your variables
# Check the data types of all variables (or columns if you want)
fullData.dtypes

Type                        object
affiliate_channel           object
affiliate_provider          object
age                        float64
country_destination         object
date_account_created        object
date_first_booking          object
first_affiliate_tracked     object
first_browser               object
first_device_type           object
gender                      object
id                          object
language                    object
signup_app                  object
signup_flow                  int64
signup_method               object
timestamp_first_active       int64
dtype: object

# 1.5 First Visualization of the data
# Plot the distribution of the labels
fig, (axis1, axis2) = plt.subplots(1,2,figsize=(15,4))
sns.countplot(x='country_destination', data=fullData, palette="husl", 
              ax = axis1)
sns.countplot(x='country_destination', data=train, palette="husl", 
              ax = axis2)

<matplotlib.axes._subplots.AxesSubplot at 0x11ad87dd0>

Step 2 : Data Cleaning

1.1 Drop out unnecessary variables
1.2 Find out all categorical variables and process them one by one (loop)
- 1.2.1 Visualze the distribution of the variable values
- 1.2.2 (optional) Date-type variable have to be split
- 1.2.3 (optional) Fill NaN values with median values
- 1.2.4 (optional) Make sure the data type of newly created variables is justified, e.g., 'Year', is an 'int' instead of 'float' type
- 1.2.5 (optional) Binary process ('one value' versus 'non- one value')
- 1.2.6 Look for new insights due to the use of new variables
1.3 Convert any remaining non-numerical variables to numerical
1.4 Particular process of some important numerical variables, e.g., age

# 1.1 Drop unecessary columns, these columns won't be useful in 
# analysis and prediction
fullData = fullData.drop(['date_account_created', 
                          'timestamp_first_active'], axis = 1)

# 1.2 Find out all categorical variables (choose those numerical 
# variables, just change 'o' to 'int64')
columnSeries = fullData.columns.to_series()
colType = columnSeries.groupby(fullData.dtypes == 'O').groups 
colType[True]

['Type',
 'affiliate_channel',
 'affiliate_provider',
 'country_destination',
 'date_first_booking',
 'first_affiliate_tracked',
 'first_browser',
 'first_device_type',
 'gender',
 'id',
 'language',
 'signup_app',
 'signup_method']

def FillNaNRandom(df,columnName):
    count_col = len(np.unique(df[columnName].value_counts()))
    count_nan = df[columnName].isnull().sum()
    range_col = df[columnName].value_counts().index
    print "There are %s different values in the list of :\n%s" 
                                % (len(range_col), range_col)

    if count_nan > 0:
        rand = np.random.randint(0, count_col, size = count_nan)
        
        # Create a random selection of the 7 possible categorical 
        # values for all 'count_ran_department' 'NaN' records
        RandforNaN = range_col[rand]

        # Find indexes for the NaN values
        NaNIndex = df[columnName] != df[columnName]
        df[columnName][NaNIndex] = RandforNaN
        print 'There are : %s NaN values in this variable!' 
                                                % count_nan
    else:
        print 'There is no NaN values for this variable!'
    return df

# There are 12 original categorical variables in total
# 1. 'affiliate_channel' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='affiliate_channel', data=fullData, palette="husl", 
              ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'affiliate_channel')

# (3) Binary process
fullData["affiliate_channel"] = 
(fullData["affiliate_channel"] == 'direct').astype(int)

There are 8 different values in the list of :
Index([u'direct', u'sem-brand', u'sem-non-brand', u'seo', u'other', u'api', u'content', u'remarketing'], dtype='object')
There is no NaN values for this variable!

# There are 12 categorical variables in total
# 2. 'affiliate_provider' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='affiliate_provider', data=fullData, palette="husl", 
              ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'affiliate_provider')

# (3) Binary process
fullData["affiliate_provider"] = 
(fullData["affiliate_provider"] == 'direct').astype(int)

There are 18 different values in the list of :
Index([u'direct', u'google', u'other', u'facebook', u'bing', u'craigslist', u'padmapper', u'vast', u'yahoo', u'facebook-open-graph', u'gsp', u'meetup', u'email-marketing', u'naver', u'baidu', u'yandex', u'wayn', u'daum'], dtype='object')
There is no NaN values for this variable!

# There are 12 categorical variables in total
# 3. 'country_destination' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='country_destination', data=fullData, palette="husl", 
              ax=axis1)

# (2) Binary process
fullData["booked"] = 
(fullData['country_destination']!= 'NDF').astype(int)

# There are 12 categorical variables in total
# 4. 'date_first_booking' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='date_first_booking', data=fullData, palette="husl", 
              ax=axis1)

<matplotlib.axes._subplots.AxesSubplot at 0x114d9c2d0>

# Type : Date
# (2) Split a 'date' variable into 'year' and 'month'
def get_year(date):
    # if the value is nan, nan is not equal to another nan
    if date == date: 
        return int(str(date)[:4])
    return date

def get_month(date):
    if date == date: 
        return int(float(str(date)[5:7]))
    return date

# Create Year and Month columns
fullData['Year'] = fullData['date_first_booking'].apply(get_year)
fullData['Month']= fullData['date_first_booking'].apply(get_month)

# (3) Fill NaN of the new created variables 'Year' and 'Month'
fullData['Year'].fillna(fullData['Year'].median(), inplace=True)
fullData['Month'].fillna(fullData['Month'].median(), inplace=True)

type(fullData['Year'].iloc[2]) # numpy.float64

# (4) Be careful of the data type of 'Year' and 'Month' variables. 
# They are float instead of int. Convert 'float' to 'int'
fullData[['Year', 'Month']] = fullData[['Year', 'Month']].astype(int)

# (5) Look for new insights
# NOTICE : in year 2014 and 2015, there wasn't "no-booking"
fig, axis1 = plt.subplots(1,1,figsize=(15,4))
sns.countplot(x="Year",hue="country_destination", data=fullData, 
              palette="husl", order=[2010,2011,2012,2013,2014,2015], 
              ax=axis1)

<matplotlib.axes._subplots.AxesSubplot at 0x1189d3e50>

# There are 12 original categorical variables in total
# 5. 'first_affiliate_tracked' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_affiliate_tracked', data=fullData, 
              palette="husl", ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_affiliate_tracked')

There are 7 different values in the list of :
Index([u'untracked', u'linked', u'omg', u'tracked-other', u'product', u'marketing', u'local ops'], dtype='object')
There are : 6085 NaN values in this variable!

# There are 12 original categorical variables in total
# 6. 'first_browser' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_browser', data=fullData, palette="husl", 
              ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_browser')

There are 55 different values in the list of :
Index([u'Chrome', u'Safari', u'-unknown-', u'Firefox', u'Mobile Safari', u'IE', u'Chrome Mobile', u'Android Browser', u'AOL Explorer', u'Opera', u'Silk', u'IE Mobile', u'BlackBerry Browser', u'Chromium', u'Mobile Firefox', u'Maxthon', u'Apple Mail', u'Sogou Explorer', u'SiteKiosk', u'RockMelt', u'Iron', u'IceWeasel', u'Yandex.Browser', u'Pale Moon', u'CometBird', u'SeaMonkey', u'Camino', u'TenFourFox', u'Opera Mini', u'wOSBrowser', u'CoolNovo', u'Avant Browser', u'Opera Mobile', u'Mozilla', u'OmniWeb', u'SlimBrowser', u'Comodo Dragon', u'Crazy Browser', u'TheWorld Browser', u'Flock', u'Stainless', u'PS Vita browser', u'Googlebot', u'Kindle Browser', u'NetNewsWire', u'Google Earth', u'UC Browser', u'Conkeror', u'Outlook 2007', u'IBrowse', u'Palm Pre web browser', u'IceDragon', u'Nintendo Browser', u'Arora', u'Epic'], dtype='object')
There is no NaN values for this variable!

# There are 12 original categorical variables in total
# 7. 'first_device_type' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='first_device_type', data=fullData, palette="husl", 
              ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'first_device_type')

There are 9 different values in the list of :
Index([u'Mac Desktop', u'Windows Desktop', u'iPhone', u'iPad', u'Other/Unknown', u'Android Phone', u'Android Tablet', u'Desktop (Other)', u'SmartPhone (Other)'], dtype='object')
There is no NaN values for this variable!

# There are 12 original categorical variables in total
# 8. 'Gender' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='gender', data=fullData, palette="husl", ax=axis1)

# (2) Assign other categories to either 'MALE' or 'FEMALE' category
i = 0
def get_gender(gender):
    global i
    if gender != 'FEMALE' and gender != 'MALE':
        return 'FEMALE' if (i%2) else 'MALE'
    i = i+1
    return gender
fullData['gender'] = fullData['gender'].apply(get_gender)

# (3) Map 'MALE'to '0' and 'FEMALE' to '1'
fullData["gender"] = fullData["gender"].map({"FEMALE": 1, "MALE": 0})

# (4) Look for insights
fig, (axis1, axis2) = plt.subplots(2,1,sharex=True,figsize=(15,8))
# frequency of country_destination for every gender
sns.countplot(x="gender",hue="country_destination", data=
              fullData[fullData['country_destination'] != 'NDF'], 
              palette="husl", ax=axis1)
# frequency of booked Vs no-booking users for every gender
sns.countplot(x="gender",hue="booked", data=fullData, palette="husl", ax=axis2)

<matplotlib.axes._subplots.AxesSubplot at 0x1197932d0>

# There are 12 original categorical variables in total
# 9. 'id' variable
# (1) Visualization
fullData['id'].iloc[0] # it shows 'gxn3p5htnn'

'gxn3p5htnn'

# There are 12 original categorical variables in total
# 10. 'language' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='language', data=fullData, palette="husl", ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'language')

# (3) Binary process
fullData["language"] = (fullData["language"] == 'en').astype(int)

There are 26 different values in the list of :
Index([u'en', u'zh', u'fr', u'es', u'ko', u'de', u'it', u'ru', u'ja', u'pt', u'sv', u'nl', u'tr', u'da', u'pl', u'no', u'cs', u'el', u'th', u'hu', u'id', u'fi', u'ca', u'is', u'hr', u'-unknown-'], dtype='object')
There is no NaN values for this variable!

# There are 12 original categorical variables in total
# 11-12. 'signup_method' and 'signup_flow' variable
# (1) Visualization
fig, axis1 = plt.subplots(1,1,sharex=True,figsize=(15,4))
sns.countplot(x='signup_method', data=fullData, palette="husl", ax=axis1)

# (2) Fill NaN values randomly
fullData = FillNaNRandom(fullData, 'signup_method')
fullData = FillNaNRandom(fullData, 'signup_flow')

# (3) Binary process
fullData["signup_method"] = 
                  (fullData["signup_method"] == "basic").astype(int)
fullData["signup_flow"] = (fullData["signup_flow"] == 3).astype(int)

There are 4 different values in the list of :
Index([u'basic', u'facebook', u'google', u'weibo'], dtype='object')
There is no NaN values for this variable!
There are 18 different values in the list of :
Int64Index([0, 25, 12, 3, 2, 23, 24, 1, 8, 6, 21, 5, 20, 16, 15, 14, 10, 4], dtype='int64')
There is no NaN values for this variable!

# 1.3 Process remaining 'object' type variables
# We will create a corresponding unique numerical value 
# for each non-numerical value in a column.
ID_col = 'user_id'
target_col = 'country_destination'

from sklearn import preprocessing

for f in fullData.columns:
    if f == target_col or f == ID_col or f == 'Type': continue
    if fullData[f].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(np.unique(list(fullData[f].values)))
        fullData[f] = lbl.transform(list(fullData[f].values))

fullData.dtypes

Type                        object
affiliate_channel            int64
affiliate_provider           int64
age                        float64
country_destination         object
date_first_booking           int64
first_affiliate_tracked      int64
first_browser                int64
first_device_type            int64
gender                       int64
id                           int64
language                     int64
signup_app                   int64
signup_flow                  int64
signup_method                int64
booked                       int64
Year                         int64
Month                        int64
dtype: object

# 1.4 Particular process of numerical variable 'age'
# 1.4.1 Find and remove outliers by assigning all age values > 100 
# to NaN, these NaN values will be replaced with real ages below
a = fullData.age.values
fullData['age'] = np.where(a >100, np.nan, a)

# get average, std, and number of NaN values in fullData
average_age   = fullData.age.mean()
std_age       = fullData.age.std()
count_nan_age = fullData.age.isnull().sum()
print count_nan_age

# generate random numbers between (mean - std) & (mean + std)
rand = np.random.randint(average_age - std_age, average_age + std_age,
                         size = count_nan_age)

# fill NaN values in Age column with random values generated
fullData["age"][np.isnan(fullData.age)] = rand

# convert type to integer
fullData['age'] = fullData.age.astype(int)

119556

# 1.4.2 Check our converting, see if it makes sense
fig, (axis1, axis2) = plt.subplots(1,2,figsize=(15,4))

# frequency for age values(in case there was a booking)
fullData['age'][fullData.country_destination != 'NDF'].hist(ax=axis1)

# cut age values into ranges 
fullData['age_range'] = pd.cut(fullData.age, [0, 20, 40, 60, 80, 100])

# frequency of country_destination for every age range
sns.countplot(x="age_range",hue="country_destination", data=
              fullData[fullData.country_destination != 'NDF'], 
                                      palette="husl", ax=axis2)

# drop age_range
fullData.drop(['age_range'], axis=1, inplace=True)

fullData.dtypes

Type                       object
affiliate_channel           int64
affiliate_provider          int64
age                         int64
country_destination        object
date_first_booking          int64
first_affiliate_tracked     int64
first_browser               int64
first_device_type           int64
gender                      int64
id                          int64
language                    int64
signup_app                  int64
signup_flow                 int64
signup_method               int64
booked                      int64
Year                        int64
Month                       int64
dtype: object

Step 3 : Data modelling

Final drop of unnecessary columns
Defining traning and testing sets
Convert 'target_col' to numerical values
Drop variables that are not useful for prediction
Select a model and use the model to predict

# Final drop of unnecessary columns
fullData = fullData.drop(['booked','id','Year','Month'],axis=1)
fullData.dtypes

Type                       object
affiliate_channel           int64
affiliate_provider          int64
age                         int64
country_destination        object
date_first_booking          int64
first_affiliate_tracked     int64
first_browser               int64
first_device_type           int64
gender                      int64
language                    int64
signup_app                  int64
signup_flow                 int64
signup_method               int64
dtype: object

# Defining traning and testing sets
X_train = fullData.ix[fullData.Type == 'Train','Type':]

Y_train = X_train["country_destination"]
X_train = X_train.drop(["Type", 'country_destination'],axis=1)

print('---------')
X_train.info()
print('---------')
X_test  = fullData.ix[fullData.Type == 'Test','Type':]
X_test = X_test.drop(["Type", 'country_destination'],axis=1)
X_test.info()

---------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 213451 entries, 0 to 213450
Data columns (total 12 columns):
affiliate_channel          213451 non-null int64
affiliate_provider         213451 non-null int64
age                        213451 non-null int64
date_first_booking         213451 non-null int64
first_affiliate_tracked    213451 non-null int64
first_browser              213451 non-null int64
first_device_type          213451 non-null int64
gender                     213451 non-null int64
language                   213451 non-null int64
signup_app                 213451 non-null int64
signup_flow                213451 non-null int64
signup_method              213451 non-null int64
dtypes: int64(12)
memory usage: 21.2 MB
---------
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62096 entries, 0 to 62095
Data columns (total 12 columns):
affiliate_channel          62096 non-null int64
affiliate_provider         62096 non-null int64
age                        62096 non-null int64
date_first_booking         62096 non-null int64
first_affiliate_tracked    62096 non-null int64
first_browser              62096 non-null int64
first_device_type          62096 non-null int64
gender                     62096 non-null int64
language                   62096 non-null int64
signup_app                 62096 non-null int64
signup_flow                62096 non-null int64
signup_method              62096 non-null int64
dtypes: int64(12)
memory usage: 6.2 MB

# Convert 'country_destination' to numerical values
# Get the name list of all countries 
range_countries = fullData.country_destination.value_counts().index

# Get the dictionary of the target column
country_num_dic = dict(zip(range_countries,
                           range(len(range_countries))))

print country_num_dic

{'FR': 3, 'NL': 9, 'PT': 11, 'CA': 7, 'DE': 8, 'IT': 4, 'US': 1, 'NDF': 0, 'other': 2, 'AU': 10, 'GB': 5, 'ES': 6}

# Create training labels
Y_train    = train['country_destination'].map(country_num_dic)

# Select the model 'Random Forests'
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

# Use 'Random Forest' model to predict on the training set
Y_pred = random_forest.predict(X_test)

random_forest.score(X_train, Y_train)

0.99654721692566439

# Use 'Random Forest' model to predict on the testing set
random_forest.score(X_test, Y_pred)

1.0