Targeted advertising with Machine Learning

Yu Han Wu
6 min readFeb 21, 2021

Introduction

Accurately targeting potential can help to reduce the cost of conversion and increase the efficiency of advertising. In this project, we aim to use machine learning models to predict if a future client will subscribe (yes/no) a term deposit (variable y).

The dataset

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking. Out of the datasets available, our group has chosen to use ‘bank-additional-full.csv’ to generate our models. It contains the latest dataset with 21 variables. We have also used the full dataset with 41188 records instead of the ones with only 10% of the examples. We believe that by using the dataset with additional variables and full examples, we would be able to construct a more robust machine learning model for predictions with higher accuracy.

Step 1: Data Cleaning

The data contained ‘unknown’ strings in numerous cells and an attribute called duration which are not useful in the predictive model. The respective data cleaning has been done using R.

“unknown” could be found in the following 6 variables: job, marital, education, default, housing, loan. Having an “unknown” may represent missing fields. As missing fields do not have predictive value, we have thus decided to omit cases with “unknown” from the above variables.
We first allow R to recognise “unknown” as NA using na.strings. Then we used na.omit to remove rows which contains “unknown”. This would reduce our observations from 41188 to 30488, which is still a significant amount for constructing our machine learning models.We then exported our cleaned data out from R in csv format, called “finalbank(1).csv”. We will then read “finalbank(1).csv” in python (spyder) for the rest of our analyses.

Duration refers to the duration of each phone call. We have removed duration as an input for our machine learning models. Duration was only provided in the data for benchmarking purposes by the bank and will not be useful in generating machine learning models. Our purpose here is to predict if a future client would purchase a term deposit, hence the call has not been made and we will not have the duration data for each potential client. Thus, duration cannot be included in our model.

Step 2: Data Partitioning

Our dataset is split into a 70/30 train-test set, which was then used to test the performance of all the models. We used a random seed of 7 to do so. Partitioning our data helps to get a handle on the ability of the predictive model to perform on future data by trying to simulate this eventuality. Therefore, there is a form of comparison with the predictions and the outcomes that actually occurred to test the accuracy of our model. Refer to Appendix 4 for codes in python (spyder).

Step 3: Handling Imbalanced Dataset

Upon further exploration on the dataset, we found out that the dataset is imbalanced. For our target variable ‘y’, the number of instances for ‘no’ (26629) significantly outnumbered the instances of ‘yes’ (3859), with an approximate ratio of 7:1.
As a result, we will be employing up sampling techniques to balance our dataset. We have chosen to upsample our data instead of downsample, or else the dataset will be too small for us to work on. The upsampling would only be done on our training set as none of the information in the validation data should be used to create synthetic observations.

Data before and after SMOTE

For our upsampling method, we will be using SMOTE in python (spyder). SMOTE finds its n-nearest neighbours in the minority class, in our case ‘yes’, for each sample of the minority class. A line between the neighbours would be drawn, and synthetic samples of the minority class would then be created on random points on the line.

Step 4: Building the model

Python was unable to recognise the inputs in our dataset as factors, but read them as strings and hence we were unable to run our models. As a result, we have used LabelEncoder to transform non-numerical labels to numerical labels. It encoded labels with value between 0 and n_classes -1 as numerical labels for the inputs in their respective columns.

We ran 5 machine learning models: Logistic Regression (LR), CART, K-Nearest Neighbours (KNN), Random Forest (RF) and Support Vector Machine (SVM). We have derived the accuracy scores using the cross_val_score function.

We can see that Random Forest has the highest mean accuracy of 0.931 and lowest standard deviation of 0.0512. We can see why this is so. Random Forest adds additional randomness to the model, and instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. The wide diversity is hence more likely to produce a better model. Furthermore, Random Forest helps to prevent overfitting according to the Law of Large Numbers, whereby Random Forest operates as an ensemble of decision trees. It is especially important that our data do not overfit, since we have used SMOTE to conduct oversampling.

To further improve our basic model, we ran variable importance to determine variables with the highest importance to our target output y. We have decided to set an importance threshold at 95%. Hence, we will be including 16 variables that are within the threshold for our improved model.

The reduced variables result in a 15% decrease in the run time. The removal of the less important variable was able to create a more efficient model and will be useful especially with larger datasets.

The results

We found out that most important variable are social economic variables like the Euribor 3 month rate (euribor3m) and employment rate (nr.employed) and other customer’s attributes such as housing loans, age, number of times contacted and the day of the week.

The result aligned with the expectations as high borrowing rates and low employment rates are usually indicators of poorly performing economy. The low GDP will also result in a decrease in savings, so term deposits usually do better in well performing economies. Also, customers who has housing loans are less likely to take up the term deposit due to the lack of free cash on hand. Furthermore, people who are in the retirement age are also less likely to take up the term deposit as they would need to use the cash.

Recommendation

The first recommendation is to get consumer data such as their housing loans, job and marital status, and number of times they have been contacted. The data can be run through the model and which will immediately identify the potential customers.

Secondly, they should also personalise their advertisements to customers who are less than 57 years old with no housing loans and find new customers instead of calling up the existing contacts. This is because the chances of purchasing the term deposit decreases with the number of times they are called.

Lastly, the bank should push for this product when the economy is doing and focus their marketing budget on other products when the economy is not doing well. This is because the chance of people purchasing a term deposit is lower as the savings decrease when the GDP is lower.

--

--