Starbucks Capstone

The capstone project for Udacity’s Data Scientist Nanodegree Program

15 min readFeb 21, 2022

https://unsplash.com/photos/jy4t6SY9Ax0?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

Introduction

This project was completed as the Capstone project of the Udacity Data Scientist Nanodegree. The project uses a mock dataset created by Starbucks that mimics customer behavior within the Starbucks Mobile Reward App. The business question centers on using data to understand the interaction of users and different advertising efforts to help develop a data product that will help Starbucks more effectively engage with users and help users receive better fit offers.

Project Goal

The overarching goal of this project is to provide data tools that enable Starbucks to more efficiently expose users to offers. To this end we will deliver 1) visual analysis highlighting the offer success rates across different demographic factors, and 2) predictive models that help estimate the probability of an offer success for a given customer (binary classification problem).

Metrics

For the model development product, the key evaluation metric will be accuracy — the number of correct predictions divided by the total number of predictions. Using accuracy as the key evaluation measure is acceptable since the classes of ‘success’ and ‘failure’ are fairly balanced; if however, the class was highly imbalanced we would likely default on f1-score to account for the rate of false positives and false negatives.

The Data

Starbucks provided 3 datasets, all in JSON format. The three datasets provided were: Portfolio.json, Profile.json, and Transcript.json.

Portfolio Dataset

The dataset containing the universe of offers offered to users. Dataset contains columns describing the characteristics of each offer type, including its offer type, difficulty, and duration.

Overall shape: 6 columns and 10 rows

reward (int) — The amount of reward given for completing an offer
channels (list of strings) –Distribution methods for offer
difficulty (int) — The minimum required to spend to complete an offer
duration (int) — Time for the offer to be open, in days
offer_type (string) — A type of offer ie BOGO, discount, informational
id (string) — Offer id

Profile Dataset

The dataset contains demographic information for users of the Starbucks Mobile Rewards App, such as user Age, date user joined the platform, and income.

Overall shape: 5 columns and 17,000 rows

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

Transcript Dataset

A one-to-many types of data set, with foreign key references to Portfolio and Profile. This dataset represented the set of transactions/interactions of users and offers on the Starbucks Mobile Reward App. For example, with this dataset, we could see when a customer made a purchase, viewed an offer, received an offer, and completed an offer.

Overall shape: 4 columns and 306,000+ rows

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since the start of the test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

The above datasets are combined to help understand what leads to ‘successful’ offers, where an offer is sent to a user, the user views the offer, and the user then completes the requirements to unlock the offer.

Exploratory Analysis

Portfolio

The above charts show the breakdown of offer types, with BOGO and Discount making up most offer types — 4 each. Looking at the chart on the right, we see that the most common offer duration is 1 week (7-days), with 5 and 10-days the second most common duration.

Profile

Following a review of offer data, we turn to user profile data. We start our review of user data by looking at gender and income data.

Looking at the gender, we note that most app customers identify as male (8,000+), while roughly 6,000 users identify as female and the remaining chose neither. The age data shows two key things: 1) the median age is 58 and 2) the large bar at 118 represents users who omitted age-related information.

Next, we look at income, both overall and by gender.

Looking at the customer income, we see income ranges from a minimum of $30,000 to a max of $120,000 with both the mean and median at around $65,000. The right-hand chart helps to analyze income by gender and shows that the median income for women is higher than men, however, given the large variance the median income across genders cannot be said to be statistically different.

Transcript

Finally, we review data related to transactions and offers as captured in the transcript.json dataset.

The above charts plot the distribution of transaction size within the dataset and an initial review showed an extremely long right-hand tail. As a result, the dataset was split at the $40 mark as this represented the ninety-ninth percentile. The left-hand chart shows all transactions under $40, which represents 99% of all transactions. We can see that a large portion of transactions occur at $10 and below, with the overall median transaction size standing at roughly $9.

Following a review of individual transactions, the next step is to view the distribution of the average order size by user. Within the dataset we see there are 138,953 total transactions that totaled to $1,775,452 by 16,578 users. On average each user made about 8 transactions, with an average spend of roughly $14 per order. Plotting average order size per user allows us to get a sense of the spending behavior of users on the mobile platform.

The data shows the multimodal distribution, with a very long right tail. Splitting the data at the 90% mark of $25 allows us to get a better sense of the data. On the left-hand chart, we see a large peak at around $3, which suggests that most user orders on the platform are ‘micro-transactions’, after that we see a flat distribution between $15 and $25 per average customer order.

Data Processing and Cleaning

The initial data exploration exercise found not only interesting characteristics of the data but also opportunities to clean the data and add new features to support additional investigation. To support the data manipulation efforts, custom functions were created for each dataset, and within each function, different steps were undertaken to process the data.

Some of the key changes included:

Encoding key variables, such as the ‘Channels’ column in the portfolio.json dataset and ‘Gender’ in the profile.json dataset. Encoding converts the string data within a single column in N columns with a bit value to indicate if the values apply or not. This supports model development later.
Renaming columns such as ‘id’ to more meaningful names such as user_id in profile.json and offer_id in portfolio.json, respectively.
Binning of variables of interest such as income, age, and length of membership in the profile.json dataset. For each respective category, the data was split into quartiles so that we can compare outcomes between individuals in the lower income quartile to those in the higher income quartiles. This allows for a quick way to assess the potential of a variable in predicting desired outcomes.
Extracting data from string JSON formatted columns such as ‘reward’ and ‘amount’ from the value column in the transcript.json dataset.

Merging Datasets

After reviewing each dataset individually, the next step is to merge the datasets together to analyze any potentially meaningful relationship between user demographic data and transaction data that might help in our overall goal of producing data models that predict offer success for users.

Earlier we reviewed spending patterns for the overall dataset, below we review spending across different demographic factors and take advantage of data binning undertaken in earlier data processing steps.

The above charts uncover some interesting patterns, some expected, and others not so expected. Starting on the left, we see that generally spending increases with income, which makes sense as people with higher overall incomes likely have higher disposable income and thus are more likely to spend. Turning to age, we see an increase in spending initially as one moves from the first to second quartile of age, but then spending flattens out. Lastly, looking at spend by Gender, we see that women tend to spend more on the platform compared to men, and individuals not identified as men or women also spend a relatively high amount. Note, some of the spending difference between men and women might be related to the fact that women tend to out-earn men, and as we saw earlier, higher income results in higher overall spend — there might be a confluence of factors playing out here.

Focusing on Offer Success

Offer success is defined as a user completing an offer and having viewed an offer. This requirement of viewing is key as users don’t opt in to offers, and offers are open to all, but some users will complete the steps required of an offer without having viewed the offer. We want to be both effective and efficient with our offers such that we don’t send offers to those who are going to spend anyways, and when we do send offers it results in spending that would otherwise not occur. With this in mind, we create a new data frame with a new column titled ‘effective_offer’ that is 1 when a user views and completes an offer and 0 otherwise. Given this definition, we would automatically exclude all offers of type ‘informational’.

With a simple binary outcome, we can now focus on analyzing offer effectiveness across different variables of interest, however, the first step is to establish a baseline of overall offer effectiveness.

We see that the offer success rate is fairly even, split roughly 52% no success and 48% success. The balance of outcomes is favorable from a modeling perspective as the class being balanced eliminates the need to account for class imbalance and supports the use of accuracy as our metric in model evaluation. Note, if we keep offers of type ‘informational’ the baseline results are roughly 61% unsuccessful and 39% successful. As previously stated, we exclude ‘informational’ type offers as our goal is to create models that predict the probability that a user will complete an offer to obtain a reward (ie “Buy $10 and get $2 off”).

With the baseline success rate now established, we now review offer success across different demographic factors:

Gender

Offer success rate varied by gender just as we saw overall spend varied by gender. Overall, women were 1.25 times more likely to react positively to an offer than men, while those not identified as men or women were 1.35 times more likely to react positively. Based on the chart below, we can increase the offer success rate by targeting women and ‘others’.

Income

Much like spending, we noted a positive relationship between income and offer success rate. The offer success rate increased at every progressive income level. Furthermore, comparing the offer success rates of the top and bottom quartile show a significant difference in success rates. This suggests that income is a potentially strong predictor of a given customer acting on an offer.

Age

Unlike gender and income, there is no observable relationship between age and users’ reaction to offers.

Membership Length

Reviewing the ‘platform’ age of the customer shows a generally positive relationship between the offer success rate and the platform age of the user. This suggests, all things being equal, a user who has been a Starbucks Mobile Reward App member longer is more likely to act on an offer than a user who just joined.

Offer Type

Turning to the offer itself, we see that BOGO’s are slightly less effective at driving engagement than discounts. The small difference suggests that this variable might not be too useful in predicting offer success rates.

Offer Difficulty

Within the Starbucks App, some offers are more difficult to achieve than others. Looking at the below chart, there appears to be a weak relationship between offer difficulty and offer success. Though not a clear linear relationship, the most difficult offers have the lowest success rate.

Model Development

With a better understanding of the different demographic and offer variables and their respective relationship with the overall success rate, we can move to model development. Our model will attempt to leverage different variables such as user income, user platform age, offer difficulty, and offer distribution method in order to predict if the offer will be successful or unsuccessful. In our binary classification problem, we will attempt to maximize prediction accuracy and use this as our guiding metric.

Model Variables

After rounds of data analysis and visualization, a few key variables shows potential for use in the development of a model aimed at predicting if an offer will or will not be a success, or rather if a given user is exposed to a given offer that they will complete the offer.

The variables of interest are:

Gender, encoded
Income
Platform Age in days
Offer Payoff (calculated as reward/difficulty) with intuition being that users are motivated not only by the size of reward but the level of spending needed to unlock the reward.
Offer distribution channel (email, mobile, social, and web)

Preprocessing

In addition to the variable encoding, and data extraction conducted earlier a few key data wrangling steps are needed to improve the effectiveness of any developed models.

Splitting data into test and train datasets using train_test_split()
Data scaling/normalization using StandardScaler(). Note, it is important to scale the data AFTER slitting the data to avoid information leakage between your training and testing dataset.

Individual Learns

The first attempt at model development centered on the use of individual learners commonly used in classification tasks.

Random Forest

Per scikit-learn documentation, “A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.”.

Fitting a random forest classifier yielded a model with an overall accuracy of 61.9% as shown below:

Random Forest Parameter Hyper-tuning

In order to try and improve the results of the Random Forest model, we will use GridSearch in order to test different hyper-parameters to see if we can get better results. Namely, we will be testing with the parameters of n_estimators and min_sample_split. Based on our model variation, we were able to improve the accuracy of the Random Forest Model to 64% compared to the roughly 62% accuracy seen in the default model.

Logistic Regression

A classic model used in binary classification problems, logistic regression attempts to model the probability of a discrete outcome given a number of different input variables.

Using a logistic regression produced largely the same results in terms of model accuracy, with an accuracy rate of also 61.9%:

Support Vector Machines (SVM)

Another powerful algorithm used in classification problems is the SVM model that attempts to draw a line (or hyperplane) between N distinct classes and maximize the distance between the classes and the hyperplane(s).

Using an SVM learner provided slightly better results compared to the two previous methods with an overall accuracy of 67.8%:

Ensemble Models

Though the individual learners provided some predictive powers, potential room for improvement exists through the use of meta-learners. In general, unlike classic learners such as the Logistic Regression or SVM that take in different variables and produce an output, a meta-learner takes in the output of different individual learns and combines their individual results into a single output.

Voting Classifier

A voting classifier is a very intuitive meta learner that effectively takes the majority result of provided input models to create a final classification. In our example, we have individual learners composed of the Random Forest Model, Logistic Regression Model, and SVM Model created earlier. These 3-models each generate a 1 or 0 classification and an associated probability measure. This is then fed into the voting classifier that then creates effectively an average of the 3 inputs, and the average is used to determine a final 1 or 0 label.

Soft

The prediction output is the average of the probabilities of the individual input. For example, if models A, B, and C gave the following probabilities of .4,.55, and .6, then the voting classifier would output .5167 and label the item as 1 (assuming 1,0 binary classification). In our soft voting classifier model, the accuracy was 65.8%.

Hard

Using ‘hard’ voting, the voting classifier takes a simple vote of the 1 or 0 labels provided by the individual learning models fed into the voting classifier. In our hard voting classifier, the accuracy was 66.5%.

Stacking Classifier with Logistic Regression

This meta-learner takes the output of different learners and then uses a logistic regression model to combine the individual inputs into a final prediction. The overall accuracy rate was 67.6%, which was an improvement compared to the Voting Classifier.

Stacked Classifier with Logistic Regression

Stacking Classifier with Random Forest

This meta-learner takes the output of different learners and then uses a random forest model to combine the individual inputs into a final prediction. The overall accuracy rate declined to 62.5%.

Improvement

The predictive power of the different models leaves ample room for improvement through the use of either different models, adjusting hyperparameters of used models, or better feature selection/engineering.

As seen in our GridSearch efforts with the Random Forest Model, we were able to incrementally increase the predictive power of the model using different values for the model's hyper-parameters. This suggests that the individual learners can be improved upon to a degree with some hyper-parameter tuning. A great next step would be to review all three individual models and refine the hyper-parameters to uncover better models.

Conclusion

Overall the project provided a great opportunity to leverage a number of lessons learned throughout the Udacity Data Scientist program. Though the data was created and provided by Starbucks, it required the use of different data processing steps such as extracting data from a column with dictionary data.

Through the data exploration and visualization steps we learned some general heuristics that could help inform future offers on the Starbucks Mobile Rewards App, namely, we found:

Individuals with higher income were more likely to complete an offer as individuals in the bottom quartile successfully completed offers 38% of the time while individuals in the top quartile completed offers roughly 65% of the time.
Individuals who identified as Female or who chose not to identify, successfully completed offers 60% and 65% of the time, respectively. While individuals identified as men completed offers only 48% of the time.
In general, users newer to the Starbucks Mobile Reward App were less likely to complete an offer as only 31% of newer users (lower quartile of platform age) completed an offer. While more seasoned users successfully completed offers about 60% of the time.

During the data exploration process, we not only identified key data wrangling steps that are needed (extracting data, defining a successful offer, etc.) we also identified potential variables of interest for model development such as income, gender, and platform age. Using what was learned during the exploration process, we created a number of different individual and ensemble learners in order to develop models to predict if a user would successfully complete an offer (receive an offer, view offer, and complete offer). Overall, excluding informational offers, the baseline offer success rate was is roughly 48% while the unsuccessful offer rate is 52%. Given that the two classes (successful and unsuccessful) are fairly balanced, we used accuracy to drive our model development, with the end goal of creating something that has better predictive power of 50%/50% guess. Using individual learners of Random Forest, Logistic Regression, and SVM we created models with accuracies that ranged from 62% to 67%. Following the development of individual learners, we turned to the use of meta-learners that are effectively a model of models. We used two different meta-learners, Voting Classifier and Stacking Classifier that both took as inputs the different individual learners we developed earlier. The end result was models with accuracy rates of roughly 67%, which matches the predictive power of the single best individual learner.

The code and the data needed for this project could be found on GitHub here:

GitHub - angeles890/Udacity_DS_CapstoneProject: Material related to Udacity Capstone Project

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com