The FIFA 21 analysis project employs Python, pandas for data manipulation, and Plotly for advanced visualization, meticulously dissecting player statistics to uncover strategic insights, revealing pivotal trends in attributes, demographics, and values in the virtual football realm.
The FIFA 21 dataset contains a wide range of information about players including their names, nationalities, positions, team affiliations, and various skill ratings. To prepare this dataset for analysis we should consider the following steps:
Remove Unnecessary Columns: Some columns may not be relevant for analysis. For example, photoUrl
and playerUrl
might not be needed.
Clean Data: Address any missing or inconsistent data. For instance, the Hits
column appears to have newline characters.
Normalize Text Data: Ensure consistency in text data like team names, positions, and nationalities.
Convert Data Types: Some columns might be better represented in different data types. For example, ratings should be numeric.
Extract Useful Information: Some columns contain multiple pieces of information. For instance, Team & Contract
could be split into separate columns for team and contract duration.
Handle Columns with Mixed Types: As indicated by the warning, at least one column has mixed data types which should be addressed.
To draw insights from the cleaned and organized FIFA 21 dataset, we can explore various aspects of the data. Some potential areas of analysis include:
Player Demographics: Analyzing the distribution of players’ ages, nationalities, and teams. This can reveal which countries or clubs are most represented in the game.
Skill Ratings: Examining the distribution of overall ratings (↓OVA
), potential ratings (POT
), and specific skill attributes like pace (PAC
), shooting (SHO
), etc. Insights could include identifying the top-rated players, positions with the highest skill ratings, or trends in player abilities.
Physical Attributes: Analyzing height and weight distributions to understand the physical makeup of top players. This can also be broken down by position or skill level.
Value and Wages: Investigating the relationship between a player’s value (e.g., release clause), wage, and their skill ratings. This can highlight the economics of the game and the correlation between a player’s cost and their abilities.
Position Analysis: Understanding the distribution of players across different positions and analyzing the skill set required for each position.
Player Development and Aging: Studying how players’ ratings change with age to understand at what age players peak and start to decline.
Comparison by Leagues and Teams: Comparing different leagues and teams in terms of their players’ average ratings, potential, and physical attributes.
These insights provide a basic understanding of the demographics, physical attributes, and skill levels of players in FIFA 21. Let us next explore a more in-depth analysis of the data set and the insights it provides us with!!
The scatter plot illustrates the relationship between a player’s overall rating, their release clause, and their wage in FIFA 21. Here are some key insights:
In this analysis, the relationship between a player’s economic value (measured by release clause and wage) and key playing attributes (like pace, shooting, passing) is explored. This can illustrate which attributes are most valued in the market. For example, it might show a strong correlation between shooting ability and economic value, indicating that players with high shooting skills command higher market prices and wages. This insight is valuable for understanding what attributes drive a player’s market value in the virtual economy of FIFA 21
In this project, we conducted a comprehensive analysis of the FIFA 21 dataset using Python and its libraries like pandas, matplotlib, and seaborn. We focused on extracting meaningful insights about the players’ demographics, skills, and economic values within the game. Our analysis included cleaning and organizing the raw data, followed by various explorations such as understanding player performance across different ages and positions, and examining the correlation between players’ economic values and their on-field attributes. These analyses provided a deeper understanding of the trends and dynamics in the virtual football world of FIFA 21, revealing how player characteristics like age, position, and skill attributes influence their market value and performance within the game. This project not only showcased the power of data analysis in sports analytics but also offered valuable insights into the factors that contribute to a player’s success and their economic value in FIFA 21.
Our aim is to harness the analytical prowess of Python’s data science libraries—Pandas for data manipulation, Seaborn for advanced visualizations, and Scikit-learn for machine learning clustering—to decode the complexities of NBA draft success and illuminate the patterns that underpin basketball excellence.
The dataset contains information about NBA players drafted in various years.
Handling Missing Values: Filled with median values or categorized as ‘Unknown’ for colleges.
Verification of Data Types: Ensured correct data types for analysis readiness.
Data Integrity: Confirmed the consistency and accuracy of the dataset.
Trend Analysis:
Player Performance Analysis:
Team Analysis:
Advanced Statistical Analysis:
Draft Analysis:
(i) Number of Players Drafted per Year:
This chart shows the count of players drafted each year. It appears that the number of players drafted has remained relatively consistent over the years, with slight variations. This consistency might be due to the fixed number of picks in each NBA draft.
(ii) Average Points Per Game Over the Years:
The line chart displays the average points scored per game by players from each draft year. There seems to be some fluctuation in the scoring average over time. This could be influenced by various factors like changes in playing styles, the evolving skill sets of players, or the defensive strategies of different eras.
(i) Correlation Matrix
The heatmap shows the correlation between various performance metrics. High positive correlations are evident between certain metrics, indicating that players who excel in one area often perform well in others. For instance, points per game (PPG) is positively correlated with win shares and value over replacement, suggesting that high scorers generally contribute significantly to their teams’ success.
(ii) Top Performers
a) Top Scorers (Points Per Game):
b) Top Players Based on Win Shares:
(i) Teams’ Draft Picks
Number of Players Drafted by Each Team:
This bar chart illustrates the total number of players each NBA team has drafted. Some teams have drafted more players than others, which could be due to various factors like team strategies, performance in seasons (affecting draft order), and trades.
(ii) Team Performance Metrics
Average Points Per Game by Team:
Teams like NOH (New Orleans Hornets), VAN (Vancouver Grizzlies), and CLE (Cleveland Cavaliers) lead in this metric. This indicates that these teams, historically, have selected players who tend to score more on average.
(i) Elbow Method
In the provided graph, the elbow appears to be at around 4 clusters. This suggests that increasing the number of clusters beyond 4 does not yield a significant decrease in WCSS. Therefore, 4 is likely a good choice for the number of clusters to use for this dataset, as it represents a point where we have a reasonable trade-off between the number of clusters and the within-cluster variance.
(ii) Cluster Analysis
The dataset was grouped into four clusters, representing different types of players based on their performance metrics. Here’s a summary of each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
(i) Average Win Shares by Draft Rank:
The line chart illustrates the average win shares across different draft ranks. There’s a clear trend showing that players with lower draft ranks (meaning they were picked earlier) tend to have higher average win shares. This suggests that earlier draft picks, on average, are more successful in contributing to their team’s success.
(ii) Comparison of Early vs. Late Draft Picks:
a) Early Picks (Ranks 1-15): The average win shares for early draft picks is approximately 34.14. This indicates that players chosen in the top 15 contribute significantly to their teams, which aligns with expectations as these players are often highly touted prospects.
b) Late Picks (Ranks 46-60): The average win shares for late draft picks is about 5.62, considerably lower than that for early picks. This reflects the common understanding that later draft picks are less likely to make a substantial impact, although there are always exceptions.
The graphical representation and the calculated averages corroborate the general consensus in the NBA Draft that earlier selections are expected to perform better than later ones. It’s important to note that there are many outliers and individual success stories that defy these averages, but the overall trend aligns with these findings.
In this data analysis project, we meticulously cleaned and explored an NBA draft dataset, uncovering insights into draft trends, player performance, and team strategies. We identified consistent drafting patterns, established correlations between performance metrics, and discerned the average output of players per team. Utilizing advanced statistical techniques, we clustered players into categories that suggested typical roles and analyzed draft success, confirming that early draft picks generally have more successful careers based on win shares. The project highlighted data analysis’s utility in sports analytics, providing a foundation for further investigative studies and potential improvements in player evaluation and team decision-making processes.
The task is to develop a machine learning model to predict user churn. An accurate model will help prevent churn, improve user retention, and grow Waze’s business.
Waze’s free navigation app makes it easier for drivers around the world to get to where they want to go. Waze’s community of map editors, beta testers, translators, partners, and users helps make each drive better and safer.
Throughout this project, we’ll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute. The PACE Stages will be repeated 4 times to train the model and enable it to achieve the best result. The PACE strategy will better equip us to complete the project in a systematic manner keeping a record of the work.
Build a dataframe for the churn data. After the dataframe is complete, organize the data for the process of exploratory data analysis, and update the progress and insights. Use tools to create visuals for an executive summary to help non-technical stakeholders engage and interact with the data.
1. For EDA of the data, import the data and packages that will be most helpful, such as pandas, numpy, and matplotlib.
2. Read in the data and store it as a dataframe object called df.
1. Since we are interested in user churn, the label
column is essential. Besides label
, variables that tie to user behaviors will be the most applicable. All variables tie to user behavior except ID
.
2. ID
can be dropped from the analysis since we are not interested in identifying a particular user. ID
does not provide meaningful information about the churn (unless ID
is assigned based on user sign-up time).
3. To check for missing data, we can use df.info()
and inspect the Non-Null Count
column. The difference between the number of non-nulls and the number of rows in the data is the number of missing values for the variable.
4. If the missing data are missing completely at random (MCAR), meaning that the reason for missingness is independent of the data values themselves, we can proceed with a complete-case analysis by removing the rows with missing values. Otherwise, we need to investigate the root cause of the missingness and make sure it won’t interfere with the statistical inference and modeling.
5. Generate summary statistics using the describe()
method.
6. And summary information using the info()
method.
Now that we know which data columns to use, it is time to decide which data visualization makes the most sense for EDA of the Waze dataset.
1. sessions: The number of occurrences of a user opening the app during the month
The sessions
variable is a right-skewed distribution with half of the observations having 56 or fewer sessions. However, as indicated by the boxplot, some users have more than 700.
2. drives: An occurrence of driving at least 1 km during the month
The drives
information follows a distribution similar to the sessions
variable. It is right-skewed, approximately log-normal, with a median of 48. However, some drivers had over 400 drives in the last month.
3. total sessions: A model estimate of the total number of sessions since a user has onboarded
The total_sessions
is a right-skewed distribution. The median total number of sessions is 159.6. This is interesting information because, if the median number of sessions in the last month was 56 and the median total sessions was ~160, then it seems that a large proportion of a user’s (estimated) total drives might have taken place in the last month. This is something that can be examined more closely later.
4. n_days_after_onboarding: The number of days since a user signed up for the app
The total user tenure (i.e., number of days since onboarding) is a uniform distribution with values ranging from near-zero to ~3,500 (~9.5 years).
5. driven_km_drives: Total kilometers driven during the month
The number of drives driven in the last month per user is a right-skewed distribution with half the users driving under 3,495 kilometers. The users in this dataset drive a lot. The longest distance driven in the month was over half the circumferene of the earth.
6. duration_minutes_drives: Total duration driven in minutes during the month
The duration_minutes_drives
variable has a heavily skewed right tail. Half of the users drove less than ~1,478 minutes (~25 hours), but some users clocked over 250 hours over the month.
7. activity_days: Number of days the users opens the app during the month
Within the last month, users opened the app a median of 16 times. The box plot reveals a centered distribution. The histogram shows a nearly uniform distribution of ~500 people opening the app on each count of days. However, there are ~250 people who didn’t open the app at all and ~250 people who opened the app every day of the month. This distribution is noteworthy because it does not mirror the sessions
distribution, which we might think would be closely correlated with activity_days
.
8. driving_days: Number of days the user drives (at least 1 km) during the month
The number of days users drove each month is almost uniform, and it largely correlates with the number of days they opened the app that month, except the driving_days
distribution tails off on the right.
However, there were almost twice as many users (~1,000 vs. ~550) who did not drive at all during the month. This might seem counterintuitive when considered together with the information from activity_days
. That variable had ~500 users opening the app on each of most of the day counts, but there were only ~250 users who did not open the app at all during the month and ~250 users who opened the app every day. Let us flag this for further investigation later.
9. device: The type of device a user starts a session with
There are nearly twice as many iPhone users as Android users represented in this data.
10. label: Binary target variable (“retained” vs “churned”) for if a user has churned anytime during the course of the month
Less than 18% of the users churned.
11. driving_days vs activity_days:
Because both driving_days
and activity_days
represent counts of days over a month and they’re also closely related, we can plot them together on a single histogram. This will help to better understand how they relate to each other without having to scroll back and forth comparing histograms in two different places
As observed previously, this might seem counterintuitive. After all, why are there fewer people who didn’t use the app at all during the month and more people who didn’t drive at all during the month?
On the other hand, it could just be illustrative of the fact that, while these variables are related to each other, they’re not the same. People probably just open the app more than they use the app to drive—perhaps to check drive times or route information, to update settings, or even just by mistake.
Confirm the maximum number of days for each variable—driving_days
and activity_days
to obtain 30 & 31 respectively. It’s true. Although it’s possible that not a single user drove all 31 days of the month, it’s highly unlikely, considering there are 15,000 people represented in the dataset.
One other way to check the validity of these variables is to plot a simple scatter plot with the x-axis representing one variable and the y-axis representing the other.
Notice that there is a theoretical limit. If you use the app to drive, then by definition it must count as a day-use as well. In other words, you cannot have more drive-days than activity-days. None of the samples in this data violate this rule, which is good for the next stages.
12. Retention by Device:
Plot a histogram that has four bars—one for each device-label combination—to show how many iPhone users were retained/churned and how many Android users were retained/churned.
The proportion of churned users to retained users is consistent between device types.
13. Churn rate per number of driving days:
The churn rate is highest for people who didn’t use Waze much during the last month. The more times they used the app, the less likely they were to churn. While 40% of the users who didn’t use the app at all last month churned, nobody who used the app 30 days churned. This isn’t surprising. If people who used the app a lot churned, it would likely indicate dissatisfaction. When people who don’t use the app churn, it might be the result of dissatisfaction in the past, or it might be indicative of a lesser need for a navigational app. Maybe they moved to a city with good public transportation and don’t need to drive anymore.
1. Nearly all the variables were either very right-skewed or uniformly distributed. For the right-skewed distributions, this means that most users had values in the lower end of the range for that variable. For the uniform distributions, this means that users were generally equally likely to have values anywhere within the range for that variable.
2. Most of the data was not problematic, and there was no indication that any single variable was completely wrong. However, several variables had highly improbable or perhaps even impossible outlying values, such as driven_km_drives
. Some of the monthly variables also might be problematic, such as activity_days
and driving_days
, because one has a max value of 31 while the other has a max value of 30, indicating that data collection might not have occurred in the same month for both of these variables.
3. Less than 18% of users churned, and ~82% were retained.
4. Distance driven per driving day had a positive correlation with user churn. The farther a user drove on each driving day, the more likely they were to churn. On the other hand, number of driving days had a negative correlation with churn. Users who drove more days of the last month were less likely to churn.
Conduct hypothesis testing on the data for the churn data. Investigate Waze’s dataset to determine which hypothesis testing method best serves the data and the churn project.
1. Research Question: Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?
2. Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.
1. Create a dictionary called map_dictionary
that contains the class labels ('Android'
and 'iPhone'
) for keys and the values you want to convert them to (2
and 1
) as values.
2. Create a new column called device_type
that is a copy of the device
column.
3. Use the map()
method on the device_type
series. Pass map_dictionary
as its argument. Reassign the result back to the device_type
series.
</br></br>
When we pass a dictionary to the Series.map()
method, it will replace the data in the series where that data matches the dictionary’s keys. The values that get imputed are the values of the dictionary.
Example:
df['column']
column |
---|
A |
B |
A |
B |
map_dictionary = {'A': 2, 'B': 1}
df['column'] = df['column'].map(map_dictionary)
df['column']
column |
---|
2 |
1 |
2 |
1 |
Since we are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type.
Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.
Steps to Conduct a 2-Sample T-test:
1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis
Note: This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).
Hypotheses:
$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
Next, let us choose 5% as the significance level and proceed with a two-sample t-test.
We can use the stats.ttest_ind()
function to perform the test.
Isolate the drives
column for iPhone users.
Isolate the drives
column for Android users.
Perform the t-test
Result: Since the p-value is larger than the chosen significance level (5%), we fail to reject the null hypothesis. We can conclude that there is not a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.
One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.
We will create a binomial logistic regression model for the churn project. We’ll determine the type of regression model that is needed and develop one using Waze’s churn project data.
Import the following packages to build a regression model:
Packages for numerics + dataframes
import pandas as pd
import numpy as np
Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
Packages for Logistic Regression & Confusion Matrix
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
1. km_per_driving_day
You know from earlier EDA that churn rate correlates with distance driven per driving day in the last month. It might be helpful to engineer a feature that captures this information.
Create a new column in df
called km_per_driving_day
, which represents the mean distance driven per driving day for each user.
Call the describe()
method on the new column.
Note that some values are infinite. This is the result of there being values of zero in the driving_days
column. Pandas imputes a value of infinity in the corresponding rows of the new column because division by zero is undefined.
a) Convert these values from infinity to zero. You can use np.inf
to refer to a value of infinity.
b) Call describe()
on the km_per_driving_day
column to verify that it worked.
2. professional_driver
Create a new, binary feature called professional_driver
that is a 1 for users who had 60 or more drives **and** drove on 15+ days in the last month.
Note: The objective is to create a new feature that separates professional drivers from other drivers. In this scenario, domain knowledge and intuition are used to determine these deciding thresholds, but ultimately they are arbitrary.
3. Perform a quick inspection of the new variable:
Check the count of professional drivers and non-professionals
Within each class (professional and non-professional) calculate the churn rate
The churn rate for professional drivers is 7.6%, while the churn rate for non-professionals is 19.9%. This seems like it could add predictive signal to the model.
1. Encode Categorical Variable:
Change the data type of the label
column to be binary. This change is needed to train a logistic regression model.
Assign a 0
for all retained
users.
Assign a 1
for all churned
users.
Save this variable as label2
as to not overwrite the original label
variable.
2. Determine whether assumptions have been met:
The following are the assumptions for logistic regression:
a) Independent observations (This refers to how the data was collected.)
b) No extreme outliers
c) Little to no multicollinearity among X predictors
d) Linear relationship between X and the logit of y
For the first assumption, we can assume that observations are independent for this project. The second assumption has already been addressed. The last assumptions will be verified after modeling.
3. Collinearity:
If there are predictor variables that have a Pearson correlation coefficient value greater than the absolute value of 0.7, these variables are strongly multicollinear. Therefore, only one of these variables should be used in our model.
If there are predictor variables that have a Pearson correlation coefficient value greater than the absolute value of 0.7, these variables are strongly multicollinear. Therefore, only one of these variables should be used in your model.
Which variables are multicollinear with each other?
sessions
and drives
: 1.0
driving_days
and activity_days
: 0.95
4. Create dummies if neccessary:
If we have selected device
as an X variable, we will need to create dummy variables since this variable is categorical.
Create a new, binary column called device2
that encodes user devices as follows:
Android
-> 0
iPhone
-> 1
5. Assign predictor variables and target
To build the model we need to determine what X variables we want to include in the model to predict our target—label2
.
Drop the following variables and assign the results to X
:
a) label
(this is the target)
b) label2
(this is the target)
c) device
(this is the non-binary-encoded categorical variable)
d) sessions
(this had high multicollinearity)
e) driving_days
(this had high multicollinearity)
Note: sessions
and driving_days
were selected to be dropped, rather than drives
and activity_days
. The reason for this is that the features that were kept for modeling had slightly stronger correlations with the target variable than the features that were dropped.
6. Split the data
a) It is important to do a train test to obtain accurate predictions. We always want to fit your model on your training set and evaluate our model on our test set to avoid data leakage.
b) Because the target class is imbalanced (82% retained vs. 18% churned), we want to make sure that we don’t get an unlucky split that over- or under-represents the frequency of the minority class. Set the function’s stratify
parameter to y
to ensure that the minority class appears in both train and test sets in the same proportion that it does in the overall dataset.
c) Use scikit-learn to instantiate a logistic regression model. Add the argument penalty = None
.
d) It is important to add penalty = 'none'
since your predictors are unscaled.
e) Fit the model on X_train
and y_train
.
f) Call the .coef_
attribute on the model to get the coefficients of each variable. The coefficients are in order of how the variables are listed in the dataset.
g) Call the model’s intercept_
attribute to get the intercept of the model.
7. Check final assumptions
Verify the linear relationship between X and the estimated log odds (known as logits) by making a regplot.
Call the model’s predict_proba()
method to generate the probability of response for each sample in the training data. (The training data is the argument to the method.) Assign the result to a variable called training_probabilities
. This results in a 2-D array where each row represents a user in X_train
. The first column is the probability of the user not churning, and the second column is the probability of the user churning.
In logistic regression, the relationship between a predictor variable and the dependent variable does not need to be linear, however, the log-odds (a.k.a., logit) of the dependent variable with respect to the predictor variable should be linear.
a) Create a dataframe called logit_data
that is a copy of df
.
b) Create a new column called logit
in the logit_data
dataframe. The data in this column should represent the logit for each user.
1. Results and Evaluation
a) If the logistic assumptions are met, the model results can be appropriately interpreted. Use the code block below to make predictions on the test data. y_preds = model.predict(X_test)
b) Now, use the score()
method on the model with X_test
and y_test
as its two arguments. The default score in scikit-learn is accuracy.
2. Show results with a confusion matrix
Classification Report:
The model has mediocre precision and very low recall, which means that it makes a lot of false negative predictions and fails to capture users who will churn.
3. Importance of Model’s Features
4. Conclusions
a) activity_days
was by far the most important feature in the model. It had a negative correlation with user churn. This was not surprising, as this variable was very strongly correlated with driving_days
, which was known from EDA to have a negative correlation with churn.
b) In previous EDA, user churn rate increased as the values in km_per_driving_day
increased. The correlation heatmap here in this notebook revealed this variable to have the strongest positive correlation with churn of any of the predictor variables by a relatively large margin. In the model, it was the second-least-important variable.
c) New features could be engineered to try to generate better predictive signal, as they often do if we have domain knowledge. In the case of this model, one of the engineered features (professional_driver
) was the third-most-predictive predictor. It could also be helpful to scale the predictor variables, and/or to reconstruct the model with different combinations of predictor variables to reduce noise from unpredictive features.
d) It would be helpful to have drive-level information for each user (such as drive times, geographic locations, etc.). It would probably also be helpful to have more granular data to know how users interact with the app. For example, how often do they report or confirm road hazard alerts? Finally, it could be helpful to know the monthly count of unique starting and ending locations each driver inputs.
We will create the final machine learning model for the churn project using feature engineering, two tree-based models: random forest, and XGBoost. The project will be completed through model development and evaluation
1. Ethical Considerations
a) What are you being asked to do?
Predict if a customer will churn or be retained.
b) What are the ethical implications of the model? What are the consequences of your model making errors? i.e What is the likely effect of the model when it predicts a false negative (i.e., when the model says a Waze user won’t churn, but they actually will)?
Waze will fail to take proactive measures to retain users who are likely to stop using the app. For example, Waze might proactively push an app notification to users, or send a survey to better understand user dissatisfaction.
c) What is the likely effect of the model when it predicts a false positive (i.e., when the model says a Waze user will churn, but they actually won’t)?
Waze may take proactive measures to retain users who are NOT likely to churn. This may lead to an annoying or negative experience for loyal users of the app.
d) Do the benefits of such a model outweigh the potential problems?
The proactive measueres taken by Waze might have unintended effects on users, and these effects might encourage user churn. Follow-up analysis on the effectiveness of the measures is recommended. If the measures are reasonable and effective, then the benefits will most likely outweigh the problems.
2. Import packages and libraries needed to build and evaluate random forest and XGBoost classification models
3. Now read in the dataset as df0
and inspect the first five rows
1. Feature Engineering
To begin, create a copy of df0
to preserve the original dataframe. Call the copy df
.
a) km_per_driving_day
(i) Create a feature representing the mean number of kilometers driven on each driving day in the last month for each user. Add this feature as a column to df
.
(ii) Get descriptive statistics for this new feature
(iii) Convert these values from infinity to zero. You can use np.inf
to refer to a value of infinity.
(iv) Call describe()
on the km_per_driving_day
column to verify that it worked.
b) percent_sessions_in_last_month
Create a new column percent_sessions_in_last_month
that represents the percentage of each user’s total sessions that were logged in their last month of use.
c) professional_driver
Create a new, binary feature called professional_driver
that is a 1 for users who had 60 or more drives **and** drove on 15+ days in the last month.
Note: The objective is to create a new feature that separates professional drivers from other drivers. In this scenario, domain knowledge and intuition are used to determine these deciding thresholds, but ultimately they are arbitrary.
d) total_sessions_per_day
Now, create a new column that represents the mean number of sessions per day since onboarding.
e) km_per_hour
Create a column representing the mean kilometers per hour driven in the last month.
f) km_per_drive
Create a column representing the mean number of kilometers per drive made in the last month for each user.
g) percent_of_sessions_to_favorite
Finally, create a new column that represents the percentage of total sessions that were used to navigate to one of the users’ favorite places.
This is a proxy representation for the percent of overall drives that are to a favorite place. Since total drives since onboarding are not contained in this dataset, total sessions must serve as a reasonable approximation.
People whose drives to non-favorite places make up a higher percentage of their total drives might be less likely to churn, since they’re making more drives to less familiar places.
2. Drop missing values
Because we know from previous EDA that there is no evidence of a non-random cause of the 700 missing values in the label
column, and because these observations comprise less than 5% of the data, use the dropna()
method to drop the rows that are missing this data.
3. Outliers
We know from previous EDA that many of these columns have outliers. However, tree-based models are resilient to outliers, so there is no need to make any imputations.
4. Variable Encoding
a) Dummying features
Because this dataset only has one remaining categorical feature (device
), it’s not necessary to use one of these special functions. we can just implement the transformation directly.
Create a new, binary column called device2
that encodes user devices as follows:
Android
-> 0
iPhone
-> 1
b) Target Encoding
The target variable is also categorical, since a user is labeled as either “churned” or “retained.” Change the data type of the label
column to be binary. This change is needed to train the models.
Assign a 0
for all retained
users.
Assign a 1
for all churned
users.
Save this variable as label2
so as not to overwrite the original label
variable.
5. Feature Selection
Tree-based models can handle multicollinearity, so the only feature that can be cut is ID
, since it doesn’t contain any information relevant to churn.
Note, however, that device
won’t be used simply because it’s a copy of device2
.
Drop ID
from the df
dataframe.
6. Evaluation Metric
Before modeling, we must decide on an evaluation metric. This will depend on the class balance of the target variable and the use case of the model.
First, examine the class balance of the target variable.
Approximately 18% of the users in this dataset churned. This is an unbalanced dataset, but not extremely so. It can be modeled without any class rebalancing.
Now, consider which evaluation metric is best. Remember, accuracy might not be the best gauge of performance because a model can have high accuracy on an imbalanced dataset and still fail to predict the minority class.
It was already determined that the risks involved in making a false positive prediction are minimal. No one stands to get hurt, lose money, or suffer any other significant consequence if they are predicted to churn. Therefore, let us select the model based on the recall score.
1. Modeling workflow and model selection process
The final modeling dataset contains 14,299 samples. This is towards the lower end of what might be considered sufficient to conduct a robust model selection process, but still doable.
a) Split the data into train/validation/test sets (60/20/20)
Note that, when deciding the split ratio and whether or not to use a validation set to select a champion model, consider both how many samples will be in each data partition, and how many examples of the minority class each would therefore contain. In this case, a 60/20/20 split would result in ~2,860 samples in the validation set and the same number in the test set, of which ~18%—or 515 samples—would represent users who churn.
b) Fit models and tune hyperparameters on the training set
c) Perform final model selection on the validation set
d) Assess the champion model’s performance on the test set
2. Split the data
Now you’re ready to model. The only remaining step is to split the data into features/target variable and training/validation/test sets.
a) Define a variable X
that isolates the features. Remember not to use device
.
b) Define a variable y
that isolates the target variable (label2
).
c) Split the data 80/20 into an interim training set and a test set. Don’t forget to stratify the splits, and set the random state to 42.
d) Split the interim training set 75/25 into a training set and a validation set, yielding a final ratio of 60/20/20 for training/validation/test sets. Again, don’t forget to stratify the splits and set the random state.
3. Modelling
a) Random Forest
Begin with using GridSearchCV
to tune a random forest model.
(i) Instantiate the random forest classifier rf
and set the random state.
(ii) Create a dictionary cv_params
of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take.
max_depth
max_features
max_samples
min_samples_leaf
min_samples_split
n_estimators
(iii) Define a set scoring
of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).
(iv) Instantiate the GridSearchCV
object rf_cv
. Pass to it as arguments:
rf
cv_params
scoring
cv=_
)refit=_
)refit
should be set to 'recall'
.
(v) Now let us fit the model to the training data
(vi) Examine the best average score across all the validation folds
(vii) Examine the best combination of hyperparameters
(viii) Pass the GridSearch
object to the make_results()
function.
Aside from the accuracy, the scores aren’t that good. However, recall that when we built the logistic regression model in the last PACE the recall was ~0.09, which means that this model has 33% better recall and about the same accuracy, and it was trained on less data.
b) XGBoost
Let us try to improve your scores using an XGBoost model.
(i) Instantiate the XGBoost classifier xgb
and set objective='binary:logistic'
. Also set the random state.
(ii) Create a dictionary cv_params
of the following hyperparameters and their corresponding values to tune:
max_depth
min_child_weight
learning_rate
n_estimators
(iii) Define a set scoring
of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).
(iv) Instantiate the GridSearchCV
object xgb_cv
. Pass to it as arguments:
xgb
cv_params
scoring
cv=_
)refit='recall'
)(v) Now fit the model to the X_train
and y_train
data.
(vi) Get the best score from the model, the best parameters, and use the make_results()
function to output all of the scores of the model.
This model fit the data even better than the random forest model. The recall score is nearly double the recall score from the logistic regression model from the previous PACE, and it’s almost 50% better than the random forest model’s recall score, while maintaining a similar accuracy and precision score
4. Model selection
Now, let us use the best random forest model and the best XGBoost model to predict on the validation data. Whichever performs better will be selected as the champion model.
a) Random Forest
Notice that the scores went down from the training scores across all metrics, but only by very little. This means that the model did not overfit the training data.
b) XGBoost
Just like with the random forest model, the XGBoost model’s validation scores were lower, but only very slightly. It is still the clear champion.
1. Use champion model to predict on test data
Now, let us use the champion model to predict on the test dataset. This is to give a final indication of how we should expect the model to perform on new future data, should we decide to use the model.
The recall was exactly the same as it was on the validation data, but the precision declined notably, which caused all of the other scores to drop slightly. Nonetheless, this is stil within the acceptable range for performance discrepancy between validation and test scores.
2. Confusion Matrix
The model predicted three times as many false negatives than it did false positives, and it correctly identified only 16.6% of the users who actually churned.
3. Feature Importance
Use the plot_importance
function to inspect the most important features of your final model.
The XGBoost model made more use of many of the features than did the logistic regression model from the previous course, which weighted a single feature (activity_days
) very heavily in its final prediction.
4. Identify an optimal decision threshold
The default decision threshold for most implementations of classification algorithms—including scikit-learn’s—is 0.5. This means that, in the case of the Waze models, if they predicted that a given user had a 50% probability or greater of churning, then that user was assigned a predicted value of 1
—the user was predicted to churn.
As recall increases, precision decreases. But what if we determined that false positives aren’t much of a problem? For example, in the case of this Waze project, a false positive could just mean that a user who will not actually churn gets an email and a banner notification on their phone. It’s very low risk.
Instead of using the default 0.5 decision threshold of the model, let us use a lower threshold 0.4:
The predict_proba()
method returns a 2-D array of probabilities where each row represents a user. The first number in the row is the probability of belonging to the negative class, the second number in the row is the probability of belonging to the positive class.
We can generate new predictions based on this array of probabilities by changing the decision threshold for what is considered a positive response. For example, the we can use a different code to convert the predicted probabilities to {0, 1} predictions with a threshold of 0.4. In other words, any users who have a value ≥ 0.4 in the second column will get assigned a prediction of 1
, indicating that they churned.
Let us compare this with the results from earlier
Recall and F1 score increased significantly, while precision and accuracy decreased marginally.
So, using the precision-recall curve as a guide, suppose you knew that you’d be satisfied if the model had a recall score of 0.5 and you were willing to accept the ~30% precision score that comes with it. In other words, We will be happy if the model successfully identified half of the people who will actually churn, even if it means that when the model says someone will churn, it’s only correct about 30% of the time.
5. Conclusion
a) Splitting the data three ways means that there is less data available to train the model than splitting just two ways. However, performing model selection on a separate validation set enables testing of the champion model by itself on the test set, which gives a better estimate of future performance than splitting the data two ways and selecting a champion model by performance on the test data.
b) Logistic regression models are easier to interpret. Because they assign coefficients to predictor variables, they reveal not only which features factored most heavily into their final predictions, but also the directionality of the weight. In other words, they tell you if each feature is positively or negatively correlated with the target in the model’s final prediction.
c) Tree-based model ensembles are often better predictors. If the most important thing is the predictive power of the model, then tree-based modeling will usually win out against logistic regression. They also require much less data cleaning and require fewer assumptions about the underlying distributions of their predictor variables, so they’re easier to work with.
d) New features could be engineered to try to generate better predictive signal, as they often do if you have domain knowledge. In the case of this model, the engineered features made up over half of the top 10 most-predictive features used by the model. It could also be helpful to reconstruct the model with different combinations of predictor variables to reduce noise from unpredictive features.
Predicting Taxi Gratuities in New York City
The goal of this project is to create a multiple linear regression and random forest model to predict high rider gratuity or not. This project will utilize yellow taxi trips taken in New York City during 2017. We will start off by conducting EDA on the provided data set. We will then prepare, create, and analyze A/B tests. The A/B test results should aim to find ways to generate more revenue for taxi cab drivers. Next, we will build a multiple linear regression model as it allows us to consider more than one variable against the variable we’re measuring against. This opens the door for much more thorough and flexible analysis to be completed. Finally, we will build a machine learning model to predict if a customer will not leave a tip. They want to use the model in an app that will alert taxi drivers to customers who are unlikely to tip, since drivers depend on tips.
Throughout this project, we’ll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute. The PACE Stages will be repeated 4 times to train the model and enable it to achieve the best result. The PACE strategy will better equip us to complete the project in a systematic manner keeping a record of the work.
For EDA of the data, import the data and packages that would be most helpful, such as pandas, numpy and matplotlib. Then, import the dataset.
1. Decide which columns are applicable
2. Consider functions that help you understand and structure the data.
head()
describe()
info()
groupby()
sortby()
3. Use the size, describe, and info function to better understand the data and make sure there are no missing values
There is no missing data according to the results from the info()
function.
4 Select data visualization types that will help you understand and explain the data.
1. Box Plots
Perform a check for outliers on relevant columns such as trip distance and trip duration. Remember, some of the best ways to identify the presence of outliers in data are box plots and histograms.
a) trip_distance
The majority of trips were journeys of less than two miles. The number of trips falls away steeply as the distance traveled increases beyond two miles.
b) total_amount
The total cost of each trip also has a distribution that skews right, with most costs falling in the $5-15 range.
c) tip_amount
The distribution for tip amount is right-skewed, with nearly all the tips in the $0-3 range.
d) tip_amount by vendor
Separating the tip amount by vendor reveals that there are no noticeable aberrations in the distribution of tips between the two vendors in the dataset. Vendor two has a slightly higher share of the rides, and this proportion is approximately maintained for all tip amounts.
Next, zoom in on the upper end of the range of tips to check whether vendor one gets noticeably more of the most generous tips.
The proportions are maintained even at these higher tip amounts, with the exception being at highest extremity, but this is not noteworthy due to the low sample size at these tip amounts.
e) Mean tips by passenger count
Examine the unique values in the passenger_count
column.
Nearly two thirds of the rides were single occupancy, though there were still nearly 700 rides with as many as six passengers. Also, there are 33 rides with an occupancy count of zero, which doesn’t make sense. These would likely be dropped unless a reasonable explanation can be found for them.
Mean tip amount varies very little by passenger count. Although it does drop noticeably for four-passenger rides, it’s expected that there would be a higher degree of fluctuation because rides with four passengers were the least plentiful in the dataset (aside from rides with zero passengers).
f) Create month and day columns
Monthly rides are fairly consistent, with notable dips in the summer months of July, August, and September, and also in February.
Suprisingly, Wednesday through Saturday had the highest number of daily rides, while Sunday and Monday had the least.
Thursday had the highest gross revenue of all days, and Sunday and Monday had the least. Interestingly, although Saturday had only 35 fewer rides than Thursday, its gross revenue was ~$6,000 less than Thursday’s—more than a 10% drop.
Monthly revenue generally follows the pattern of monthly rides, with noticeable dips in the summer months of July, August, and September, and also one in February.
g) Plot mean trip distance by drop-off location
This plot presents a characteristic curve related to the cumulative density function of a normal distribution. In other words, it indicates that the drop-off points are relatively evenly distributed over the terrain. This is good to know, because geographic coordinates were not included in this dataset, so there was no obvious way to test for the distibution of locations.
To confirm this conclusion, consider the following experiment:
(i) Create a sample of coordinates from a normal distribution—in this case 1,500 pairs of points from a normal distribution with a mean of 10 and a standard deviation of 5
(ii) Calculate the distance between each pair of coordinates
(iii) Group the coordinates by endpoint and calculate the mean distance between that endpoint and all other points it was paired with
(iv) Plot the mean distance for each unique endpoint
The curve described by this graph is nearly identical to that of the mean distance traveled by each taxi ride to each drop-off location. This reveals that the drop-off locations in the taxi dataset are evenly distributed geographically. Note, however, that this does not mean that there was an even distrubtion of rides to each drop-off point. Let us examine this next.
h) Histogram of rides by drop-off location
Notice that out of the 200+ drop-off locations, a disproportionate number of locations receive the majority of the traffic, while all the rest get relatively few trips. It’s likely that these high-traffic locations are near popular tourist attractions like the Empire State Building or Times Square, airports, and train and bus terminals. However, it would be helpful to know the location that each ID corresponds with. Unfortunately, this is not in the data.
1. EDA helps a data professional to get to know the data, understand its outliers, clean its missing values, and prepare it for future modeling.
2. Visualizations helped us understand that this dataset has some outliers that we will need to make decisions on prior to designing a model.
1. The research question for this data project: “Is there a relationship between total fare amount and payment type?”
2. Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.
1. Use descriptive statistics to conduct Exploratory Data Analysis (EDA).
In the dataset, payment_type
is encoded in integers:
We are interested in the relationship between payment type and the fare amount the customer pays. One approach is to look at the average fare amount for each payment type.
Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, we conduct a hypothesis test.
2. Hypothesis Testing
Null hypothesis: There is no difference in average fare between customers who use credit cards and customers who use cash.
Alternative hypothesis: There is a difference in average fare between customers who use credit cards and customers who use cash
For the purpose of this exercise, our hypothesis test is the main component of your A/B test.
$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.
$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.
You choose 5% as the significance level and proceed with a two-sample t-test.
Since the p-value is significantly smaller than the significance level of 5%, we reject the null hypothesis.
Notice the ‘e-12’ at the end of the pvalue result.
We can conclude that there is a statistically significant difference in the average fare amount between customers who use credit cards and customers who use cash.
1. The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.
2. This project requires an assumption that passengers were forced to pay one way or the other, and that once informed of this requirement, they always complied with it. The data was not collected this way; so, an assumption had to be made to randomly group data entries to perform an A/B test. This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it’s easier to pay for longer/farther trips with a credit card. In other words, it’s far more likely that fare amount determines payment type, rather than vice versa.
1. Import the packages that are needed for building linear regression models.
1. Convert pickup & dropoff columns to datetime
2. Create a new column called duration
that represents the total number of minutes that each taxi ride took.
3. Call df.info()
to inspect the columns and decide which ones to check for outliers.
4. Plot a box plot for each feature: trip_distance
, fare_amount
, duration
.
a) All three variables contain outliers. Some are extreme, but others not so much.
b) It’s 30 miles from the southern tip of Staten Island to the northern end of Manhattan and that’s in a straight line. With this knowledge and the distribution of the values in this column, it’s reasonable to leave these values alone and not alter them. However, the values for fare_amount
and duration
definitely seem to have problematic outliers on the higher end.
Imputations
1. trip_distance
outliers
a) To check, we need to sort the column values, eliminate duplicates, and inspect the least 10 values.
b) Calculate the count of rides where the trip_distance
is zero.
148 out of ~23,000 rides is relatively insignificant. We could impute it with a value of 0.01, but it’s unlikely to have much of an effect on the model. Therefore, the trip_distance
column will remain untouched with regard to outliers.
2. fare_amount
outliers
The range of values in the fare_amount
column is large and the extremes don’t make much sense.
Low values: Negative values are problematic. Values of zero could be legitimate if the taxi logged a trip that was immediately canceled.
High values: The maximum fare amount in this dataset is nearly \$1,000, which seems very unlikely. High values for this feature can be capped based on intuition and statistics. The interquartile range (IQR) is \$8. The standard formula of Q3 + (1.5 * IQR)
yields $26.50. That doesn’t seem appropriate for the maximum fare cap. In this case, we’ll use a factor of 6
, which results in a cap of $62.50.
a) Impute values less than $0 with 0
.
b) Now impute the maximum value as Q3 + (6 * IQR)
.
3. duration
outliers
The duration
column has problematic values at both the lower and upper extremities.
Low values: There should be no values that represent negative time. Impute all negative durations with 0
.
High values: Impute high values the same way you imputed the high-end outliers for fares: Q3 + (6 * IQR)
.
Feature Engineering
1. Create mean_distance
column
For example, if our data were:
Trip | Start | End | Distance |
---|---|---|---|
1 | A | B | 1 |
2 | C | D | 2 |
3 | A | B | 1.5 |
4 | D | C | 3 |
The results should be:
A -> B: 1.25 miles
C -> D: 2 miles
D -> C: 3 miles
Notice that C -> D is not the same as D -> C. All trips that share a unique pair of start and end points get grouped and averaged.
Then, a new column mean_distance
will be added where the value at each row is the average for all trips with those pickup and dropoff locations:
Trip | Start | End | Distance | mean_distance |
---|---|---|---|---|
1 | A | B | 1 | 1.25 |
2 | C | D | 2 | 2 |
3 | A | B | 1.5 | 1.25 |
4 | D | C | 3 | 3 |
Let us begin by creating a helper column called pickup_dropoff
, which contains the unique combination of pickup and dropoff location IDs for each row.
One way to do this is to convert the pickup and dropoff location IDs to strings and join them, separated by a space. The space is to ensure that, for example, a trip with pickup/dropoff points of 12 & 151 gets encoded differently than a trip with points 121 & 51.
So, the new column would look like this:
Trip | Start | End | pickup_dropoff |
---|---|---|---|
1 | A | B | ‘A B’ |
2 | C | D | ‘C D’ |
3 | A | B | ‘A B’ |
4 | D | C | ‘D C’ |
Now, let us use a groupby()
statement to group each row by the new pickup_dropoff
column, compute the mean, and capture the values only in the trip_distance
column. Assign the results to a variable named grouped
.
2. Create a mean_distance
column that is a copy of the pickup_dropoff
helper column.
3. Use the map()
method on the mean_distance
series. Pass grouped_dict
as its argument. Reassign the result back to the mean_distance
series.
</br></br>
When we pass a dictionary to the Series.map()
method, it will replace the data in the series where that data matches the dictionary’s keys. The values that get imputed are the values of the dictionary.
4 Repeat the process used to create the mean_distance
column to create a mean_duration
column.
5. Create two new columns, day
(name of day) and month
(name of month) by extracting the relevant information from the tpep_pickup_datetime
column.
6. Create rush_hour
column
Define rush hour as:
Create a binary rush_hour
column that contains a 1 if the ride was during rush hour and a 0 if it was not.
7. Create a scatterplot to visualize the relationship between mean_duration
and fare_amount
.
8. Drop features that are redundant, irrelevant, or that will not be available in a deployed environment.
9. Create a pairplot to visualize pairwise relationships between fare_amount
, mean_duration
, and mean_distance
.
These variables all show linear correlation with each other.
10. Next, code a correlation matrix to help determine most correlated variables.
Visualize a correlation heatmap of the data.
mean_duration
and mean_distance
are both highly correlated with the target variable of fare_amount
They’re also both correlated with each other, with a Pearson correlation of 0.87.
This model will predict fare_amount
, which will be used as a predictor variable in machine learning models. Therefore, let us try modeling with both variables even though they are correlated.
1. Set your X and y variables. X represents the features and y represents the outcome (target) variable.
2. Dummy encode categorical variables
3. Create training and testing sets. The test set should contain 20% of the total samples. Set random_state=0
.
4. Use StandardScaler()
, fit()
, and transform()
to standardize the X_train
variables. Assign the results to a variable called X_train_scaled
.
5. Instantiate your model and fit it to the training data.
6. Evaluate Model
a) Train Data
Evaluate the model performance by calculating the residual sum of squares and the explained variance score (R^2). Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.
b) Test Data
Calculate the same metrics on the test data. Scale the X_test
data using the scaler that was fit to the training data. Do not refit the scaler to the testing data, just transform it. Call the results X_test_scaled
.
The model performance is high on both training and test sets, suggesting that there is little bias in the model and that the model is not overfit. In fact, the test scores were even better than the training scores.
For the test data, an R2 of 0.868 means that 86.8% of the variance in the fare_amount
variable is described by the model.
The mean absolute error is informative here because, for the purposes of the model, an error of two is not more than twice as bad as an error of one.
1. obtain actual
,predicted
, and residual
for the testing set, and store them as columns in a results
dataframe.
2. Create a scatterplot to visualize actual
vs. predicted
.
3. Visualize the distribution of the residuals
using a histogram
4. results[‘residual’].mean()
The distribution of the residuals is approximately normal and has a mean of -0.015. The residuals represent the variance in the outcome variable that is not explained by the model. A normal distribution around zero is good, as it demonstrates that the model’s errors are evenly distributed and unbiased.
5. Create a scatterplot of residuals
over predicted
.
The model’s residuals are evenly distributed above and below zero, with the exception of the sloping lines from the upper-left corner to the lower-right corner, which you know are the imputed maximum of \$62.50 and the flat rate of \$52 for JFK airport trips.
6 Use the coef_
attribute to get the model’s coefficients. The coefficients are output in the order of the features that were used to train the model.
The coefficients reveal that mean_distance
was the feature with the greatest weight in the model’s final prediction. A common misinterpretation is that for every mile traveled, the fare amount increases by a mean of \$7.13. This is incorrect. The data used to train the model was standardized with StandardScaler()
. As such, the units are no longer miles. In other words, we cannot say “for every mile traveled…”, as stated above. The correct interpretation of this coefficient is: controlling for other variables, for every +1 change in standard deviation, the fare amount increases by a mean of \$7.13.
Note also that because some highly correlated features were not removed, the confidence interval of this assessment is wider.
So let us translate this back to miles instead of standard deviation
a) Calculate the standard deviation of mean_distance
in the X_train
data.
b) Divide the coefficient (7.133867) by the result to yield a more intuitive interpretation.
Now we can make a more intuitive interpretation: for every 3.57 miles traveled, the fare increased by a mean of \$7.13. Or, reduced: for every 1 mile traveled, the fare increased by a mean of \$2.00.
7. Conclusion
Drivers who didn’t receive tips will probably be upset that the app told them a customer would leave a tip. If it happened often, drivers might not trust the app. Drivers are unlikely to pick up people who are predicted to not leave tips. Customers will have difficulty finding a taxi that will pick them up, and might get angry at the taxi company. Even when the model is correct, people who can’t afford to tip will find it more difficult to get taxis, which limits the accessibility of taxi service to those who pay extra.
It’s not good to disincentivize drivers from picking up customers. It could also cause a customer backlash. The problems seem to outweigh the benefits.
Effectively limiting equal access to taxis is ethically problematic, and carries a lot of risk.
We can build a model that predicts the most generous customers. This could accomplish the goal of helping taxi drivers increase their earnings from tips while preventing the wrongful exclusion of certain people from using taxis.
This is a supervised learning, classification task. We could use accuracy, precision, recall, F-score, area under the ROC curve, or a number of other metrics. However, we don’t have enough information at this time to know which are most appropriate. We need to know the class balance of the target variable.
Import packages and libraries needed to build and evaluate random forest and XGBoost classification models.
Begin by reading in the data. There are two dataframes: one containing the original data, the other containing the mean durations, mean distances, and predicted fares from the earlier called nyc_preds_means.csv.
Join the two dataframes using any method.
Feature Engineering
Copy df0
and assign the result to a variable called df1
. Then, use a Boolean mask to filter df1
so it contains only customers who paid with credit card.
Notice that there isn’t a column that indicates tip percent, which is what we need to create the target variable. we’ll have to engineer it.
Add a tip_percent
column to the dataframe by performing the following calculation:
Round the result to three places beyond the decimal. This is an important step. It affects how many customers are labeled as generous tippers. In fact, without performing this step, approximately 1,800 people who do tip ≥ 20% would be labeled as not generous.
Now create another column called generous
. This will be the target variable. The column should be a binary indicator of whether or not a customer tipped ≥ 20% (0=no, 1=yes).
1. Begin by making the generous
column a copy of the tip_percent
column.
2. Reassign the column by converting it to Boolean (True/False).
3. Reassign the column by converting Boolean to binary (1/0).
Next, engineer four new columns that represent time of day bins. Each column should contain binary values (0=no, 1=yes) that indicate whether a trip began (picked up) during the following times:
am_rush
= [06:00–10:00)
daytime
= [10:00–16:00)
pm_rush
= [16:00–20:00)
nighttime
= [20:00–06:00)
Now, create a month
column that contains only the abbreviated name of the month when each passenger was picked up, then convert the result to lowercase.
Drop redundant and irrelevant columns as well as those that would not be available when the model is deployed. This includes information like payment type, trip distance, tip amount, tip percentage, total amount, toll amount, etc. The target variable (generous
) must remain in the data because it will get isolated as the y
data for modeling.
Variable Encoding
Many of the columns are categorical and will need to be dummied (converted to binary). Some of these columns are numeric, but they actually encode categorical information, such as RatecodeID
and the pickup and dropoff locations. To make these columns recognizable to the get_dummies()
function as categorical variables, you’ll first need to convert them to type(str)
.
1. Define a variable called cols_to_str
, which is a list of the numeric columns that contain categorical information and must be converted to string: RatecodeID
, PULocationID
, DOLocationID
.
2. Write a for loop that converts each column in cols_to_str
to string.
Evaluation Metric
Examine the class balance of the target variable
A little over half of the customers in this dataset were “generous” (tipped ≥ 20%). The dataset is very nearly balanced.
To determine a metric, consider the cost of both kinds of model error:
False positives (the model predicts a tip ≥ 20%, but the customer does not give one)
False negatives (the model predicts a tip < 20%, but the customer gives more)
False positives are worse for cab drivers, because they would pick up a customer expecting a good tip and then not receive one, frustrating the driver.
False negatives are worse for customers, because a cab driver would likely pick up a different customer who was predicted to tip more—even when the original customer would have tipped generously.
F1 score is the metric that places equal weight on true postives and false positives, and so therefore on precision and recall.
Modelling
1. Split the Data
a) Define a variable y
that isolates the target variable (generous
).
b) Define a variable X
that isolates the features.
c) Split the data into training and testing sets. Put 20% of the samples into the test set, stratify the data, and set the random state.
2. Random Forest
Begin with using GridSearchCV
to tune a random forest model.
a) Instantiate the random forest classifier rf
and set the random state.
b) Create a dictionary cv_params
of any of the following hyperparameters and their corresponding values to tune. The more you tune, the better your model will fit the data, but the longer it will take.
max_depth
max_features
max_samples
min_samples_leaf
min_samples_split
n_estimators
c) Define a set scoring
of scoring metrics for GridSearch to capture (precision, recall, F1 score, and accuracy).
d) Instantiate the GridSearchCV
object rf1
. Pass to it as arguments:
rf
cv_params
scoring
cv=_
)refit=_
)e) Examine the best average score across all the validation folds, Examine the best combination of hyperparameters, and use the make_results()
function to output all of the scores of your model.
This is an acceptable model across the board. Typically scores of 0.65 or better are considered acceptable, but this is always dependent on the use case.
f) Use your model to predict on the test data. Assign the results to a variable called rf_preds
. Call rf_test_scores
to output the results.
All scores increased by at most ~0.02.
3. XGBoost
Let us try to improve your scores using an XGBoost model.
a) Instantiate the XGBoost classifier xgb
and set objective='binary:logistic'
. Also set the random state.
b) Create a dictionary cv_params
of the following hyperparameters and their corresponding values to tune:
max_depth
min_child_weight
learning_rate
n_estimators
c) Define a set scoring
of scoring metrics for grid search to capture (precision, recall, F1 score, and accuracy).
d) Instantiate the GridSearchCV
object xgb1
. Pass to it as arguments:
xgb
cv_params
scoring
cv=_
)refit='f1'
)e) Now fit the model to the X_train
and y_train
data.
f) Get the best score from this model, the best parameters, and use the make_results()
function to output all of the scores of your model.
g) Use your model to predict on the test data. Assign the results to a variable called xgb_preds
. Call xgb_test_scores
to output the results.
The F1 score is ~0.01 lower than the random forest model. Both models are acceptable, but the random forest model is the champion.
4. Plot a confusion matrix of the champion model’s predictions on the test data
The model is almost twice as likely to predict a false positive than it is to predict a false negative. Therefore, type I errors are more common. This is less desirable, because it’s better for a driver to be pleasantly surprised by a generous tip when they weren’t expecting one than to be disappointed by a low tip when they were expecting a generous one. However, the overall performance of this model is satisfactory.
5. Feature Importance
1. This model performs acceptably. Its F1 score was 0.7235 and it had an overall accuracy of 0.6865. It correctly identified ~78% of the actual responders in the test set, which is 48% better than a random guess. It may be worthwhile to test the model with a select group of taxi drivers to get feedback.
2. Unfortunately, random forest is not the most transparent machine learning algorithm. We know that VendorID
, predicted_fare
, mean_duration
, and mean_distance
are the most important features, but we don’t know how they influence tipping. This would require further exploration. It is interesting that VendorID
is the most predictive feature. This seems to indicate that one of the two vendors tends to attract more generous customers. It may be worth performing statistical tests on the different vendors to examine this further.
3. There are almost always additional features that can be engineered, but hopefully the most obvious ones were generated during the first round of modeling. In our case, we could try creating three new columns that indicate if the trip distance is short, medium, or far. We could also engineer a column that gives a ratio that represents (the amount of money from the fare amount to the nearest higher multiple of $5) / fare amount. For example, if the fare were $12, the value in this column would be 0.25, because $12 to the nearest higher multiple of $5 ($15) is $3, and $3 divided by $12 is 0.25. The intuition for this feature is that people might be likely to simply round up their tip, so journeys with fares with values just under a multiple of $5 may have lower tip percentages than those with fare values just over a multiple of $5. We could also do the same thing for fares to the nearest $10.
4. It would probably be very helpful to have past tipping behavior for each customer. It would also be valuable to have accurate tip values for customers who pay with cash. It would be helpful to have a lot more data. With enough data, we could create a unique feature for each pickup/dropoff combination.
In just 11 years, Airbnb has grown from nothing to a $30 billion firm, with over half a billion people resting their heads in an Airbnb property each night. They currently host over 7 million listings across 100,000 cities in 220 countries and regions. In this vast market of accommodation and short-term rentals, accurately predicting rental prices has become increasingly important for platforms such as Airbnb. This ability to set competitive, yet profitable prices is not only beneficial to the hosts in generating revenue, but also the business as it enables Airbnb to obtain a clearer understanding of their price dynamics, improve their strategic decision-making, and increase their global market share.
Airbnb operates on a sharing-economy platform, however, unlike other firms operating on this platform such as Uber or Lyft, Airbnb allows the individual hosts to set their price, and offers them an algorithm tool for price suggestions. Pricing has been one of the biggest challenges for Airbnb and it has been identified that Airbnb hosts forfeit around 46% of the revenue they should be earning due to imperfect pricing strategy.
The Airbnb Manchester 2018 data set will be used for analysis and it contains a wide range of comprehensive variables that play a role in understanding the price set for the rentals. There are 3272 observations along with 27 different variables across the data set. R Studio will be used for the analysis of this data set. Let us look at the classification of the variables and their types using a table.
1. Variable Classification
2. Limitations of Data Set
Sample Bias: Since the data only encompasses the listings in Manchester, it might not provide the most accurate results in predicting prices for listings in other areas indicating that it could geographically skewed.
Time Dynamics: Since the data only covers the information from 2018, the model generated might prove inadequate in generating the best predictions for periods later due to changes in housing price dynamics, and economic instability.
External Factors: The data does not capture the metrics of external variables including, policies, regulatory changes, and political conditions.
3. Data Cleaning
Boxplot of Price
We can clearly see an outlier in our dependent variable price at $11,900, and hence it is important to eliminate this outlier before our analysis to avoid inaccurate results.
Boxplot of Price ≤ 4000
The above plot displays that we have successfully eliminated our outlier and completed the cleaning process, and the data is now ready for analysis.
1. Superhost vs Price
The boxplot analysis shows that super hosts (1) charge lower prices than normal hosts (0) despite higher reviews and more listings, likely aiming to increase booking volume and maintain high occupancy rates. In contrast, normal hosts, with fewer listings and lower review scores, charge more to compensate for less frequent bookings.
2. Overall Rating and Reviews Per Year vs Price
The above analysis shows that hosts with a higher overall rating of more than 80 charge more, consistent with our expectations, however, hosts with fewer reviews (0-25) charge higher prices as they might not rent out frequently. This highlights the need for using log transformations of prices to address skewed data and extreme values for clearer insights.
3. Distance vs Price
The above graph uses the log(price) to display a clearer understanding of the relationship between price and distance. It is evident by the graph that the majority of the Airbnb rentals are done closer to the city center and possess a higher average price due to their high demand compared to those further away as fewer people would prefer not to rent them leading to lower average prices.
4. Relevant Variables vs Price
Log(price) enables generating clear plots with a linear function line. The 4 variables all have a positive correlation with price. Accommodations for 2-6 guests, properties with 1-2 bedrooms, and 1-2 bathroom are most common, indicating a focus on individuals and small groups. Hosts with 1-5 listings typically charge lower prices and rent out more frequently, while those with higher listings, likely professional managers, demand higher prices. These variables will be further analysed while building our model to understand their importance in predicting price.
1. Variable Sets
4 variable sets are created to ensure evenly spread out data.
Variable Set 1: super host, host listings, host identity, distance, type_1, type_2, type_3, type_4
Variable Set 2: entire home, private room, shared room, accommodates
Variable Set 3: bathrooms, bedrooms, real bed, wireless, breakfast
Variable Set 4: instant bookable, cancellation policy, smoking, guest profile, guest phone, reviews per year, overall rating
2. Model Generation
4 conditional regression tree models will be generated to identify the most important variable in predicting price. In order to develop a more trustworthy model less sensitive to the variables used and combat the limitations of Regression Trees, we will use Random Forest as our next ML model. Random Forest generates a very large number of regression trees and averages out all the predictions to improve accuracy of the results.
3. Model Selection
a) A 10-fold cross-validation approach will be utilized, segmenting the dataset into ten parts to train the model ten times—each instance using nine parts for training and one part for testing.
b) The trained models for each of the four formulas (formula 1, 2, 3, and 4) will be stored in a list named ‘models’ through iterative training.
c) Upon completing all iterations, the performance of each model will be evaluated and summarized to identify the most effective formula for the dataset.
1. Algorithm for Formula
2. Conditional Regression Trees
a) Regression Tree - Formula 1
b) Regression Tree - Formula 2
The max depth for the above tree is set to 3 to generate a simple understanding of the model. The variable “accommodates” is shown as the main predictor of price. Following from 1 to 10, an Airbnb which can accommodate greater than 8 people but lesser than 10 people has 91 observations in the data set, and an average price of $206.57, which is also used as the predicted price for any observation in this range.
c) Regression Tree - Formula 3
A max depth was not set to see the extent of the tree, and it ranks “accommodates” the best predictor of price with a depth of 6.
d) Regression Tree - Formula 4
The above regression tree model is a composite of all the variable sets indicating that the model finds “accommodates” as the most important predictor of price in our data set. A max depth of 5 is set in order to avoid overfitting of the data and maintain the accuracy of the model. This result is consistent with the results generated by the model using Formula 2 & 3 which both include the “accommodates” variable and rank it the highest.
3. Random Forest
We first set the seed to 100 across all models in order to ensure reproducibility and that our code runs the same every time. Furthermore, this helps in comparison as it ensures no result is due to random chance or error.
a) Random Forest Model - Formula 1
Ntree = Min(100): Sets and generates a min of 100 regression trees
The above plot indicates that the model found the “host listing” variable to be the most important predictor of price and its percentage of mean-squared error indicates the decrease in accuracy of model (1.91) if this variable will not be included.
This value is generated when the model keeps the values of the other 7 variables in the variable set constant and randomly changes the values of the “host listing” data set to measure its impact on the accuracy of the prediction.
Ntree = Min(1000): Sets and generates a min of 1000 regression trees
The above plot indicates that “type 3” is a more accurate predictor of price as more trees are generated and has a higher % of increase in MSE (4.85%), which shows the fall in accuracy of model if this variable is removed.
Difference in Error Graph of Ntree Min(100) & Ntree Min(1000)
The error graph also indicates that setting more than 40 trees does not affect the accuracy of the model regardless of the higher Ntree number we input, indicating that this is the extent of Model 1’s accuracy. The Error graph generated using a minimum of 1000 regression trees on the right-hand side is a justification of this statement.
b) Random Forest Model - Formula 2
Variable Importance Plot Model 2 - Ntree Min(500)
The above figure indicates the variable importance and the error plot for formula 2 which consists of variable set 1 and 2. “accommodates” is the variable with the highest percentage increase of MSE (5.96%), however, “distance” also has a % increase in MSE of 4.6% making it an important predictor of price along with the number of people the rental can accommodate. when the number of minimum trees is set to 500, the error plot indicates that trees generated beyond 100 do not have any impact on the accuracy of the model indicated by the model.rf plot on the right-hand side.
c) Random Forest Model - Formula 3
Variable Importance Plot Model 3 - Ntree Min(1500)
The above figure shows the Variable Importance plot of Model/Formula 3. “accommodates” has the highest percentage increase in MSE of 11% indicating that 11% of the model accuracy will be lost without including this variable, showcasing the value this variable holds in predicting price. “bedrooms” finishes on a close second at 10.06%.
d) Random Forest Model - Formula 4
Variable Importance Plot Model 4 - Ntree Min(3000)
Formula 4 containing all the variable sets is used to generate the Random Forest Model 4. The variable importance indicates the “accommodates” is the best predictor of price in our data set with a % increase in MSE of 56.7%. This implies that more than half the accuracy of the model will be lost without including this variable indicating the importance of the role the number of people a rental can accommodate is in predicting its price.
Both regression trees and random forest estimated “Accommodates” as the most important variable in predicting price. Random forest typically amplifies the predictive indicators found in individual trees. If a variable is consistently the best splitter in individual regression trees, it will likely emerge as important in the random forest model as well.
1. Model Selection
Cross Validation Results
The above figure indicates the performance of each model on 3 different metrics, enabling us to choose the best model based on their results in the cross-validation process.
a) R-Squared: Higher value implies that a model explains a greater proportion of variance. All the models perform quite similarly; however, Model 1 seems to have a slight edge over the others.
b) Mean Absolute Error (MAE): Model 3 & 4 have a smaller MAE indicating that on average these models generate lesser errors while predicting price.
c) Root Mean Square Error (RMSE): RMSE is the square root of the average of squared differences between prediction and actual observation. Model 3 appears to have the lowest RMSE.
The cross-validation concludes that Model 3 has the best accuracy in predicting price. It consists of Variable Set 1,2 and 3 and the variable importance ranking of the top 5 variables is given below.
Model 3 Variable Ranking
The number of people a listing can hold and the bedrooms it has are the most important in customer choice and willingness to pay. A study in 2015 concluded that not all amenities have equal influence on pricing and that the price of the rental heavily relies on its capacity. It also claimed that some variables have the opposite effect wherein their inclusivity does not affect the price perceived by the customer. Evidence of this is the negative % increase in MSE value (-1.8%) held by “breakfast”, showcasing that guests do not consider breakfast as a significant value add.
A study in 2016 revealed a significant positive relationship in the perceived trustworthiness of the host according to their photo on the listing price. An increase in one unit in visual based trust led to a 7% increase in price of the listing, confirming the importance of “host_identity” in predicting price. It also revealed a positive relationship in the apartment size - number of bedrooms. Hence, Airbnb should make the hosts aware of these metrics to increase their revenue.
The increasing importance of connectivity justifies the importance of “wireless” in predicting price as customer view this as a necessity rather than a luxury increasing their willingness to pay. A more in-depth understanding of the variables with less variable importance is important for the hosts to avoid unnecessary investments and reduce costs to maximise profit.
The study resulted in identifying the most important variables in predicting Airbnb prices, while acknowledging limitations such as regionality, data scope, temporality, and feature incompleteness of some key variables that influence price such as aesthetic appeal and neighborhood of the listing. Future research can implement real-time market trends and optimize models for transferability between regions. The practical implications suggest focusing on listing size and host credibility to optimize pricing strategies. The study’s significance lies in its potential to enhance revenue for hosts and Airbnb, reinforcing the platform’s competitive stance in the hospitality sector through data-driven decision-making. This understanding of price predictors, derived from machine learning insights, can inform both strategic decisions at the platform level and operational decisions at the individual host level.