Predicting red wine prices from SAQ.com (Société des alcools du Québec)
23 Jun 2019Objective: The goal of this machine learning project is to develop a model that can predict red wine prices using SAQ data. We will also interpret the results to find the variables that are most predictive.
Problem definition
In this post, we will build a regression model in order to predict red wine prices. We will get the data from SAQ website (Société des alcools du Québec). To do so, we will first extract the data from SAQ.com. Once we get the data that we need, we will build our regressor model to predict red wine prices.
In this machine learning pipeline, we will :
- Get red wines data from SAQ.com
- Extract & uploading data
- Exploratory Data Analysis (EDA)
- Naive Baseline
- Feature Engineering & Model Selection
- Model Evaluation on test data
- Model Analysis (Interpretability)
- Conclusion
Note that the dataset has approximately 5000 rows. Our metric evaluation will be MAE. Feel free to see my repository on Github.
Important: For the purpose of our project, we will follow these steps in a linear fashion but machine learning pipeline is an iterative procedure.
Get red wines data from SAQ.com
Scrapy vs Beautifulsoup:
Our project is small, the logic is not very complex and we want the job done quickly, so we will use BeautifulSoup to keep our project simple. If your project needs more customization such as proxy, data pipeline, then Scrapy might be a better choice.
from bs4 import BeautifulSoup
from requests import get
import itertools
import pandas as pd
import requests
import lxml
import time
headers = ({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
class AuthError(Exception):
def __init__(self):
self.msg = "auth error"
def saq_urls():
headers = ({'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
wine_list = ['https://www.saq.com/webapp/wcs/stores/servlet/SearchDisplay?searchType=&showOnly=product&langId=-1&catalogId=50000&storeId=20002&categoryIdentifier=06&pageSize=9900']
for w in wine_list:
source = requests.get(w, headers=headers).text
soup = BeautifulSoup(requests.get(w).text, 'lxml')
saq_containers = soup.find_all('div', class_="ColD affichageMozaique")
first = saq_containers[0]
#get all the links
wine_list = [url.get('href') for url in first.find_all('a')[1:]]
complet_list = list(dict.fromkeys(wine_list))
return complet_list
def redwine_data():
wines_list = saq_urls()
name = []
price = []
producer = []
wines_info = []
for page in wines_list:
# source = requests.get(page).text
source = ''
while source == '':
try:
source = requests.get(page).text
break
except:
print("Connection refused by the server..")
print("Sleep for 5 seconds")
time.sleep(5)
print("Continue...")
continue
soup = BeautifulSoup(source)
w_name = soup.find_all("h1", {"class": "product-description-title"})
w_price = soup.find_all("p", {"class": "price"})
w_producer = soup.find_all("td", {"class": "product-detailsR"})[0]
w_infos = soup.find_all("div", {"id": "details"})
for info in w_name:
nm = info.text
name.append(nm)
for info in w_price:
pc = info.text
pc1 = pc.replace('Regular price: ','')
pc2 = pc1.replace('$','')
price.append(pc2)
for info in w_producer:
prod = info.text
producer.append(prod)
for info in w_infos:
ac = info.text.strip()
ac1 = ac.replace('\n','')
ac2 = ac1.replace(' ','')
ac3 = ac2.replace('\r',' ')
wines_info.append(ac3)
# len(wines_info)
# wines_info
def nextword(target, source):
for i, w in enumerate(source):
if w == target:
return source[i+1]
country = [nextword('DetailedinfoCountry', s.split()) for s in wines_info]
# Region
region = [nextword('Region', s.split()) for s in wines_info]
# Designation of origin
origin = [nextword('Designationoforigin', s.split()) for s in wines_info]
# Regulateddesignation
designation = [nextword('Regulateddesignation', s.split()) for s in wines_info]
# Size
size = [nextword('Size', s.split()) for s in wines_info]
# Alcohol
alcohol = [nextword('Degreeofalcohol', s.split()) for s in wines_info]
# Sugar
sg = [nextword('%', s.split()) for s in wines_info]
sg1 = [s.rpartition('Sugarcontent')[2] for s in sg]
sugar = [s[0:3] for s in sg1]
cols = ['Name', 'Country', 'Region', 'Origin', 'Designation', 'Producer', 'Size','Alcohol', 'Sugar','Price']
df = pd.DataFrame({'Name': name, 'Country': country, 'Region': region, 'Origin':origin, 'Designation':designation, 'Producer':producer, 'Size': size, 'Alcohol': alcohol, 'Sugar': sugar, 'Price': price})[cols]
df['Price'] = df['Price'].apply(pd.to_numeric, errors='coerce')
df.to_csv(r'C:\Users\Admin\Documents\Python\SAQ\saq_data.csv', encoding='utf-8', index=False)
def main():
print('Getting data from saq.com')
redwine_data()
print('Done')
print('Your data is ready')
if __name__ == "__main__":
main();
Preparing environment & uploading data
Importing packages
from sklearn.compose import ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, KBinsDiscretizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import model_selection
from sklearn.model_selection import learning_curve
from sklearn.linear_model import SGDRegressor
from sklearn.svm import LinearSVR
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from catboost import CatBoostRegressor
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
import io
import time
import d6tpipe
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)
%config InlineBackend.figure_format ='retina'
random_state= 42
scoring_metric = 'neg_mean_absolute_error'
api = d6tpipe.api.APIClient()
api.setToken('<Your Token>') # Your Token
api.list_pipes()
pipe = d6tpipe.Pipe(api, 'saqredwine2019')
pipe.pull()
# Drop unnecessary column: Unnamed: 0
cols = list(pd.read_csv(pipe.dirpath/'saq_redwine_5000_GCP2.csv', nrows =1))
df = pd.read_csv(pipe.dirpath/'saq_redwine_5000_GCP2.csv', usecols =[i for i in cols if i != 'Unnamed: 0']).astype({'Size':object})
df
It’s a clean, easy to understand set of data. Here’s what all the features mean:
- Country: Country of origin
- Region: Region where the wine was made
- Designation: Product of a single vineyard with that vineyard’s name appearing on the wine label.
- Producer: Wine-producer
- Size: Volume in mL or L (7500 mL, 1.5L)
- Alcohol: Degree of Alcohol (%)
def prepare_data(data, target_name):
"""
Separate X, y
Separate numerical and categorical columns.
"""
# != 0 & NaN on Target
data = data.loc[data[target_name] != 0]
data.dropna(subset= [target_name], inplace=True)
if target_name is not None:
X = data.drop(target_name, axis=1)
y = data[target_name]
# get list of numerical & categorical columns in order to process these separately in the pipeline
num_cols = X.select_dtypes("number").columns
cat_cols = X.select_dtypes("object").columns
else:
X = data
y = None
return X, y, num_cols, cat_cols
# retrieve X, y and separated columns names for numerical and categorical data
X, y, num_cols, cat_cols = prepare_data(df, 'Price')
# Split train & test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_state)
Exploratory Data Analysis
Steps we will check in this dataset:
- Check if data is intuitive
- Missings values
- Outliers
- Linear Relationships
- Cross-Validation Startegy
# Stats
stats = pd.DataFrame(columns = ['Columns','Types','Values', 'Uniques', 'Uniques(no nulls)', 'Missing(n)', 'Missing(%)'])
eda = pd.DataFrame()
for c in X_train.columns:
eda['Columns'] = [c]
eda['Types'] = X_train[c].dtypes
eda['Values'] = [X_train[c].unique()]
eda['Uniques'] = len(list(X_train[c].unique()))
eda['Uniques(no nulls)'] = int(X_train[c].nunique())
eda['Missing(n)'] = X_train[c].isnull().sum()
eda['Missing(%)'] = (X_train[c].isnull().sum()/ len(X_train)).round(3)*100
stats = stats.append(eda)
stats
Price Distribution
from scipy import stats
from scipy.stats import norm, skew
# Target describe()
print(y_train.describe())
plt.figure(figsize=(22,6))
plt.subplot(1,3,1)
sns.distplot(y_train, bins=50)
(mu, sigma) = norm.fit(y_train)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('Price Distribution')
plt.subplot(1,3,2)
stats.probplot(y_train, plot=plt)
plt.subplot(1,3,3)
sns.boxplot(y_train)
plt.title('Price Boxplot')
plt.show()
stats.skew(y_train)
# save jpeg
# bplot.figure.savefig('price_boxplot.jpg', format='jpeg')
So, with 5.0 of positive kurtosis Price
are definitely heavy-tailed and has some outliers that we need take care. In this case of positive skewness, log transformations usually works well.
Log Transformed Price Distribution
# Log transformation on Price
ylog_train = np.log1p(y_train)
# Plot
plt.figure(figsize=(22,6))
plt.subplot(1,3,1)
sns.distplot(ylog_train, bins=50, fit=norm)
(mu, sigma) = norm.fit(ylog_train)
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],loc='best')
plt.ylabel('Frequency')
plt.title('LogSalePrice Distribution')
plt.subplot(1,3,2)
stats.probplot(ylog_train, plot=plt)
plt.subplot(1,3,3)
sns.boxplot(ylog_train)
plt.title('Price Boxplot')
plt.show()
stats.skew(ylog_train)
Categorical & Numerical Features: Linear Relationships
# Select the numeric columns
numeric_subset = X_train.select_dtypes('number')
# Create columns with square root and log of numeric columns
for col in numeric_subset.columns:
# Skip the Energy Star Score column
if col == 'Price':
next
else:
numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])
numeric_subset['log_' + col] = np.log(numeric_subset[col])
# Select the categorical columns
categorical_subset = X_train[['Country', 'Region', 'Designation', 'Producer', 'Size']]
# # One hot encode
categorical_subset = pd.get_dummies(categorical_subset)
features = pd.concat([numeric_subset, categorical_subset, y_train], axis = 1)
# Find corr
correlations = features.corr()['Price'].sort_values()
# Top 10 negative correlations
correlations.head(10)
# Top 10 positive correlations
correlations.tail(10)
There are no strong positive linear relationships although we do see that Producer
is slightly positively correlated with target Price
.
EDA Conclusions:
- Almost all features are categoricals
Region
,Designation
,Alcohol
and our targetPrice
have missing valuesAlcohol
have more than 16% missing valuesName
act as an ID column but have some interesting values as text feature- High number of uniques categorical values:
Producer
andRegion
- Target
Price
distribution varies byProducer
Naive Baseline
For our naive baseline, we will use the median value of Price
on the training set for all observations on the test set. We will then calculate MAE to jauge the improuvement of our final model. So, our model will need to beat our naive baseline performance.
Mean Absolute Error (MAE)
def mae(y_true, y_pred):
return np.mean(abs(y_true - y_pred))
print("Naive Baseline Performance on the test set: MAE = %0.4f" % mae(y_test, np.median(y_train)))
Naive Baseline Performance on the test set: MAE = 43.2902
# Numeric: Alcohol
numeric_transformer = Pipeline([
('imput', SimpleImputer(strategy='mean'))])
# Categorical: Country, Region, Designation, Producer, Size
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))]) # use OneHotEncoder(handle_unknown='ignore') to ignore new categories in test set
# ColumnTransformer
cat_feat = ['Country', 'Region', 'Designation', 'Producer', 'Size']
num_feat = ['Alcohol']
column_trans = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_feat),
('cat', categorical_transformer, cat_feat)])
lr = Pipeline([('column_trans', column_trans),
('LinearRegression', LinearRegression())])
rf = Pipeline([('column_trans', column_trans),
('RandomForestRegressor', RandomForestRegressor(random_state=random_state))])
sdg = Pipeline([('column_trans', column_trans),
('SGDRegressor', SGDRegressor(loss='squared_loss', random_state=random_state))])
gbr = Pipeline([('column_trans', column_trans),
('GradientBoostingRegressor', GradientBoostingRegressor(random_state=random_state))])
lsvr = Pipeline([('column_trans', column_trans),
('LinearSVR', LinearSVR(random_state=random_state))])
cat = Pipeline([('column_trans', column_trans),
('CatBoostRegressor', CatBoostRegressor(verbose=1000, random_seed=random_state))])
# Setting up Cross-Validation
scores = []
results = []
names = []
cv = KFold(n_splits=5, shuffle=True, random_state=5)
for est, name in zip((lr, rf, sdg, gbr, lsrv, cat),
('LinearRegression', 'RandomForestRegressor', 'SGDRegressor', 'GradientBoostingRegressor', 'LinearSVR', 'CatBoostRegressor')):
start_time = time.time()
# crossvalidate classifiers on training data
cv_score = cross_val_score(est, X=X_train, y=y_train, cv=cv, scoring=scoring_metric)
# cross_validate vs cross_val_score
scores.append([name.strip(), -1*cv_score.mean(), cv_score.std(), (time.time() - start_time)])
results.append(cv_score)
names.append(name)
scores = pd.DataFrame(scores, columns=["Regressor", 'MAE', '+/-', 'Time(s)']).sort_values('MAE', ascending=True)
# scores.style.highlight_min()
scores
Most of our models outperform our naive baseline. Now, we will compare our models with some feature engineering and hyperparameter tuning.
Important: Scikit-Learn version is generally slower than the XGBoost version, but here we’ll stick to Scikit-Learn because the syntax is more familiar.
(Light) Feature Engineering & Model Selection
- Create a new feature
Year
fromName
- Impute missing values from numerical and categorical features
- Quantile Binning on
Alcohol
- One-hote encoding categorical features
- Apply log-transform to
Price
since target is skewed - Evaluation using Nested Cross-validation (5x2-fold): Inner loop is responsible for hyperparameters tuning and model selection, and the outer loop is used to evaluate the model selected by the inner loop
from datetime import datetime
from sklearn.base import BaseEstimator, TransformerMixin
class YearTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X):
return X.str.extract('(20\d{2})', expand=True).replace(np.nan, 'Other', regex=True)
# Numeric: Quantile based Adaptative Binning on Alcohol variable
numeric_transformer = Pipeline([
('imput', SimpleImputer(strategy='mean')),
('KBinsDiscretizer', KBinsDiscretizer(n_bins=4, encode='onehot', strategy='quantile'))])
# Categorical: Country, Region, Designation, Producer, Size
categorical_transformer = Pipeline([
('imput', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Categorical: Name -> Year
year_transformer = Pipeline([
('year', YearTransformer()),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# ColumnTransformer
cat_feat = ['Country', 'Region', 'Designation', 'Producer', 'Size']
num_feat = ['Alcohol']
year_feat = 'Name'
column_trans = ColumnTransformer(
transformers=[
('num', numeric_transformer, num_feat),
('cat', categorical_transformer, cat_feat),
('year', year_transformer, year_feat)])
rf = Pipeline([('column_trans', column_trans),
('RandomForestRegressor', RandomForestRegressor(random_state=random_state))])
sdg = Pipeline([('column_trans', column_trans),
('SGDRegressor', SGDRegressor(loss='squared_loss', random_state=random_state))])
gbr = Pipeline([('column_trans', column_trans),
('GradientBoostingRegressor', GradientBoostingRegressor(random_state=random_state))])
lsrv = Pipeline([('column_trans', column_trans),
('LinearSVR', LinearSVR(random_state=random_state))])
cat = Pipeline([('column_trans', column_trans),
('CatBoostRegressor', CatBoostRegressor(verbose=1000, random_seed=random_state))])
# Model hyperparameters & tuning space
param_grid_rf = [{'RandomForestRegressor__max_features': [4, 5, 6],
'RandomForestRegressor__n_estimators':[100, 200, 300]}]
param_grid_sdg = [{'SGDRegressor__alpha': [1e-5, 1e-4, 1e-3]}]
param_grid_gbr = [{'GradientBoostingRegressor__n_estimators' : [900, 1000],
'GradientBoostingRegressor__max_depth' : [5, 6, 7]}]
param_grid_lsvr = [{'LinearSVR__C':[17, 18, 19, 20]}]
param_grid_cat = [{'CatBoostRegressor__iterations':[1000, 1600],
'CatBoostRegressor__depth': [8, 12],
'CatBoostRegressor__bagging_temperature': [2, 3]}]
# Multiple GridSearchCV objects, one for each algorithm
gridcvs = {}
inner_cv = KFold(n_splits=2, shuffle=True, random_state=random_state)
for pgrid, est, name in zip((param_grid_rf, param_grid_sdg, param_grid_gbr, param_grid_lsvr, param_grid_cat),
(rf, sdg, gbr, lsrv, cat),('RandomForestRegressor', 'SGDRegressor', 'GradientBoostingRegressor', 'LinearSVR', 'CatBoostRegressor')):
gcv = GridSearchCV(estimator=est,
param_grid=pgrid,
scoring=scoring_metric,
n_jobs=1,
cv=inner_cv,
verbose=0,
refit=True)
gridcvs[name] = gcv
outer_cv = KFold(n_splits=5, shuffle=True, random_state=random_state)
outer_scores = {}
results = []
names = []
for name, gs_est in sorted(gridcvs.items()):
nested_score = cross_val_score(gs_est,
X=X_train,
y=y_train,
cv=outer_cv)
outer_scores[name] = nested_score
print(f'{name}: outer {-1*nested_score.mean():.2f} +/- {nested_score.std():.2f}')
results.append(nested_score)
names.append(name)
fig = plt.figure(figsize=(10, 7), dpi=80, facecolor='w', edgecolor='k')
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names, rotation=45)
plt.show()
Generally, Random Search is better when we have limited knowledge of the best model hyperparameters and we can use Random Search to narrow down the options and then use Grid Search with a more limited range of options.
Model Evaluation on Test Data
Now, we will select and use the best model from hyperparameter tuning in order to estimate the generalization performance on the test set. If this generalization error is similar to the validation error, we have reason to believe that our model will perform well on unseen data.
In order to estimate how the model performance varies with training set size, we will also plot the Learning curve for our model.
gcv_model_select = GridSearchCV(estimator=gbr,
param_grid=param_grid_gbr,
scoring=scoring_metric,
n_jobs=-1,
cv=inner_cv,
verbose=1,
refit=True)
gcv_model_select.fit(X_train, y_train)
best_model = gcv_model_select.best_estimator_
print('Best Parameters: %s' % gcv_model_select.best_params_)
Fitting 2 folds for each of 12 candidates, totalling 24 fits [Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers. [Parallel(n_jobs=-1)]: Done 24 out of 24 | elapsed: 51.3s finished Best Parameters: {‘GradientBoostingRegressor__max_depth’: 5, ‘GradientBoostingRegressor__n_estimators’: 900}
best_model = Pipeline([('column_trans', column_trans), ('GradientBoostingRegressor',
TransformedTargetRegressor(regressor = GradientBoostingRegressor(n_estimators= 900, max_depth=5, random_state=random_state),
func=np.log1p, inverse_func=np.expm1))])
best_model.fit(X_train, y_train)
train_MAE = mean_absolute_error(y_true=y_train, y_pred=best_model.predict(X_train))
test_MAE = mean_absolute_error(y_true=y_test, y_pred=best_model.predict(X_test))
print(f'Training Mean absolute Error: {train_MAE:.2f}')
print(f'Test Mean Absolute Error: {test_MAE:.2f}')
Training Mean absolute Error: 15.49
Test Mean Absolute Error: 27.96
Learning curves
With Learning curves we will estimate the variation of our model performance with training set size
def plot_learning_curve(estimator, clf, X, y, ylim=None, cv=None, train_sizes=None):
plt.figure(figsize=(10, 7), dpi=80, facecolor='w', edgecolor='k')
plt.title(f'Learning Curve ({clf})')
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
plt.grid(True)
return
best_model = Pipeline([('column_trans', column_trans), ('GradientBoostingRegressor',
TransformedTargetRegressor(regressor = GradientBoostingRegressor(n_estimators= 900, max_depth=5, random_state=random_state),
func=np.log1p, inverse_func=np.expm1))])
train_sizes = np.linspace(.1, 1.0, 10)
ylim = (0.2, 1.01)
cv = 2
plot_learning_curve(best_model, 'GradientBoostingRegressor', X_train, y_train,
ylim=ylim, cv=cv, train_sizes=train_sizes)
plt.show()
Looking at the plot above, we can see two distinct trends. As the number of training samples grows, the training set declines but we observe an improving generalization on the test set.
Also, the training score is greater than the validation score for the maximum number of training samples, so adding more training samples will most likely increase generalization.
To get a sense of the predictions, we can plot the distribution of true values on the test set and the predicted values.
Density plot of predictions
final_predictions = best_model.predict(X_test)
# Density plot of the final predictions and the test values
plt.figure(figsize=(22,6))
sns.kdeplot(final_predictions, label = 'Predicted values')
sns.kdeplot(y_test, label = 'Test values')
# Label the plot
plt.xlabel('Price'); plt.ylabel('Density');
plt.title('Final Predictions & Test Values');
The distribution looks to be nearly the same although the density of the predicted values is closer to the median of the test values. It appears the model might be less accurate at predicting the extreme values and instead predicts values closer to the median.
Model Analysis (Interpretability)
In this section, we will try to understand how the model makes predictions.
What features in the data did the model think are most important?
def get_column_names_from_ColumnTransformer(column_transformer):
col_name = []
for transformer_in_columns in column_transformer.transformers_[:-1]:#the last transformer is ColumnTransformer's 'remainder'
raw_col_name = transformer_in_columns[2]
if isinstance(transformer_in_columns[1],Pipeline):
transformer = transformer_in_columns[1].steps[-1][1]
else:
transformer = transformer_in_columns[1]
try:
names = transformer.get_feature_names(cat_feat)
except AttributeError: # if no 'get_feature_names' function, use raw column name
names = raw_col_name
if isinstance(names,np.ndarray): # eg.
col_name += names.tolist()
elif isinstance(names,list):
col_name += names
elif isinstance(names,str):
col_name.append(names)
return col_name
# Feature importances into a dataframe
pipeline = Pipeline([('column_trans', column_trans),
('GradientBoostingRegressor',GradientBoostingRegressor(n_estimators= 900, max_depth=6, min_samples_leaf= 5, random_state=random_state))])
pipeline.fit(X_train, y_train)
feature_results = pd.DataFrame({'feature': get_column_names_from_ColumnTransformer(column_trans),
'importance': pipeline.steps[1][1].feature_importances_})
feature_results = feature_results.sort_values('importance', ascending = False).reset_index(drop=True)
# Top 20 features
sns.set()
plt.figure(figsize=(17,5))
plt.title(f'Feature Importance')
sns.barplot(x='importance', y='feature', data=feature_results.head(20), orient='h', color = 'b')
The importance score assigned to each feature is a measure of how often that feature was selected, and how much of an effect it had in reducing impurity when it was selected. Since we one-hot encode most of our categorical features the resulting sparsity ensures that our continuous variable Alcohol
is assigned higher feature importance.
We need to consider others interpretability methods (Permutation importance or SHAP values) in order to help combat the preference for continuous variables over binary ones.
Conclusions
How did we do?
Not so great.
Looking to our Learning curve, it seems that we can improve generalization with more training samples. We could also test and add new variables to our dataset. It could be data from red wine descriptions in order to do some text analysis.
Why even do this?
How can SAQ make more money from predicting red wine prices?
Our idea of text description could help. We could look into the relationship between text description or wine critics and price. If we can identify certain words or characteristics of a text associated with expensive wines, then the SAQ could price its red wines accordingly.