Rohan Dawar

Dec 15, 20215 min read

Predicting Wine Prices with Machine Learning

Updated: Jan 10, 2022

The world of wine is immense and perhaps a bit intimidating. Buying a new wine can often feel like gambling. Is it going to be worth the price or will I end up with unpalatable vinegar? The taste of wine, and really everything of course, is subjective. Wine critics like James Suckling have built up a public reputation on the supposed merit of their superior assessments. The consumer permits a certain level of trust in their judgement, attributed to their more 'refined' palate and their sensory vocabulary.

Additionally, countless papers have been published dissecting the relationship between price tag and perceived quality, and not surprisingly, often conclude that price can dictate or at least correlate with perceived quality. Both the price tag and the critic's review can inform you in making a suitable purchase, but they can also "prime" your senses and therefore affect how you experience the wine.

My solution to this is to strip away both the senses and the price tag. I'm going to build a machine with no taste buds, that can tell us how much a bottle of wine should be worth, without knowing the price. It will do this by looking at various other attributes like the country of origin and grape varietal. Then I will compare the predictions against the target prices and see how the models perform.

1. Finding A Dataset and Cleaning

For this I head over to Kaggle, the one-stop-shop for datasets, analysis and ML models. I found this dataset from 2017 by user zynicide, who scraped wine listing off of WineMag (AKA WineEnthusiast). In total the dataset includes ~150k rows, but after dropping rows without price we are left with 137,235 entries. We can see that "points" (a score given to the wine by a verified taster) all fall between 80-100 so we can scale these to 0-1. The prices listed are all in 2017 U.S. Dollars as listed in WineEnthusiast's database.

2. Input Engineering

Word Counter

Since the description column is unique for every entry in the dataset, we can use a word counter to find the top 500 most frequent words and then narrow that down to the top 72 most frequent descriptive words and check the descriptions against the list. Here is a peek at the top 15 most frequent words:

Once the models are fitted, we can determine which of these words has the most impact on the price of the bottle.

Mean Target Encoding

For columns that have too many unique values to one-hot encode, we can perform mean target encoding: taking the mean price of the rows with that value. In this case, I will mean target encode the following columns: region_1, region_2, winery and designation. From this process we can determine which values of these columns correspond to the highest and lowest bottle prices:

Those without values for the aforementioned columns are filled with the mean target value of the country of origin. In the ML models I will apply a MinMax-scaler to these MTE values. A zip file of all these MTE values can be found here:

One-Hot Encoding

Lastly, I one-hot encode the country, province and variety columns, as these columns have a small enough number of unique values to be encoded as their own columns in the dataframe.

After input engineering, our dataset has a total of 1191 input columns; of which 72 are Word columns, 4 are MTE columns, 1 is points and the rest are one-hot encoded from unique values of country, province and variety.

3. Data Analysis

Before building models, we can perform some simple data analysis for trends that may be useful in informing model predictions. Below is a collection of graphs produced using plotly express as data analysis on the main dataframe:

4. Baseline Models

In order to get a baseline accuracy for predicting price I first set up some dumb models. To test the accuracy of these models I will be using Root Mean Squared Error (RMSE) and Mean Fractional Error (MFE). RMSE will tell us on average how many dollars our model is off by. MFE will tell us on average the fraction (percentage) our model is off by. Here are the dumb baseline models:

Dumb Random Model:

Predicts a random integer between the minimum and maximum price in the dataset.

Dumb Mean Model:

Always predicts the mean price of the dataset: 33.13.

Dumb Median Model:

Always predicts the median price of the dataset: 24.

Results:

Model	Validation RMSE	Validation MFE %	Time To Train (sec)
Dumb Random	1302.85	5,373 %	53.16
Dumb Mean	36.32	79.74 %	0.02
Dumb Median	37.45	53.06 %	0.03

So a "good" model is one that can beat our dumb mean model, which can guess the price of a bottle with a root mean squared error of $36.32.

5. Linear Regression Models

These models are based on classical linear regression with both Ridge & Lasso including built-in methods of regularization, which help to reduce over-fitting:

Model	Validation RMSE	Validation MFE %	Time To Train (sec)
Ridge Regression	15.01	30.75 %	11.93
Ridge Regression Log	15.36	47.90 %	11.80
Lasso Regression	30.40	58.59 %	3.17
Lasso Regression Log	35.06	57.04 %	2.12

Ridge Regression Feature Importance:

6. Decision Tree Models

These models are based on Decision Tree Regression. Multiple Decision Trees can then be ensembled into Random Forests, provide a method to "average out" error:

Model	Validation RMSE	Validation MFE %	Time To Train (sec)
Decision Tree Regression	17.58	9.05 %	10.76
Decision Tree Regression Log	16.58	8.92 %	10.28
Random Forest Regression	9.96	8.81 %	1039.23
Random Forest Regression Log	9.19	8.23 %	978.85

Random Forest Regression Log Feature Importance:

7. Gradient Boosting Models

These models are based on Gradient Boosting where each successive regression tree is fit on a modified version of the original training set.

Model	Validation RMSE	Validation MFE %	Time To Train (sec)
Gradient Boosting	12.47	22.25 %	374.80
Gradient Boosting Log	14.90	17.45 %	368.45

Gradient Boosting Feature Importance:

Gradient Boosting Log Feature Importance:

8. Cross Validation

In cross validation with 3 folds, the random forest regression log model has a validation RMSE varying by 11, which is quite significant. This means that further k-fold cross validation should be explored on this dataset or a larger one, to obtain the most generalizable training set.

9. Hyper-Parameter Tuning

In tuning the Random Forest Log Model we see that the default setting for most hyper-parameters produces the lowest error, with the exception of max_features and n_estimators:

A minimum RMSE of 9.108 is achieved with n_estimators = 200. But tuning both max_features and n_estimators can be explored further.

10. Conclusion

Now we have a model which takes roughly 16 minutes to train from our input-engineered dataframe that can accurately predict the price of a bottle of wine with an average error of $9 or 8%. This error could be further reduced with more cross-validation, hyper-parameter tuning and input engineering.

Here are some hand-picked example predictions from the model showcasing a good, average and poor prediction:

Further analysis can be done to uncover any attribute-error trends in order to determine which attributes cause the model to perform poorly.

Fore those wanting to run the models I have included the .joblib files for the most promising models, the Random Forest and Gradient Boosting, log and normal:

Here is the .csv file of the full dataframe with dropped NaN prices and new features added:

And finally, here is a link to the full Jupyter notebook containing my code and work for this project:

Cheers! 🍷