Predicting Retail Sales Data (Python ML)
This project was done with a team of three students for the machine learning course CIS 519. We set out to tackle the Kaggle M5 forecasting challenge described here. Our goal was to create a computational method to accurately predict sales of Walmart goods based on historical data.
Overview
In this project, we evaluate and compare various ML, DL, and Bayesian learning approaches to forecast item-level sales data for Walmart's products.
The problem of time-series prediction is common across a multitude of domains but is particularly important for retailers, so they can prepare appropriate
inventory and staffing levels.
There have been many studies evaluating classic statistical techniques for time-series problems such as Exponential Smoothing (ETS) and
Autoregressive Integrated Moving Average, but far fewer studies use Decision Tree Models, NBD Regression, and Long Short Term Memory Recurrent Neural Networks
(LSTM-RNNs). We make single step predictions and compare the quality of these predictions on basis of RMSE over a 657 day test period.
Probability Model
Uncertainty distributions offer a unique way to explore data that are worth comparing against the results of our machine learning algorithms.
The model we use is the Negbin II, which takes probabilities from a gamma mixture of poissons with covariates. This model yielded a test RMSE of 1.68.
The results are shown below:

Decision Tree Models
As a foundation for our foray into machine learning models, we explore one of the simpler types of models in the space - decision trees.
To benchmark, we build a simple decision stump and tree. Then we examine a random forest and a boosted tree. By passing in the features we have
compiled during our feature engineering step, we can learn regressors which hopefully forecast our data reasonably well. The results for the decision
stump and tree are, as expected, fairly inaccurate. The stump simply makes a choice between sales of zero or the mean, while the tree is more nuanced
but still extrapolates poorly to test data. Next we try a random forest and a boosted tree. In order to decide the optimal tree depth for our Adaboost
algorithm, we can examine a plot of training vs validation errors as the tree depth is increased.

The validation error appears to decrease up to a depth of around 4, then remain constant for Adaboost (and increase for a simple tree). So we use a max depth of 4 for both our tree and Adaboost algorithms. Below is the combined performance of the two final tree models on the validation set. As can be seen below, both models seem to systematically overpredict sales as a result of higher values in the training set. Both also have test RMSE values of approximatly 1.90.

LSTM Neural Network
We were able to capture the temporal component of our regression using the RF with feature engineering, but Recurrent Neural Networks are perhaps
suited better for a time series task as they inherently preserve sequential dependencies. RNN's and specifically LSTM’s were popularized for their
success on text-to-speech data, but we can adapt our model to work for our task. Specifically, we use the Many-to-One variant which means
that many time steps are used to compute a single value. To construct our data for input to the LSTM we used a shifted sliding window.
That is, we constructed samples such that the factors for a day's value consisted of the 28 days prior. This test RMSE value was 1.78.

By adding a greater number of features into our network (and increasing the complexity of our model) we were able to make our LSTM network more accurate. We added 1-day and 1-week lags in addition to categorical embeddings. These embeddings consisted of variables like day-of-week, month, and year. With these improvements, we were able to bring the test RMSE of our LSTM model down to 1.63.

Model Comparisons
As the simplest and most intuitive benchmark, using a naive approach of taking the previous day's sales to forecast the current day's sales yields an
RMSE of 2.39. For the purpose of our managerial end goal, this is what we want to beat. Purely on the basis of out-of-sample RMSE, we see the LSTM is our
best performing model, with a huge improvement over the 1 day lag approach. However, it is complex and takes time to train, especially if extrapolated to a
larger data set. The probability model has strong out-of-sample fit and scales much more efficiently to larger sets of data. The random forest model is a
significant improvement over a naive approach, but is not nearly as accurate as either of the other models. The optimal model to use will depend on the
situation and specific application.