Predicting the Future (of product sales!)
- stellabonaparte
- Dec 7, 2021
- 5 min read

With supply chain woes top of mind, businesses are taking a second look at how they allocate inventory. Is a product in one store collecting dust while another store is out of stock? Leveraging sales, geographic, and demographic data can help predict future sales.
The Tasks:
Task 1: Create store formats for existing stores. Categorize 85 stores into inventory segments. Each segments will have its own inventory plan. How many segments are optimal?
Task 2: Project which of these segments new stores will fall into.
Task 3: Create a 12-month sales forecast for new stores. The Tools:
Alteryx aggregation, formula, and predictive tools
Tableau data visualization software
The Files:
storedemographicdata.csv
storeinformation.csv
storesalesdata.csv
Note: If you would like a more technical view of how I arrived at my findings, just scroll down to the In-Depth Breakdown section. If you prefer plain English, continue here.
Task 1: Determine Store Formats
Grouped the sales data by store, year, and inventory category. Then joined this to the geographic data.
Ran three iterations of the Alteryx Cluster Diagnostics tool (the purpose of the Cluster tool is to categorize things into segments.)
Then visualized the results as a map.
In step 2 above I started by clustering the stores into six inventory segments and the results showed two of the six segments would be too small. Next I ran 4 clusters and then 3. A review of the statistics showed that 3 clusters were optimal.

Looking at the map below, imagine if inventory was distributed based on geography alone rather than on accurate sales data!


The number of stores in each segment and their total sales:
Cluster Store.Count Total Sales.2015
1 25 698302817.97
2 35 968990542.87
3 25 807040485.46
Task 2: Predict which of the three store formats NEW stores will fit into.
I employed Alteryx's Model Comparison tool to compare three different types of predictive models to find which one would be most accurate and most precise. This was the Boosted predictive model.

The Variable Importance Plot on the Boosted Model show that the three most important variables to the prediction of store segment were:
· Child in household Age 0 to 9
· Home Value of 750K Plus
· Age 65 Plus
Variable Importance Plot

Task 3: Predicting Sales
The final task involved using both ETS and ARIMA time-series models in Alteryx to determine which would be most accurate.
First, I reviewed the charts to determine if and to what degree there was Error, Trend, and Seasonality (ETS) so this was reflected in the ETS model.
I allowed the ARIMA model to be chosen by the Alteryx ARIMA tool.
I ran the ETS and ARIMA models.
After comparing the performance results of each model against their holdout samples and against each other, and reviewing the generated statistics, I am confident that the ETS model is the optimal one.
Tableau Visualization of the 12-period forecast for existing stores with historic data:

I ran a separate ETS forecast for each cluster. Then multiplied by the number of new stores in each cluster and joined the data. Finally, I put the new store forecasts in an Excel table with the existing store forecasts.
12-Month Forecasts for New and Existing Stores- Month/2016

All three tasks have been completed. This concludes my overview of the Udacity Predictive Analytics for Business capstone project. Thanks for reading. Below, the In-Depth Analysis has the same information with more technical details.
In-Depth Analysis
Task 1: Determine Store Formats for Existing Stores
The optimal number of store segments for existing stores is three. I did three iterations of the Alteryx Cluster Diagnostics tool and visualized the results with Tableau.

First, using the K-Means method, I ran it with six clusters, but the segmentation was indistinct and diffuse. I could also see that two of the six segments would be too small.
Next, I ran 4 clusters and then 3 clusters. The C-H Index showed that 3 clusters were optimal because the median for 3 clusters was higher than for 4 clusters and more compact than for 2 clusters. (While the median was higher for 2, the AR and C-H indices both showed more IQR compactness using 3 clusters.)

The number of stores in each segment and their total sales:
Cluster Store.Count Sum_Total.Store.Sales.2015
1 25 698302817.97
2 35 968990542.87
3 25 807040485.46
One way the clusters differ from each other is that Cluster 1 has the smallest average distance therefore is more compact, which may indicate it will be easiest to determine an optimal inventory mix for the stores in this segment. It also has the largest separation compared to the other clusters:

Tableau Visualization:



Task 2: Formats for New Stores
After running and comparing the Decision Tree, Forest, and Boosted Models using the Model Comparison tool and validating with the original 20% validation sample I chose the Boosted Model as the best option.
In the Model Comparison Report, Boosted was the highest in accuracy 76.5% (vs. Forest 70.6% and DT 64.7%)
It was also highest in the averaged precision and recall (F1) at 83.3% (vs. Forest 75% and DT 66.7%):

The Variable Importance Plot on the Boosted Model show that the three most important variables to the prediction of store segment were:
· Age0to9
· HVal750KPlus
· Age65Plus

The resulting predicted segments for new stores are as follows:

Task 3: Predicting Sales
Choice of ETS Notation
Error: Since the error is increasing and decreasing over time, the E component is multiplicative (m).
Trend: The trend is not linear, nor does it follow a discernable pattern, thus the T component is none (n).
Seasonality: The seasonality is decreasing slightly over time, so the S component is multiplicative (m).
Decomposition Plot results:

I allowed the ARIMA model to be chosen by the Alteryx ARIMA tool.
After running the ETS and ARIMA models, I observed that the results of the ARIMA (1,0,0) (1,1,0) model still showed some autocorrelation in the ACF and PACF plots.


External Validation
ETS model results using TS Compare tool:
The ETS model forecast conforms closely to the actual values as shown by the blue line:

ETS Model Accuracy Measures:

ARIMA model accuracy measures:

Notable differences between the ETS and ARIMA models:
· As can be seen above, the RMSE measures show that the ETS model has significantly less standard deviation from the mean with a measure of 663,707 vs. the ARIMA measure of 1,050,239.
· The MASE for ETS (0.3257) is also lower than that of ARIMA (0.5463) which indicates a significant reduction in error for the ETS model.
After comparing these performance results of each model against their holdout samples and against each other, I am confident that the ETS model is the optimal one.

After summarizing to get the average sum, I separated the clusters and ran an ETS forecast for each cluster. Then I multiplied by the number of new stores in each cluster using the Formula tool and joined the data. Finally, I put the new store forecasts in an Excel table with the existing store forecasts.
12-Month Forecasts for New and Existing Stores
Month/2016

Comments