top of page
Search

Predicting the Future (of product sales!)

  • Writer: stellabonaparte
    stellabonaparte
  • Dec 7, 2021
  • 5 min read

ree

With supply chain woes top of mind, businesses are taking a second look at how they allocate inventory. Is a product in one store collecting dust while another store is out of stock? Leveraging sales, geographic, and demographic data can help predict future sales.


The Tasks:

  • Task 1: Create store formats for existing stores. Categorize 85 stores into inventory segments. Each segments will have its own inventory plan. How many segments are optimal?

  • Task 2: Project which of these segments new stores will fall into.

  • Task 3: Create a 12-month sales forecast for new stores. The Tools:

  • Alteryx aggregation, formula, and predictive tools

  • Tableau data visualization software


The Files:

  • storedemographicdata.csv

  • storeinformation.csv

  • storesalesdata.csv

Note: If you would like a more technical view of how I arrived at my findings, just scroll down to the In-Depth Breakdown section. If you prefer plain English, continue here.

Task 1: Determine Store Formats

  1. Grouped the sales data by store, year, and inventory category. Then joined this to the geographic data.

  2. Ran three iterations of the Alteryx Cluster Diagnostics tool (the purpose of the Cluster tool is to categorize things into segments.)

  3. Then visualized the results as a map.

In step 2 above I started by clustering the stores into six inventory segments and the results showed two of the six segments would be too small. Next I ran 4 clusters and then 3. A review of the statistics showed that 3 clusters were optimal.


ree

Looking at the map below, imagine if inventory was distributed based on geography alone rather than on accurate sales data!




ree

ree








The number of stores in each segment and their total sales:


Cluster Store.Count Total Sales.2015


1 25 698302817.97

2 35 968990542.87

3 25 807040485.46


Task 2: Predict which of the three store formats NEW stores will fit into.

I employed Alteryx's Model Comparison tool to compare three different types of predictive models to find which one would be most accurate and most precise. This was the Boosted predictive model.

ree













The Variable Importance Plot on the Boosted Model show that the three most important variables to the prediction of store segment were:

· Child in household Age 0 to 9

· Home Value of 750K Plus

· Age 65 Plus


Variable Importance Plot

ree

Task 3: Predicting Sales


The final task involved using both ETS and ARIMA time-series models in Alteryx to determine which would be most accurate.

  1. First, I reviewed the charts to determine if and to what degree there was Error, Trend, and Seasonality (ETS) so this was reflected in the ETS model.

  2. I allowed the ARIMA model to be chosen by the Alteryx ARIMA tool.

  3. I ran the ETS and ARIMA models.

  4. After comparing the performance results of each model against their holdout samples and against each other, and reviewing the generated statistics, I am confident that the ETS model is the optimal one.

Tableau Visualization of the 12-period forecast for existing stores with historic data:

ree

I ran a separate ETS forecast for each cluster. Then multiplied by the number of new stores in each cluster and joined the data. Finally, I put the new store forecasts in an Excel table with the existing store forecasts.


12-Month Forecasts for New and Existing Stores- Month/2016

ree




















All three tasks have been completed. This concludes my overview of the Udacity Predictive Analytics for Business capstone project. Thanks for reading. Below, the In-Depth Analysis has the same information with more technical details.


In-Depth Analysis


Task 1: Determine Store Formats for Existing Stores


The optimal number of store segments for existing stores is three. I did three iterations of the Alteryx Cluster Diagnostics tool and visualized the results with Tableau.


ree


First, using the K-Means method, I ran it with six clusters, but the segmentation was indistinct and diffuse. I could also see that two of the six segments would be too small.

Next, I ran 4 clusters and then 3 clusters. The C-H Index showed that 3 clusters were optimal because the median for 3 clusters was higher than for 4 clusters and more compact than for 2 clusters. (While the median was higher for 2, the AR and C-H indices both showed more IQR compactness using 3 clusters.)

ree














The number of stores in each segment and their total sales:


Cluster Store.Count Sum_Total.Store.Sales.2015


1 25 698302817.97

2 35 968990542.87

3 25 807040485.46


One way the clusters differ from each other is that Cluster 1 has the smallest average distance therefore is more compact, which may indicate it will be easiest to determine an optimal inventory mix for the stores in this segment. It also has the largest separation compared to the other clusters:


ree

Tableau Visualization:


ree




ree

ree






Task 2: Formats for New Stores


After running and comparing the Decision Tree, Forest, and Boosted Models using the Model Comparison tool and validating with the original 20% validation sample I chose the Boosted Model as the best option.


In the Model Comparison Report, Boosted was the highest in accuracy 76.5% (vs. Forest 70.6% and DT 64.7%)

It was also highest in the averaged precision and recall (F1) at 83.3% (vs. Forest 75% and DT 66.7%):


ree

The Variable Importance Plot on the Boosted Model show that the three most important variables to the prediction of store segment were:

· Age0to9

· HVal750KPlus

· Age65Plus


ree











The resulting predicted segments for new stores are as follows:


ree












Task 3: Predicting Sales


Choice of ETS Notation

Error: Since the error is increasing and decreasing over time, the E component is multiplicative (m).

Trend: The trend is not linear, nor does it follow a discernable pattern, thus the T component is none (n).

Seasonality: The seasonality is decreasing slightly over time, so the S component is multiplicative (m).


Decomposition Plot results:


ree














I allowed the ARIMA model to be chosen by the Alteryx ARIMA tool.

After running the ETS and ARIMA models, I observed that the results of the ARIMA (1,0,0) (1,1,0) model still showed some autocorrelation in the ACF and PACF plots.


ree










ree









External Validation

ETS model results using TS Compare tool:

The ETS model forecast conforms closely to the actual values as shown by the blue line:

ree














ETS Model Accuracy Measures:


ree

ARIMA model accuracy measures:


ree

Notable differences between the ETS and ARIMA models:

· As can be seen above, the RMSE measures show that the ETS model has significantly less standard deviation from the mean with a measure of 663,707 vs. the ARIMA measure of 1,050,239.

· The MASE for ETS (0.3257) is also lower than that of ARIMA (0.5463) which indicates a significant reduction in error for the ETS model.

After comparing these performance results of each model against their holdout samples and against each other, I am confident that the ETS model is the optimal one.



ree

After summarizing to get the average sum, I separated the clusters and ran an ETS forecast for each cluster. Then I multiplied by the number of new stores in each cluster using the Formula tool and joined the data. Finally, I put the new store forecasts in an Excel table with the existing store forecasts.


12-Month Forecasts for New and Existing Stores

Month/2016


ree



 
 
 

Comments


©2021 by S. Bonaparte Portfolio. Proudly created with Wix.com

bottom of page