Working with Time Series data: splitting the dataset and putting the model into production

Question

I've been working with ML for sometime now, especially Deep Learning, but I haven't work with Time Series before, and now I started working in a project for Demand Forecasting. I'm studying the statistical / auto-regressive methods and also trying to understand how CNN and LSTM can be used to tackle the problem. But I'm having a hard time sorting some stuff in my head, mainly about how to split the dataset and put the model into production. So, here are my two main doubts:

I started using Time Series Nested Cross-Validation. Alright, I understand that it's not the only option, but I think it fits great to tune my model hyperparameter's and guarantee that it doesn't overfit. Since in production I'll have to forecast the next 90 days, my test set is always 90 days. But here is the question: with statistical / auto-regressive models (like ARIMA), when I finish tuning the parameters, what should I do? Should I use the model with the largest training set to put into production? But wouldn't I be missing 90 days of recent data? Is it safe to retrain it using the whole data and the same parameters so that it doesn't miss this data?

After a lot of research to understand how to use LSTM and other Machine Learning models for Time Series, I understood that the training dataset needs to be transformed into samples with a rolling window. I mean, I pass a window through the dataset with N elements as input and M elements as output with the window going one by one. Alright, but then, how do I split the training dataset into training and validation (to use ModelCheckpoint and EarlyStopping)? I've seen some tutorials using a random split of these generated samples. But I feel that it creates a data leakage between the training and validation set. The other option seems to be splitting in a temporal way before the rolling window process (eg. having 90 days of validation set). It sounds better for me, since no data would be leaked but then, how would I put it into production? If I simply pick the model trained with the largest dataset, it would be missing 90 days from the test set plus 90 days from the validation set. So, it wouldn't pick recent trends. And I don't think it's safe to simply retrain the model with the whole dataset and the same hyperparameters since I wouldn't have a way to early stop the training process.

I understand that I need to retrain my model constantly because the world is changing and it needs to pick new trends of the data. So, after finding the best hyperparameters, I expect the model to be automatically trained with them within a given schedule (every week, for example). But I can't wrap up my head around those doubts. Am I training a model to predict the next 90 days using data of 90 days ago (with the statistical models) or 180 days ago (with ML}?

Skander H. · Answer

For standard statistical methods (ARIMA, ETS, Holt-Winters, etc...)

I don't recommend any form of cross-validation (even time series cross-validation is a little tricky to use in practice). Instead, use a simple test/train split for experiments and initial proofs of concept, etc...
Then, when you go to production, don't bother with a train/test/evaluate split at all. As you pointed out correctly, you don't want to loose valuable information present in the last 90 days.  Instead, in production you train multiple models on the entire data set, and then choose the one that gives you the lowest AIC or BIC.
This approach, try multiple models then and pick the one with the lowest Information Criterion, can be thought of as intuitively using Grid Search/MSE/L2 regularization.
In the large data limit, the AIC is equivalent to leave one out CV, and the BIC is equivalent to K-fold CV (if I recall correctly). See chapter 7 of Elements of Statistical Learning, for details and a discussion in general of how to train models without using a test set.  
This approach is used by most production grade demand forecasting tools, [including the one my team uses][1]. For developing your own solution, if you are using R, then auto.arima and ETS functions from the Forecast and Fable packages will perform this AIC/BIC optimization for you automatically (and you can also tweak some of the search parameters manually as needed, increase).
If you are using Python, then the ARIMA and Statespace APIs will return the AIC and BIC for each model you fit, but you will have to do the grid-search loop your self. There are some packages that perform auto-metic time series model selection similar to auto.arima, but last I checked (a few months back) they weren't mature yet (definitely not production grade).

For LSTM based forecasting, the philosophy will be a little different.

For experiments and proof of concept, again use a simple train/test split (especially if you are going to compare against other models like ARIMA, ETS, etc...) - basically what you describe in your second option.
Then bring in your whole dataset, including the 90 days you originally left out for validation, and apply some Hyperparameter search scheme to your LSTM with the full data set. Bayesian Optimization is one of the most popular hyperparameter tuning approaches right now.
Once you've found the best Hyperparameters, then deploy your model to production, and start scoring its performance.
Here is one important difference between LSTM and Statistical models:
Usually statistical models are re-trained every time new data comes in (for the various teams I have worked for, we retrain the models every week or sometimes every night - in production we always use different flavors of exponential smoothing models).
You don't have to do this for LSTM, instead you need only retrain it every 3~6 months, or maybe you can automatically re-trigger the retraining process when ever the performance monitoring indicates that the error has gone above a certain threshold.
BUT - and this is a very important BUT!!!! - you can do this only because your LSTM has been trained on several hundred or thousand products/time series simultaneously, i.e. it is a global model. This is why it is "safe" to not retrain an LSTM so frequently, it has already seen so many previous examples of time series that it can pick on trends and changes in a newer product without having to adapt the local time series specific dynamic.
Note that because of this, you will have to include additional product features (product category, price, brand, etc...)  in order for the LSTM to learn the similarities between the different product. LSTM only performs better than statistical methods in demand forecasting if it is trained on a large set of different  products. If you train a separate LSTM for each individual time series product, then you will almost certainly end up overfitting, and a statistical method is guaranteed to work better (and is easier to tune because of the above mentioned IC trick).

To recap:

In both cases, do retrain on the entire data set, including the 90s days validation set, after doing your initial train/validation split.

For statistical methods, use a simple time series train/test split for some initial validations and proofs of concept, but don't bother with CV for Hyperparameter tuning. Instead, train multiple models in production, and use the AIC or the BIC as metric for automatic model selection. Also, perform this training and selection as frequently as possible (i.e. each time you get new demand data).
For LSTM, train a global model on as many time series and products as you can, and using additional product features so that the LSTM can learn similarities between products. This makes it safe to retrain the model every few months, instead of every day or every week. If you can't do this (because you don't have the extra features, or you only have a limited number of products, etc...), don't bother with LSTM at all, and stick with statistical methods instead. 
Finally, look at hierarchical forecasting, which is another approach that is very popular for demand forecasting with multiple related products.

IrishStat · Answer

simply select the forecast horizon based upon how often yo u will update hour forecasts. Assume you have 200 observations and plan on reforecasting every 7 periods. Now take 193 most recent values and predict the observations for period 194-200 . Now take 186 observations and predict the observations for 187-193. Now take 186 historical values and predict 187-193 . In this way all of your history is used to obtain a model and parameters to predict the next 7 values from K origins (test points).

Now at each point in the future remodel using all of the known data to predict the next 7 values.

It is important to note that one can specify a model or allow empirical identification ala https://autobox.com/pdfs/ARIMA%20FLOW%20CHART.pdf at each of the test points in order to provide a measure of the expected inadequacy/adequacy.

In this way your model is DYNAMIC and is identified based upon all of the historical data.

Now what I suggest is that at each model-building stage you EXPLICITELY test for constancy of parameters AND constancy of the error variance in order to yield a useful model AND respond to model dynamics (changes). In this way you are effectively discarding data that is no longer relevant as things may have changed such that older data needs to be put aside (parameter constancy )or at least modified via variance-stabilizing weights (GLS).

Working with Time Series data: splitting the dataset and putting the model into production

2 Answers

Add your own answers!

Ask a Question