What is a backtest and what information does it provide?
Systematic trading strategies aim to detect recurring patterns and exploit them profitably. For a disciplined implementation of this goal, clear rules for entry, exit and position size control are required. The quality and robustness of the trading strategy, which was previously converted into a code, can be checked on the basis of historical data. This simulation of buy and sell transactions is called backtesting. The backtest provides answers to the following questions, among others, and thus enables an objective assessment of the trading strategy:
What performance has the strategy achieved in the past?
Which drawdowns occurred in the past and how long did they last?
How often does the pattern to be traded usually occur and how often can it be traded profitably?
How does the strategy behave in different market phases?
What share do outliers have in the overall performance?
What correlation does the capital curve show in comparison to the benchmark or other trading strategies?
Detailed information on the risk and return of the trading strategy can be seen in the performance report using various key figures. These include, for example, the profit factor, the maximum drawdown, the Sharpe and Sortino Ratio, and the skewness and curvature of the distribution (more on this in the article "The most important key figures for assessing systematic trading strategies").
The code must be checked
Each trading strategy is based on an algorithm written in a programming language and then applied to historical data. The correct syntax must be used to ensure that it works within the software used. However, to avoid errors during backtesting, it is also essential to take a look at the details. For example, it should be checked whether a so-called look-ahead bias exists. A widespread source of error is, for example, the use of price data or variables for generating trading signals, although the former were not available at that time. A classic example is the use of the closing price for intraday entry (which is logically only fixed at the end of the trading day). This error is often overlooked, especially in trading strategies that use multiple time frames or subsequently revised data sets.
Data quality is the be-all and end-all
In addition to the programming code, the perfect quality of the underlying data is one of the central prerequisites for creating a precise backtest.
If savings are made at this point, all further steps are pointless, as distorted trading signals and correspondingly useless performance data are displayed. Many individual details of the nature of the data are important: First of all, the data must be correct and complete and cover a sufficiently long history containing different market phases. Depending on which instruments are to be tested and later traded, adjustments must be made or different data sets are required. In the case of shares, for example, splits and dividend payments play an important role in calculating entry and exit signals and the resulting performance. The price of a stock that has undergone one or even several splits in the last few decades is usually adjusted retrospectively for the entire history, so that the prices actually traded in the past are distorted. A remedy for this is therefore data records containing "as traded" prices and thus providing a clean data basis for testing. In futures contracts, on the other hand, price gaps after a contract change play a central role. To obtain correct results, all historical rollover gaps must be adjusted. Several methods have been established for this purpose (link to the study). The very frequently used backward adjustment ensures that price jumps that exist at the time of the contract change due to a contango or backwardation structure are adjusted.
Another stumbling block are backtests that show several entries and exits within one trading day but do not use intraday data. Since high-resolution data (e.g. tick data) is usually not available for long periods of time, this shortcoming is often the case with backtests. Since the sequence of the highest and lowest price "within" the day candle (intrabar) is not known, caution is required when using multiple entries and exits within the same period. Due to the too low resolution of the data series, it cannot be determined with certainty whether, for example, the stop was triggered first or whether the price target was reached. Often, algorithms here use simplified assumptions for the calculation, which, however, have nothing to do with actual market events.
Pre-inclusion and survivorship bias
An often neglected aspect of data collection is the correct historical composition of stock indices.
If a trading strategy is to trade, for example, shares of the S&P 500 Index, it is essential for a fair benchmark comparison to ensure that all additions and disposals to the index are represented in exactly the same form in the backtest. A cardinal error is the test with the current members of the index and thus the assumption that these shares were always part of the benchmark during the test period. An example:
The Google share was included in the index in 2006. At that time, the share price was almost USD 400.
If the backtest goes back further, for example to 2004, the Google share would be part of the simulation in the backtest even before the actual inclusion in the index, at prices of around USD 100.
This so-called pre-inclusion bias significantly oversubscribes the performance so that the backtest loses its informative value.
Survivorship bias describes another phenomenon related to the composition of equity indices: Stocks that are currently no longer represented in the underlying index fall through the backtest if the historical composition is not taken into account in the backtest. Those shares usually have a weak performance (or even insolvency) before they are removed from the index. It is precisely this circumstance that also causes the backtest performance to be oversubscribed, because weak values are not considered in the backtest at any time. This effect is particularly evident in the US technology index Nasdaq, where the majority of the index members from 2000 are no longer part of the index today. Both effects lead to an oversubscription of the backtest performance when using unsuitable data. To eliminate this source of error, a data history is required that takes into account all historical index changes and also contains all data series of the delisted stocks.
Consideration of realistic costs
Every transaction costs money in real trading and must be taken into account in the backtest at a realistic level. In addition to the pure transaction costs, which consist of commissions and the bid-ask spread, there are other cost components that can vary greatly depending on the trading instrument and region. Particularly in the case of trading strategies that have a high turnover, the costs can eat up a significant portion of the performance. The order types used in the respective trading strategy also play an important role. For example, breakouts that are tested and traded with stop orders may have a higher slippage than mean-reversion approaches that use limit orders. This can be remedied by taking into account the prevailing liquidity and by limiting the maximum position size depending on the market depth.
Curve Fitting - the greatest danger in backtesting
The main danger in backtesting is the (often unconscious) over-adaptation of the algorithm to the data series. This is known as curve fitting.
The more parameters a trading strategy contains, the more closely it can logically be adapted to the past. Thus, in extreme cases, all possible combinations of the parameters of a strategy could be evaluated and the most profitable setting for real trading could be selected. A look into the past may look promising in the simulation, but the algorithm lacks something crucial: the ability to consistently generate good results in the future - with real money. In other words: the strategy lacks robustness.
The following example on the left shows the in-sample period of an optimised trade strategy with a Sharpe Ratio of 1.59. An application in the out-of-sample period (right graph) shows impressively which of the consequences are due to over-optimisation: the Sharpe Ratio falls from 1.59 to -0.18. [1]
Sensitivity tests provide information on the robustness of the trading strategy
An important method to test the robustness of a trading approach is to look at the performance figures of a trading strategy when the underlying parameters vary. The idea behind it: If a trading strategy delivers positive results only with a few settings, it will only generate profits in the future if market conditions remain exactly the same. History shows that this is not the case. Every market shows certain changes over time. Therefore, many different parameter combinations of the trading strategy are calculated. The more combinations of these parameters provide positive results, the more stable the trading strategy is and the lower the risk of curve fitting.
Another factor that examines the robustness of a trading strategy is the multi-market test. In this test, a trading strategy that is to trade the DAX future, for example, is also tested on other stock index futures that do not correlate with the DAX future. The higher the number of markets that show solid results, the higher the robustness of the strategy can be rated and vice versa.
Furthermore, the consistency of the performance plays an important role.
The performance and stability of the underlying trading strategy is therefore examined on the basis of different market phases. The data should therefore definitely include up, down and sideways trend phases; at the same time, however, it should also include phases of high and low volatility. The evaluation of the results in all market regimes provides information on how the strategy reacts to different market conditions and how consistent the results are. The more consistent they are and the less influence outliers have, the more robust the strategy is.
The Out Of Sample Test - How does the strategy perform on "unknown territory"?
The final and decisive test for the quality and survivability of a trading strategy is the Out-of-Sample Test. This involves splitting the available data history into one data set for training and optimizing the strategy (In Sample, IS for short) and another that is subsequently used to validate the previously obtained backtest results (Out- of-Sample, OOS for short).
The model is trained and adapted on data area A and then verified on data area B - i.e. on unknown data terrain.
This is the only way to check whether the tested strategy is robust and can therefore be used for real trading, or whether it is just the result of an over-adjustment to the noise of the historical data series.
If this process is repeated several times, it is referred to as walk forward analysis (WFA). Here, the optimized parameters from the in-sample phase are used for signal generation during the subsequent out-of-sample phase, thereby ensuring periodic (re-)synchronization between model and data set. With the help of the The Walk-Forward method, the behavior of the trading strategy can be observed both during the development/training phase (IS) and during the verification phase (OOS). The walk forward test thus provides information on how well the trading strategy copes with market changes in "real life" (i.e. without prior knowledge of the data series) and how well the adjustment is successful. The comparison of the performance of the respective IS and OOS phases allows a clear statement to be made as to whether real money trading should be considered or not.
Monte Carlo - the royal road to validation
Scientific studies show that an approach models the best possible estimate of future performance. The Monte Carlo method simulates a large number of data sets whose statistical properties correspond to those of real market data and which are overlaid with random noise. Unlike bootstrap resampling, for example, where the simulation is based on actual observed market data, the Monte Carlo method uses artificial data. For more information on Monte Carlo, please refer to our article "A factory for tactical algorithms".
Conclusion
When used correctly, the use of backtest provides a number of advantages for asset managers and investors. For example, the simulation of the trading strategy provides valuable information on the profitability and risk of the model - before even one euro is risked. Thanks to the clearly and unambiguously defined set of rules, the trading strategy can be implemented one-to-one on the market. Negative influences of emotions and behaviour patterns are reduced to a minimum. The robustness of the strategy can be put to the test with the help of extensive test procedures.