Part 3: AI Challenges | Forward-Looking Bias

Author: Dimitrios Markonis, Data Science Manager

After discussing overfitting and the curse of dimensionality in parts 1 and 2 of our series AI Challenges and How We Safeguard Against Them, this third part focuses on another challenge associated with using AI, namely “forward-looking bias” and the steps we take to avoid it.

Using data from the future

Forward-looking bias—also known as look-ahead bias—is a common problem in data analysis. It occurs when future information is included in the data set used to make predictions. It’s like cheating on a test by looking at the answers before you start. In the case of AI, forward-looking bias occurs when future information is used to train the model, and the consequences of this problem are overly optimistic estimates of the model’s performance.

How forward-looking bias impacts the probability of technical and regulatory success (PTRS)

Forward-looking bias can sneak into an AI model when working with time series data and analyzing past performance. Time series data refers to a sequence of data points collected at regular time intervals ordered by time to capture changes over time. Introducing forward-looking bias into time series data can happen both during model training and evaluation.

To avoid forward-looking bias, only historical data should feed into an AI model that corresponds to information available at the time of analysis. For example, suppose we try to predict the probability of technical and regulatory success (PTRS) for trials in the design stage. In that case, we cannot use training data that contain information that would only be available later when the first outcomes are available.

Here is a concrete example. At Intelligencia AI, we perform studies for our customers that compare our AI-powered PTRS assessments with the historical benchmark for a drug. Sometimes, these studies look at the past, trying to answer the question, “What PTRS would the Intelligencia AI model have assigned to a drug program if we had done the analysis on July 31, 2021?”

To avoid forward-looking bias, our July 31, 2021, PTRS calculation cannot include any data generated on or after August 1, 2021. A glimpse into the future would give the model an unfair advantage and make the predictions more accurate than they would have been had we not included information from the future.

How to avoid forward-looking bias

By taking decisive steps, we avoid accidentally sneaking information from the future into our models and ensure that our analyses always reflect the state of knowledge at the time of the prediction.

Here is a short overview of our three mission-critical approaches to avoid forward-looking bias. 

  • Expert data curation helps avoid forward-looking bias by ensuring that the data used to train and test a model is representative and free from systematic errors, specifically that the data does not contain any information from the future. By carefully selecting, preprocessing and validating the data, we mitigate biases that could lead to inaccurate model predictions. Domain expertise in both data science and drug development is critical to expert data curation and developing that broad in-house domain expertise, which has been a focus of Intelligencia AI ever since we were founded.
  • Out-of-sample testing is another way of safeguarding against bias. It involves splitting the data into two sets at a defined point in time: the set before that point is used for model training, and the other is used for later validating that the model made accurate predictions. 

Here is a way to think about this approach. Imagine you want to predict how many people will shop on Amazon in December based on past years’ data. You use data from January to November to train the model. In the second step, you test your model using data from December that you haven’t seen before. If your model accurately predicts the December shopping trends, it’s a sign that it is a good model.

  • Prospective assessment helps safeguard against any forward-looking bias that could lead to inflated performance metrics. It involves evaluating a model’s performance by applying it to future data that was not available when it was created. This method ensures that the model can make accurate predictions in real-world scenarios.

In our shopping example, you would collect historical data from previous years, train the model with that data, use the model to make predictions for the coming holiday season and then wait until it is December again to collect the actual data. The last step—after December data collection is completed—is to compare the model’s predictions to the actual data collected. This comparison tells you how accurately your model can predict future shopping behavior.

In summary

Forward-looking bias is a serious issue in the pharmaceutical industry because it can lead to inaccurate predictions. Using data that was not available at the time of the analysis can lead to overconfident and overly optimistic results, which then impact drug development strategies, resource allocation, and investment decisions.

At Intelligencia AI, we ensure the accuracy and reliability of our predictions by applying proven, effective approaches to counter forward-looking bias. Our ASCO 2024 analysis is another example and validation of our accurate PTRS. Be sure to read the full blog series, with part 1 on overfitting and part 2 on the curse of dimensionality. Get in touch with us to learn more about our techniques and dive deeper.