Study Says LLMs Predict the Future Better Than Humans

The field of forecasting is poised for a major revolution thanks to recent breakthroughs in artificial intelligence (AI) and large language models (LLMs).

When the weatherman often gets it wrong, imagine a future where forecasts for anything from stock market trends to election results are not just accurate, but also timely and cost-effective. That future might be closer than we think, thanks to the relentless march of technology, specifically in the field of artificial intelligence.

The recent study by a team from UC Berkeley, led by Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt, shines a spotlight on this very possibility, exploring how language models (LMs) can be harnessed to predict the future.

Approaching Human-Level Forecasting with Language Models

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

arXiv.orgDanny Halawi

The Study in a Nutshell

The crux of the research lies in leveraging language models to forecast future events with a precision that rivals, and sometimes surpasses, human forecasters. The researchers developed a system that integrates retrieval-augmented LMs to sift through vast amounts of data, generate predictions, and aggregate these into a final forecast. They tested this system against predictions from human experts across several forecasting platforms, with the results being nothing short of impressive.

Here's the exciting part: the forecasts aren't just numbers. These LMs can also provide explanations for their predictions! Imagine a world where AI helps us anticipate economic shifts, political events, or even natural disasters. Pretty cool, right?

Paper Summary

This work studies whether language models (LMs) can forecast events at the level of competitive human forecasters by developing a retrieval-augmented LM system.
The system automatically searches for relevant information, generates forecasts, and aggregates predictions.
A large dataset of questions from competitive forecasting platforms was collected to facilitate the study.
Under a test set published after the knowledge cut-offs of the LMs, the end-to-end performance of the system was evaluated against the aggregates of human forecasts.
On average, the system nears the crowd aggregate of competitive forecasters, and in some selective settings surpasses it.
The retrieval system generates search queries to retrieve relevant news articles, ranks articles on relevancy, and summarizes the top articles.
The reasoning system takes the summarized articles and question, prompts LMs to generate forecasts, and aggregates the forecasts into a final prediction using trimmed mean.
A self-supervised approach is used to fine-tune a LM to make accurate predictions and explanatory reasonings, by selecting outputs that outperform the crowd to teach the model better reasoning methods.
Hyperparameter search is used to identify optimal system configurations that to the best end-to-end performance.
The work suggests that using LMs to automatically forecast the future could provide accurate predictions at scale and help inform decision making.

Methodology

The researchers developed a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. The system automates three key components:

Retrieval - gathers relevant information from news sources using LMs for query expansion, relevance ranking, and summarization
Reasoning - weighs available data and makes a forecast using prompted and fine-tuned LMs
Aggregation - ensembles individual forecasts into an aggregated prediction

To train and evaluate the system, the researchers collected a large dataset of binary forecasting questions from 5 competitive platforms. The test set contains only questions published after the knowledge cut-offs of the LMs to prevent leakage.

Result: Beyond Human Accuracy

The optimized system, using hyperparameter search and self-supervised fine-tuning, approaches the performance of aggregated human forecasts on the test set, as measured by Brier score. This is the first automated system that nears human crowd-level performance, which generally outperforms individual forecasters.

Aggregating the system's predictions with crowd forecasts also consistently outperforms either one individually.

This not only underscores the potential of LMs in forecasting tasks but also hints at a future where LMs could become indispensable tools for analysts across various domains.

The Forecasting Framework Used

The framework utilized by the researchers for forecasting with language models (LMs) is an innovative blend of retrieval-augmented forecasting and self-supervised fine-tuning, aimed at harnessing the predictive power of LMs for forecasting at a near-human or even superior level.

1. Retrieval-Augmented Forecasting System

At the heart of their system is the retrieval component that scours through historical data to pull relevant information, which is then used by the LM to make informed predictions. This process mimics how human experts often draw on a wide array of data sources to come to a conclusion, albeit at a scale and speed unattainable by humans.

a. Search Query Generation

The system begins with the LM generating search queries based on the forecasting question at hand. This step involves understanding the question's context and relevant factors to formulate queries that can fetch pertinent historical data and news articles.

b. News Retrieval

Using the generated queries, the system retrieves articles from various news APIs. This step is crucial for gathering the most recent and relevant information that can impact the forecast.

💡

The research suggests that the crucial factor for accurate forecasting was the quality and relevance of the data retrieved during the initial stages of the process. This underscores the potential impact of external information on enhancing the predictive capabilities of language models beyond their pre-trained knowledge.

c. Relevance Filtering and Re-ranking

Not all retrieved articles are equally relevant. The LM then ranks these articles based on their relevance to the forecasting question, filtering out less pertinent information to ensure the reasoning process is based on high-quality data.

d. Text Summarization

Given the vast amount of information available, the LM summarizes the top articles to distill the most critical details pertinent to the forecasting question. This summarized information serves as the foundation for generating forecasts.

2. Reasoning and Prediction Generation

a. Reasoning with Summarized Data

With the summarized information at hand, the LM engages in a reasoning process, contemplating different outcomes based on the data. This step often involves weighing various factors and considering potential scenarios that could affect the forecast.

b. Generating Predictions

Based on its reasoning, the LM generates probabilistic forecasts regarding the future event. These predictions are the LM's best estimates, given the data it has analyzed.

3. Aggregation and Final Forecasting

a. Ensembling Predictions

Recognizing the value of diverse perspectives, the system aggregates multiple forecasts into a single prediction. This could involve averaging the predictions or employing more sophisticated ensembling techniques to combine insights from different models or forecasting rounds.

b. Fine-tuning for Improved Accuracy

To enhance the LM’s forecasting accuracy, the researchers apply a self-supervised fine-tuning approach. This involves training the LM on a subset of data where its predictions have historically outperformed human forecasts. The fine-tuning process helps the LM learn effective reasoning strategies and apply them to new forecasting questions.

Language Models and Performance (Brier Score)

In the research on forecasting with language models (LMs), the researchers evaluated a wide range of instruction-tuned language models to assess their natural forecasting abilities and the impact of fine-tuning on their performance. Below is a list of the language models used in the study, along with a summary of their performance based on the baseline evaluation (i.e., without additional information retrieval or specific fine-tuning for forecasting tasks). The performance metric used here is the Brier score, where a lower score indicates better forecasting performance.

GPT-3.5-Turbo
- Zero-shot: 0.237
- Scratchpad: 0.257
GPT-3.5-Turbo-1106
- Zero-shot: 0.274
- Scratchpad: 0.261
GPT-4 (GPT-4-0613)
- Zero-shot: 0.219
- Scratchpad: 0.222
GPT-4-1106-Preview
- Zero-shot: 0.208
- Scratchpad: 0.209
Llama-2-7B
- Zero-shot: 0.353
- Scratchpad: 0.264
Llama-2-13B
- Zero-shot: 0.226
- Scratchpad: 0.268
Mistral-7B-Instruct
- Zero-shot: 0.237
- Scratchpad: 0.243
Mistral-8x7B-Instruct
- Zero-shot: 0.238
- Scratchpad: 0.238
Claude-2.1
- Zero-shot: 0.220
- Scratchpad: 0.215
Gemini-Pro
- Zero-shot: 0.243
- Scratchpad: 0.230

Key Observations

The GPT-4-1106-Preview model showed the best baseline performance among the evaluated models, indicating that it may have the most potential for forecasting tasks among those tested.
Llama-2-7B had significantly higher Brier scores in the zero-shot setup, suggesting it was less naturally suited to forecasting tasks compared to other models.
Claude-2.1 demonstrated relatively consistent performance between zero-shot and scratchpad approaches, with scratchpad prompting slightly improving its forecasting accuracy.
Across all models, the scratchpad approach generally did not dramatically improve performance compared to the zero-shot approach, which indicates the inherent forecasting capabilities of the LMs were already being maximally exploited in the zero-shot condition.

These performance metrics underscore the variability in natural forecasting ability across different LMs and highlight the importance of model selection and potential fine-tuning in developing a system for forecasting future events.

The Brier Score

The Brier Score is a measure used to evaluate the accuracy of probabilistic predictions. It applies across various domains, including weather forecasting, sports betting, and more recently, evaluating the performance of language models in forecasting tasks. Here’s a breakdown of how it works, what it means, and its limitations:

How It Works

The Brier Score is calculated as follows:

[ \text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} (f_i - o_i)^2 ]

where:

(N) is the number of predictions made,
(f_i) is the forecasted probability for the (i^{th}) event (the model's predicted probability that the event will occur),
(o_i) is the actual outcome of the (i^{th}) event, coded as 1 if the event occurs and 0 if it does not.

The score thus represents the mean squared difference between the predicted probabilities and the actual outcomes.

What It Means

The Brier Score ranges from 0 to 1, where:

A score of 0 indicates perfect accuracy,
A score of 1 indicates the worst possible accuracy.

Lower scores are better, reflecting that the predicted probabilities are close to the actual outcomes. A good model should have a Brier Score significantly lower than the score achieved by random guessing or by always predicting the mean probability of the outcomes in the dataset.

Limitations

While the Brier Score is a useful and widely used measure, it has some limitations:

Sensitivity to Imbalanced Datasets: In cases where the outcomes are heavily skewed towards one result (e.g., when one outcome is much more common than the other), the Brier Score may not fully capture the model's predictive skill. Models that naively predict the more frequent outcome could appear to perform well.
Lack of Differentiation Between Types of Errors: The Brier Score treats all errors the same, regardless of the direction of the error (overestimation vs. underestimation) or the context in which the prediction is made. It doesn't distinguish between the cost of false positives and false negatives, which can be important in certain applications.
Not Intuitive: While the Brier Score provides a quantitative measure of accuracy, it can be less intuitive to interpret compared to other metrics like accuracy, precision, and recall, especially for stakeholders not familiar with probabilistic forecasting.
No Information on Calibration: The Brier Score evaluates the accuracy of the probabilities but does not provide information on the calibration of the model. A model can have a low Brier Score but still be poorly calibrated, meaning that the probabilities do not reflect true likelihoods. Calibration must be assessed separately, often using calibration plots or related metrics.

Despite these limitations, the Brier Score remains a valuable tool for evaluating the performance of probabilistic forecasts, offering a concise metric to compare models and assess improvements over time.

Interesting Sidenotes on the Scratchpad Approach

What is the Scratchpad Approach

The scratchpad approach is a method used to structure the interaction with language models (LMs), guiding them to process and generate responses in a more deliberate and reasoned manner. This method involves providing a set of instructions or prompts that lead the LM through a series of steps, encouraging it to "think aloud" or externalize its reasoning process as it arrives at a conclusion or forecast. The aim is to make the LM's thought process more transparent and to potentially improve the quality and reliability of its outputs by encouraging a more analytical approach to problem-solving.

In the context of the research paper on forecasting with language models, the scratchpad approach was used to guide the LMs in generating forecasts for future events. Specifically, the LM was prompted to:

Provide reasons why the answer might be no. The LM lists considerations or evidence that would support a negative outcome for the forecasted event.
Provide reasons why the answer might be yes. The LM enumerates factors or evidence supporting a positive outcome for the event.
Aggregate considerations. The LM synthesizes the reasons for and against the forecasted outcome, weighing them to form a more balanced view.
Output a forecast. Based on the aggregated considerations, the LM provides a probabilistic forecast (a number between 0 and 1) indicating the likelihood of the event occurring.

This approach was designed to mimic the process of human judgmental forecasting, where forecasters use a combination of evidence, domain knowledge, and logical reasoning to make predictions. By structuring the LM's "thinking" in this way, the researchers aimed to improve the LM's forecasting capability and make its reasoning process more interpretable and aligned with human forecasting practices.

My Thoughts on Why it Did Not Make Significant Improvements to Predictions

The scratchpad approach did not result in a significant improvement over the zero-shot approach for a couple of reasons that could be considered, including the quality and relevance of the retrieved data, the effectiveness of the prompting strategy itself, and the nature of forecasting tasks when applied to language models (LMs).

Quality and Relevance of Retrieved Data: One potential reason why the scratchpad approach didn't markedly enhance performance could be attributed to the quality and relevance of the information retrieved through the retrieval-augmented process. If the retrieved data provided to the LM during the scratchpad prompting was highly relevant and comprehensive, it might have already contained sufficient information for making accurate forecasts. In such cases, the additional reasoning steps facilitated by the scratchpad method may not contribute significantly to the prediction accuracy. Essentially, if the retrieved information was all that was necessary for making predictions, the addition of structured reasoning might not leverage additional insights from the pre-trained knowledge of the LMs, especially if the forecasting questions were primarily dependent on recent events or specific information beyond the LMs' pre-training cut-off.
Effectiveness of the Prompting Strategy: The scratchpad approach relies heavily on the quality of the prompts used to guide the LM in generating forecasts. It's possible that the prompts designed for the scratchpad approach did not effectively elicit the LMs' reasoning capabilities or guide the model to integrate its pre-trained knowledge with the retrieved data optimally. If the prompts were not adequately tailored to encourage deeper reasoning or if the models were not effectively utilizing their pre-trained knowledge in conjunction with the new information, this could limit the added value of the scratchpad method.
Nature of Forecasting Tasks with LMs: Forecasting future events is inherently challenging, particularly for events where outcomes are highly uncertain or where the relevant information is not fully captured in the model's pre-training data. Language models, even when augmented with retrieval systems, primarily generate forecasts based on patterns and information they have been exposed to during training. For many forecasting tasks, especially those closely aligned with the models' knowledge cut-offs, the pre-trained knowledge might not add significantly to the information retrieved about the specific forecasting question. In such cases, the scratchpad's structured reasoning process may not significantly outperform a more direct prediction strategy that relies on recently retrieved information.

Implications for the Future

The results suggest that in the near future, LM-based systems may generate forecasts at the level of competitive human forecasters. This could enable automated, scalable forecasting to complement human predictions and inform decision making across governments, companies, and other institutions.

The work provides the largest, most recent forecasting dataset to date, and proposes novel retrieval and self-supervised fine-tuning approaches to optimize LM-based forecasting. This paves the way for further research into developing high-performing automated forecasting systems.

Aiding Human Decision-Making

The promise of LMs in forecasting opens up exciting possibilities for enhancing human decision-making. From policy formulation to investment strategies, the ability to accurately predict outcomes could become a game-changer, making processes more efficient and outcomes more reliable.

Democratizing Forecasting

By making accurate forecasting more accessible and cost-effective, LMs could democratize the process, enabling smaller organizations or even individuals to make data-driven decisions that were previously beyond their reach.

Continuous Improvement

As LMs continue to evolve, their forecasting abilities will likely improve, expanding their applicability to a broader range of questions and scenarios. This ongoing advancement could eventually lead to LMs becoming a standard tool in the forecaster's arsenal.

Challenges and Limitations

However, it's not all smooth sailing. The study also acknowledges the limitations of current LMs, particularly their dependence on historical data, which might not always be predictive of future events. Moreover, the system's efficacy varied across different types of questions, indicating room for improvement in how LMs handle nuanced or complex forecasting tasks.

Unanswered Questions

If you're like me you are wondering how can this framework be implemented. But not so fast. Building on the research of forecasting with language models (LMs), several critical questions need to be addressed before one can successfully implement and potentially expand upon such a system. These questions touch on the source of data, model selection, evaluation metrics, and practical application concerns.

Where is the "crowd" data coming from?
- Which platforms or sources will be used to gather human forecast aggregates?
- How will the data be accessed (APIs, web scraping, partnerships)?
- What criteria will be used to select and filter questions from these platforms?
How will relevant information be retrieved and evaluated for its relevance to the forecasting questions?
- What sources of information will be used for retrieval (news databases, scientific journals)?
- How will the system ensure the retrieved information is current and accurately relevant to the specific forecasting question?
Which language models will be used, and why?
- Based on what criteria will specific LMs be chosen (knowledge cut-off dates, performance on similar tasks, computational resources)?
- Will multiple LMs be evaluated and compared, or will a single model be used throughout?
How will the language models be fine-tuned?
- What data will be used for fine-tuning the models?
- How will the fine-tuning process be designed to improve forecasting accuracy without introducing biases?
What methodology will be used to aggregate individual LM forecasts into a final forecast?
- Will the system employ a simple average, a weighted average based on confidence scores, or another method?
- How will the system ensure that the aggregation method improves overall forecasting performance?
How will the system's forecasting performance be evaluated?
- What metrics (e.g., Brier score, accuracy) will be used to assess forecasting performance?
- How will these metrics be calculated, and what benchmarks (e.g., human crowd performance) will be used for comparison?
What are the ethical considerations and potential biases in using LMs for forecasting?
- How will the system account for and mitigate potential biases in the data or model predictions?
- What ethical guidelines will be followed to ensure the responsible use of forecasting predictions?
How will the system be integrated into decision-making processes?
- Who will be the users of the forecasting system (governments, businesses, research institutions)?
- How will the forecasts be presented to users to inform decision-making effectively?
How will the system handle forecasting questions with insufficient data or high uncertainty?
- What strategies will be employed when available information is too sparse or ambiguous for confident forecasting?
- How will the system communicate the level of uncertainty in its forecasts to users?
What scalability and maintenance considerations must be addressed?
- How will the system be scaled to handle a large volume of forecasting questions across various domains?
- What processes will be put in place for regular maintenance and updates of the LM and retrieval system?

Answering these questions is crucial for anyone looking to build or improve upon a system for forecasting future events using language models. Each question highlights a key area of consideration in the design, implementation, and operation of such a system, ensuring its effectiveness, reliability, and ethical use.

The research by Halawi and his team at UC Berkeley marks a significant step forward in our understanding and utilization of language models for forecasting. While challenges remain, the potential benefits of incorporating LMs into forecasting processes are too significant to ignore. As we stand on the cusp of what could be a revolution in forecasting, it's clear that the future of forecasting is not just about predicting the weather but about unlocking the predictive power of AI to inform and guide decision-making across the spectrum of human endeavor.