A fox, a hedgehog and an LLM – it sounds like the beginning of a joke, but this is in fact the introduction to a topic that is consuming the attention of Philip Tetlock, psychology professor, prediction methodology pioneer and author of the seminal work, Superforecasting: The Art and Science of Prediction. In our wide-ranging conversation, we explored his identification of canonical human behaviors – the foxes and the hedgehogs – and the role of “the wisdom of the crowd” versus expert knowledge, and what defines “superforecasters”; are they super-intelligent, or just super-thinkers? Tetlock has spent a career delineating these groups and their defining attributes and prediction capabilities, in the hope that these findings will influence how we, as humans, make important economic and political decisions. Recently, he has extended this work to focus on how LLMs could assist, or even replace, humans when forecasting uncertain futures and he has arrived at a clear conclusion: “It is absolutely crucial to integrate LLMs into almost all lines of inquiry,” says Tetlock.
But Tetlock’s prediction story starts decades ago, in the days of the Cold War: “I was a moderately conventional research psychologist who had some grounding in cognitive and social psychology. I had the sense that there were certain basic principles of human cognition that were relevant to these debates. In 1985, I was a young faculty member, I’d just gotten tenure at Berkeley and Gorbachev had just become the General Party Secretary of the Soviet Union. And there was a ferocious debate going on among national security people and politicians about U.S. relations with the Soviet Union. There were some who thought that Gorbachev was a real reformer, and there are others who thought that he was more of a ‘neo-Stalinist in Gucci garb,’ as one person put it. So, this is a situation where virtually nobody really gets it right, and virtually nobody admits that they’re wrong; everybody claims that they were right in some fundamental sense. So that’s what motivated me to get involved in forecasting.”
His first forays into the murky realm of prediction resulted in the creation of a “prediction tournament” in which a group of nearly 300 experts with knowledge across the political and economic spectrum were asked to provide predictions for more than 100 questions derived from current affairs, with forecast timescales anywhere from one to ten years out, resulting in more than 28,000 predictions over the period. And one of the most remarkable conclusions gave rise to one of his most famous quotes: that, on average, these experts were no more accurate than “dart-throwing chimpanzees,” i.e. random guessing. Tetlock is now less fond of that pithy quote despite the headline-grabbing popularity, as it led to the unfounded conclusion that all expert predictions over all timescales were worthless, when instead the reality was that there were two important distinctions to be made (with apologies to George Orwell):
1) Not all predictors are equal – some are more equal than others
2) Not all timescales are equal – the shorter the time period, the better the predictability
This initial project revealed critical clues that became the foundation of his future work, in particular, the identification of the aforementioned hedgehogs and foxes.
The hedgehog and fox monikers originally derive from an essay by the philosopher, Isaiah Berlin, based on an old Greek adage that “a fox knows many things, but a hedgehog knows one big thing,” that Berlin used to characterize writers into two schools, one that views the world through a single lens or “Big Idea,” and the other that draws on a wide variety of experiences that cannot be attributed to a single school of thought or paradigm. This simple didactic framework was picked up by Tetlock and used to characterize different approaches that expert forecasters took to the task of prediction. One of his key findings was that experts frequently behaved as the proverbial “hedgehogs,” tied to one primary philosophy that dictated their thinking, and they did not actively consider other points of view. However, a subgroup also emerged who made predictions by stitching together diverse sources of information and views, rather than selecting them based on a primary prevailing principle. Moreover, these individuals engaged in symmetric “Bayesian updating” of forecasts, adjusting their forecasts in any direction in accordance with the evidence, whereas hedgehogs only updated their forecasts when the evidence supported their primary thesis. As Tetlock summarizes the results in Superforecasting, “the first [group] failed to do better than random guessing and, in their longer-range forecasts, even managed to lose to the chimp. The second group beat the chimp, though not by a wide margin…however modest their foresight was, they had some.”
Importantly, when evaluating the defining behaviors of the latter group, Tetlock found that “it wasn’t whether they had PhDs or access to classified information. Nor was it what they thought – whether they were liberals or conservatives, optimists or pessimists. The critical factor was how they thought.” In short, they were very similar to Berlin’s definition of a fox.
Tetlock is an adherent of the System 1 versus System 2 cognitive modes outlined by his friend and collaborator, Daniel Kahneman, which have become a bedrock framework in this series of interviews. One of the defining characteristics of the foxes was their successful navigation of the cognitive reflection test, “that’s where you view people’s first reactions to all the problems as wrong, your intuitive System 1 reaction is wrong. It’s a test essentially, of your ability to exercise System 2 correction over incorrect System 1,” Tetlock explains.
But the relevance of Kahneman’s insights to an individual’s predictive skill goes even deeper. Specifically, one must exhibit the ability to master or overcome the following four mental heuristic “traps”:
The Availability Heuristic: substituting ease of recall for the frequency of actual occurrence
The Affect Heuristic: likes and dislikes dictate our beliefs more than rational analysis
The Substitution Heuristic: we replace a hard question with a simpler one that comes readily to mind, e.g., replacing a question about the likelihood of an accident or incident with one’s own experience of such events
The Narrative Heuristic: the plausibility of the story trumps the analysis of probability in the human mind
In essence, these System 1 tendencies can be ascribed to “the propulsive urge behind all human efforts to comprehend reality,” as Tetlock puts it. He goes on to identify the core issue, “the problem is that we move too fast from confusion and uncertainty…to a clear and confident conclusion.” This behavior clearly served us well in our ancestral past when we were evading lions in the grasslands of the savannah, but under less life-threatening circumstances, it would behoove us to override these tendencies and to more actively manage the transition from System 1 reactive response to System 2 reasoning and retrospection.
The key behavior Tetlock identifies is to be aware of the deceptive power of these intuitions by checking the default use of System 1 for validity for a given circumstance or question. For this, he references the Kahneman-Klein definition of expert or learnable systems, which again has featured as a key theme of this series: there needs to be a well-defined environment with the opportunity to acquire sufficient data to observe and analyze the essential behavior of this environment. He points out that the geopolitical space is one in which it is exceedingly difficult to make expert predictions, as forecasters get very delayed, erratic feedback and as a result, it’s very hard for them to develop well-calibrated probability estimates. Conversely, in areas like sports, poker and bridge, short-range meteorology and some areas of medicine and business, which Tetlock calls “learning friendly environments,” forecasters get a lot of regular, rapid feedback on their judgments.
As Tetlock puts it, “Politics is a learning unfriendly environment, so people are going to be very good at spinning narratives that sound very persuasive and plausible, but those narratives aren’t going to translate very reliably into accurate forecasts.” And he agrees with Yuval Noah Harari’s observations about the power of narrative in binding communities, “narratives and rituals, a combination of the two, are really the glue that holds communities of co-believers together,” creating a strong affinity for these System 1 narratives that must be overcome, he argues.
Critically, foxes are proficient in not taking anything at face value by overusing System 1; they always invoke System 2 to consider alternative points of view, and so-called “counterfactuals,” which challenge these seductive intuitions. Tetlock explains that “counterfactuals balance hindsight bias – the difficulty people have in remembering what they thought before. Counterfactual scenarios can reconnect us to our past states of ignorance…it’s good mental hygiene and it’s useful for debiasing.”
To round out the key observations from these first prediction groups, Tetlock found that another well-known phenomenon was also apparent in these groups – the so-called “wisdom of the crowd.” The wisdom of the crowd was first conjectured by Aristotle (as an aside, it is continually humbling to see how prescient the ancient Greeks were about all aspects of human cognitive condition and behavior) and popularized by the author, James Surowiecki, in his book The Wisdom of Crowds, in which it was shown that if certain conditions of independence of the viewpoints were met, the noise or uncertainty of each view was effectively cancelled out, resulting in aggregate prediction that is far more accurate than that of any individual. There are a multitude of popular examples of this phenomenon, most notably the stock market, for which the market value determined by the aggregate view of the crowd of investors produces, on average, more accurate pricing of assets than by any expert stock picker; or the jury system for which the wisdom of a crowd of twelve is held to be superior to that of any single (expert) judge. At the more prosaic level, the crowd view of the number of beans in a jar is famously more accurate than any individual guess, as is the perspective contained on digital “crowd” forums such as Wikipedia, Reddit, StackOverflow and the like. The same behavior was also apparent in the prediction groups – the aggregate forecast across any of the groups, e.g., the hedgehogs or the foxes, was more accurate than that of any individual in each group. But, again, the foxes had an advantage as they each individually “internally aggregate” different perspectives, so the aggregate forecast for the group of foxes is effectively an “aggregate of aggregates,” which enhances the wisdom of the foxy crowd.
In 2011, this compelling set of findings motivated the recently formed U.S. IARPA agency to create a forecasting challenge in which five teams were invited to compete with a reference group of intelligence community analysts in an internal prediction market. Tetlock seized on this opportunity and created the Good Judgement Project (GJP), together with fellow University of Pennsylvania faculty, Barbara Mellers and Don Moore. They recruited 2800 individuals who were interested in committing to work on the project with little reward (the payment was a $250 gift voucher for active participation), and were able to pass some basic cognitive and bias tests. The GJP team was initially pitted against four other teams, including teams from MIT and the University of Michigan. Tetlock summarizes the results: “In year 1, GJP beat the official control group by 60%. In year 2, we beat the control group by 78%. GJP also beat its university-affiliated competitors, including the University of Michigan and MIT, by hefty margins, from 30% to 70%.”
A good summary of the overall competition and results can be found here and here. Notably, the wisdom of the crowd was clearly apparent: on average, teams were 23 percent more accurate than individuals. When different team performance was then compared, it was found that the intelligence prediction markets beat the ordinary teams (comprised of a mixture of hedgehogs and lesser foxes) by about 20 percent, but teams of “superforecasters” who demonstrated the most foxlike behavior beat the prediction markets by 15 percent to 30 percent. The magnitude of this difference is remarkable, given the fact that the economic and political questions asked were of the type, “Will there be a violent incident in the South China Sea in 2013 that kills at least one person?,” for which there were no exact reference cases or analytical frameworks, and so the lure of simple retrospective narratives is undeniable. Tetlock summarizes the accomplishment as follows: “Superforecasters could predict out 300 days with the same accuracy as ordinary predictors could predict 100 days,” He also draws an analogy to a standard eyechart test − superforecasters could effectively “see” three more rows of a standard eight row chart, which he describes as “life-changing.”
So far, our discussion has not considered the role of the intelligence of the forecasters. Tetlock agrees with Kahneman’s definition of intelligence that it is “the ability to find relevant information and apply attention,” and that this definition is “very appropriate to the forecasting world, where it’s really about skill at picking out signals in a noisy world.” He further decomposes intelligence into two components: “crystallized” and “fluid” intelligences. The former is the product of all one’s knowledge and experiences analyzed and aggregated into accumulated perspective or “wisdom.” The latter is more a description of raw problem-solving ability, which can be tested by tests such as Ravens Progressive Matrices.
Tetlock opines that “both are useful in forecasting; both the raw problem-solving power of fluid intelligence and the capacity to sum up a wide range of associative networks of knowledge (crystallized intelligence) that bear on a problem.” He further observes that foxes apply more fluid intelligence than hedgehogs to their forecasting, as they exhibit a higher degree of curiosity and desire to seek out alternative explanations and viewpoints.
There are a number of other key attributes of superforecasters: they don’t fall into Kahneman heuristic traps (see above) as often and, in particular, they focus on accuracy over narrative coherence, focusing on how much data they can gather and how well that data can be interpreted. Last, when placed together in teams, superforecasters shared best practices; as Tetlock puts it, “they were not just drawing on their own cognitive resources, they were drawing on the cognitive resources of other talented team members, so it just put them in another league altogether.”
Interestingly, it was also observed that superforecasters tend to become more open-minded over time, but this seemingly appealing attribute has an inherent risk in prediction scenarios, as Tetlock describes it, “narrative richness and probability are inversely related; the easier it is for imaginative, open-minded people to transport themselves into possible worlds, the more at risk they are of assigning too much probability to too many possibilities.”
In summary, in order to be a superforecaster, one must use fluid and crystallized intelligence and be open-minded to counterfactuals…but not too open. And one must resist seductive narratives as they typically hide strong ideological messages and biases.
But what does this all have to do with LLMs?
Clearly, one of the key superforecaster skills is being able to identify relevant (if not perfect) reference cases and alternative scenarios or counterfactuals. In addition, the GJP team found that two techniques, known as “Fermizing” and “Extremizing,” were also critical to achieving the best predictions. Fermizing is named after the Fermi problem methodology pioneered by the scientist Enrico Fermi, who taught his students to break complex problems into smaller constituent components and to perform what would commonly be called call back-of-the-envelope, or order-of-magnitude estimations on each component.
Extremizing is an algorithmic overlay built on top of the wisdom of the crowd. In essence, it recognizes that a crowd of forecasters, each of whom has access to different pieces of information, are not exactly equivalent to one “omniscient” individual (with the same analytical abilities) who has access to all of the information and therefore would exhibit more predictive certainty. So, to account for this effect, one should move from the simple median answer to a value that is more extreme in that direction, in proportion to the diversity of the views represented. It was consistently found that this approach applied to the teams of average forecasters and resulted in predictions that were more accurate and almost as good as those of the superforecasters.
So, given all the above, it is reasonable to conjecture that LLMs could act to help in the following ways: i) to expose relevant and counterfactual scenarios that increase the diversity of views considered – effectively acting as extremizers; ii) to break problems into fermized chunks and find appropriate reference cases for each and; iii) as enhancers to crystalized intelligent analyses by providing facile access to all sources of knowledge, and as a complement to human fluid intelligence.
Tetlock points out that “It’s very tempting as you get older and older, to keep falling back on the things you know and these over-learned, over-rehearsed routines, and that leads you to Max Planck, and his idea that science advances funeral by funeral – by the death of the opponents of new ideas.” In this context, the lack of adherence to any particular world view by LLMs can be advantageous to the investigation of future unknowns. Importantly, Tetlock argues that forecasting is also a better test of any LLM’s capabilities than standardized tests, for which the LLM performance is too often influenced by being trained on the test materials, or by being optimized in post-training to enhance performance in a specific test or set of tests.
Early work from a group at UC Berkeley showed that LLMs that were not prompted to apply superforecasting methodologies were inferior forecasters to the wisdom of the human crowd but, by fine-tuning models appropriately, these LLMs could approach the human crowd forecasts across a wide variety of different categories, particularly where human forecasts are less certain.
Tetlock extended this analysis of the role of LLMs by taking a slightly different tack: evaluating the effect of LLMs acting as assistants to human forecasters. The team designed two LLM assistants, one instructed to provide high-quality superforecasting advice by prompting it to follow the 10 rules of superforecasting, and the other designed to be perversely overconfident in its forecasts, neglecting known base rate statistics, thus providing “noisy” or strongly hedgehog-like forecasting advice. A third, less advanced model that did not know anything about prediction methodologies or offer any related advice was used as a control. The ~1000 pre-selected human participants then attempted to use these LLMs to answer a set of six forecasting questions, such as “How many refugees and migrants will arrive in Europe by sea in the Mediterranean in December (of the current year)?”
A key conclusion was that human forecasters interacting with either LLM assistant enhanced their prediction accuracy by as much as 28 percent compared to the control group, or by as much as 41 percent for the superforecasting assistant, if an outlier question was removed. The surprising results didn’t stop there. Tetlock and team found that LLM-assisted forecasting did not benefit less skilled forecasters more than the more skilled ones, or help more with harder questions, although the small question sample size may have impaired the ability to distinguish these effects. But the clear conclusion was that just the activity of interacting with a state-of-the-art LLM was helpful to all human participants under all circumstances. Tetlock therefore concluded that LLMs “can improve human reasoning even in contexts outside the model’s training data, pointing to the potential for true human-AI complementarity.”
Tetlock and team have since extended their analysis of LLMs by a second series of experiments in which they explored what he calls The Wisdom of the Silicon Crowd. In this work, a set of state-of-the-art LLMs were given scratchpad prompts describing the essential behaviors of superforecasters, then the models were presented with a set of prediction questions being posed on the Metaculus forecasting platform. The model responses were then compared to those from a set of 925 human predictors answering the same questions. The experiment utilized a “silicon crowd” comprised of a diversity of models of different sizes, geographic origins and with, and without, internet access. Remarkably, Tetlock and team found that the aggregate forecasting accuracy across the crowd of LLMs was indistinguishable from that of the human crowd. In addition, they found that when the LLMs were given the human crowd estimate as an additional input, their estimates improved markedly, providing the first evidence that humans can assist LLMs that, in turn, can further assist humans in a symbiotic relationship.
An interesting additional finding was that there was significant variation in model accuracy across different topic areas, with the models being less comparable to human forecasters in topic areas such as climate, politics and education, and more comparable for topics such as law, literature, economics and technology. This is arguably consistent with the fact that the latter are well-described in the written literature upon which the models are trained, which is less the case for the former topic areas for which there is considerably more ambiguity, complexity or diversity of data.
Finally, in his most recent work, Tetlock and collaborators are evaluating the performance of the state-of-the-art models and different crowds of human forecasters as part of a continuous forecasting platform they have created. The results are continuously updated and published on forecastbench.org. The current league table shows that a crowd of LLMs come close to matching the average human crowd but is still 30 percent below the level of teams of superforecasters: the gold standard for forecasting.
So what does this mean for the future?
Tetlock sees the current state of LLMs as undoubtedly a promising new predictive tool: “LLMs are notoriously black box entities. But I’m going to make the inferential leap that what’s going on is similar to what goes on when we aggregate human forecasters, and that there are different forecasters that have different components of the signal that’s relevant, and when you blend them, you get something that’s more powerful than any of the individuals taken together.”
Furthermore, he believes that LLMs will become superforecasters in their own right (without requiring human oversight): “I am 80 percent confident this would happen in the next two to five years. I think [LLMs will] be able to mimic the versatility of super forecasters in a broad-based geopolitical tournament.” Conversely, he believes that LLMs may also help superforecasters become ultra-forecasters – exhibiting capabilities beyond their current limits, “but it will be a domain specific achievement. I’m not expecting it to be a system wide kind of thing.”
But he also wonders to what extent these AI systems demonstrate real intelligence, contemplating that “LLMs are doing more than just summarizing – they are wonderful at getting the ‘gist’ of different points of view. The question is, can they do more than that? Can they create novel syntheses? Can they go beyond differentiating the points of view that already exist in the debate space, and integrate the clashing points of view and bring out the strengths of the strongest, and tamp down the weaknesses, and produce something that’s altogether novel and viable?” This goes to the heart of the question that underlies David Eagleman’s “Level 2” intelligence test of “conceptualizing”: the ability to propose a new model or medium that produces a new understanding or form of expression that is grounded in experienced reality.
Although the answer to this question is not yet clear, Tetlock sees a clear path for real progress to be made with AI, if we include “world models” inside, or as invokable entities by the AI systems. He explains that this will allow us to “describe the world across all domains of interest.” He points out that, “You can predict things in the physical world because you have definitive models with certainty, but when mixed with the political or psychological worlds, there are no definitive models, so the number of futures is too large to be predictable.”
And he agrees with Yann LeCun’s vision of a future “Society of Machines,” saying “there probably won’t be one model that is expert in all domains. It’s much more likely there’ll be a whole host of models with expertise in a specific domain that we aggregate in some way, which mirrors progress in human history between different experts and styles of thinking – like foxes and hedgehogs.”
Turning to the more prosaic world of business, Tetlock observes that “you have different [needs for different] job functions. The CEO probably is required to be more foxlike; the CFO is required to be a hedgehog.” And this duality is important, “I don’t mean to relegate the hedgehogs to a purely peripheral role, because they’re absolutely crucial…the foxes are deriving a lot of their ideas from the hedgehogs. They steal ideas from hedgehogs.”
He illustrates this point with reference to progress in the scientific community, “hedgehogs represent a core scientific value. Any idiot can explain a lot with a lot; parsimony is a core scientific value. The capacity to integrate a wide range of facts within an elegant set of internally consistent explanatory principles is regarded as a very high achievement in science. You want to give Nobel Prizes for things like that. More foxlike thinking is regarded by many famous scientists as being somewhat second rate; the hedgehogs are the guys who come up and clear up the anomalies, the complications, the messy stuff—the cleanup crew. It is interesting that the world seems to be structured into a somewhat perverse fashion that a style of thinking that works best in basic science may not work very well when it comes to applied forecasting in messy real-world events!”
So, at the end of a fascinating discussion that revealed a great deal about the messy business of predicting the future – something we as humans continuously have to do to survive and thrive in our complex world – the moral of the story seems to be that, in the future, a plethora of AI systems will behave as foxes and hedgehogs in different domains, based on different world models, and will assist their human hedgehog and fox counterparts to better predict future outcomes. And maybe this human-AI collective cooperation will help us all become super- or even ultra-forecasters so we can make more intelligent, informed decisions in all aspects of our lives: personal, professional and political.