Artificial intelligence has become an increasingly prevalent part of our modern world, as this emerging technology has begun to spread across a wide range of professions and industries.
This trend has led to a significant increase in enthusiasm around the potential applications of artificial intelligence and a rise in valuations of major companies like semiconductor firm Nvidia (NASDAQ:NVDA) and software giant Microsoft (NASDAQ:MSFT) which are considered at the forefront of the AI arms race. These gains, in turn, have supported the broader stock market even during recent periods of economic uncertainty.
But there are still doubts about the effectiveness of artificial intelligence and its potential ability to replace tasks traditionally performed by humans.
In a memo, Bernstein analysts sought to explore this question by challenging AI models to evaluate vast amounts of data and provide professional-level financial analysis.
Their test included both horizontal large language models - or multipurpose ones - like ChatGPT from OpenAI or Grok from xAI and more specialized "vertical" large language models. The test was also divided into two phases, the first focused on basic skills such as creating graphs and extracting data, while the second explored whether the model could formulate its own opinions and make reliable judgments about management decisions.
Phase one found that many AI platforms faced issues with terminology changes, such as replacing "employee costs" with "employee benefit expenses." And while they noted that the models' prowess in creating graphs was "strong," data reliability remained a key issue, with only a few models able to retrieve numbers with reasonable accuracy.
In the second phase, the models showed "remarkable efficiency" in more cognitively complex tasks, such as extracting key investor concerns from management calls or creating a timeline of issues in a company. Some models even managed to conduct effective tone analysis of earnings calls, "highlighting instances where management appeared less confident or evasive," they said.
"The results were astonishing, and given the response time to this vast amount of data, do we dare say it outperforms humans?" wrote the analysts.