Copyright forbes

You can’t just scrape 100,000 investment bankers off the internet. The AI spending spree continues apace with no end in sight. In earnings calls Wednesday, Meta, Google and Microsoft all said they would continue to hike their capital expenditures to build out AI data centers, on top of the $400 billion that major tech companies are on track to spend on AI infrastructure this year, according to Morgan Stanley estimates. But so far, return on investment for AI use cases has been weak. A recent MIT study found that 95% of generative AI pilots at companies have failed to develop beyond the proof-of-concept stage. Leading AI labs like OpenAI are still having trouble with getting their AI to do ‘grunt work’ tasks of entry-level investment bankers. Generative AI seems to be heading towards its “trough of disillusionment” moment. Chatbots like ChatGPT and Gemini can draft a perfect sonnet in a few seconds. They can code and write movie scripts. So why is AI still so far behind in actually being able to take over white collar work? The answer: training data, specifically specialized real world workflows. Turns out that while there is an abundance of publicly available data for how to generate language or code, there is significantly less for how to manage an IPO or merger. That data is largely private. For writing sonnets, models have been trained on an entire internet’s worth of text, which includes immense amounts of poetry and literature. This is data that AI companies were able to obtain for free by scraping the entirety of the internet. “Large language models do well when you collect a lot of data, and we don’t have nearly as much data for real-world tasks.” Robert Nishihara, co-founder of Anyscale On the other hand, for the real-world workflows that define an entry-level investment banker’s day, even things like specialized data entry and spreadsheet manipulation, there is almost no easily-scrapable training data. “Large language models do well when you collect a lot of data, and we don’t have nearly as much data for real-world tasks, meaning the AI hasn’t seen the examples it needs to master these specific skillsets,” said Robert Nishihara, co-founder of Anyscale, a company that provides AI software infrastructure. In real-world, multi-step tasks, generative AI's inherent lack of control is a critical flaw. Because the AI produces different responses each time and is prone to hallucination, even small errors quickly compound across multiple steps, causing the entire workflow to rapidly go off the rails. Error reduction is why AI labs are doubling down on domain-specific data collection with experts, because each domain has its own set of corner cases that needs to be accounted for, said Lake Dai, founder of Sancus Ventures. It’s becoming a big industry: Surge AI, a company whose entire business is providing AI labs with human-generated training data, reached $1 billion in recurring revenue last year. Experts can make over $100 an hour to generate the data to automate their own jobs. That data can be people doing their office jobs for hours at a time, said Paco Guzmán, head of research at Handshake, a training data provider. For example, this could be an investment banker formatting a presentation in the right way, or a doctor entering patient notes into a health record system. Experts can make over $100 an hour to generate the data to automate their own jobs. “There’s a huge demand for this type of data because AI model makers want to be a partner for every single professional and help them increase productivity, so once they’re done with the financial domain, there is medical, recruiting, and an endless amount of other domains,” he said. But even if AI labs manage to collect thousands, even millions of examples of white collar work, will AIs be able to wholesale do our jobs for us? Not necessarily, because current AI models still can’t learn like humans can, said Nishihara. “To do a job, you need to be able to learn on the job,” he said. “Humans can learn on the job and learn from mistakes and only one example, but today’s AI models can’t do that,” he said. “We won’t fully replace white collar work until they can do that.” More from Forbes Got a tip? Share confidential information with Forbes. Editorial StandardsReprints & Permissions