Unstructured, which preps sloppy data for LLM training, has raised $40 million at a $230 million valuation.
Brian Raymond is founder and CEO of Unstructured, an AI startup that makes disparate human-generated data ready to be used to train and improve AI models.
Artificial intelligence models require vast amounts of data to train and improve. And while there is ample data to address that need, it is often a mess of various formats — PDFs, HTML, Word docs, emails — that must be distilled down before being fed into those AI models.
That’s Unstructured sweet spot: “really messy, sloppy data,” as founder and CEO Brian Raymond describes it. The startup transforms over 30 different file formats into one that a machine learning model can understand.
“We're focused on the ugly underside of AI that nobody wants to touch,” Raymond told Forbes. “Developers frickin’ hate this stuff.”
Unstructured said Thursday it has raised $40 million in a Series B round led by Menlo Ventures with participation from Databricks Ventures and NVentures, NVIDIA’s venture capital arm, among others. The new fundraise values the company at $230 million and brings its total capital to $65 million.
“The ability to chunk the data is actually kind of an art form in itself,” Tim Tully, a partner at Menlo Venture, told Forbes, adding that Unstructured’s tool helped him build an AI application that could process board meeting data and present it to the firm’s LPs.
Unstructured says that some 50,000 organizations use its open-source software to prep their data for AI training. Developers, who must download the tool every time they need to update an AI model with fresh data, are doing so about a million times a month on average, Raymond said. The company uses its own mix of models to detect the file type of a document as well as what’s inside it and routes the contents through the appropriate reformatting pipeline changing it into the JSON format preferred by most AI models.
Unstructured says it has about 1,000 paying customers, among them the U.S. military which uses its tools to prepare classified data to train its own large language models and Independent Health, a health insurance company training its AI on insurance policies.
38-year-old Raymond, a former CIA officer, founded Unstructured in July 2022 after working at enterprise AI company Primer AI, where he realized the need for a tool that cleaned up and readied troves of enterprise data for LLM training, a problem no one wanted to solve, he said.
“Nobody is passionate about getting data ready, everyone's passionate about the models themselves,” he said. “Our vision is to connect human generated data with foundation models.”
Source: Forbes