• DiggerInsights
  • Posts
  • CambioML’s Unified LLM Interface for Data Cleaning and Extraction

CambioML’s Unified LLM Interface for Data Cleaning and Extraction

Simplifying Extraction and Cleanse for LLM Training with LLM-Based Data Enrichment Tools

ai generated image of data cleaning

The big AI race in the tech industry has players reeling, with hopes of winning and thriving by developing large language models (LLMs) with the best performance necessary for their users. One critical way of ensuring great model quality and performance is through data cleaning.

Data cleaning is the process by which data scientists fix or remove data from the dataset used to train models. This data may include incorrectly formatted, corrupted, duplicated, or incomplete ones, vital to repair or eliminate as they can cause unreliable outcomes and algorithms.

However, existing methods of data cleaning are much too time-consuming, with data scientists having to spend most of their efforts on collecting, cleaning, and preparing data for training. CambioML, a YC-backed startup founded by Rachel Hu, aims to change this by creating LLM-based data enrichment tools that can accurately retrieve and transform data through Uniflow, the startup’s open-source Python library.

cambioML website homepage

Photo Courtesy of CambioML

Data Cleaning and Transforming

The performance of LLMs, which are machine learning models* trained to comprehend and generate human language text, depends on various aspects, from the amount of noise that can distort patterns to biases that may create a systematic deviation of data from true value. By removing or, at the very least, reducing noise and bias, which have to be done through processing and cleaning data, ML models can work fairly and accurately.

Given how significant a step data cleaning is, extraordinary measures are often taken to accomplish it. With the abundance of data that needs cleaning and processing, as well, a lot of muscle work is required. According to CambioML’s team, this entails taking over 50% of machine learning scientists’ time. In the case of companies handling pre-trained models, 50% of their data scientists often spend most of their time building data-cleaning pipelines.

*Machine Learning Models: Computer programs used to recognize patterns and make predictions from previously unseen datasets.
aint nobody got time for that meme rechecking extractions? aint nobody got time for that

In the most basic sense, machine learning scientists can clean data by extracting the information they need from documents and transforming them into a format suitable for training. Traditional methods of extraction involve PDF parsers* that often struggle to extract text from documents accurately due to complex PDF layouts, requiring machine learning scientists to waste time rechecking results.

Once time has been spent on extraction, scientists move on to data transformation. In the case of fine-tuning LLMs using feedback-based learning* methods such as RLHF or RLAIF*, machine learning scientists have to develop datasets with both a preferred answer and a rejected answer for all potential questions. This need prompts a significant amount of labor hours, as machine learning scientists would have to create pairs of positive and negative responses for the model to perform most accurately every time.

*PDF Parsers: Software used to extract data from PDF documents.
*Feedback-based Learning: A method incorporating corrective feedback from humans synthetic sources to improve the efficiency and performance of algorithms.
*RLHF and RLAIF: Reinforcement learning from human feedback and reinforcement learning from AI feedback.
approved crying cat meme when you just finished creating response pairs and your boss givees you another load to work on

Aware of these pain points after previously working as machine learning scientists and engineers at Amazon Web Services (AWS) and Tesla, CambioML’s team built the Uniflow to address them by creating methods of extraction and transformation that are more efficient.

From Unstructured Data to Vital Insights

Uniflow’s extraction feature, which can be performed either through CambioML’s open-source Python library or its Pro API version*, allows users to input multiple raw documents in the form of PDF, HTML, or URL for Uniflow to certainly extract accurately. Being LLM-agnostic*, users can utilize CambioML’s home-trained models or use other existing models, such as Open AI’s GPT3.5 and GPT4, Google’s Gemini 1.5 and MultiModal, AWS’ BedRock, Huggingface’s Mistral-7B, and more to perform the extraction.

*API: Application programming interface, a software intermediary that allows two applications to communicate with each other.
*LLM-Agnostic: An approach where a platform is not dependent on any one specific large language model.

Photo Courtesy of CambioML

Once documents are extracted and turned into clean text, users can utilize Uniflow’s features to convert them into desired formats to either fit a database schema*, build LLM datasets*, or customize prompts*. Uniflow’s LLM-agnostic interface allows users to compare data outputs across the myriad of LLM options as well to ensure that they will use the best data for training.

By creating this unified LLM interface*, CambioML believes that companies can efficiently extract insights required for the best research and development results for their machine learning models. With the data enrichment tools found on Uniflow, CambioML ensures that companies and enterprises can reduce the time spent cleaning by up to 90% while discovering insights from 10 times more data, increasing revenue opportunities.

*Database Schema: A representation of data that shows how the data in a database should be stored logically.
*LLM Datasets: Massive collections of text and code used to train and fine-tune large language models.
*Customize Prompts: Building specific instructions or requests to guide a large language model towards a desired outcome.
*Unified LLM Interface: LLMs designed to provide optimal viewing and interaction for any screen size, device, or orientation.

Source: CambioML’s GitHub

Meme & AI-Generated Picture

data scientists to noise and bias you shall not pass meme
ai generated image of data cleaning
ai generated image of data cleaning

Job Posting

  • Capital One - Distinguished Engineer - San Francisco, CA+ (Remote/Hybrid)

  • CloudFlare - Security Research Analyst, Threat Intelligence - New York City, NY+ (Remote/Hybrid)

  • Expert Institute - Salesforce Developer - Washington DC+ (Remote/Hybrid)

  • Getty Images - Sr. Analyst, Content Planning & Analysis - Chicago, IL+ (Remote/Hybrid)

Promote your product/service to Digger Insights’ Community

Advertise with Digger Insights. Digger Insights’ Miners are professionals and business owners with diverse Industry backgrounds who are looking for interesting and helpful tools, products, services, jobs, events, apps, and books. Email us [email protected]

Your feedback would be greatly appreciated. Send it to [email protected] 

Reply

or to participate.