• DiggerInsights
  • Posts
  • Deepset’s Open-Source Solution to Simplify Big Data Processing

Deepset’s Open-Source Solution to Simplify Big Data Processing

PLUS: Voiceling

Mornin’ miners⛏️,

Happy Friday!

Welcome to "Digger Insights" - your daily 5-minute enlightenment on the most recent tech updates! In our enjoyable and easily digestible dispatches, we break down the latest tech trends, giving you a quick and comprehensive view.

Join us, and in just a few minutes each day, you'll gain an advantageous perspective on the ever-evolving tech landscape. Ready to dive in?

Let’s get to it!

Today’s Highlights:

  • ⚙️Deepset’s Open-Source Solution to Simplify Big Data Processing🖥️

  • 💻Voiceling🎙️

Startup deepset Develops Open-Source Platform for Simplified Data Processing

We are seeing more and more companies across various sectors leveraging large language models to further improve their everyday work efficiency, especially those that have to deal with an enormous amount of data.

Though this is the case, some companies have trouble maximizing LLM capabilities and utilizing them to their utmost potential. A German-based startup named Deepset dedicates its resources to helping companies figure out how to apply LLMs to their business applications.

LLM Support

According to research conducted by the International Data Corporation (IDC) in collaboration with Seagate, the global data sphere will grow to 163 zettabytes, equivalent to about a trillion gigabytes, by 2025. In a world where data has become a critical part of basically everyone’s lives, this would mean greater business opportunities and possibly greater user experiences, especially in parts of the world that are extremely data-driven.

Photo Courtesy of IDC and Seagate

However, this also means businesses would have to deal with a gargantuan amount of data that they would have to search, summarize, and analyze. Working with data would become immensely difficult, even more than it is now. Having LLMs automate data processes would probably make the lives of data-focused workers much simpler or, at the very least, a little less overwhelming.

There’s no doubt that building your own LLM to be able to perform NLP* tasks and solutions would be time-consuming and far too expensive for some, though, as we previously learned in volume 3 of the Demystifying LLM Deep Dive series. It is also a difficult task to figure out the methods to train them, which pre-trained model to use during training if needed, and so on.

This is where deepset comes in, by providing an open-source framework named Haystack that lets developers choose the components needed to build NLP-friendly projects and applications, all in one place.

*NLP: Natural language processing, a machine learning technology that gives computers the ability to interpret and comprehend human language.

Photo Courtesy of deepset

LLM Framework

Haystack provides crucial components like open-source LLMs, vector databases, and tools like file converters to assist businesses with all sorts of NLP processes. The tools Haystack provides can help with preprocessing, finetuning, and so on. Haystack also has a retrieval architecture that could become a monumental help if the trillion gigabytes of world data ever comes.

Photo Courtesy of deepset

The startup’s retrieval architecture can retrieve relevant information from large datasets based on the user’s query and input. This system can find and rank documents, web pages, and other sources of data according to the information the user is searching for. This could save up a lot of data workers’ time, as opposed to performing searches and retrievals manually.

According to deepset cofounder and CEO Milos Rusic, the platform enables developers to build production-ready applications without having to be NLP or LLM scientific researchers. The startup’s cloud service, deepset Cloud, also supports this idea.

History of Cloud

deepset was first founded in 2018 with Rusic alongside CTO Malte Pietsch and co-founder Timo Möller. From the very start, the startup focused on assisting its customers with performing NLP tasks and expanded its portfolio by launching deepset Cloud in 2022. The cloud service is model-agnostic, meaning it is usable for any LLM, and is SOC 2-certified*, ensuring that the startup’s cloud maintains a high level of information security.

*SOC 2: a compliance for service organizations that specifies how organizations should manage customer data with standards based on security, availability, processing integrity, confidentiality, and privacy.

Photo Courtesy of deepset

The deepset Cloud allows AI teams to build LLM systems that are customizable and flexible, with full ownership of all their data. Essentially, the cloud can also assist developers in creating NLP applications through experimentation and production. It allows developers to experiment with different language models, observing and comparing them to see which fits their application the most.

The startup currently works with enterprises in the UK, the US, and Europe. A legal publishing house named Manz used deepset Cloud to develop LLM-enabled products to find relevant regulations, templates, and precedents from millions of documents. The R&D team of aircraft maker Airbus used Haystack for an application that helps pilots find and use the most relevant aircraft operation guidelines whenever they need it, even when they’re already in the cockpit.

Photo Courtesy of Airbus

deepset recently raised $30 million to further develop Haystack and deepset Cloud by adding new capabilities and optimizing for virtual private cloud (VPC)* setups. The startup will also incorporate capabilities in its Cloud service to enable retrieval-augmented generation. One of deepset’s priorities is also to create a platform that will be viable for customers with heavy privacy constraints.

The funding was led by Balderton Capital, supported by Google Ventures, Harpoon, System.One, and Lunar. deepset has raised a total of $46 million in capital.

*Virtual Private Cloud: an isolated private cloud hosted within a public cloud, securing virtual networking environments like IP addresses, subnets, and network gateways.

Favorite Product of the Day

Voiceling

There are definitely times when we find interesting videos that we can’t understand and end up not watching because it gets too tiring to read or manually translate subtitles of a foreign language. Dubbing exists, but a lot of the time, not many language options exist.

What if you could have a tool that dubs and translates videos in ANY language with just one click? Meet Voiceling, an AI Chrome extension that seamlessly integrated itself into YouTube and has over 30+ languages so you can enjoy global content in your native language!

Photo Courtesy of Voiceling

Voiceling has realistic voiceovers and can recognize gender, detect multiple speakers, and conserve any background noise. This way, voices in videos are perfectly matched, with each speaker assigned to a unique voice.

Download Voiceling’s Chrome extension here!

Meme & AI-Generated Picture

Job Posting

  • Cruise - Senior Security Software Engineer II - San Francisco, CA (Remote/Hybrid)

  • CrunchyRoll - Principal Product Designer, Service Monetization - San Francisco, CA (Remote/Hybrid)

  • Kunai - IAM (Okta) Engineer - San Francisco, CA (Remote)

  • Two Barrels LLC - Creative Director - Salt Lake City, UT (Remote/Hybrid)

Promote your product/service to Digger Insights’ Community

Advertise with Digger Insights. Digger Insights’ Miners are professionals and business owners with diverse Industry backgrounds who are looking for interesting and helpful tools, products, services, jobs, events, apps, and books. Email us [email protected]

Your feedback would be greatly appreciated, send it to [email protected] 

Reply

or to participate.