October 3, 2023
A few weeks ago I updated our post on startup text stacks. Today I want to delve a little further into the tech stacks that startups are using for LLM-based projects.
LangChain (a vendor in the space) and Sequoia (a top VC firm) have both posted excellent summaries of the emerging LLM tech stack. I'll build on their work here by adding what we've seen based on our own work in the field and based on our conversations with early stage startup CTOs.
There are a wide variety of use cases that companies are trying to tackle with generative AI and LLMs (large language models). Relative to "regular" applications, these use cases involve dealing with relatively unstructured inputs from users, and with providing relatively unstructured outputs as well. What do I mean by unstructured? In the pre-LLM world, it was quite difficult to build a good chatbot, because open ended questions from users are hard for computers to parse and understand - they quite literally do not confirm to any particular structure! Generative AI outputs are meant to resemble human output as well - while you can generate a CSV as output from an LLM, it's more likely that you are looking for a generated image or block of text that appears similar to that created by a human.
Large language models are used in these to pull understanding from human input, and to generate output, but a lot of glue is needed in between. Since LLMs have limited context windows (a maximum input message length), how do you analyze large inputs? How do you search an existing body of knowledge using the power of an LLM? And what about dealing with non-text based inputs and outputs, whether audio (think speech), image, or even video? These needs necessitate a deeper stack of technologies.
Let's start by defining terms, as the space is evolving rapidly (see if you can guess which sentences in this section were written by an LLM! the whole rest was me, for better or worse):
Large Language Models: Large Language Models (aka ChatGPT in the public imagination) are artificial neural networks that are trained on vast amounts of text data scraped from the internet. They can contain billions to trillions of weights and work by taking an input text and repeatedly predicting the next token or word. LLMs are used for a variety of natural language processing tasks such as question answering, text generation, and natural language understanding. My favorite explainer on LLMs is this one by Stephen Wolfram - it's detailed and drills down into how ChatGPT works without overwhelming with math or code.
LLM Orchestration Frameworks: LLM Orchestration Frameworks provide a way to manage and control large language models. They can help to simplify the development and deployment of LLM-based applications, and they can also help to improve the performance and reliability of these applications. Your mileage will vary by programming language, as some frameworks support python first and foremost (as it's the language most closely associated with data science).
Vector Databases: Vector Databases store data as high-dimensional vectors (aka embeddings), which are mathematical representations which help capture the semantic "meaning" of a block of text. Comparing vector embeddings allows efficient and fast lookup of nearest-neighbor in the N-dimensional space typically powered by k-nearest neighbor (k-NN) indexes. In English, this means that you can use vector embeddings to find related text, whether based on a user query or based on a comparison of documents.
Image Generation: Image Generation is the task of generating new images from an existing dataset. By now you've seen the astronauts riding unicorns, the world leaders imagined against all sorts of backdrops, and even short movie trailers - all entirely generated via ML-based algorithms leveraging some of the same underlying technology as LLMs. While these algorithms have struggled at including text in images, their capabilities in that regard are also improving fast.
OCR (Reading text from images): OCR stands for Optical Character Recognition - the ability to read text from images such as scanned documents or photographs. Modern OCR software uses machine learning and computer vision approaches to identify individual characters. If you've ever done a Recaptcha (I know you have) - you've helped train image recognition algorithms! While OCR has existed for decades, its accuracy has greatly improved in the LLM era, and text extracted from images can serve as an important input into LLM-based processes.
Speech-to-Text (and vice versa): Speech-to-text conversion means that an app recognizes the words spoken by a person and converts the voice to written text. Text-to-speech conversion means that an app recognizes written text and converts it into spoken audio. As with OCR, converting speech to text provides valuable input for LLM-based workflows. Conversion of text back to spoken audio is progressing rapidly in terms of realism - these advances power calling engines now being used to both make and take phone calls from real humans!
LLMs - What Should I Use? For starters, here's the latest leaderboard on Large Language Model rankings. GPT-4 continues to be in the lead, although Claude by Anthropic has narrowed the gap. In terms of models that can be run privately, and LLaMA 2 by Meta is not far behind the leaders either.
While the options are growing, in discussions with CTOs and in my own experience, the best place to start continues to be OpenAI. It's easy to stand up and get going, and chances are that your product issues lie elsewhere anyway! Once you get traction, you can then consider going in house or moving elsewhere based on your needs. Startups focused on solving particular vertical-specific problems are best served by focusing on those issues first, powered by the most accessible model they can access. The list of models available on HuggingFace grows daily, but time spent training those models is better spent AFTER product-market fit, when you are just optimizing away the cost of a vendor like OpenAI. Startups focused on building core ML technology - if you're reading this blog post to learn something… err… find a new line of work.
LLM Orchestration Frameworks - Useful? LangChain, Llama Index, Semantic Kernel and other frameworks promise to make using LLMs easy - except that it's already really easy!
Example Call via OpenAI APIs
Similarly, they make integrating with vector databases like Pinecone or Chroma easy, but that's pretty easy as well, as seen in this Python or Typescript-based example app.
As a result, we're not actually seeing as much adoption as we might have expected. Frameworks do help structure code in two important ways: 1) so that future developers can quickly understand how an application works and update it, and 2) so that future changes, whether swapping out the LLM implementation or the vector database, can be made easily. While I don't recommend future-proofing as a high-value exercise for any startup, if you feel a particular sensitivity around the risk that developers might leave or that implementation choices might change, using a framework here could help.
Vector Databases: Some form of embedding storage and search is proving essential for LLM applications. PineconeDB and Chroma are among leading startups in this space, but it's worth noting that all the major cloud providers (AWS, GCP, Azure) offer a solution as well. Why use a vector database? Put simply, they are the key to semantic search. Want to search a large volume of your company's internal documents? Want to create a client-specific content search? What about a search of a website's contents?
These search use cases are powered by vector embeddings. You chop a document up into small, overlapping sections, sending each of these to an LLM model, which returns a vector embedding. Store these vector embeddings in a vector database, and you can query efficiently for nearest matches. The embeddings encapsulate the LLM's understanding of the "meaning" of the text, enabling more accurate search across documents. Note that very short fragments of text result in lower quality search results, and traditional methods like elastic search can outperform in use cases like these. While the examples above focused on text, vector databases can also be used for similarity search across images, video, and other forms of content - provided a good model for generating vectors from the content!
Image Generation and OCR: Unless you've deliberately avoided this topic until now, you're familiar with the work of OpenAI's Dall-E, currently in its third (substantially improved) version. Stable Diffusion (and DreamStudio by Stability AI) is an important competitor in this space which outperforms Dall-E 2 in many instances. MidJourney is another powerful image generation service which excels at more artistic imagery - but it's only available via Discord. You can interact with the Discord API to access MidJourney (certainly an odd - and perhaps deliberately difficult for businesses - way to expose access). On the OCR side of the equation, Amazon, Google, and Azure all provide ML-based OCR solutions as part of their AI services. Numerous other players exist, but the state of the art has progressed rapidly here, and you should get serviceable results from the tech giants - use whichever integrates most easily into your use case.
Speech-to-Text (and back): Turning recorded audio streams into text is key for powering LLM-based processes, and the reverse is needed in order to communicate back to users in a near-human voice. While Google's speech-to-text API provides industry-leading quality for real-time transcription (in over 100 languages), startups like Eleven Labs are leading the way on cloned-speech generation. Speech-to-text is a well-solved problem at this point, as the technology is available on every smartphone and Alexa device. Human speech generation is a newer frontier, and real-time performance still suffers, with audible lags in most demos I've heard. In short bursts you can fool a human listener, but "conversation" with such a bot comes off stilted pretty quickly - as of Oct 2023, but by the time you read this, the problem may have been solved!
Successful LLM-based projects are more than just a thin interface on top of GPT-4. At the same time, they can very well be less than a custom-trained model built through the work of highly-paid ML specialists. By chaining together capabilities from an LLM, an image generator, OCR library, and speech processing, you can build applications that were only dreams in 2022! Start from the growing ecosystem of API providers, and progress to HuggingFace open-source models when those no longer provide what you need (or once you're at the point of scaling). You can certainly build from scratch using PyTorch, Keras, Tensorflow, and other libraries - but I haven't discussed those in this article for a reason: if you're not directly in the ML industry, it's likely a waste of your time.