Learnings from 6 LLM Projects in 6 Months
Sadly, None of This Was Written By ChatGPT
Jul 11, 2023
A lot of things happened in the first half of 2023, but in tech - everyone went ChatGPT, and then generally LLM-crazy. Suddenly every other client conversation we've had has morphed into some variant of "how can I use LLMs to make my app do..." ? We've learned a lot in the last six months working to deliver generative AI-based software across a range of sectors, from healthcare to real estate to marketing and fintech. LinkedIn is filled with breathless stories about the wonders of LLMs by the minute, so I thought I'd share Fraction's experiences from the front lines trying to get this new technology to solve real problems. This is not intended as counterpoint, as machine learning capabilities continue to advance by the day - but hopefully our real world experiences help you avoid mistakes.
Big Picture Takeaways:
Build vs Buy: Do your MVP with the APIs. LLM leaderboards and rankings vary, but all of the highest performing models are still proprietary, and many lack callable APIs. Many of the best open-source models are currently only available on non-commercial licensing terms. While it may be technically possible to train a smaller ~10B parameter model to outperform for a specific use case, the cost-benefit of building your MVP this way just isn't worth it. We performed precisely this evaluation for a client and nudged them toward starting with OpenAI in order to establish what's possible first (even more applicable now that GPT4 has entered GA release). Testing and iterating this way can provide a path toward eventual replacement with a proprietary model. There may be valid business reasons to go in house, but when you do, be prepared to step back before you go forward.
Find use cases where a 50-60% reduction in human effort is a big win. A lot of LLM output seems great on the surface, but is inaccurate (or simply banal) upon closer inspection. Can you identify a use case where generative AI output can reduce the time needed for a human to complete the job? Perhaps that's looking up helpful information automatically, perhaps that's generating a draft response email, or perhaps that's using vector-based search to help users chat with relevant documents.
Can you reliably predict when LLM output will be shaky? If AI can reliably handle 60% of the work that's great - but only if the human effort involved in checking the work doesn't waste the savings. In the healthcare space, a concrete example of this would be reading radiology images - if ML could do this reliably for a given type of scan, it could be put to use. But if the false negative or false positive rates are unpredictable, this forces a radiologist to re-read every scan, negating the benefit. Expect to test your prompts quite a bit, and also to play with temperature settings when you need predictable responses.
Garbage in / garbage out still applies. We found we had to clean emails substantially before putting them into a vector database for "related" searches in order to get any value - in this particular case because emails tend to contain a lot of garbage data (people's signatures and re-copied threads dilute the signal in the message). The good news: you can use AI to do the cleaning step too! But overall don't neglect that these models need clean input data
LLM Dos + Don'ts:
Do tons of prompt engineering! When the engineer says, "this is the best I could get it to respond" just remember that many engineers are not ideal writers. Invest the effort in iterative prompting, and do this manually via the chat UIs, before handing off to an engineer to build a system around it.
Do use AI to check its own work. For instance, if you want to generate a color scheme excluding brown, you can generate the hex codes, and then ask AI whether there's any brown in the result. Iterative layering like this will improve responses, and is one of the approaches OpenAI used to improve performance in GPT4.
Don't expect an LLM to work without enough data (or any machine learning approach for that matter). If you don't have enough data, you don't have enough data. Consider implementing an initial model via an LLM (or via a deterministic model if the results are too poor) to ship an MVP, and then collect data by grading the responses of this initial algorithm. This training data can later be used to build a v2 model using RLHF (reinforcement learning with human feedback).
Don't expect or build your strategy around perfection. I'm repeating myself here, but too often I see founders waiting for perfection instead of shipping and iterating, and I've seen that repeat with LLM projects. It might be great if an LLM could do all of the work of a top notch designer or brand expert, but reducing the workload by 60% can increase the human's productivity by 2.5X. Take that win via a hybrid approach, rather than waiting for the perfect solution (which may end up being an unreachable asymptote).
Thoughts for Developers / Implementation Considerations:
Chain Your Calls. Expect to use LLMs in an iterative manner to improve the quality of output. LangChain, Microsoft Semantic Kernel, and similar projects make this easier. The core idea: if you break a problem down into bite-sized chunks, LLM performance can improve remarkably.
In many use cases traditional approaches still perform, and are an order of magnitude faster. Keyword and elastic search ain't dead yet... and neither are deterministic algorithms. Sometimes it's better to just use the LLM to put a pretty, chatty wrapper around a deterministic answer.
Performance matters, and the APIs can't always provide it. For some use cases you just can't wait, or afford the (lack of) reliability of the APIs. For these you'll have to invest in going local and training, or in caching responses where it makes sense. Note that OpenAI and other API providers continue to improve capacity and performance, so this problem may prove ephemeral.
The technology in this space is changing daily, but it's helpful to ground yourself in some basic tech realities. AI and LLMs can perform when you've properly set the table for them. But expect to write a nontrivial amount of glue code (developers are nothing if not digital plumbers), and expect to spend even more time and effort properly defining (and getting feedback on) your use case. What good is all the intelligence in the world, applied to a problem poorly defined?
<Plug> If you've found this helpful, contact me to learn about engaging with us - our fractional resources are helping deliver production-ready LLM-based software! </Plug>