Six months of shipping generative AI across healthcare, real estate, and fintech taught us that the gap between a demo and a production system is wider than anyone admits.
In the first half of 2023, every client conversation shifted. “How can I use LLMs to make my app do…?” became the default opening. We spent six months trying to answer that question in production — across healthcare, real estate, marketing, and fintech — and came back with something more useful than hype: a set of patterns that actually work and a few that don’t.
The short answer for any MVP: use the APIs. LLM leaderboards shift constantly, but the highest-performing models are still proprietary, and many of the best open-source alternatives are only available under non-commercial licenses.
Prompt engineering: the practice of crafting, iterating, and systematically testing the inputs given to a large language model in order to elicit more accurate, reliable, and contextually appropriate outputs — without modifying the underlying model weights. Effective prompt engineering is a skill distinct from general software engineering and often has a larger impact on output quality than model selection.
We ran this evaluation for a client who wanted to train their own smaller model (~10B parameters) to outperform on a specific use case. The cost-benefit didn’t hold up. Starting with OpenAI — even before GPT-4 hit general availability — let the team prove what was possible first, then iterate. If there are valid business reasons to go in-house (data privacy, cost at scale, latency), be prepared to step back before going forward. The API phase generates the training data and success criteria that make in-house training tractable.
This is particularly true for teams where the LLM component is one part of a larger system. The engineering cost of running your own inference infrastructure rarely pays off until you’re processing at significant volume — and by then, you’ll have learned exactly what behavior you need to optimize for.
Get a scoped project plan with story-point estimates for your LLM integration — model selection, prompt pipeline, evaluation framework, and deployment. Free and instant.
Scope Your Project for FreeNo call required. Takes a few minutes.
The most useful framing we found: look for use cases where a 50–60% reduction in human effort is a meaningful win. A lot of LLM output seems impressive in demos but turns out to be inaccurate or banal on closer inspection. The question isn’t whether the AI can replace the human — it’s whether it can reduce the time the human needs to complete the job.
Examples that passed this test: generating draft response emails from a CRM context; using vector-based search to help users surface relevant documents from a private corpus; auto-populating first-pass reports from structured data with a human reviewing the final output. Examples that didn’t: use cases where any error forced a human to re-check the entire output, negating the time savings.
The critical variable is predicting error rate before you build. In healthcare, for instance, if ML can reliably handle a given type of radiology scan classification, it’s valuable. But if the false negative and false positive rates are unpredictable, a radiologist must re-read every scan — and you’ve added cost without adding value. The same logic applies to any high-stakes classification task. Run your prompts at volume, study the failure modes, and model whether the error rate makes the human review cheaper or more expensive than doing the work manually.
One finding that surprised us: data cleaning matters as much as model selection. We had to strip email threads of signatures and re-copied content before loading them into a vector database — because that noise diluted the signal enough to make retrieval unreliable. The good news: you can use AI for the cleaning step too. But don’t treat data quality as someone else’s problem.
For teams looking to understand which AI projects are worth pursuing at all, Fraction’s engineers can help you run a structured AI opportunity assessment for sales and revenue workflows before committing to a build.
The most consistent pattern we observed: prompt engineering is underinvested on almost every project. Engineers write a prompt, get something plausible, and move on. The output at launch is 60% of what it could be — and no one knows it.
The fix is simple but requires discipline: iterate on prompts manually through a chat UI before an engineer builds a system around them. Many engineers are not strong writers. Handing the prompting to someone who is — or spending dedicated time on iterative refinement — consistently produces better results than a technically skilled engineer writing prompts quickly.
One technique that works particularly well: use AI to check its own work. If you want to generate a color scheme that excludes brown, generate the hex codes, then ask the model whether the result contains any brown. Layered checks like this improve output quality without adding latency in most cases. This iterative self-checking approach is also one of the techniques OpenAI used in GPT-4’s development — the model is used to evaluate and improve its own outputs, not just generate them once.
Temperature settings also matter more than most guides suggest. When you need predictable, structured output — JSON, classifications, ordered lists — lower temperature settings reduce the variance in ways that make downstream processing far more reliable. Treat temperature as a first-class parameter, not an afterthought.
Before ChatGPT’s rise in late 2022, we observed that startups’ tech stacks leaned heavily toward full-stack JavaScript — around 65% in our experience, with Python at about 25%. By mid-2023, that had shifted sharply: nearly 50% of the startups we worked with were using Python as a key language.
The reason is structural, not fashionable. Python is the native language of machine learning. Most LLM tooling — including LangChain — releases Python support first, with the most depth and the fewest rough edges. LangChain’s Python documentation and APIs are substantially more complete than their JavaScript equivalents. New capabilities, new model integrations, and new orchestration patterns almost always appear in Python before they’re available anywhere else.
This doesn’t mean JavaScript teams are blocked. But it does mean they’ll hit Python-first limitations at the edges of new tooling, and they should plan for that. For teams choosing their stack for a new AI-heavy product, Python is now a serious consideration even if the rest of the product runs on Node. If you want to understand the full picture of what an LLM tech stack looks like across tooling, models, and infrastructure, that tradeoff is worth mapping explicitly before you start.
The single most impactful implementation pattern we used across projects: break problems into smaller pieces and chain calls. Rather than asking the model to do everything in one prompt, decompose the task into discrete, verifiable steps. Each step’s output becomes the next step’s input, and you can validate — and correct — at each stage before errors compound.
| Approach | When it works | Main risk |
|---|---|---|
| Single large prompt | Simple, well-defined tasks with limited output variability | Errors compound silently; hard to debug which step failed |
| Chained calls | Multi-step tasks where intermediate outputs can be validated | More API calls and latency; requires more upfront design |
| Deterministic + LLM hybrid | When the answer is knowable but needs a conversational wrapper | Requires clear boundaries between what the LLM and algorithm each handle |
LangChain and Microsoft Semantic Kernel are the primary tools for managing chained calls. They handle state, manage context windows, and make it easier to inject validation between steps. The core idea — break the problem into bite-sized chunks — is what delivers the quality improvement. The framework just makes it manageable to implement.
One pattern worth calling out specifically: in many use cases, traditional deterministic approaches still outperform LLMs and are orders of magnitude faster. Keyword search and elastic search aren’t dead. Sometimes the right answer is to use the LLM to put a friendly, conversational wrapper around a deterministic result — rather than asking the LLM to derive the answer itself. Recognizing when that’s the right call is part of what makes an experienced AI engineer valuable.
Performance at scale is the other consideration most projects underweight early. API latency and reliability can be real blockers for user-facing features. Caching responses where appropriate, and designing around API unavailability, are production requirements — not post-launch concerns. For teams ready to move from proof of concept to production systems, understanding what production-grade LLM and RAG engineering actually requires is a useful starting point before scoping the build.
Praveen Ghanta is a five-time founder and serial entrepreneur. He is the founder of DevHawk.ai, an AI-powered engineering management platform, and Fraction.work, which connects fast-growing companies with top fractional tech and growth marketing talent. Previously, he founded HiddenLevers, a risk analytics platform for wealth management that he bootstrapped from inception to acquisition by Orion Advisor Solutions in 2021, serving thousands of advisors and $600B in assets. He earlier founded SmartWorkGroups, acquired by Intralinks in 2000.
Connect on LinkedIn →Describe your software or AI project. Get a full scope with story-point pricing, sprint estimates, and a downloadable plan in minutes. No calls, no waiting.
Scope Your Project for FreeWorking on a data strategy? Talk to a Fraction CTO. → Book an intro call