AI Tools

Learnings from 6 LLM Projects in 6 Months

Six months of shipping generative AI across healthcare, real estate, and fintech taught us that the gap between a demo and a production system is wider than anyone admits.

Praveen Ghanta, CEO, Hire Fraction · July 11, 2023 ·8 min read

LLMsprompt engineeringgenerative AILangChainAI production

What you’ll learn

Why starting with proprietary APIs beats training your own model — even for use cases where custom models eventually win
The specific threshold (50–60% effort reduction) that separates a useful LLM integration from one that just adds overhead
How to predict before deployment whether your LLM’s error rate will negate its productivity gains
Why nearly 50% of startups shifted to Python as a primary language in the first half of 2023 — and what that means for your stack
The exact prompt engineering practice that consistently improves LLM output quality without changing the underlying model

In the first half of 2023, every client conversation shifted. “How can I use LLMs to make my app do…?” became the default opening. We spent six months trying to answer that question in production — across healthcare, real estate, marketing, and fintech — and came back with something more useful than hype: a set of patterns that actually work and a few that don’t.

Should you build your own LLM or use the APIs?

The short answer for any MVP: use the APIs. LLM leaderboards shift constantly, but the highest-performing models are still proprietary, and many of the best open-source alternatives are only available under non-commercial licenses.

Definition

Prompt engineering: the practice of crafting, iterating, and systematically testing the inputs given to a large language model in order to elicit more accurate, reliable, and contextually appropriate outputs — without modifying the underlying model weights. Effective prompt engineering is a skill distinct from general software engineering and often has a larger impact on output quality than model selection.

We ran this evaluation for a client who wanted to train their own smaller model (~10B parameters) to outperform on a specific use case. The cost-benefit didn’t hold up. Starting with OpenAI — even before GPT-4 hit general availability — let the team prove what was possible first, then iterate. If there are valid business reasons to go in-house (data privacy, cost at scale, latency), be prepared to step back before going forward. The API phase generates the training data and success criteria that make in-house training tractable.

This is particularly true for teams where the LLM component is one part of a larger system. The engineering cost of running your own inference infrastructure rarely pays off until you’re processing at significant volume — and by then, you’ll have learned exactly what behavior you need to optimize for.

Building an AI feature and not sure where to start?

Get a scoped project plan with story-point estimates for your LLM integration — model selection, prompt pipeline, evaluation framework, and deployment. Free and instant.

Scope Your Project for Free

No call required. Takes a few minutes.

Which LLM use cases actually deliver value in production?

The most useful framing we found: look for use cases where a 50–60% reduction in human effort is a meaningful win. A lot of LLM output seems impressive in demos but turns out to be inaccurate or banal on closer inspection. The question isn’t whether the AI can replace the human — it’s whether it can reduce the time the human needs to complete the job.

Examples that passed this test: generating draft response emails from a CRM context; using vector-based search to help users surface relevant documents from a private corpus; auto-populating first-pass reports from structured data with a human reviewing the final output. Examples that didn’t: use cases where any error forced a human to re-check the entire output, negating the time savings.

The critical variable is predicting error rate before you build. In healthcare, for instance, if ML can reliably handle a given type of radiology scan classification, it’s valuable. But if the false negative and false positive rates are unpredictable, a radiologist must re-read every scan — and you’ve added cost without adding value. The same logic applies to any high-stakes classification task. Run your prompts at volume, study the failure modes, and model whether the error rate makes the human review cheaper or more expensive than doing the work manually.

One finding that surprised us: data cleaning matters as much as model selection. We had to strip email threads of signatures and re-copied content before loading them into a vector database — because that noise diluted the signal enough to make retrieval unreliable. The good news: you can use AI for the cleaning step too. But don’t treat data quality as someone else’s problem.

For teams looking to understand which AI projects are worth pursuing at all, Fraction’s engineers can help you run a structured AI opportunity assessment for sales and revenue workflows before committing to a build.

How do you do prompt engineering properly — not just good enough?

The most consistent pattern we observed: prompt engineering is underinvested on almost every project. Engineers write a prompt, get something plausible, and move on. The output at launch is 60% of what it could be — and no one knows it.

The fix is simple but requires discipline: iterate on prompts manually through a chat UI before an engineer builds a system around them. Many engineers are not strong writers. Handing the prompting to someone who is — or spending dedicated time on iterative refinement — consistently produces better results than a technically skilled engineer writing prompts quickly.

One technique that works particularly well: use AI to check its own work. If you want to generate a color scheme that excludes brown, generate the hex codes, then ask the model whether the result contains any brown. Layered checks like this improve output quality without adding latency in most cases. This iterative self-checking approach is also one of the techniques OpenAI used in GPT-4’s development — the model is used to evaluate and improve its own outputs, not just generate them once.

Temperature settings also matter more than most guides suggest. When you need predictable, structured output — JSON, classifications, ordered lists — lower temperature settings reduce the variance in ways that make downstream processing far more reliable. Treat temperature as a first-class parameter, not an afterthought.

Why is Python becoming the dominant language for LLM projects?

Before ChatGPT’s rise in late 2022, we observed that startups’ tech stacks leaned heavily toward full-stack JavaScript — around 65% in our experience, with Python at about 25%. By mid-2023, that had shifted sharply: nearly 50% of the startups we worked with were using Python as a key language.

The reason is structural, not fashionable. Python is the native language of machine learning. Most LLM tooling — including LangChain — releases Python support first, with the most depth and the fewest rough edges. LangChain’s Python documentation and APIs are substantially more complete than their JavaScript equivalents. New capabilities, new model integrations, and new orchestration patterns almost always appear in Python before they’re available anywhere else.

This doesn’t mean JavaScript teams are blocked. But it does mean they’ll hit Python-first limitations at the edges of new tooling, and they should plan for that. For teams choosing their stack for a new AI-heavy product, Python is now a serious consideration even if the rest of the product runs on Node. If you want to understand the full picture of what an LLM tech stack looks like across tooling, models, and infrastructure, that tradeoff is worth mapping explicitly before you start.

How does chaining LLM calls improve output quality?

The single most impactful implementation pattern we used across projects: break problems into smaller pieces and chain calls. Rather than asking the model to do everything in one prompt, decompose the task into discrete, verifiable steps. Each step’s output becomes the next step’s input, and you can validate — and correct — at each stage before errors compound.

Approach	When it works	Main risk
Single large prompt	Simple, well-defined tasks with limited output variability	Errors compound silently; hard to debug which step failed
Chained calls	Multi-step tasks where intermediate outputs can be validated	More API calls and latency; requires more upfront design
Deterministic + LLM hybrid	When the answer is knowable but needs a conversational wrapper	Requires clear boundaries between what the LLM and algorithm each handle

LangChain and Microsoft Semantic Kernel are the primary tools for managing chained calls. They handle state, manage context windows, and make it easier to inject validation between steps. The core idea — break the problem into bite-sized chunks — is what delivers the quality improvement. The framework just makes it manageable to implement.

One pattern worth calling out specifically: in many use cases, traditional deterministic approaches still outperform LLMs and are orders of magnitude faster. Keyword search and elastic search aren’t dead. Sometimes the right answer is to use the LLM to put a friendly, conversational wrapper around a deterministic result — rather than asking the LLM to derive the answer itself. Recognizing when that’s the right call is part of what makes an experienced AI engineer valuable.

Performance at scale is the other consideration most projects underweight early. API latency and reliability can be real blockers for user-facing features. Caching responses where appropriate, and designing around API unavailability, are production requirements — not post-launch concerns. For teams ready to move from proof of concept to production systems, understanding what production-grade LLM and RAG engineering actually requires is a useful starting point before scoping the build.

Frequently asked questions

Should you use proprietary LLM APIs or train your own model for an MVP?

For most MVPs, start with proprietary APIs like OpenAI. The highest-performing models are still proprietary, and the cost-benefit of training a custom 10B-parameter model rarely makes sense at the MVP stage. Use the API phase to prove what’s possible, then evaluate whether an in-house model is justified once you have real usage data and a working product.

What is a realistic target for how much LLMs can reduce human effort?

A 50–60% reduction in human effort is a meaningful and realistic target for well-scoped LLM use cases. That translates to roughly a 2–2.5x productivity gain for the human working alongside the AI. Waiting for LLMs to eliminate the human entirely is usually a mistake — the hybrid approach delivers real value now, while the technology continues to improve.

How important is prompt engineering compared to the underlying model choice?

Prompt engineering has an outsized impact on output quality and is frequently underinvested. Many engineers are not strong writers, so prompts get written quickly and left alone. Iterating manually through a chat UI before handing off to an engineer can dramatically improve results — and catching bad prompts early is far cheaper than rebuilding a system around them.

Why is Python becoming more important for LLM projects than JavaScript?

Python is the native language of machine learning. Most LLM tooling — including LangChain — releases Python support first and with the most depth. By mid-2023, Fraction observed nearly 50% of startups using Python as a key language, up sharply from the 25% baseline seen before ChatGPT’s rise. Teams building on JavaScript can still get far, but they’ll hit Python-first limitations at the edges of new tooling.

When does chaining LLM calls actually improve output quality?

Chaining improves quality when the problem can be broken into discrete, verifiable sub-steps. Rather than asking the model to do everything in one pass, chain calls let you validate intermediate outputs, use the LLM to check its own work, and correct errors before they compound. Tools like LangChain and Microsoft Semantic Kernel are built specifically to manage this kind of multi-step orchestration.

What kinds of LLM use cases are most likely to fail in production?

Use cases fail most often when the error rate is unpredictable and forces humans to re-check every output — negating the efficiency gain — or when the input data is too noisy for the model to extract reliable signal. Healthcare and high-stakes classification tasks are especially vulnerable. The fix is identifying failure modes before deployment and building systems that surface low-confidence outputs for human review rather than treating all outputs as equally reliable.

Praveen Ghanta

CEO, Hire Fraction

Praveen Ghanta is a five-time founder and serial entrepreneur. He is the founder of DevHawk.ai, an AI-powered engineering management platform, and Fraction.work, which connects fast-growing companies with top fractional tech and growth marketing talent. Previously, he founded HiddenLevers, a risk analytics platform for wealth management that he bootstrapped from inception to acquisition by Orion Advisor Solutions in 2021, serving thousands of advisors and $600B in assets. He earlier founded SmartWorkGroups, acquired by Intralinks in 2000.

Connect on LinkedIn →

Get started

Get an Instant Project Plan + Cost Estimate

Describe your software or AI project. Get a full scope with story-point pricing, sprint estimates, and a downloadable plan in minutes. No calls, no waiting.