Category: AI & Technology

  • Where AI Earns Its Keep in Professional Services, and Where It Quietly Fails

    There are two ways to be wrong about AI in professional services, and almost every firm is wrong in one of them. The first is to treat AI as a discontinuity — to assume it is about to remake the profession, displace the practitioners, and reward the firms that bet aggressively on rebuilding themselves around it. The second is to treat AI as a fad — to assume it is hype, that the existing way of doing things will reassert itself, and that any investment in it is a tax on a profession that has worked fine for a hundred years. Both views are reassuring in their certainty. Both are wrong. The reality is messier and more interesting, and getting it right requires resisting the urge to be certain about something that is still in motion.

    Every conversation about AI in professional services eventually arrives at the same set of questions. Will AI replace attorneys, accountants, bookkeepers? Will small firms lose to large firms with better technology? Will technology-first competitors disrupt incumbents? These are the wrong questions, or at least they are the wrong first questions. The right first question is far more boring: where, specifically, in the work that this firm does every day, can AI make the work better?

    The honest answer for most small firms today is “a few places, narrowly, with careful supervision.” That is less exciting than the broader claims, but it is what we actually see when we deploy these tools inside our firms. The places where AI works are surprising. The places where it does not are also surprising. The difference between the two has almost nothing to do with the underlying model and almost everything to do with the structure of the work — which is the part that gets the least attention in the AI discourse and that, in our experience, matters the most.

    Where AI Earns Its Keep Today

    Document review. Not the final review by an attorney, but the first-pass triage. Finding the relevant clauses in a hundred-page contract, surfacing the unusual provisions, comparing against a known good template. The attorney still does the legal judgment, but she does it on a curated and annotated document instead of a raw one. The time savings are real. The accuracy improvement is also real — the AI does not get tired on page sixty, the way a human reviewer does, which means the unusual clause that hides on page sixty-two no longer gets missed.

    Drafting. Standard letters, standard motions, standard engagement letters, standard responses to common client questions. The output is never publishable as-is, but it is far better than starting from a blank page. The skill is in writing the prompt correctly and in editing the output rigorously. The associate who knows how to do both does the same work in half the time. The skill of editing AI output is, importantly, not the same as the skill of drafting from scratch. It requires a different cognitive posture — a critical, suspicious, line-by-line read rather than a generative one. Firms that train their associates explicitly in this skill get more out of AI drafting than firms that simply hand the tools to the team and hope.

    Research. Tax research, case research, regulatory research. AI search is good at finding the relevant authority. It is not yet good at synthesizing the authority into a defensible answer. So we use it to find what to read, not to decide what to do. The distinction is operational: AI is a research assistant, not a research conclusion. The firm that treats it as a research conclusion will eventually issue an opinion that is wrong, lose a client over it, and discover that AI hallucination is not a theoretical risk — it is a malpractice risk hiding inside a productivity tool.

    Bookkeeping categorization. The marginal AI improvement here is enormous because the work is repetitive, the categories are well-defined, and the corrections are easy to learn from. The bookkeeper goes from coding every transaction to reviewing the AI’s codings. Throughput doubles. Accuracy goes up. This is the canonical example of AI fitting the structure of the work — a high-volume, well-defined, correctable task with clear feedback loops. Where the structure of the work matches the strength of the model, the value is unambiguous. Where the structure does not match, no amount of model improvement helps.

    Where AI Quietly Fails

    Anything that requires the model to understand who the client actually is, what they actually want, and what they have actually agreed to. AI does not know your client. It cannot tell you whether the answer that is technically correct is also the answer your client should hear, in the way your client should hear it, given the relationship you have with them.

    Anything that involves novel judgment. The first time a fact pattern looks like X but is actually Y, AI will get it wrong, because it is averaging across cases it has seen. The exceptions are where the practitioner earns her living. AI cannot replace the practitioner there and probably should not try.

    Anything that introduces material risk. We do not let AI send anything to clients without human review. We do not let AI sign anything. We do not let AI make decisions that we would not let a first-year associate make on her own. The standard is the same one we have always used for first-year work: useful, but always reviewed.

    The “quiet failure” framing in this section’s title is deliberate. AI does not usually fail loudly. It fails by producing output that looks plausible, that is wrong in ways that are subtle, and that requires a knowledgeable reviewer to catch. The firms that get hurt by AI are not the firms whose AI tools crashed. They are the firms whose AI tools worked just well enough to be trusted by people who did not have the skill to verify the output. The reviewer-skill problem is the actual problem. The model-quality problem is a secondary one, and it is one the vendors will solve faster than the reviewer-skill problem will be solved. The firms that invest in their reviewers’ AI literacy are the firms that will use AI well over the next decade. The firms that invest only in tools are buying half the answer.

    The Deterministic Layer Is the Point

    We have written elsewhere about the line between deterministic systems and nondeterministic ones. AI is nondeterministic by nature. The work in our firms is mostly deterministic by nature — the same kinds of matters, the same kinds of documents, the same kinds of decisions, with the same kinds of safeguards. The way we use AI is to put nondeterministic steps inside deterministic workflows, with deterministic checks on the output. This is unglamorous and it is also what works.

    A modern tax controversy practice looks the same as it did five years ago from the client’s perspective. The forms are the same. The deadlines are the same. The IRS is the same. What has changed is that the steps in between — pulling transcripts, summarizing notices, drafting responses, calculating projections — happen faster and with fewer errors. The practitioner spends more time on the substantive judgment and less time on the mechanical work. That is the entire promise of AI in this kind of practice, and it is enough.

    The architectural insight here is worth stating explicitly. The job of the firm is to deliver deterministic outcomes — the right legal advice, the right return, the right book of accounts. The job of the workflow is to deliver those outcomes reliably. AI is a tool that can do some of the intermediate steps faster, but it cannot be allowed to compromise the determinism of the outcome. So we wrap the nondeterministic AI steps in deterministic scaffolding: a structured input, a known-good template, a human reviewer, a checklist verification. The scaffolding is the firm’s promise to the client. The AI is the productivity multiplier inside the scaffolding. Firms that get this layering right move faster without losing reliability. Firms that get it wrong move faster and lose reliability at the same time, and the loss of reliability is not visible until the first time it matters, at which point it is too late.

    The Talent Implication Most Firms Miss

    The popular narrative is that AI will reduce the need for junior associates. The narrative is partially right and mostly misleading. The mechanical work that junior associates used to do — first-pass document review, basic research, template drafting — is exactly the work AI is best at. So firms will indeed need fewer hours of that work from juniors. But the firms that will thrive are the ones that take the time they used to spend supervising juniors on mechanical work and reinvest it in training juniors on the judgment work that AI cannot do. The output is the same headcount, but a different developmental curve — juniors who are doing harder, more cognitively demanding work earlier in their careers, and reaching senior judgment maturity faster than their predecessors did.

    The firms that get this wrong will hollow out their talent pipeline. They will keep the same supervision model — juniors doing mechanical work, seniors reviewing it — but with AI in the middle, which means the juniors are not actually doing the mechanical work, which means they are not building the muscle that the mechanical work used to build. Five years later, those firms will have senior associates who have never had to read a hundred-page contract from cover to cover, and who therefore cannot reliably catch the things that the AI missed. The talent risk of AI is not that it will replace the juniors. The talent risk is that it will produce a generation of seniors who never developed the underlying skill that AI is now imperfectly performing. The cure is to be deliberate about what juniors do learn, given what they no longer have to do.

    What We Are Building Toward

    Over the next several years we expect AI to keep moving from optional to assumed inside the firms we own. The associates we hire will use it because it makes their work better. The clients will benefit because the work will be faster, cheaper, and more accurate. The competitive advantage will accrue to firms that integrate AI carefully into their existing workflows, not to firms that try to rebuild themselves around it. Quiet integration beats loud rebranding every time.

    The firms that will struggle are not the ones that are slow to adopt AI. They are the ones whose underlying processes were so undocumented and so ad-hoc that they cannot tell where AI would fit. The pre-condition for using AI well is having a real process to begin with. That has always been the pre-condition for everything else in a professional services firm, too. AI is not a way to skip the work of building a real firm. It is a multiplier that rewards firms that have already done the work. The multiplier on zero is still zero, and a lot of firms are about to discover that their AI investments are multiplying the wrong thing.

    What to Do Monday Morning

    Pick three tasks in your firm that are repetitive, well-defined, and currently consume meaningful associate time. Document those tasks. Then pilot AI on them, with explicit human review at every step. Measure the time savings, the accuracy delta, and the reviewer experience. The pilot is the data. Once you have the data, you can decide whether to expand the use of AI on that task — and whether to expand it to other tasks of similar shape.

    Resist the temptation to deploy AI across the firm at once. The deployment that scales is the one that is preceded by a documented process and followed by measured results. The deployment that fails is the one that is announced before it is tested. There is enormous pressure inside firms right now to be seen to be using AI. The pressure is mostly cultural, not commercial, and it is causing firms to make commitments faster than their actual experience with the tools supports.

    And finally, decide explicitly what you will not let AI do. Write it down. Tell the team. The list is as important as the list of what you will let AI do, because the boundary is what protects the firm from the quiet failure mode. Firms with clear lines about what AI does and does not do produce reliable work with AI in the mix. Firms with fuzzy lines produce variable work and eventually a malpractice claim. The clear-line firm is the durable firm, and the durability is the point.

  • Deterministic vs. Nondeterministic: Where LLMs Fit in Modern Automation

    Deterministic vs. Nondeterministic: Where LLMs Fit in Modern Automation

    Every few years, an entirely new primitive arrives that forces operators to rethink how work gets done. The spreadsheet did it. The relational database did it. The web browser did it. Each one looked, on arrival, like a curiosity for hobbyists and a threat to whatever workflow it eventually displaced. Each one ended up redrawing the org chart of the average company. The pattern is so consistent that, once you have seen a few of these transitions, the temptation is to assume you know what to do with the next one. The temptation is usually wrong. What stays constant across these transitions is not the playbook for adopting the new primitive — it is the fact that the operators who get the next one right are the ones who refuse to assume they already know how to think about it.

    Large language models are the latest primitive, and they have already triggered the usual cycle: breathless optimism, predictable backlash, and a quieter, more interesting middle phase in which serious operators figure out where the new tool actually belongs. We sit on the boards and in the cap tables of small technology companies working through that middle phase right now. What follows is what we are seeing, what we believe, and where we think the puck is going. The thesis, briefly stated: the right mental model for LLMs is not “intelligent software” but “probabilistic infrastructure,” and the companies that internalize that distinction will build dramatically more durable systems than the ones that do not.

    The Old Contract: Determinism as a Feature

    For the last fifty years, enterprise software has been built on a single, almost moral premise: given the same input, produce the same output, every time. This is determinism, and it is not a stylistic choice. It is the load-bearing assumption underneath payroll, accounting, settlement, claims processing, identity, access control, and the long tail of internal tooling that makes a company function. When a deterministic system gives a different answer on Tuesday than it gave on Monday, that is not creativity. That is a bug, and in regulated industries it is often a fineable one.

    Deterministic systems are testable. They are debuggable. They produce audit trails that satisfy auditors, regulators, and the occasional plaintiff’s attorney. They are also, by design, brittle. Tell a rule engine something it has not seen before and it either fails loudly or, worse, fails quietly. For decades, the industry has papered over that brittleness with two expensive ingredients: humans and consultants. Both are now in shorter supply than the work requires.

    The deeper observation is that the deterministic regime made an implicit bet about the world: that the inputs to enterprise systems could be cleaned up before they arrived. Forms would be filled out correctly. PDFs would conform to templates. Customers would describe their problems in approved categories. The bet was never fully true, but it was true enough for a long time that the industry got away with it. The cleanup work was hidden inside the human layer — clerks, support agents, paralegals, billing operators — who normalized messy reality into the structured shapes the deterministic systems could consume. The work was real, but it was outside the software, so it did not show up in the architecture. LLMs are the first technology that can credibly absorb a large fraction of that hidden cleanup work, which is why their impact will be larger than the surface metrics suggest.

    The New Primitive: Probabilistic by Design

    LLMs invert the old contract. They are probabilistic, sampling from distributions over possible outputs. Ask the same question twice and you may get two reasonable, non-identical answers. For an operator coming from the deterministic world, this feels like a regression. It is not. It is the price of a capability that traditional software has never had: the ability to read messy, unstructured, ambiguous input and produce a useful response without anyone having to enumerate the rules in advance.

    The strategic question for any operator is therefore not whether to adopt LLMs. That question is already settled by the economics. The strategic question is where, inside the business, the new primitive belongs, and just as importantly, where it does not.

    The cultural transition is harder than the technical one. Engineers who have spent decades treating nondeterminism as the enemy now have to learn to design with it, around it, and on top of it. The right mental model is not “LLMs make software smarter.” The right mental model is “LLMs are a new kind of dependency, with different failure modes than the dependencies engineers are used to managing.” This sounds modest. It is the entire game. Companies that treat LLMs as smart software try to push them deeper into the core, where the consequences of variance are highest, and discover the hard way that probabilistic systems do not belong there. Companies that treat LLMs as a probabilistic dependency keep them at the edges, wrap them in deterministic scaffolding, and discover that the same model can do an enormous amount of useful work without ever being asked to make decisions it cannot reliably make.

    Where Determinism Still Wins

    Determinism still wins anywhere correctness is binary and the cost of variance is high. Moving money. Writing to a system of record. Granting or revoking access. Calculating tax. Signing a contract. Executing a known sequence of API calls. These are the places where the answer is either right or wrong, where there is no graceful degradation, and where the auditor will eventually come asking. We tell our portfolio companies, bluntly, that an LLM has no business making any of these decisions on its own. The deterministic layer is not legacy. It is the spine.

    The reason the deterministic spine has to remain deterministic is not just regulatory. It is epistemic. A regulator, a customer, or an internal investigator must be able to look at a system and answer the question, “why did this happen?” Deterministic systems answer that question by replaying the inputs and showing the same outputs. Probabilistic systems cannot answer it the same way, because the same inputs do not always produce the same outputs. You can sometimes recover a plausible explanation by examining the prompt, the context, and the seed — but “plausible” is not “deterministic,” and any business that has to defend its decisions in front of an auditor will discover that the difference matters.

    Where LLMs Earn Their Keep

    LLMs earn their keep at the edges of the business, in the places where structured systems have always struggled. Reading a customer email and figuring out what the customer actually wants. Pulling line items out of a PDF invoice that arrived in a format no one has seen before. Triaging a support ticket. Drafting a first pass of a contract, a memo, a follow-up. Classifying a transaction. Summarizing a meeting. Translating a stakeholder’s loose request into a structured query against a database that already exists.

    None of this is glamorous. All of it is expensive when done by humans, and all of it has historically been the work that breaks every rule-based system the moment reality drifts. This is the work LLMs were built for, and it is the work where the ROI in our portfolio has been most consistent and most defensible.

    The pattern of where LLMs work and where they do not is, ultimately, a pattern about the structure of the underlying task. Tasks with high input variance and tolerant-of-variance outputs — interpretation, summarization, classification, first-draft generation — are where LLMs shine. Tasks with low input variance and intolerant-of-variance outputs — money movement, identity, compliance — are where LLMs fail. The mistake operators make is treating the LLM as a general-purpose tool that can be pointed at anything. It is not. It is a specialized tool with a particular shape, and the operators who match the tool to tasks of the right shape are the ones who get returns from it. The operators who match the tool to tasks of the wrong shape are the ones writing the cautionary press releases.

    The Hybrid Pattern

    The systems that are working in production are neither purely deterministic nor purely LLM-driven. They are hybrids, and the pattern is converging across our companies and across the industry.

    An LLM sits at the edge, where the input is messy. It interprets, extracts, classifies, and proposes. Deterministic code sits at the core, where the action is consequential. It validates, executes, and logs. Between the two lives a contract: structured outputs, schema validation, allow-listed tool calls, retries, and human-in-the-loop review for the cases that fall outside the contract. The LLM proposes. The deterministic layer disposes. The audit trail survives.

    This is not a theoretical architecture. It is, increasingly, the default. The companies that have skipped this discipline and pointed an LLM directly at production systems have been the ones generating the cautionary tales that fill the trade press.

    The hybrid pattern is also where the architectural craft has migrated to. A decade ago, the interesting design decisions in enterprise software were about data models, API contracts, and consistency guarantees. Today, the most interesting decisions are about where the seam between LLM and deterministic code sits, how the contract between them is enforced, and what happens when the LLM produces something the deterministic layer cannot accept. These are real engineering decisions with real consequences, and the engineers who get them right are the ones whose systems will run quietly in production for years. The engineers who treat the seam as an afterthought are the ones whose systems will be torn out within a year of deployment, replaced by something better-designed by a competitor.

    Taming the Nondeterminism

    You can narrow an LLM’s variance, though you cannot eliminate it. Temperature settings, constrained decoding, JSON schemas, function calling, evals, and automated retries all push the distribution of outputs toward something tight enough to ship. The teams that are winning treat their LLM layer the way good engineering teams have always treated flaky dependencies: assume it will misbehave, design the surrounding system to catch it, and measure relentlessly. Evals are the new unit tests, and the companies that take them seriously are the ones whose systems do not embarrass them in front of customers.

    The discipline of evals deserves special emphasis, because it is the discipline that most determines whether an LLM-powered system can be operated at scale. An eval is a test that measures, against a defined set of inputs, whether the LLM’s outputs meet a defined standard. The companies that run hundreds of evals per release catch regressions before they reach customers. The companies that run none discover regressions after customers notice. The cost of building evals is real and the cost of not building them is larger, but the second cost is invisible until it suddenly is not. We push our portfolio companies, hard, to invest in evals well before they feel necessary, because the alternative is to operate the system blind for as long as the team will tolerate, which is always longer than is wise.

    Build, Buy, or Wrap

    A recurring question in our boardrooms is what to build, what to buy, and what to wrap. Our working answer: do not build foundation models, do not buy thin wrappers, and wrap aggressively wherever the underlying model is a commodity. The durable value at the small-company scale is almost never in the model itself. It is in the proprietary data, the workflow integration, the deterministic guardrails, and the trust the company has earned with its customers. Those are the assets that compound. The model underneath will be replaced, probably more than once, before the decade is out.

    The most reliable way for a small company to lose money in this market is to fall in love with a particular model provider and design the product around the provider’s specific capabilities. The model market is moving too fast, and the provider that is on top this quarter will not necessarily be on top next quarter. The companies that abstract the model behind their own interface — that can swap providers as the price-performance curve moves — preserve optionality that the companies that hardwire a particular API quietly lose. Optionality is not a feature anyone buys directly, but it is a feature that, over five years, separates the companies that compound from the companies that have to keep rebuilding.

    What We Are Watching

    Three things are on our radar going into the next stretch. First, the slow professionalization of LLM operations: evals, observability, prompt versioning, and the rest of the unglamorous plumbing that turns a demo into a system. Second, the migration of agentic patterns out of the lab and into real workflows, which will force a much more serious conversation about permissions, identity, and liability than the industry has had so far. Third, the quiet consolidation of the deterministic layer itself, as workflow engines, orchestrators, and policy systems learn to host LLM steps as first-class citizens rather than bolt-ons.

    Of these three, agentic patterns are the one most likely to be misjudged, in both directions. The optimists will claim that agents are the future of all enterprise work, and they will be partially right and mostly early. The pessimists will claim that agents are a parlor trick that does not scale, and they will be partially right about the current generation and entirely wrong about the trajectory. The truth is that agents will work, but only in narrow, well-instrumented contexts, with clear permissions, careful identity controls, and explicit human accountability. The companies that build that scaffolding before they ship agents will be fine. The companies that ship agents without it will produce the next round of cautionary tales.

    The Real Question

    The interesting question, then, is not whether deterministic or nondeterministic systems are better. That framing belongs to a debate that has already been settled by the operators actually shipping software. The interesting question is where, inside your specific business, the line between the two should sit, and who in your organization has the authority and the technical judgment to draw it.

    Get that line right, and you get a business that is both reliable and adaptive: the audit trail your regulators expect, paired with the flexibility your customers have quietly started to demand. Get it wrong in either direction, and you end up with a system that is either too brittle to serve the market or too loose to be trusted with it. The companies that figure this out early will spend the next decade compounding the advantage. The ones that do not will spend it explaining themselves.

    What to Do Monday Morning

    Map your existing system into two columns. Determinism on one side; tolerated nondeterminism on the other. Be honest about which side each step actually belongs on, not which side it currently sits on. Most companies, when they do this exercise for the first time, discover that they have placed several deterministic-by-nature operations inside an LLM call, and several LLM-appropriate operations inside rigid rule engines. The map is the diagnosis. The work that follows is the cure.

    Build the contract between the two layers with the same rigor you would use for an API contract with an external vendor. Schema validation, allow-listed tool calls, structured outputs, retries, fallback paths, human-in-the-loop review for edge cases. Treat the LLM as if it were a contractor whose work product had to be checked before it could be accepted into the system. Because that is what it is.

    And finally, invest in evals before you feel ready. The companies that ship LLM-powered systems without evals are not saving time — they are deferring the cost of measurement, and the cost compounds. Evals are not a sign of maturity; they are a prerequisite for it. Build them early, run them often, and treat the eval suite as a first-class artifact of the codebase. The companies that do this will look, in five years, dramatically more competent than the companies that did not, and the gap will be visible in their products, their reliability, and their margins.