Artificial intelligence is genuinely useful for strategic and structural work, but it is not yet a trusted operator inside production systems. The gap between the marketing promise and the operating reality is a governance problem, and it is one that boards are only beginning to face?
AI accelerates serious work, but production deployment requires governance, staging, rollback, scope control, and human review that the tools do not provide on their own.
Artificial intelligence is a capable implementer but not a reliable autonomous operator inside production systems. The most consequential failures are not visible errors but invisible technical debt that accumulates while surface output looks correct. Multi-model workflows, with one AI planning and another executing, improve architectural discipline but increase rather than decrease the supervision burden on the human operator. The missing layer in enterprise AI adoption is governance: scope control, staging, rollback, dependency mapping, human review, and a strict bias toward native platform functionality over custom code. The firms that benefit most from artificial intelligence will be the firms that professionalize their adoption rather than the firms that automate fastest.
For almost two months, I used artificial intelligence to help rebuild the website of a U.S.-registered investment bank. This was not a side project, not a sandbox, and not a demonstration assembled for a conference panel. It was a regulated, client-facing platform on which brand credibility, search visibility, structured content, page performance, and institutional trust all carried weight. The work involved a small team, a live production environment, and the ordinary commercial pressure that attaches to any piece of infrastructure that represents a financial advisory firm to its prospective counterparties.
The experience changed how I think about artificial intelligence in serious work, and it did so in a way that I did not anticipate at the outset. The model that has dominated public discussion in 2025 and into 2026 is one of accelerating autonomy, with each generation of tooling described as more capable, more agentic, and closer to the threshold at which entire professional functions can be handed over and run unattended. That model is not wrong, exactly. But it is incomplete in a way that becomes obvious the moment one moves from controlled demonstrations into the messier reality of live systems with dependencies, regulatory exposure, and consequences that do not reverse cleanly.
What I learned was not that artificial intelligence is overhyped. The leverage I obtained from it was real, and in several domains it was substantial. What I learned was that the marketing narrative around these tools is selling a quality of execution that the tools cannot yet provide on their own, and that the gap between the demonstration and the deployment is the place where most enterprise adoption is going to either succeed or fail in the years immediately ahead. The resolution of that gap is not a technical problem. It is a governance problem, and it is one that companies, boards, and operators are only beginning to take seriously.
THE MARKETING PROMISE AND THE OPERATING REALITY The promise being sold to the market is one of ease. Modern artificial intelligence systems are presented as approachable, increasingly autonomous, and capable of replacing meaningful portions of human execution across knowledge work. The promise is directionally accurate. In a controlled environment, with a clean prompt and a bounded task, today's models produce output that is genuinely impressive. They write code, they generate content, they structure documents, they draft schemas, they summarize long material, and they produce recommendations that often hold up under inspection. The speed is real. The leverage is real. The reduction in friction across many knowledge tasks is also real, and anyone who pretends otherwise is not engaging seriously with what these tools have become.
Production systems, however, are not chat windows. A website, a customer relationship management platform, a financial model, a compliance workflow, an investor database, or a code repository is not a discrete task that can be completed and walked away from. It is an operating environment with dependencies that the artificial intelligence cannot see and cannot reliably reason about, and those dependencies have a way of becoming visible only after they are violated. A change in one place can affect another. A small script can conflict with an animation that loads three seconds later. A styling adjustment can break a layout at a breakpoint that the model never tested. A field in a content management system can fail silently in a way that looks correct on the page and incorrect to the search engine that indexes it.
This is the gap that most public discussion of artificial intelligence is currently glossing over. The tools are capable of producing a correct answer in isolation. The problem is that production work is not isolated. It requires context, awareness of constraints that have accumulated over time, scope discipline, the willingness to test before changing, the existence of a path back when the change is wrong, and the judgment to recognize when not to act at all. Those qualities are inconsistent in the current generation of artificial intelligence systems, and the marketing language obscures the inconsistency rather than acknowledging it.
WHERE THE TOOLS ARE GENUINELY STRONG It would be intellectually dishonest to write about this experience without first acknowledging where artificial intelligence performed exceptionally well, because the strongest finding from the project is not that these tools failed but that their strengths and their weaknesses are not evenly distributed. They cluster, and they cluster in ways that suggest both how to use them now and how to think about their probable evolution.
In the strategic and structural layers of the work, artificial intelligence was an extraordinary collaborator. It helped construct a coherent site architecture, define service-page frameworks, organize the firm's positioning across mergers and acquisitions advisory, private placements, closed-end fund rights offerings, applied artificial intelligence advisory, corporate development services, and sector-specific work. It helped structure a content management system designed for retrieval by traditional search engines, by answer engines, and by large language models, which is not a trivial schema problem and which most professional service firms have not yet addressed. The strategic translation between the firm's commercial positioning and its digital expression was faster, sharper, and more consistent than I could have produced unassisted, and the underlying architecture has held its shape through every subsequent disruption to the project.
The same was true of business writing once tone and framing had been established. The institutional voice that runs through the site, across more than two hundred pieces of content and six service pages, is consistent in a way that ordinarily requires either a single very disciplined writer or a small team operating under tight editorial supervision. Artificial intelligence produced that consistency at a velocity no human team could match. Frequently asked questions, service descriptions, calls to action, metadata, and article summaries all carried the same register, the same tightness, and the same refusal of the marketing softness that disfigures most professional service writing. This is the kind of work for which the current tools are simply better than what most firms are willing to budget for, and the implications of that fact for how content operations are staffed are going to be substantial.
Schema and structured data design were similarly strong. The output is declarative. The rules are bounded. The result can be inspected against a specification. The same applies to taxonomy design, content classification, and the architecture of metadata for retrieval by language models. Across all of these surfaces, the tools were not merely useful. They were the right instrument for the task, and the work product was production-grade on the first or second pass.
"The dangerous part was not that artificial intelligence made mistakes. The dangerous part was that the mistakes often looked like progress." WHERE THE TOOLS ARE GENUINELY WEAK The dangerous part was not that artificial intelligence made mistakes. Every implementer makes mistakes, and every mature engineering organization is built around the assumption that mistakes will occur. The dangerous part was that the mistakes often looked like progress, and that the appearance of progress was, in the moment, indistinguishable from the substance of it.
The most revealing pattern of the project was that the tools repeatedly solved the visible problem while creating an invisible one. A layout issue would be addressed by injecting a script that later conflicted with another script. A page would render correctly above the fold while concealing duplicated structural sections beneath it that no human reader would notice but that a search engine crawler would find immediately. A visual animation would work in one context and create a performance issue elsewhere. A small styling change would bypass the platform's native conventions and introduce a maintenance liability that would not surface for weeks. None of this was incompetence on the part of the artificial intelligence. It was, in each case, a confident execution of a reasonable plan that did not reckon with what it could not see.
In one case, an entire page on the site became unrecoverable after conflicting front-end logic was introduced into it. The platform's recovery mechanism, which is the only rollback path available to users of that system, ran for over an hour without completing and required intervention from the platform's support team. The site was eventually restored through the manual creation of a duplicate copy at the last working backup. The episode was not catastrophic in the long run, but it was instructive. The artificial intelligence had executed each individual step competently. The combination of the steps was a destruction event. The model could write the code. It could not reliably understand what the code would do in combination with the code that was already there.
That distinction is worth dwelling on, because it is the distinction that separates a capable junior implementer from a senior one. A junior developer, given a task, will write code that accomplishes the task. A senior developer, given the same task, will first ask what should not be touched, what dependencies exist, what testing should occur, and what path back is available if the change does not work. The current generation of artificial intelligence has the first capability in abundance. It does not yet reliably have the second. And the marketing of these tools, which is uniformly aimed at making them feel approachable to non-engineers, actively obscures the absence of the second capability rather than warning users about it.
The default behavior of the tools under pressure compounded the problem. When something did not work, the model's reflex was to add more code, another workaround, another custom rule, another layer of logic on top of the layers already present. This created the appearance of forward motion. The visible problem would often disappear. The system, however, became more fragile with each addition. By the middle of the project, the platform's script registry, which can only grow and which provides no means of cleanup, contained over four hundred entries on the original site and over six hundred on the rebuilt copy, against an applied ceiling of fifteen. The remaining entries were dead weight, accumulated from iterative fixes, and they were unrecoverable through any tool the platform exposed.
This is a broader lesson that applies far beyond a single web project. Artificial intelligence makes complexity cheap to generate. It does not make complexity cheap to maintain. Anything that can be produced in three seconds can also be produced four hundred times in twenty minutes, and a system that has accumulated four hundred near-identical fixes is not a system that has been improved. It is a system that has been buried.
TWO MODELS ARE BETTER THAN ONE, BUT NOT ENOUGH
Roughly halfway through the project, I changed the workflow. Instead of issuing instructions directly to the execution model, I began drafting them through a second artificial intelligence acting as a planning and quality layer. The second model would review the state of the work, formulate a structured brief, identify constraints I had documented in earlier sessions, and propose a more disciplined version of the instruction than I would have produced on my own. The brief would then be passed to the execution model, which would carry it out.
The change was significant. The quality of execution improved noticeably. Single-pass clean builds, which had been rare in the early phase, became the norm. Architectural decisions that had previously been made inline during execution were now made deliberately, before any code was written. The supervising model functioned, in effect, as a senior engineer reviewing the work plan of a junior one, and the junior one performed better against a clearer plan. There was a moment in the session record, after one such review produced a structurally cleaner approach to an animation problem that had defeated three earlier attempts, when the execution model wrote the words "Stop. The other model is right." That sentence captured the dynamic precisely.
It would be tempting to present this as a solution. It is not. It is a partial governance pattern that improves architectural quality without resolving the underlying issue, and it carries costs of its own that are worth being honest about.
The first cost is that the supervising model cannot execute. Every instruction it produces still has to be carried out by the execution model, and the execution model continues to make execution-level mistakes when the brief does not anticipate a specific platform constraint. Silent failure modes, payload limits, foreground tab requirements, and class-replacement behaviors continued to cause regressions even when the supervising model had produced a structurally clean plan. The supervisor improves planning. It does not improve the reliability of the execution layer.
The second cost is that the human remains in the critical path. The supervising model produces briefs that I have to validate before they are executed, because the supervisor cannot directly observe the platform constraints either. Without that validation step, the supervising model would produce architecturally clean instructions that would still fail in execution. The human cannot be removed from this loop without forfeiting the benefit of the loop.
The third cost is the most important one, and it is the cost that should give pause to anyone designing multi-model workflows for enterprise use. Two models in sequence is more cognitive load on the operator, not less. The operator is now the project memory, the constraint validator, the adjudicator between two models that may disagree on approach, and the quality reviewer of the final output. The supervision burden does not decrease. It shifts upward in the abstraction stack, from the level of code review to the level of architectural review. The work becomes harder to delegate, not easier, because the operator must now hold the entire system in mind in order to know whether either model is producing output that is appropriate to the constraints that neither model can fully see.
Artificial intelligence makes complexity cheap to generate. It does not make complexity cheap to maintain. THE LAYER NO ONE IS BUILDING The layer that the market is currently underbuilding is the governance layer that sits above the model. Most of the public discussion of artificial intelligence in the enterprise is focused on the model itself, on which model is most capable, which prompting strategy produces the best results, which fine-tuning approach is most effective, and which agentic framework is most autonomous. These are the wrong questions to be asking first. The binding constraint in production deployment is not the model. It is the operating discipline that surrounds it, and that discipline is not provided by any of the tools and is not yet being built around them at the speed it should be.
In a serious production environment, the minimum framework that should surround any artificial intelligence touching live systems includes scope control before execution, a staging environment that mirrors production with sufficient fidelity to catch dependency conflicts, a rollback mechanism that actually works rather than one that exists in the documentation but fails under pressure, version control or at minimum a reliable change log, dependency mapping prior to any structural change, human review before deployment, post-change quality assurance, explicit limits on what the model is permitted to access or modify, a strong bias toward native platform functionality before any custom solution is introduced, and a strict prohibition on broad changes when the requested task is narrow.
None of this is glamorous. None of this resembles the marketing of agentic artificial intelligence. But this is the substance of what makes production systems safe, and it is the substance that has been almost entirely absent from the way artificial intelligence is being introduced into operational workflows. The disciplines I have just listed are not new. They are the accumulated lessons of every system that has ever broken, codified into the standard practice of mature engineering organizations over the past forty years. Code review, staging, versioning, rollback, testing, and scope control exist for reasons. The reasons did not disappear when the implementer became a language model. If anything, the reasons became more important, because the language model is faster, more confident, and less aware of consequences than any human implementer it has displaced.
The companies that will benefit most from artificial intelligence in the years immediately ahead are not going to be the companies that adopt the most tools or that integrate them most aggressively. They are going to be the companies that build the strongest governance layer around them. They will know where the tools are powerful and where the tools are fragile. They will separate planning from execution. They will preserve human judgment at the points where the cost of error is highest. They will use artificial intelligence aggressively, but they will not use it blindly. They will treat the output of these systems as leverage, not as authority, and they will design their workflows around the understanding that the model is an implementer rather than a decision maker.
THE QUESTIONS THAT BOARDS SHOULD BE ASKING This is not, in the end, a story about a single website. The website was the laboratory, but the lessons travel. The same dynamics apply to any deployment of artificial intelligence inside an enterprise system, which is to say that they apply to almost every meaningful application of the technology that is currently being built or considered.
If an artificial intelligence agent can modify a customer relationship management platform, update a financial model, generate client-facing materials, alter a data room, interact with customer records, change a website, or write code into a production environment, then the relevant question for any board, any executive committee, and any investor evaluating the deployment is not whether the model can perform the task. The relevant questions are what the model can access, what it can change, who approves the change, whether there is an audit trail, whether the change can be reversed, whether the output has been tested, whether the model understands the dependencies it is operating among, what happens when it makes a confident mistake, and who owns the resulting risk.
These are not technology questions. They are governance questions, and the discipline required to answer them is the discipline that boards have always been expected to bring to any operational system that touches client trust, regulatory exposure, or financial integrity. The tools have changed. The discipline has not, and the firms that recognize this and act on it will compound the advantage that artificial intelligence offers them. The firms that do not will discover, sometimes catastrophically, that an unsupervised implementer working at machine speed inside a production system is not the productivity multiplier the marketing promised. It is an operational risk that the existing risk management framework was never designed to handle.
For investors evaluating artificial intelligence companies, the diligence implications follow directly. Demonstration quality is not enough. Speed is not enough. Usage growth is not enough. The hard questions are about workflow reliability, permissioning, auditability, exception handling, the human review loops that exist around the model, and the true cost of supervision once the deployment is at scale. A product that performs beautifully in a demonstration may still be unsafe in a production deployment, and a model that appears intelligent in a chat may still lack the judgment required to operate inside a complex business system without inflicting damage that no one budgeted for. None of this makes the underlying technology unworthy of investment. It does mean that the diligence model that was built for software-as-a-service in the 2010s is insufficient for artificial intelligence in the 2020s, and the firms that update their diligence accordingly will allocate capital better than the firms that do not.
INFRASTRUCTURE, NOT MAGIC My conclusion, after two months of using artificial intelligence in a real production environment, is not that companies should slow their adoption of these tools. The conclusion is that companies should professionalize their adoption, and that the professionalization is the work that almost no one is currently doing.
Artificial intelligence is now part of the operating stack. It will increasingly touch content, code, workflows, due diligence, analytics, marketing, compliance, customer support, and internal operations across every category of enterprise. This direction is not reversing, and the firms that resist it on principle will simply be outcompeted by the firms that engage with it more thoughtfully. The question is not whether to adopt. The question is how to adopt without absorbing risks that the existing governance frameworks were never built to absorb.
Operating stacks need controls. They have always needed controls. The introduction of artificial intelligence into the stack does not eliminate the need for those controls. It increases it, because the implementer is now faster, more confident, and less constrained by tacit knowledge of what should not be touched than any implementer that came before. The companies that internalize this and build accordingly will compound the leverage that artificial intelligence offers them. The companies that treat artificial intelligence as a substitute for the disciplines that produced reliable systems in the first place will discover, on a timeline of months rather than years, that the absence of those disciplines is what made the systems reliable to begin with.
Artificial intelligence can accelerate serious work. It can sharpen thinking. It can compress timelines. It can permit small teams to produce output that would previously have required much larger ones. All of this is real, and the leverage is meaningful. But when these tools touch production, the standard against which their output is measured has to change. In production, impressive is not enough. The work has to be controlled. It has to be reviewable. It has to be reversible. It has to be safe. And the responsibility for ensuring that it meets those standards belongs to the operators and the institutions that deploy these tools, not to the tools themselves and not to the companies that sell them.
The discipline problem is not an artificial intelligence problem. It is a management problem, and it is the problem that will separate the firms that benefit from this technology from the firms that are damaged by it.
Marcus Magarian is Managing Director, Technology and Cross-Border Transactions at Chatsworth Securities LLC.
Two months of using artificial intelligence to rebuild a regulated financial services website demonstrated that the tools are exceptionally strong in strategic translation, content authoring, and structured configuration, and meaningfully weaker at execution inside live systems with dependencies. The most dangerous failure mode was not error but the appearance of progress, with visible problems solved while invisible technical debt accumulated underneath. Adopting a second AI as a planning and quality layer improved architectural discipline but did not eliminate the need for human supervision, because the operator remained the project memory and the constraint validator. The binding constraint in enterprise AI deployment is not model capability but the governance layer above the model, which almost no one is yet building at the speed the technology requires.
When to speak with Chatsworth
You may benefit from an advisory conversation if your board is evaluating timing, valuation expectations, buyer universe quality, or diligence readiness. Chatsworth provides senior-led perspective on process design and execution risk independently of whether a mandate results.
Speak with the team →