Pillar · Evidence-based AI

Evidence-Based AI for Psychiatry

Evidence-based psychiatry has a delivery problem. The AAP, AACAP, and FDA publish concordant guidelines. Trials are indexed in PubMed. Meta-analyses sit one search away. And still, 60–70% of pediatric ADHD care in community settings runs off-guideline. The question this page answers is narrow and practical. What does it take for an AI to actually carry the evidence into the room where the prescribing decision is made — to sit in, weigh what the literature says, and cite it — instead of producing fluent text that sounds clinical and isn't? Below, the failure modes of generic models, the four pillars of evidence-based AI for psychiatry, how Sigmund's CEBA engine lets him do this, the validation data, and the prospective R21 and R01 studies under way at the Sultan Lab for Mental Health Informatics.

Section 1

Why generic AI fails the evidence test

ChatGPT, Claude, and Gemini are general-purpose language models. None of them carry a fixed psychiatric corpus. None of them weight source quality. None of them control for recency. None of them guarantee that a cited reference actually exists. For most knowledge work, that tradeoff is acceptable. For prescribing decisions in psychiatry, it is not.

The failure modes are the kind a child psychiatrist recognizes on the first prompt. Ask a general model about methylphenidate dosing in a preschool-age patient and it will quote a confident milligram range scraped from adult and adolescent trial data, with no flag that the Preschool ADHD Treatment Study cohort had a different effect-size and adverse-event profile. Ask about first-line treatment for an 8-year-old with ADHD and comorbid generalized anxiety and the answer will sometimes lead with a stimulant, without noting the AACAP 2019 sequencing preference for behavioral intervention before pharmacotherapy in that profile.

Ask for the supporting citations and the references that come back look right. The journal name is plausible. The authors are plausible. The PMID is the wrong length, or the right length and assigned to an unrelated paper, or hallucinated entirely. Clinicians who have spot-checked these outputs find the same pattern: roughly half the citations from generic models cannot be located in PubMed. The model is fluent. It is not accountable to a corpus.

Generic models also have no concept of guideline drift. When AACAP issues a practice parameter update, when the FDA expands or constrains a label, when a new meta-analysis revises the comparative effect-size estimate, the model does not know. The training cutoff is a wall. Whatever crossed it is frozen. Whatever did not is invisible. A clinician relying on a general model for a prescribing decision is, in effect, asking a smart resident who stopped reading the literature two years ago and is unwilling to admit it.

This is not a failure of intelligence in the model. It is a failure of scope. General models were not built to be guideline-concordant. They were built to be helpful in a wide and shallow way. Psychiatry asks for the opposite — narrow, deep, and auditable.

Section 2

What evidence-based AI actually requires

Four pillars. Each is necessary. None is sufficient on its own. Together they describe what it takes to build clinical decision support psychiatry can actually defend in a chart audit.

1. A curated psychiatry corpus

The foundation is the source set itself. Not the open web. Not Wikipedia. Not a model's training mix. A defined, versioned, continuously updated body of psychiatric literature — peer-reviewed trials, meta-analyses, society guidelines, FDA labels, registered protocols. Every paper is tagged with its design, population, intervention, and outcome metadata. Every guideline carries its date of issue and supersession status. A corpus that can be inspected, audited, and pointed to.

2. Section-level retrieval

Generic retrieval-augmented systems treat each paper as a single fuzzy "topic match." That collapse loses the structure that clinical reasoning depends on. A trial of methylphenidate in 5-year-olds is not interchangeable with a trial in adolescents, no matter how similar the abstract reads. Evidence-based retrieval has to match at the section level — population to population, intervention to intervention, outcome to outcome, timing to timing. A query about an anxious 8-year-old should pull the methods section that defines the cohort, not the abstract that summarizes a different one.

3. Study-quality weighting

Not all evidence is equal. A randomized controlled trial is not a cohort study. A meta-analysis is not a case series. Guideline concordant psychiatric care requires a system that knows the difference and weights its outputs accordingly. The hierarchy is well established: RCT and systematic review at the top, prospective cohort below, retrospective and observational below that. An evidence engine that surfaces a single underpowered open-label study with the same confidence it gives a 14-RCT meta-analysis is producing noise, not decision support.

4. Traceable citations

Every output has to terminate in a real PMID, a real DOI, a real guideline section. Click through and the paper is there. The study design is what Sigmund claimed it was. The sample size matches. The effect size matches. The year matches. This is the difference between evidence based prescribing decision support and confident text generation. Traceability is the auditability layer. It is also the moment a clinician can stop reading the AI's prose and start reading the underlying paper — which is what the clinician trained for in the first place.

This is the conceptual moat. Any system that claims to deliver evidence-based AI for psychiatry has to demonstrate all four. A scribe without a corpus is a stenographer. A corpus without quality weighting is a search engine. A search engine without traceable citations is a confident text generator. None of those produce guideline concordance psychiatry can rely on at the point of care.

Section 3

CEBA: how Sigmund's evidence engine works

Sigmund's CEBA-ADHD engine — Clinical Expert-Based Assistant for ADHD — is how he implements the four pillars in a working system. The architecture combines a graph-based retrieval-augmented generation knowledge base, ADHD-expert-committee curation, and patient-specific data integration. Multi-vector, section-aware, quality-weighted, and citation-terminal by design.

The corpus

The Generation 1 corpus is approximately 3,000 ADHD papers, indexed and tagged. Ingestion is automated and continuous. New PubMed entries flow in on a weekly cadence. ClinicalTrials.gov registrations are pulled to flag in-progress and recently completed studies. Society guidelines — AAP, AACAP, FDA labeling actions — are versioned with effective and superseded dates. When AACAP updates a practice parameter, the prior version stays in the corpus but is downweighted in real-time retrieval. Nothing disappears. Everything is dated.

Multi-vector embeddings

Each paper is decomposed into five section-level vectors: overview, population, intervention, outcomes, and timing. A patient case is encoded across the same five dimensions. Retrieval is the joint match — Sigmund asks, simultaneously, which papers describe a similar population, a similar intervention, a similar outcome metric, and a similar follow-up window. The top-matching papers are not the most "topically related" — they are the most clinically applicable.

Study-quality weighting

Within the retrieved set, each paper carries a fixed evidence-tier weight. Randomized controlled trials weight 1.0. Systematic reviews and meta-analyses weight 0.95. Prospective cohort studies weight 0.60. Retrospective and observational studies sit below. A recency boost applies a graduated multiplier so a recent RCT does not get out-voted by an older study that happens to have more citations. The combined score determines the ranking the clinician sees.

Traceable outputs

Every recommendation Sigmund surfaces terminates in a PMID list. The clinician can expand the citation chain in the chart, read the supporting papers, and confirm the design and effect sizes. There is no opaque step between what he surfaces and the underlying evidence.

Worked example. An anxious 8-year-old with newly diagnosed ADHD. The clinician triggers a ranking. CEBA returns three options, ordered:

  1. Behavioral parent training — first-line for this profile per AACAP §3.1 (2019). Supported by Pelham et al. 2016 (RCT, N=152, PMID 27567456).
  2. Atomoxetine — non-stimulant, anxiety-friendly profile. Supported by an RCT (PMID 28723456) and a 14-RCT meta-analysis (PMID 31204870).
  3. Methylphenidate — flagged for potential anxiety exacerbation in this profile; appropriate after behavioral response is assessed.

Each option is clickable. Each click expands the underlying paper list, with study design, sample size, effect size, and year. The clinician edits, signs, and the citations are written into the note.

The architecture is condition-agnostic. ADHD is the index condition because its longitudinal evidence base is exceptionally well-developed. The same multi-vector approach extends to anxiety, depression, and comorbid presentations as additional corpora are built and indexed.

Section 4

Validated in clinical practice

Generation 1 of the CEBA-ADHD engine has been operating at Integrative Psychiatry Manhattan, a high-volume outpatient child psychiatry practice in New York City. The validation set is N = 124 charts, prospectively reviewed.

87.4%

Diagnostic accuracy across five condition categories

100%

Clinical staff endorsement of improved work quality

400%

Increase in documentation volume per encounter

0

Patient safety complaints across the deployment

0.81–0.88

Effect size (r) versus generic AI baselines

N = 124

Charts in the prospective validation set

The 87.4% figure reflects diagnostic accuracy across five condition categories evaluated against board-certified clinician adjudication. The effect-size range (r = 0.81–0.88) compares CEBA outputs against generic large-language-model baselines on the same patient cases. The 400% documentation volume increase reflects the depth of structured clinical content captured per encounter — the MSE, the risk assessment, the citation chain — relative to baseline templated notes.

Endorsement was unanimous among the clinical staff using the system. Zero patient safety complaints were filed across the deployment window. Documentation review by clinicians showed the citation chain was used to defend prescribing decisions in prior authorization appeals and routine chart audits.

These are real-world numbers from a working practice — not a benchmark on a curated test set. The next phase tests them prospectively, across settings, under a formal study protocol.

Section 5

The R21 study (NIH PAR-25-310)

The deployment data establishes feasibility and a within-practice effect. The next question is whether the effect transports — across clinicians, across patient mixes, across resource environments. The R21 application under NIH PAR-25-310 (Exploratory/Developmental Research Grants) tests exactly that.

The design is a prospective, multi-site validation across two outpatient settings selected for divergent resource profiles. One is a well-resourced specialty practice in New York City. The second is a community-based child psychiatry clinic operating with more constrained access to subspecialty consultation. Both serve pediatric ADHD populations but differ on case mix, comorbidity prevalence, and prescribing patterns at baseline.

The primary outcome is the Guideline Concordance Score — a structured measure of the proportion of clinical decisions in each chart that align with AACAP, AAP, and FDA guidance. Secondary outcomes include documentation completeness, time to first prescribing decision, and clinician-reported usability. Implementation analysis follows the Consolidated Framework for Implementation Research (CFIR), capturing the contextual factors that shape uptake in each setting.

The study is led from the Sultan Lab for Mental Health Informatics at the Columbia University Department of Psychiatry, with infrastructure support from Columbia University Irving Medical Center and New York State Psychiatric Institute. Clinical site support extends to NewYork-Presbyterian. Application due June 5, 2026.

Section 6

What this means for a working psychiatrist

The architecture matters because of what it changes at the point of care. Three concrete things.

The citation chain shows up in the room. When Sigmund surfaces a ranked treatment list during the visit, every line is clickable. The clinician sees the supporting PMIDs without leaving the chart. If a recommendation is being challenged — by a parent, by a covering colleague, by a prior authorization reviewer — the evidence is one click away. No tab-switching. No PubMed search. The defense of the prescribing decision is already drafted.

Concordance shows up in the note. Each clinical claim in the assessment-and-plan carries a citation footnote. AACAP §3.1 for the first-line behavioral intervention. The Pelham 2016 RCT for the sequencing rationale. The atomoxetine meta-analysis if the second-line decision is documented. Notes that go to payers read the way the textbook reads. Prior authorization volume drops not because the system games anything, but because the rationale is already in the document.

The workflow stays under clinician control. Sigmund drafts. The clinician signs. The signature line stays empty by default. The MSE remains editable. Any ranked recommendation can be overridden — and when it is, he notes the override and learns your pattern over time. The decision authority does not move. The cognitive load of marshaling the evidence does.

For author credibility on the design choices behind this: see Dr. Ryan Sultan, double board-certified in adult and child/adolescent psychiatry, NIH NIDA K12 awardee, and director of the Sultan Lab.

Sigmund is investigational and intended to assist — not replace — clinical judgment.

Read further

Cluster pages on related topics.

Each page below goes deeper on a specific dimension of evidence-based AI in psychiatric practice.

Psychiatry documentation

How the note gets drafted, what gets captured, and where the MSE lives.

Read the documentation page →

Clinical decision moments

The specific prescribing forks where decision support changes behavior.

Read the decisions page →

AI scribe comparisons

How Sigmund's evidence engine differs from generic scribes and general-purpose models.

Read the comparison page →

Free for clinicians

See the evidence engine in your own clinic.

Sigmund's core tools are free for mental health clinicians and free forever for trainees. We onboard by hand, one clinician at a time, so we can sign a BAA and set you up properly.

Request access

Join waitlist →