Impact and challenges of large language models in healthcare
Large language models (LLMs) continue to revolutionize healthcare delivery and patient engagement. However, as these models have matured, one challenge has emerged as the defining concern: context management. When trained on robust health datasets and provided with the right contextual information at the right time, LLMs can play a crucial role in enhancing patient care through data-driven insights.
This guide explains the current state and challenges of large language models in healthcare, and how organizations can implement them responsibly with proper context management strategies. Here is what we’ll cover:
FAQs about LLMs in healthcare
What is an LLM?
LLMs are deep learning models trained on massive neural networks. In essence, LLMs can process vast sequences of text and extract meaning — fast. Since mid-2024, these models have become dramatically more capable, with context windows expanding to 200K+ tokens and costs dropping by 80-90%.
When applied to healthcare, large language models offer proven solutions for improving patient care and streamlining processes.
What is the role of LLMs in medicine?
LLMs can answer questions, summarize text, paraphrase complicated jargon, and translate words into different languages. They can now also use tools, call external systems, and orchestrate complex multi-step workflows. From quickly scanning a long patient file to helping patients understand their diagnosis to autonomously scheduling follow-up care, these models are transforming healthcare delivery.
What are the applications of medical LLMs?
Medical providers now rely on large language models in healthcare to:
- Streamline administrative tasks: Clinicians spend roughly 33% of their workday on activities outside of patient care. LLMs now automate scheduling, prior authorization requests, referral management, and follow-up care instructions with minimal human oversight.
- Manage clinical documentation: LLMs summarize patient notes and medical histories, extract structured data from unstructured text, and generate draft care plans. Modern implementations use retrieval-augmented generation (RAG) to pull relevant context from electronic health records (EHRs) in real-time.
- Proactively detect adverse events: Using data from EHRs, LLMs can automatically detect patterns indicating potential adverse health events, drug interactions, or care gaps that require intervention.
- Orchestrate care workflows: Agentic LLM systems can manage complex, multi-step care coordination tasks — from identifying high-risk patients to coordinating outreach to tracking engagement to escalating to clinicians when needed.
Despite the extensive benefits of LLMs, implementing them effectively in healthcare environments presents unique challenges.
The challenges of LLMs in healthcare
Context management
The healthcare industry has learned a crucial lesson: the quality of LLM outputs is almost entirely determined by the context provided. While early concerns focused on model accuracy and hallucinations, the real challenge is ensuring models have access to the right information at the right time.
Brendan Smith-Elion, VP of Product Management at Arcadia, describes this as "the context problem."
"These models are incredibly capable, but they're only as good as the information you feed them. The hard part isn't the AI — it's architecting systems that can dynamically assemble the relevant patient data, clinical guidelines, organizational policies, and real-time information the model needs to make good decisions."
Let’s review several factors that have made context management the primary concern.
The rise of the Model Context Protocol (MCP)
Anthropic's Model Context Protocol, released in 2024, standardized how LLMs connect to data sources. This has accelerated adoption but also revealed the complexity of healthcare data integration. Organizations now need strategies for:
- Connecting LLMs to EHRs, claims systems, HIEs, and external data sources
- Managing permissions and data access patterns across systems
- Ensuring context freshness and accuracy
- Orchestrating multiple context sources for complex queries
Agentic architectures demand better context
Modern healthcare LLM applications aren't single-shot queries — they're autonomous agents that make decisions over minutes or hours. These agents need to maintain context across multiple steps, tools, and data sources. Poor context management leads to agents that lose track of patient state, repeat actions, or make decisions on stale data.
Regulatory scrutiny on data provenance
The U.S. Food and Drug Administration (FDA) and Office of the National Coordinator for Health Information Technology (ONC) now require clear documentation of the data used to inform AI decisions. This means tracking not just model outputs, but the complete context provided to models — a significant architectural challenge.
Implementation challenges
Healthcare organizations also encounter challenges in implementing these new solutions, including:
Context assembly and orchestration
The biggest challenge isn't model capability — it's assembling the right context from disparate healthcare systems. Healthcare organizations must:
- Build context pipelines: Develop systems that can query EHRs, claims databases, lab systems, ADT feeds, and external sources in real-time
- Implement smart retrieval: Use vector databases and semantic search to find relevant patient history, similar cases, and applicable guidelines
- Manage context windows strategically: Even with 200K token windows, you must prioritize what information matters most for each use case
- Handle context freshness: Ensure lab results, medication changes, and care plan updates are reflected immediately
Model drift and knowledge boundaries
LLMs constantly receive updates, but healthcare organizations need stability. Modern approaches include:
- Versioning strategies: Lock models to specific versions for regulated use cases
- Hybrid architectures: Use general-purpose LLMs for reasoning but fine-tuned smaller models for domain-specific medical knowledge
- RAG as a hedge: Retrieval-augmented generation (RAG) reduces reliance on model training data by grounding responses in your organization's current documentation
Trust, transparency, and auditability
Healthcare users demand to know why an AI made a recommendation. This requires:
- Explainable context chains: Track what data the model accessed and how it influenced the output
- Human-in-the-loop by design: Structure workflows so clinicians verify AI decisions at critical points
- Audit trails for compliance: Log all context provided, model responses, and user actions
Infrastructure cost and complexity
While per-token costs have dropped dramatically, the infrastructure around context management is expensive:
- Vector databases for semantic search: Pinecone, Weaviate, or AWS OpenSearch
- Real-time data pipelines: Streaming EHR events, claims updates, and external data
- Security and PHI protection: Encrypted data stores, secure enclaves for model inference, BAA-compliant infrastructure
- Integration layers: MCP servers, FHIR APIs, custom connectors to legacy systems
4 practical steps for implementing healthcare LLMs
Healthcare organizations need a structured approach that prioritizes context management. The "Plan, Do, Study, Act" system provides this framework. Watch the following video for an in-depth explainer of this process from Brendan Smith-Elion, or continue reading for a quick overview:
Video transcript
My name is Brendan Smith-Elion. I’m a VP of Product here at Arcadia.
Today, we’re going to be talking about large language models — the newest buzzword and three-letter acronym in healthcare — including their use cases, how you can fine-tune them, and how to drive success.
This will be a high-level discussion that introduces some technical concepts. We’ll talk about how to think about large language models in healthcare, including biases, potential pitfalls, and a general framework for using them effectively.
So what’s so challenging about using large language models in healthcare? There are four key things we’re going to talk about.
The first is that they’re difficult to tune. A large language model is a massive neural network — essentially a giant graph of facts with weights between those facts. Most general-purpose models contain healthcare knowledge, but they also include a huge amount of external knowledge that can influence outputs.
This includes things like biases, consumer behaviors, and unrelated facts that can bleed into healthcare use cases. If you don’t tune or adjust these models, you can end up with unexpected or inappropriate information in the output.
The second challenge is unpredictable results. Large language models are constantly being tuned and revised. Many commercial models have new information injected into them on an ongoing basis, which leads to drift.
In traditional healthcare AI and machine learning, you can manage drift because you control the data. With large language models, especially commercial ones, there’s an added dimension of drift as the underlying model itself changes.
The third thing to consider is that large language models are most powerful when they have additional data and context. The more relevant information you provide, the more accurate and precise the output becomes.
However, injecting that level of context can be challenging out of the box. You need to think carefully about strategies to ensure the model has the right information for your specific use case.
The final consideration is that large language models improve with feedback. If you build a feedback loop into your system — even something as simple as a thumbs-up signal — you can continuously improve results over time.
This brings us to a familiar concept in healthcare: continuous quality improvement. Plan, do, study, act. This framework works extremely well when thinking about large language model implementation.
During the planning phase, think about prioritization. What jobs can you automate? What opportunities exist in your ecosystem?
For each opportunity, evaluate how ready your data is. Is it available? Clean? Trusted? Aggregated? Actionable? If the data isn’t accurate and actionable, the output won’t be either.
You can score data readiness on a scale from one to ten, where ten represents highly actionable data and one represents non-actionable data.
Another critical factor is user trust in AI. One of the biggest failures in my career was trying to convince a doctor to change behavior using a sepsis prediction algorithm — not because the model was inaccurate, but because the clinician didn’t trust the result.
Trust is essential whether the user is a patient, clinician, or payer. Some healthcare use cases are particularly sensitive, and that needs to be reflected in how you prioritize and design solutions.
You should also evaluate the relative infrastructure cost. While we often think of language models as text-based, many are now multimodal, incorporating audio, video, and images — which significantly increases cost.
Another factor is existing behavioral levers. Where do you already have clinical decision support, trusted interfaces, or widely adopted applications that automation can plug into?
Once you score these dimensions, you can rank opportunities and decide which ones to pursue first.
From there, experiment. Use freely available tools like GPT or Gemini to understand what outputs look like and build experimental frameworks that gather both qualitative and quantitative feedback.
This might include a “Wizard of Oz” approach, where users interact with a system that appears automated, but is manually guided behind the scenes to understand whether it influences behavior.
Reinforcement learning from human feedback is critical. With as few as 50 curated examples per use case, you can significantly improve output quality through techniques like few-shot learning.
Clinicians can help by correcting outputs and showing what they would actually write. Feeding those examples back into the model dramatically improves results.
You can further refine outputs through expert review, stack ranking results, and reinforcing preferred responses. Over time, variance decreases and confidence in the output increases.
It’s also essential to capture user feedback early, whether through structured inputs, behavioral signals, or simple indicators like thumbs up or thumbs down.
Auditing outputs through randomized review helps identify unexpected or clinically inappropriate results that may not surface through analytics alone.
Ultimately, this approach — combining feedback loops, expert input, and reinforcement learning — allows healthcare organizations to build models that are accurate, trustworthy, and clinically meaningful.
To wrap up, focus on outcome metrics like time to complete tasks and overall workflow efficiency. If your solution adds friction, extra steps, or unnecessary clicks, it’s not helping users.
The goal is to simplify workflows, improve patient management, and drive meaningful impact without disruption.
Plan: Design Your Context Architecture
Before selecting models or use cases, design how you'll manage context. Start with an inventory of your data sources:
- What systems contain patient data? (EHR, claims, labs, pharmacy, ADT, registries, etc.)
- Which sources have APIs? Which ones require custom integration?
- What's the latency for each source? (real-time vs. batch)
- Where are the gaps in your data?
Then, map use cases to context requirements:
- Build a grid ranking opportunities by: data availability, integration complexity, latency requirements, and clinical value
- For each opportunity, document: What context is required? Where does it live? How fresh must it be? Who has access rights?
Next, choose your architecture pattern:
- RAG for document-heavy workflows: Prior authorizations, clinical guideline adherence, care plan generation
- Agentic for multi-step processes: Care coordination, patient outreach campaigns, referral management
- Hybrid for complex use cases: Combine RAG for knowledge retrieval with agents for workflow orchestration
Finally, plan for MCP integration:
- Identify which systems will expose MCP servers
- Design your context server architecture (centralized vs. distributed)
- Plan for authentication, rate limiting, and access control
Do: Implement Context First
With architecture designed, build context management infrastructure first and models second. Start with context pipelines:
- Implement real-time data streaming from critical systems (ADT, labs, medications)
- Build semantic search over clinical notes and patient history
- Create context assembly logic that gathers relevant information based on use case
Then, experiment with model selection:
- Test multiple models (Claude, ChatGPT, Gemini) with the same context to compare outputs
- Consider fine-tuned medical models (Med-PaLM, BioGPT derivatives) for specific domains
- Use smaller, faster models for high-volume tasks; reserve large models for complex reasoning
Next, implement reinforcement learning from human feedback with context awareness:
- Collect feedback not just on outputs, but on whether the model had sufficient context
- Train models to request additional context when uncertain
- Build feedback loops that improve context retrieval, not just model responses
Finally, use MCP to standardize integrations:
- Implement MCP servers for your key systems
- Leverage existing MCP connectors for common platforms (Epic, Cerner, Athena)
- Build custom MCP servers for legacy systems
Study: Evaluate Context Completeness
Assessment now focuses on context quality as much as output quality. Begin by conducting an expert review with context auditing:
- Show clinical experts both the model output AND the context provided
- Ask: "Did the model have enough information to make this recommendation?"
- Identify systematic gaps in context assembly
Then, measure context metrics:
- Context retrieval latency (are we fast enough for real-time use cases?)
- Context relevance (does semantic search return the right documents?)
- Context completeness (are we missing critical data elements?)
- Context freshness (how old is the data we're providing?)
Next, test edge cases:
- Complex patients with extensive histories
- Rare conditions where guidelines may be ambiguous
- High-risk scenarios where missing context could harm patients
Finally, monitor for context drift:
- Track whether your data sources maintain consistent structure
- Alert on schema changes in upstream systems
- Audit for data quality issues that degrade context
Act: Operationalize with Context Governance
Moving to production requires ongoing context management. Start by implementing context monitoring:
- Dashboard showing context assembly success rates by source system
- Alerts for stale data or retrieval failures
- Audit logs showing what context informed each AI decision
Then, build feedback loops:
- Capture when users override AI recommendations (suggests insufficient context)
- Track "dead ends" where workflows don't complete (context gaps?)
- A/B test different context assembly strategies
Next, establish governance:
- Clinical advisory board reviews model performance monthly
- Data stewards monitor context quality and completeness
- Regular audits of what data is accessed and how it's used
Finally, plan for regulatory requirements:
- Document your context sources and assembly logic
- Maintain versioned datasets showing what data was available when
- Prepare audit trails for FDA/ONC review
The key to success with healthcare LLMs: Context-aware AI architecture
Healthcare organizations succeeding with LLMs share a common pattern: they've invested heavily in context management infrastructure before worrying about model selection or fine-tuning.
Key architectural elements include:
- Unified health data platform: Centralized repository with real-time streaming from source systems
- Vector database for semantic retrieval: Fast, relevant context retrieval from unstructured data
- MCP server layer: Standardized connections between LLMs and healthcare systems
- Orchestration layer: Manages multi-step agent workflows and context assembly
- Observability and governance: Tracks what context is used and how models respond
This infrastructure is expensive and complex to build, but it's table stakes for production LLM applications in healthcare. Organizations trying to skip ahead to model fine-tuning without solving context management consistently struggle with accuracy, auditability, and user trust.
Modern context management in action
With proper context architecture, several use cases have matured significantly:
- Prior authorization automation: RAG systems pull relevant clinical notes, guidelines, and plan criteria; agents orchestrate form completion and submission; success rates now exceed 85% for routine cases.
- Care gap closure: Agents identify gaps, assemble patient context, generate personalized outreach, track responses, and escalate to care managers; operating at scale across entire populations.
- Clinical decision support at the point of care: Real-time context assembly from EHR provides relevant guidelines, drug interactions, and similar patient outcomes delivered at the point of care in seconds.
- Patient engagement: Conversational agents with full patient context handle appointment scheduling, medication questions, and care plan education, maintaining context across multiple interactions over weeks.
The future of LLMs in healthcare
As LLMs continue to evolve, context management will remain the differentiator between experimental pilots and production healthcare applications. The Model Context Protocol represents a critical standardization effort, but healthcare organizations must still solve the hard problems of data integration, real-time assembly, and governance.
The opportunity is massive: LLMs that truly understand patient context can transform care delivery. But the path requires disciplined investment in data infrastructure, not just AI models.
Get in on the ground floor of context-aware healthcare AI. Learn how Arcadia's data platform provides the unified health data foundation needed for LLM applications.