Skip to main content

ADR-012: Ollama for LLM Runtime

Status

Deprecated - May 2026

:::warning Deprecated — May 2026 Ollama was replaced by Azure OpenAI (GPT-4o-mini for chat, text-embedding-3-large for embeddings) to leverage enterprise SLA, managed scaling, and Microsoft ecosystem integration. Local model hosting via Ollama is no longer used in production. This ADR is kept for historical reference. :::

Context

BookWorm's AI-enhanced features require a robust, scalable, and cost-effective Large Language Model (LLM) runtime to power intelligent functionalities across the platform. The requirements include:

  • Book Recommendations: Personalized book suggestions based on reading history and preferences
  • Content Analysis: Automatic book categorization, summary generation, and metadata extraction
  • Search Enhancement: Semantic search capabilities with natural language query understanding
  • Customer Support: AI-powered chatbot for customer inquiries and support
  • Content Moderation: Automated review and rating content moderation
  • Inventory Intelligence: Demand prediction and inventory optimization insights
  • User Experience: Natural language interfaces for complex queries and interactions
  • Performance: Low-latency inference for real-time user interactions
  • Cost Control: Predictable and scalable cost structure without per-token pricing
  • Privacy: On-premises or private cloud deployment for sensitive data handling
  • Flexibility: Support for multiple model architectures and easy model switching

The choice of LLM runtime significantly impacts system performance, operational costs, data privacy, and the overall quality of AI-powered features.

Decision

Adopt Ollama as the primary Large Language Model runtime, integrated with Microsoft Agent Framework for AI orchestration, to provide powerful, cost-effective, and privacy-conscious AI capabilities across BookWorm's platform.

AI Architecture Strategy

Ollama Model Management

  • Model Repository: Centralized management of multiple LLM models for different use cases
  • Resource Optimization: GPU-aware deployment with automatic GPU detection
  • Container Integration: Aspire-based deployment with OpenWebUI interface
  • Azure Container App Support: Scalable cloud deployment with min replicas configuration

Chat Service Agent Orchestration

  • Sequential Processing: Multi-agent pipeline for comprehensive chat responses
  • Specialized Agents: BookAgent, LanguageAgent, SentimentAgent, SummarizeAgent
  • RAG Integration: Hybrid search capabilities with vector embeddings
  • Real-time Streaming: Live response streaming with Redis backplane

BookWorm AI Agents

AgentPurposeIntegrationCapabilities
BookAgentBook search and recommendationsCatalog search, MCP toolsPersonalized suggestions, catalog queries
LanguageAgentTranslation and language processingText processingMulti-language support
SentimentAgentEmotion analysisCustomer feedbackPositive/Negative/Neutral classification
SummarizeAgentContent summarizationText processingKey insights extraction

Rationale

Why Ollama?

Cost-Effective and Predictable

  1. No Token-Based Pricing: Fixed infrastructure costs without per-request charges
  2. Open Source Models: Access to state-of-the-art models without licensing fees
  3. Resource Efficiency: Optimized model serving with quantization and optimization
  4. Horizontal Scaling: Scale compute resources based on actual usage patterns
  5. Multi-Tenancy: Serve multiple applications from shared infrastructure

Privacy and Data Control

  1. On-Premises Deployment: Complete control over data and model inference
  2. No External API Calls: All AI processing happens within private infrastructure
  3. Data Sovereignty: Compliance with data protection regulations and policies
  4. Model Customization: Fine-tune models on proprietary data without external exposure
  5. Audit Trail: Complete visibility into AI processing and decision-making

Performance and Reliability

  1. Low Latency: Direct model access without network round-trips to external services
  2. High Throughput: Optimized inference serving with batching and caching
  3. Offline Capability: Continue AI functionality during internet connectivity issues
  4. Custom Optimization: Hardware-specific optimizations for GPU and CPU inference
  5. Availability Control: Service-level agreements based on internal infrastructure

Flexibility and Model Diversity

  1. Multiple Model Support: Run different models for specialized tasks
  2. Model Versioning: A/B test different model versions and capabilities
  3. Custom Models: Deploy fine-tuned or domain-specific models
  4. Easy Migration: Switch between models without changing application code
  5. Experimentation: Rapid prototyping with new models and approaches

Why Agent Framework Integration?

.NET Ecosystem Alignment

  1. Native .NET Integration: First-class support for .NET applications and patterns
  2. Dependency Injection: Seamless integration with ASP.NET Core DI container
  3. Configuration: Familiar configuration patterns and environment management
  4. Testing: Comprehensive testing framework for AI-powered applications
  5. Observability: Built-in telemetry, logging, and performance monitoring

Enterprise AI Patterns

  1. Prompt Engineering: Centralized prompt management with versioning and testing
  2. Function Calling: Integrate AI with business logic through semantic functions
  3. Memory Systems: Persistent context and conversation state management
  4. Plugin Architecture: Modular AI capabilities with reusable components
  5. Security: Built-in security patterns for AI applications and data handling

LLM Architecture Overview

Implementation Strategy

Ollama Deployment with Aspire

public static async Task AddOllama(
this IDistributedApplicationBuilder builder,
Action<IResourceBuilder<OllamaResource>>? configure = null)
{
var ollama = builder
.AddOllama(Components.Ollama.Resource)
.WithDataVolume()
.WithOpenWebUI()
.WithImagePullPolicy(ImagePullPolicy.Always)
.WithLifetime(ContainerLifetime.Persistent)
.PublishAsAzureContainerApp((_, app) => app.Template.Scale.MinReplicas = 0);

if (await ollama.IsUseGpu())
{
ollama.WithGPUSupport();
}

configure?.Invoke(ollama);
}

Agent Orchestration Service

public sealed class AgentOrchestrationService : IAgentOrchestrationService
{
public async Task<string> ProcessAgentsSequentiallyAsync(
string userMessage,
Guid conversationId,
Guid assistantReplyId,
CancellationToken cancellationToken = default)
{
var runtime = new InProcessRuntime();
await runtime.StartAsync(cancellationToken);

SequentialOrchestration orchestration = new(
orchestrateAgents.LanguageAgent,
orchestrateAgents.SummarizeAgent,
orchestrateAgents.SentimentAgent,
orchestrateAgents.BookAgent)
{
ResponseCallback = ResponseCallbackAsync,
};

var result = await orchestration.InvokeAsync(userMessage, runtime, cancellationToken);
var finalResults = await result.GetValueAsync(TimeSpan.FromSeconds(60), cancellationToken);


public sealed class HybridSearch : ISearch
{
public async Task<IReadOnlyList<TextSnippet>> SearchAsync(
string text,
ICollection<string> keywords,
int maxResults = 20,
CancellationToken cancellationToken = default)
{
var vector = await embeddingGenerator.GenerateVectorAsync(
text, cancellationToken: cancellationToken);

await collection.EnsureCollectionExistsAsync(cancellationToken);
var vectorCollection = (IKeywordHybridSearchable<TextSnippet>)collection;

var options = new HybridSearchOptions<TextSnippet>
{
VectorProperty = r => r.Vector,
AdditionalProperty = r => r.Description,
};

var nearest = vectorCollection.HybridSearchAsync(
vector, keywords, maxResults, options, cancellationToken);

return await nearest.ToListAsync(cancellationToken);
}
}

Consequences

Benefits

  • Cost predictability with fixed infrastructure costs vs per-token pricing
  • Data privacy with on-premises processing and no external API calls
  • Performance control with direct model access and hardware optimization
  • Agent orchestration enabling complex multi-step AI workflows
  • Hybrid search combining vector similarity with keyword matching
  • Real-time streaming for responsive chat experiences

Trade-offs

  • Infrastructure complexity requiring GPU management and model deployment
  • Operational overhead for model updates and performance monitoring
  • Resource requirements for GPU-enabled hardware and memory management
  • Expertise needs for AI model optimization and troubleshooting

Alternatives Considered

OpenAI GPT API

  • Pros: Best-in-class performance, managed service, extensive capabilities
  • Cons: High per-token costs, data privacy concerns, external dependency
  • Decision: Cost and privacy concerns outweigh performance benefits

Azure OpenAI Service

  • Pros: Enterprise-grade security, Microsoft ecosystem integration
  • Cons: External dependency, significant costs at scale, limited model control
  • Decision: Preference for self-hosted solution with complete control

Hugging Face Transformers

  • Pros: Open source, extensive model library, community support
  • Cons: Complex deployment, requires ML expertise, optimization challenges
  • Decision: Ollama provides better production-ready deployment