ADR-012: Ollama for LLM Runtime
Status
Accepted - July 2024
Context
BookWorm's AI-enhanced features require a robust, scalable, and cost-effective Large Language Model (LLM) runtime to power intelligent functionalities across the platform. The requirements include:
- Book Recommendations: Personalized book suggestions based on reading history and preferences
- Content Analysis: Automatic book categorization, summary generation, and metadata extraction
- Search Enhancement: Semantic search capabilities with natural language query understanding
- Customer Support: AI-powered chatbot for customer inquiries and support
- Content Moderation: Automated review and rating content moderation
- Inventory Intelligence: Demand prediction and inventory optimization insights
- User Experience: Natural language interfaces for complex queries and interactions
- Performance: Low-latency inference for real-time user interactions
- Cost Control: Predictable and scalable cost structure without per-token pricing
- Privacy: On-premises or private cloud deployment for sensitive data handling
- Flexibility: Support for multiple model architectures and easy model switching
The choice of LLM runtime significantly impacts system performance, operational costs, data privacy, and the overall quality of AI-powered features.
Decision
Adopt Ollama as the primary Large Language Model runtime, integrated with Microsoft Semantic Kernel for AI orchestration, to provide powerful, cost-effective, and privacy-conscious AI capabilities across BookWorm's platform.
AI Architecture Strategy
Ollama Model Management
- Model Repository: Centralized management of multiple LLM models for different use cases
- Resource Optimization: GPU-aware deployment with automatic GPU detection
- Container Integration: Aspire-based deployment with OpenWebUI interface
- Azure Container App Support: Scalable cloud deployment with min replicas configuration
Chat Service Agent Orchestration
- Sequential Processing: Multi-agent pipeline for comprehensive chat responses
- Specialized Agents: BookAgent, LanguageAgent, SentimentAgent, SummarizeAgent
- RAG Integration: Hybrid search capabilities with vector embeddings
- Real-time Streaming: Live response streaming with Redis backplane
BookWorm AI Agents
Agent | Purpose | Integration | Capabilities |
---|---|---|---|
BookAgent | Book search and recommendations | Catalog search, MCP tools | Personalized suggestions, catalog queries |
LanguageAgent | Translation and language processing | Text processing | Multi-language support |
SentimentAgent | Emotion analysis | Customer feedback | Positive/Negative/Neutral classification |
SummarizeAgent | Content summarization | Text processing | Key insights extraction |
Rationale
Why Ollama?
Cost-Effective and Predictable
- No Token-Based Pricing: Fixed infrastructure costs without per-request charges
- Open Source Models: Access to state-of-the-art models without licensing fees
- Resource Efficiency: Optimized model serving with quantization and optimization
- Horizontal Scaling: Scale compute resources based on actual usage patterns
- Multi-Tenancy: Serve multiple applications from shared infrastructure
Privacy and Data Control
- On-Premises Deployment: Complete control over data and model inference
- No External API Calls: All AI processing happens within private infrastructure
- Data Sovereignty: Compliance with data protection regulations and policies
- Model Customization: Fine-tune models on proprietary data without external exposure
- Audit Trail: Complete visibility into AI processing and decision-making
Performance and Reliability
- Low Latency: Direct model access without network round-trips to external services
- High Throughput: Optimized inference serving with batching and caching
- Offline Capability: Continue AI functionality during internet connectivity issues
- Custom Optimization: Hardware-specific optimizations for GPU and CPU inference
- Availability Control: Service-level agreements based on internal infrastructure
Flexibility and Model Diversity
- Multiple Model Support: Run different models for specialized tasks
- Model Versioning: A/B test different model versions and capabilities
- Custom Models: Deploy fine-tuned or domain-specific models
- Easy Migration: Switch between models without changing application code
- Experimentation: Rapid prototyping with new models and approaches
Why Semantic Kernel Integration?
.NET Ecosystem Alignment
- Native .NET Integration: First-class support for .NET applications and patterns
- Dependency Injection: Seamless integration with ASP.NET Core DI container
- Configuration: Familiar configuration patterns and environment management
- Testing: Comprehensive testing framework for AI-powered applications
- Observability: Built-in telemetry, logging, and performance monitoring
Enterprise AI Patterns
- Prompt Engineering: Centralized prompt management with versioning and testing
- Function Calling: Integrate AI with business logic through semantic functions
- Memory Systems: Persistent context and conversation state management
- Plugin Architecture: Modular AI capabilities with reusable components
- Security: Built-in security patterns for AI applications and data handling
LLM Architecture Overview
Implementation Strategy
Ollama Deployment with Aspire
public static async Task AddOllama(
this IDistributedApplicationBuilder builder,
Action<IResourceBuilder<OllamaResource>>? configure = null)
{
var ollama = builder
.AddOllama(Components.Ollama.Resource)
.WithDataVolume()
.WithOpenWebUI()
.WithImagePullPolicy(ImagePullPolicy.Always)
.WithLifetime(ContainerLifetime.Persistent)
.PublishAsAzureContainerApp((_, app) => app.Template.Scale.MinReplicas = 0);
if (await ollama.IsUseGpu())
{
ollama.WithGPUSupport();
}
configure?.Invoke(ollama);
}
Agent Orchestration Service
public sealed class AgentOrchestrationService : IAgentOrchestrationService
{
public async Task<string> ProcessAgentsSequentiallyAsync(
string userMessage,
Guid conversationId,
Guid assistantReplyId,
CancellationToken cancellationToken = default)
{
var runtime = new InProcessRuntime();
await runtime.StartAsync(cancellationToken);
SequentialOrchestration orchestration = new(
orchestrateAgents.LanguageAgent,
orchestrateAgents.SummarizeAgent,
orchestrateAgents.SentimentAgent,
orchestrateAgents.BookAgent)
{
ResponseCallback = ResponseCallbackAsync,
};
var result = await orchestration.InvokeAsync(userMessage, runtime, cancellationToken);
var finalResults = await result.GetValueAsync(TimeSpan.FromSeconds(60), cancellationToken);
RAG Hybrid Search
public sealed class HybridSearch : ISearch
{
public async Task<IReadOnlyList<TextSnippet>> SearchAsync(
string text,
ICollection<string> keywords,
int maxResults = 20,
CancellationToken cancellationToken = default)
{
var vector = await embeddingGenerator.GenerateVectorAsync(
text, cancellationToken: cancellationToken);
await collection.EnsureCollectionExistsAsync(cancellationToken);
var vectorCollection = (IKeywordHybridSearchable<TextSnippet>)collection;
var options = new HybridSearchOptions<TextSnippet>
{
VectorProperty = r => r.Vector,
AdditionalProperty = r => r.Description,
};
var nearest = vectorCollection.HybridSearchAsync(
vector, keywords, maxResults, options, cancellationToken);
return await nearest.ToListAsync(cancellationToken);
}
}
Consequences
Benefits
- Cost predictability with fixed infrastructure costs vs per-token pricing
- Data privacy with on-premises processing and no external API calls
- Performance control with direct model access and hardware optimization
- Agent orchestration enabling complex multi-step AI workflows
- Hybrid search combining vector similarity with keyword matching
- Real-time streaming for responsive chat experiences
Trade-offs
- Infrastructure complexity requiring GPU management and model deployment
- Operational overhead for model updates and performance monitoring
- Resource requirements for GPU-enabled hardware and memory management
- Expertise needs for AI model optimization and troubleshooting
Alternatives Considered
OpenAI GPT API
- Pros: Best-in-class performance, managed service, extensive capabilities
- Cons: High per-token costs, data privacy concerns, external dependency
- Decision: Cost and privacy concerns outweigh performance benefits
Azure OpenAI Service
- Pros: Enterprise-grade security, Microsoft ecosystem integration
- Cons: External dependency, significant costs at scale, limited model control
- Decision: Preference for self-hosted solution with complete control
Hugging Face Transformers
- Pros: Open source, extensive model library, community support
- Cons: Complex deployment, requires ML expertise, optimization challenges
- Decision: Ollama provides better production-ready deployment