ADR-012: Ollama for LLM Runtime

Status

Accepted - July 2024

Context

BookWorm's AI-enhanced features require a robust, scalable, and cost-effective Large Language Model (LLM) runtime to power intelligent functionalities across the platform. The requirements include:

Book Recommendations: Personalized book suggestions based on reading history and preferences
Content Analysis: Automatic book categorization, summary generation, and metadata extraction
Search Enhancement: Semantic search capabilities with natural language query understanding
Customer Support: AI-powered chatbot for customer inquiries and support
Content Moderation: Automated review and rating content moderation
Inventory Intelligence: Demand prediction and inventory optimization insights
User Experience: Natural language interfaces for complex queries and interactions
Performance: Low-latency inference for real-time user interactions
Cost Control: Predictable and scalable cost structure without per-token pricing
Privacy: On-premises or private cloud deployment for sensitive data handling
Flexibility: Support for multiple model architectures and easy model switching

The choice of LLM runtime significantly impacts system performance, operational costs, data privacy, and the overall quality of AI-powered features.

Decision

Adopt Ollama as the primary Large Language Model runtime, integrated with Microsoft Agent Framework for AI orchestration, to provide powerful, cost-effective, and privacy-conscious AI capabilities across BookWorm's platform.

AI Architecture Strategy

Ollama Model Management

Model Repository: Centralized management of multiple LLM models for different use cases
Resource Optimization: GPU-aware deployment with automatic GPU detection
Container Integration: Aspire-based deployment with OpenWebUI interface
Azure Container App Support: Scalable cloud deployment with min replicas configuration

Chat Service Agent Orchestration

Sequential Processing: Multi-agent pipeline for comprehensive chat responses
Specialized Agents: BookAgent, LanguageAgent, SentimentAgent, SummarizeAgent
RAG Integration: Hybrid search capabilities with vector embeddings
Real-time Streaming: Live response streaming with Redis backplane

BookWorm AI Agents

Agent	Purpose	Integration	Capabilities
BookAgent	Book search and recommendations	Catalog search, MCP tools	Personalized suggestions, catalog queries
LanguageAgent	Translation and language processing	Text processing	Multi-language support
SentimentAgent	Emotion analysis	Customer feedback	Positive/Negative/Neutral classification
SummarizeAgent	Content summarization	Text processing	Key insights extraction

Rationale

Why Ollama?

Cost-Effective and Predictable

No Token-Based Pricing: Fixed infrastructure costs without per-request charges
Open Source Models: Access to state-of-the-art models without licensing fees
Resource Efficiency: Optimized model serving with quantization and optimization
Horizontal Scaling: Scale compute resources based on actual usage patterns
Multi-Tenancy: Serve multiple applications from shared infrastructure

Privacy and Data Control

On-Premises Deployment: Complete control over data and model inference
No External API Calls: All AI processing happens within private infrastructure
Data Sovereignty: Compliance with data protection regulations and policies
Model Customization: Fine-tune models on proprietary data without external exposure
Audit Trail: Complete visibility into AI processing and decision-making

Performance and Reliability

Low Latency: Direct model access without network round-trips to external services
High Throughput: Optimized inference serving with batching and caching
Offline Capability: Continue AI functionality during internet connectivity issues
Custom Optimization: Hardware-specific optimizations for GPU and CPU inference
Availability Control: Service-level agreements based on internal infrastructure

Flexibility and Model Diversity

Multiple Model Support: Run different models for specialized tasks
Model Versioning: A/B test different model versions and capabilities
Custom Models: Deploy fine-tuned or domain-specific models
Easy Migration: Switch between models without changing application code
Experimentation: Rapid prototyping with new models and approaches

Why Agent Framework Integration?

.NET Ecosystem Alignment

Native .NET Integration: First-class support for .NET applications and patterns
Dependency Injection: Seamless integration with ASP.NET Core DI container
Configuration: Familiar configuration patterns and environment management
Testing: Comprehensive testing framework for AI-powered applications
Observability: Built-in telemetry, logging, and performance monitoring

Enterprise AI Patterns

Prompt Engineering: Centralized prompt management with versioning and testing
Function Calling: Integrate AI with business logic through semantic functions
Memory Systems: Persistent context and conversation state management
Plugin Architecture: Modular AI capabilities with reusable components
Security: Built-in security patterns for AI applications and data handling

LLM Architecture Overview

Implementation Strategy

Ollama Deployment with Aspire

public static async Task AddOllama(
    this IDistributedApplicationBuilder builder,
    Action<IResourceBuilder<OllamaResource>>? configure = null)
{
    var ollama = builder
        .AddOllama(Components.Ollama.Resource)
        .WithDataVolume()
        .WithOpenWebUI()
        .WithImagePullPolicy(ImagePullPolicy.Always)
        .WithLifetime(ContainerLifetime.Persistent)
        .PublishAsAzureContainerApp((_, app) => app.Template.Scale.MinReplicas = 0);

    if (await ollama.IsUseGpu())
    {
        ollama.WithGPUSupport();
    }

    configure?.Invoke(ollama);
}

Agent Orchestration Service

public sealed class AgentOrchestrationService : IAgentOrchestrationService
{
    public async Task<string> ProcessAgentsSequentiallyAsync(
        string userMessage,
        Guid conversationId,
        Guid assistantReplyId,
        CancellationToken cancellationToken = default)
    {
        var runtime = new InProcessRuntime();
        await runtime.StartAsync(cancellationToken);

        SequentialOrchestration orchestration = new(
            orchestrateAgents.LanguageAgent,
            orchestrateAgents.SummarizeAgent,
            orchestrateAgents.SentimentAgent,
            orchestrateAgents.BookAgent)
        {
            ResponseCallback = ResponseCallbackAsync,
        };

        var result = await orchestration.InvokeAsync(userMessage, runtime, cancellationToken);
        var finalResults = await result.GetValueAsync(TimeSpan.FromSeconds(60), cancellationToken);

RAG Hybrid Search

public sealed class HybridSearch : ISearch
{
    public async Task<IReadOnlyList<TextSnippet>> SearchAsync(
        string text,
        ICollection<string> keywords,
        int maxResults = 20,
        CancellationToken cancellationToken = default)
    {
        var vector = await embeddingGenerator.GenerateVectorAsync(
            text, cancellationToken: cancellationToken);

        await collection.EnsureCollectionExistsAsync(cancellationToken);
        var vectorCollection = (IKeywordHybridSearchable<TextSnippet>)collection;

        var options = new HybridSearchOptions<TextSnippet>
        {
            VectorProperty = r => r.Vector,
            AdditionalProperty = r => r.Description,
        };

        var nearest = vectorCollection.HybridSearchAsync(
            vector, keywords, maxResults, options, cancellationToken);

        return await nearest.ToListAsync(cancellationToken);
    }
}

Consequences

Benefits

Cost predictability with fixed infrastructure costs vs per-token pricing
Data privacy with on-premises processing and no external API calls
Performance control with direct model access and hardware optimization
Agent orchestration enabling complex multi-step AI workflows
Hybrid search combining vector similarity with keyword matching
Real-time streaming for responsive chat experiences

Trade-offs

Infrastructure complexity requiring GPU management and model deployment
Operational overhead for model updates and performance monitoring
Resource requirements for GPU-enabled hardware and memory management
Expertise needs for AI model optimization and troubleshooting

Alternatives Considered

OpenAI GPT API

Pros: Best-in-class performance, managed service, extensive capabilities
Cons: High per-token costs, data privacy concerns, external dependency
Decision: Cost and privacy concerns outweigh performance benefits

Azure OpenAI Service

Pros: Enterprise-grade security, Microsoft ecosystem integration
Cons: External dependency, significant costs at scale, limited model control
Decision: Preference for self-hosted solution with complete control

Hugging Face Transformers

Pros: Open source, extensive model library, community support
Cons: Complex deployment, requires ML expertise, optimization challenges
Decision: Ollama provides better production-ready deployment

Status​

Context​

Decision​

AI Architecture Strategy​

Ollama Model Management​

Chat Service Agent Orchestration​

BookWorm AI Agents​

Rationale​

Why Ollama?​

Cost-Effective and Predictable​

Privacy and Data Control​

Performance and Reliability​

Flexibility and Model Diversity​

Why Agent Framework Integration?​

.NET Ecosystem Alignment​

Enterprise AI Patterns​

LLM Architecture Overview​

Implementation Strategy​

Ollama Deployment with Aspire​

Agent Orchestration Service​

RAG Hybrid Search​

Consequences​

Benefits​

Trade-offs​

Alternatives Considered​

OpenAI GPT API​

Azure OpenAI Service​

Hugging Face Transformers​