MemoryStackMemoryStack/Documentation
    Back to Documentation

    Multimodal Support

    Memory OS goes beyond text - store and retrieve memories from images, audio, documents, and more. Build AI that truly understands the world.

    What is Multimodal Memory?

    Multimodal memory means your AI can remember and understand information from multiple formats - not just text, but images, audio, documents, and video. This creates richer, more contextual AI experiences.

    Just like humans remember faces, voices, and visual scenes, your AI can too.

    Text-Only Memory

    "User showed me a product screenshot"

    Limited context - the AI can't actually see what was in the screenshot.

    Multimodal Memory

    "User showed product screenshot"
    + Image: Blue dashboard with analytics charts

    Full context - the AI can see and understand the actual image content.

    Supported Formats

    Images

    Store screenshots, photos, diagrams, and visual content with automatic vision analysis.

    Formats:
    JPG, PNG, GIF, WebP, SVG

    Audio

    Remember voice conversations, audio notes, and spoken content with transcription.

    Formats:
    MP3, WAV, M4A, OGG

    Documents

    Store PDFs, Word docs, spreadsheets with full text extraction and understanding.

    Formats:
    PDF, DOCX, TXT, CSV, XLSX

    Video

    Extract key frames and transcribe audio from video content for searchable memories.

    Formats:
    MP4, WebM, MOV

    How It Works

    1

    Upload & Process

    Upload any supported format. Memory OS automatically processes it using AI vision, speech-to-text, or document parsing.

    Image → Vision AI → "Blue dashboard with 3 charts showing upward trends"
    2

    Extract & Embed

    Content is extracted and converted into semantic embeddings, making it searchable by meaning.

    Text + Image Description → Embeddings → Searchable Memory
    3

    Search & Retrieve

    Search across all formats using natural language. Find images by describing what's in them, audio by what was said.

    Query: "dashboard screenshots" → Finds all relevant images

    Code Examples

    Storing an Image Memory

    from memorystack import MemoryStackClient
    
    memory = MemoryStackClient(
        api_key="your_api_key",
        user_id="user_123"
    )
    
    # Store image with context
    result = memory.create_memory(
        content="Product dashboard showing Q4 analytics",
        memory_type="visual",
        attachments=[{
            "type": "image",
            "url": "https://example.com/dashboard.png",
            "description": "Blue dashboard with 3 charts"
        }],
        metadata={
            "category": "product_design",
            "date": "2024-01-15"
        }
    )
    
    print(f"Stored memory: {result['id']}")
    # Vision AI automatically analyzes the image

    Storing Audio Memory

    # Store voice note
    result = memory.create_memory(
        content="Meeting notes from product review",
        memory_type="audio",
        attachments=[{
            "type": "audio",
            "url": "https://example.com/meeting.mp3",
            "duration": 1800  # 30 minutes
        }],
        metadata={
            "participants": ["Alice", "Bob"],
            "meeting_type": "product_review"
        }
    )
    
    # Audio is automatically transcribed
    # Transcription is embedded and searchable

    Searching Multimodal Memories

    # Search across all formats
    results = memory.search_memories(
        query="dashboard designs with charts",
        limit=10
    )
    
    # Filter by attachment type
    image_memories = memory.search_memories(
        query="product screenshots",
        filters={
            "has_attachment": "image"
        }
    )
    
    # Search audio transcriptions
    meeting_notes = memory.search_memories(
        query="API discussion",
        filters={
            "has_attachment": "audio",
            "memory_type": "meeting"
        }
    )
    
    # Process results
    for mem in results['results']:
        print(f"Content: {mem['content']}")
        if mem.get('attachments'):
            for att in mem['attachments']:
                print(f"  - {att['type']}: {att['url']}")
                if att.get('analysis'):
                    print(f"    Analysis: {att['analysis']}")

    Real-World Use Cases

    🎨Design Assistant

    Remember all design iterations, mockups, and feedback. Search by visual similarity or description.

    Example:
    "Show me all dashboard designs with dark mode" → Finds relevant screenshots

    🎙️Meeting Assistant

    Record and transcribe meetings automatically. Search through all past discussions by topic.

    Example:
    "What did we decide about the API pricing?" → Finds relevant meeting segments

    📚Research Assistant

    Store PDFs, papers, and documents. Search across all your research materials by concept.

    Example:
    "Papers about transformer architectures" → Finds relevant documents and excerpts

    Best Practices

    ✅ Do

    • • Add descriptive text context with attachments
    • • Use appropriate memory types for different formats
    • • Include relevant metadata (date, category, etc.)
    • • Optimize image sizes before uploading
    • • Use transcription for searchable audio

    ❌ Don't

    • • Upload without any text description
    • • Store extremely large files (> 50MB)
    • • Forget to add searchable metadata
    • • Mix unrelated content in one memory
    • • Skip format validation

    Next Steps