Multimodal Support
Memory OS goes beyond text - store and retrieve memories from images, audio, documents, and more. Build AI that truly understands the world.
What is Multimodal Memory?
Multimodal memory means your AI can remember and understand information from multiple formats - not just text, but images, audio, documents, and video. This creates richer, more contextual AI experiences.
Just like humans remember faces, voices, and visual scenes, your AI can too.
Text-Only Memory
Limited context - the AI can't actually see what was in the screenshot.
Multimodal Memory
Full context - the AI can see and understand the actual image content.
Supported Formats
Images
Store screenshots, photos, diagrams, and visual content with automatic vision analysis.
Audio
Remember voice conversations, audio notes, and spoken content with transcription.
Documents
Store PDFs, Word docs, spreadsheets with full text extraction and understanding.
Video
Extract key frames and transcribe audio from video content for searchable memories.
How It Works
Upload & Process
Upload any supported format. Memory OS automatically processes it using AI vision, speech-to-text, or document parsing.
Image → Vision AI → "Blue dashboard with 3 charts showing upward trends"Extract & Embed
Content is extracted and converted into semantic embeddings, making it searchable by meaning.
Text + Image Description → Embeddings → Searchable MemorySearch & Retrieve
Search across all formats using natural language. Find images by describing what's in them, audio by what was said.
Query: "dashboard screenshots" → Finds all relevant imagesCode Examples
Storing an Image Memory
from memorystack import MemoryStackClient
memory = MemoryStackClient(
api_key="your_api_key",
user_id="user_123"
)
# Store image with context
result = memory.create_memory(
content="Product dashboard showing Q4 analytics",
memory_type="visual",
attachments=[{
"type": "image",
"url": "https://example.com/dashboard.png",
"description": "Blue dashboard with 3 charts"
}],
metadata={
"category": "product_design",
"date": "2024-01-15"
}
)
print(f"Stored memory: {result['id']}")
# Vision AI automatically analyzes the imageStoring Audio Memory
# Store voice note
result = memory.create_memory(
content="Meeting notes from product review",
memory_type="audio",
attachments=[{
"type": "audio",
"url": "https://example.com/meeting.mp3",
"duration": 1800 # 30 minutes
}],
metadata={
"participants": ["Alice", "Bob"],
"meeting_type": "product_review"
}
)
# Audio is automatically transcribed
# Transcription is embedded and searchableSearching Multimodal Memories
# Search across all formats
results = memory.search_memories(
query="dashboard designs with charts",
limit=10
)
# Filter by attachment type
image_memories = memory.search_memories(
query="product screenshots",
filters={
"has_attachment": "image"
}
)
# Search audio transcriptions
meeting_notes = memory.search_memories(
query="API discussion",
filters={
"has_attachment": "audio",
"memory_type": "meeting"
}
)
# Process results
for mem in results['results']:
print(f"Content: {mem['content']}")
if mem.get('attachments'):
for att in mem['attachments']:
print(f" - {att['type']}: {att['url']}")
if att.get('analysis'):
print(f" Analysis: {att['analysis']}")Real-World Use Cases
🎨Design Assistant
Remember all design iterations, mockups, and feedback. Search by visual similarity or description.
🎙️Meeting Assistant
Record and transcribe meetings automatically. Search through all past discussions by topic.
📚Research Assistant
Store PDFs, papers, and documents. Search across all your research materials by concept.
Best Practices
✅ Do
- • Add descriptive text context with attachments
- • Use appropriate memory types for different formats
- • Include relevant metadata (date, category, etc.)
- • Optimize image sizes before uploading
- • Use transcription for searchable audio
❌ Don't
- • Upload without any text description
- • Store extremely large files (> 50MB)
- • Forget to add searchable metadata
- • Mix unrelated content in one memory
- • Skip format validation
