AI Building Blocks
What can you transform? Start from what you have (images, text, audio) and discover which building blocks turn it into what you need. Focus on production-ready solutions, not research.
Document RAG Pipeline
Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.
Language Model
Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.
Image Embedding
Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.
From Image
15 blocksImage Understanding(5)
Image Embedding
Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.
Image Captioning
Generate natural language descriptions of image content. Enables text-based search over visual content.
Visual Question Answering
Answer natural language questions about images. Combines vision and language understanding.
Optical Character Recognition
Detect and read text in images and documents. Core for document intake, receipts, and scene text search.
Chart and Table Understanding
Parse charts, diagrams, and tables into structured data for analysis and QA.
Image Perception(5)
Object Detection
Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.
Image Segmentation
Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.
Depth Estimation
Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.
Pose Estimation
Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.
Optical Flow
Estimate pixel-wise motion between frames. Useful for video editing, stabilization, and robotics.
Image Transformation(5)
Image to 3D
Generate 3D models from single or multiple images. Powers 3D asset creation, VR/AR, and e-commerce.
Image to Video
Animate still images into videos. Bring photos to life with natural motion.
Image Transformation
Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.
Background Removal
Segment foreground and remove or replace backgrounds for product photos and portraits.
Face Anonymization
Blur, mask, or re-synthesize faces to protect privacy in images and video frames.
From Text
18 blocksText Retrieval(3)
Text Embedding
Convert text into dense vector representations for semantic search, clustering, and retrieval.
Cross-Encoder Reranking
Re-score retrieved passages with a cross-encoder to boost search precision.
Hybrid Sparse + Dense Retrieval
Combine lexical (BM25) and dense retrieval with weighted fusion or cascades to improve recall and precision for search and RAG.
Text to Media(4)
Image Generation
Generate images from text descriptions. Powers creative tools, marketing, and synthetic data.
Text to Speech
Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.
Text to 3D
Generate 3D models from text descriptions. Enables rapid prototyping and creative 3D content generation.
Text to Video
Generate videos from text descriptions. The frontier of generative AI for content creation.
Text Generation(3)
Language Model
Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.
Controllable Generation
Generate text with constraints on style, length, structure, or safety guardrails.
Code Generation & Repair
Generate, refactor, or fix code with language models specialized for programming.
Text Analysis(4)
Text Classification
Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.
Named Entity Recognition
Extract named entities (people, organizations, locations, dates) from text. Key for information extraction and knowledge graphs.
PII Detection & Anonymization
Detect and redact personally identifiable information to stay compliant.
Hallucination Detection
Score or flag generated text for factuality and grounding.
Text Transformation(4)
Machine Translation
Translate text between languages. Essential for global communication, localization, and cross-lingual applications.
Text Summarization
Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.
Question Answering
Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.
Long-Context Summarization
Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.
From Audio
9 blocksSpeech Recognition
Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.
Audio Classification
Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.
Voice Activity Detection
Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.
Audio Transformation
Transform audio signals: enhance, denoise, separate sources, change voice, or convert music styles.
Speaker Diarization
Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.
Keyword Spotting
Detect wake words and short commands with low latency and tiny footprints.
Speech Emotion Recognition
Classify speaker emotion or affective state from voice.
Voice Cloning
Replicate a speaker’s voice or convert one voice to another (TTS-to-TTS).
Audio Watermark Detection
Detect or verify watermarks in synthetic or distributed audio.
From Video
5 blocksVideo Understanding
Understand and describe video content. Powers video search, summarization, and analysis.
Action Recognition
Classify actions or activities in video clips for safety, sports, and analytics.
Multi-Object Tracking
Track multiple objects across video frames with consistent identities.
Video OCR
Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.
Audio-Visual Speech Separation
Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.
From Document
3 blocksDocument Extraction
Extract structured information from documents like PDFs, invoices, forms, and contracts.
Document Question Answering
Answer questions about document content including text, tables, and layouts. Essential for document AI.
Document RAG Pipeline
Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.
Common Pipelines
Pre-built combinations of building blocks for common use cases.
Direct Visual Search
Embed images directly with CLIP/SigLIP, search by text or image query.
- Photo library search
- E-commerce visual search
- Real-time indexing
- Text-to-image search
Caption + RAG Visual Search
Generate captions for images, embed captions, search via text RAG.
- Detailed scene search
- Accessibility-first apps
- Human-readable index
- Can describe complex scenes
Document RAG Pipeline
Extract text from documents, chunk, embed, retrieve, generate with LLM.
- Enterprise search
- Legal document QA
- Grounds LLM in your data
- Citable sources
Voice Assistant Pipeline
Speech-to-text, process with LLM, text-to-speech response.
- Voice assistants
- Call center bots
- Natural interaction
- Hands-free
Example: Text Search in Photo Database
You have thousands of photos and want to search them with text queries like "sunset at the beach" or "birthday party with cake". Here are your options:
Direct CLIP Embedding
Embed images directly with CLIP/SigLIP. Text queries are embedded in the same space. Simple, real-time capable.
Best for: General visual concepts, fast indexing, product similarity
Caption + Text RAG
Generate detailed captions with a VLM, then use text embedding for search. More descriptive, human-readable index.
Best for: Complex scene descriptions, debugging, accessibility requirements
Missing a building block? Have benchmark results to share?
Contribute Data