The composable
operations of AI.
Every AI pipeline reduces to a sequence of typed transformations: something goes in, something else comes out. This index catalogues those transformations by input modality, lists the implementations that currently matter, and links each block to its registry page.
Start from what you have — image, text, audio, video, a document — and follow the arrows to what you need.
Text, in and out.
- 01Text→VectorText Embedding
Convert text into dense vector representations for semantic search, clustering, and retrieval.
Read the block page →Implementations- OpenAI text-embedding-3-largeapi
- Cohere embed-v3api
- Voyage AI voyage-3api
- BGE-M3open-source
- + 2 more on the block page
- 02Text→TextLanguage Model
Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.
Read the block page →Implementations- GPT-4oapi
- Claude 3.5 Sonnetapi
- Gemini 1.5 Proapi
- Llama 3.1 405Bopen-source
- + 2 more on the block page
- 03Text→ImageImage Generation
Generate images from text descriptions. Powers creative tools, marketing, and synthetic data.
Read the block page →Implementations- DALL-E 3api
- Midjourneyapi
- Stable Diffusion 3open-source
- FLUX.1open-source
- + 1 more on the block page
- 04Text→AudioText to Speech
Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.
Read the block page →Implementations- ElevenLabsapi
- OpenAI TTSapi
- Coqui XTTSopen-source
- Barkopen-source
- 05Text→3D ModelText to 3D
Generate 3D models from text descriptions. Enables rapid prototyping and creative 3D content generation.
Read the block page →Implementations- Shap-Eopen-source
- Point-Eopen-source
- MVDreamopen-source
- Meshyapi
- + 1 more on the block page
- 06Text→VideoText to Video
Generate videos from text descriptions. The frontier of generative AI for content creation.
Read the block page →Implementations- Soraapi
- Runway Gen-3api
- Klingapi
- CogVideoXopen-source
- + 1 more on the block page
- 07Text→Structured DataText Classification
Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.
Read the block page →Implementations- SetFitopen-source
- DistilBERTopen-source
- DeBERTa-v3open-source
- BART-large-mnliopen-source
- + 1 more on the block page
- 08Text→TextMachine Translation
Translate text between languages. Essential for global communication, localization, and cross-lingual applications.
Read the block page →Implementations- NLLB-200open-source
- Google Cloud Translationapi
- DeepLapi
- MADLAD-400open-source
- + 1 more on the block page
- 09Text→TextText Summarization
Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.
Read the block page →Implementations- BART-large-cnnopen-source
- Pegasusopen-source
- LongT5open-source
- Claudeapi
- + 1 more on the block page
- 10Text→TextQuestion Answering
Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.
Read the block page →Implementations- RoBERTa-SQuADopen-source
- DPR (Dense Passage Retrieval)open-source
- FiD (Fusion-in-Decoder)open-source
- Perplexity APIapi
- + 1 more on the block page
- 11Text→Structured DataNamed Entity Recognition
Extract named entities (people, organizations, locations, dates) from text. Key for information extraction and knowledge graphs.
Read the block page →Implementations- spaCyopen-source
- GLiNERopen-source
- BERT-NERopen-source
- Flairopen-source
- + 1 more on the block page
- 12Text→Structured DataCross-Encoder Reranking
Re-score retrieved passages with a cross-encoder to boost search precision.
Read the block page →Implementations- Cohere Rerankapi
- BGE-Rerankeropen-source
- monoT5open-source
- 13Text→Structured DataPII Detection & Anonymization
Detect and redact personally identifiable information to stay compliant.
Read the block page →Implementations- Microsoft Presidioopen-source
- spaCy PII Pipelinesopen-source
- AWS Comprehend PIIapi
- 14Text→Structured DataHallucination Detection
Score or flag generated text for factuality and grounding.
Read the block page →Implementations- RAGASopen-source
- SelfCheckGPTopen-source
- G-Evalopen-source
- 15Text→TextControllable Generation
Generate text with constraints on style, length, structure, or safety guardrails.
Read the block page →Implementations- Guidance/Outlinesopen-source
- Guardrails AIopen-source
- NeMo Guardrailsopen-source
- 16Text→TextLong-Context Summarization
Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.
Read the block page →Implementations- Gemini 1.5 Proapi
- Claude 3.5 Sonnet 200Kapi
- LLama 3.1 70B 128Kopen-source
- 17Text→TextCode Generation & Repair
Generate, refactor, or fix code with language models specialized for programming.
Read the block page →Implementations- GPT-4o (Code)api
- DeepSeek-Coder-V2open-source
- CodeLlama 70B Instructopen-source
- 18Text→Structured DataHybrid Sparse + Dense Retrieval
Combine lexical (BM25) and dense retrieval with weighted fusion or cascades to improve recall and precision for search and RAG.
Read the block page →Implementations- Elasticsearch + ELSER/BM25open-source
- Pyserini + Faissopen-source
- Weaviate Hybridopen-source
Image, in and out.
- 01Image→VectorImage Embedding
Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.
Read the block page →Implementations- OpenAI CLIPopen-source
- SigLIPopen-source
- OpenCLIPopen-source
- DINOv2open-source
- 02Image→TextImage Captioning
Generate natural language descriptions of image content. Enables text-based search over visual content.
Read the block page →Implementations- GPT-4 Visionapi
- Claude 3.5 Sonnetapi
- LLaVAopen-source
- BLIP-2open-source
- + 1 more on the block page
- 03Image→Bounding BoxesObject Detection
Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.
Read the block page →Implementations- YOLOv8/YOLOv11open-source
- RT-DETRopen-source
- Grounding DINOopen-source
- Florence-2open-source
- + 1 more on the block page
- 04Image→Segmentation MaskImage Segmentation
Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.
Read the block page →Implementations- Segment Anything (SAM)open-source
- SAM 2open-source
- Mask2Formeropen-source
- YOLOv8-segopen-source
- + 1 more on the block page
- 05Image→Depth MapDepth Estimation
Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.
Read the block page →Implementations- Depth Anything V2open-source
- MiDaSopen-source
- ZoeDepthopen-source
- Marigoldopen-source
- 06Image→3D ModelImage to 3D
Generate 3D models from single or multiple images. Powers 3D asset creation, VR/AR, and e-commerce.
Read the block page →Implementations- TripoSRopen-source
- LGM (Large Gaussian Model)open-source
- InstantMeshopen-source
- CSM (Common Sense Machines)api
- + 1 more on the block page
- 07Image→VideoImage to Video
Animate still images into videos. Bring photos to life with natural motion.
Read the block page →Implementations- Stable Video Diffusionopen-source
- Runway Gen-3 Alphaapi
- Klingapi
- Pikaapi
- 08Image→TextVisual Question Answering
Answer natural language questions about images. Combines vision and language understanding.
Read the block page →Implementations- GPT-4Vapi
- Claude 3.5 Sonnetapi
- LLaVAopen-source
- Qwen-VLopen-source
- + 1 more on the block page
- 09Image→ImageImage Transformation
Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.
Read the block page →Implementations- Stable Diffusion XLopen-source
- ControlNetopen-source
- InstructPix2Pixopen-source
- Real-ESRGANopen-source
- + 1 more on the block page
- 10Image→TextOptical Character Recognition
Detect and read text in images and documents. Core for document intake, receipts, and scene text search.
Read the block page →Implementations- PaddleOCRopen-source
- TrOCRopen-source
- Tesseractopen-source
- 11Image→Structured DataPose Estimation
Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.
Read the block page →Implementations- RTMPose (MMPose)open-source
- MoveNetopen-source
- OpenPoseopen-source
- 12Image→Structured DataOptical Flow
Estimate pixel-wise motion between frames. Useful for video editing, stabilization, and robotics.
Read the block page →Implementations- RAFTopen-source
- GMFlowopen-source
- FlowFormeropen-source
- 13Image→ImageBackground Removal
Segment foreground and remove or replace backgrounds for product photos and portraits.
Read the block page →Implementations- MODNetopen-source
- U^2-Netopen-source
- Segment Anythingopen-source
- 14Image→ImageFace Anonymization
Blur, mask, or re-synthesize faces to protect privacy in images and video frames.
Read the block page →Implementations- DeepPrivacy2open-source
- YOLOv8-face + OpenCVopen-source
- BriaRMBG + Bluropen-source
- 15Image→Structured DataChart and Table Understanding
Parse charts, diagrams, and tables into structured data for analysis and QA.
Read the block page →Implementations- Table Transformer (TATR)open-source
- DocTRopen-source
- ChartQA Modelsopen-source
Audio, in and out.
- 01Audio→TextSpeech Recognition
Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.
Read the block page →Implementations- OpenAI Whisper APIapi
- Whisper (local)open-source
- Deepgramapi
- AssemblyAIapi
- + 2 more on the block page
- 02Audio→Structured DataAudio Classification
Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.
Read the block page →Implementations- Audio Spectrogram Transformer (AST)open-source
- Wav2Vec2open-source
- CLAPopen-source
- YAMNetopen-source
- + 1 more on the block page
- 03Audio→Structured DataVoice Activity Detection
Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.
Read the block page →Implementations- Silero VADopen-source
- WebRTC VADopen-source
- pyannote VADopen-source
- Speechbrain VADopen-source
- 04Audio→AudioAudio Transformation
Transform audio signals: enhance, denoise, separate sources, change voice, or convert music styles.
Read the block page →Implementations- Demucsopen-source
- RVC (Retrieval Voice Conversion)open-source
- so-vits-svcopen-source
- DeepFilterNetopen-source
- + 1 more on the block page
- 05Audio→Structured DataSpeaker Diarization
Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.
Read the block page →Implementations- pyannote.audioopen-source
- NVIDIA NeMo Diarizationopen-source
- Resemblyzeropen-source
- 06Audio→Structured DataKeyword Spotting
Detect wake words and short commands with low latency and tiny footprints.
Read the block page →Implementations- Google Speech Commands KWSopen-source
- openWakeWordopen-source
- Picovoice Porcupineapi
- 07Audio→Structured DataSpeech Emotion Recognition
Classify speaker emotion or affective state from voice.
Read the block page →Implementations- SpeechBrain SERopen-source
- Wav2Vec2-Emotionopen-source
- Emo-CLAPopen-source
- 08Audio→AudioVoice Cloning
Replicate a speaker’s voice or convert one voice to another (TTS-to-TTS).
Read the block page →Implementations- RVCopen-source
- so-vits-svcopen-source
- OpenVoiceopen-source
- 09Audio→Structured DataAudio Watermark Detection
Detect or verify watermarks in synthetic or distributed audio.
Read the block page →Implementations- Audiowmarkopen-source
- AudioSeal Detectoropen-source
- Stable Signature (beta)api
Video, in and out.
- 01Video→TextVideo Understanding
Understand and describe video content. Powers video search, summarization, and analysis.
Read the block page →Implementations- Gemini 1.5 Proapi
- GPT-4V (with frames)api
- VideoLLaMAopen-source
- InternVideo2open-source
- + 1 more on the block page
- 02Video→Structured DataAction Recognition
Classify actions or activities in video clips for safety, sports, and analytics.
Read the block page →Implementations- TimeSformeropen-source
- VideoMAEopen-source
- SlowFastopen-source
- 03Video→Structured DataMulti-Object Tracking
Track multiple objects across video frames with consistent identities.
Read the block page →Implementations- ByteTrackopen-source
- StrongSORT/BoT-SORTopen-source
- OC-SORTopen-source
- 04Video→TextVideo OCR
Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.
Read the block page →Implementations- PaddleOCR + FFmpegopen-source
- EasyOCRopen-source
- TrOCR + SAM trackingopen-source
- 05Video→AudioAudio-Visual Speech Separation
Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.
Read the block page →Implementations- AV-Separation (MS3)open-source
- SpeechSplit + Visual Conditioningopen-source
- NeMo AV-Diarizationopen-source
Document, in and out.
- 01Document→Structured DataDocument Extraction
Extract structured information from documents like PDFs, invoices, forms, and contracts.
Read the block page →Implementations- Docling (IBM)open-source
- Unstructured.ioopen-source
- Azure Document Intelligenceapi
- Google Document AIapi
- + 1 more on the block page
- 02Document→TextDocument Question Answering
Answer questions about document content including text, tables, and layouts. Essential for document AI.
Read the block page →Implementations- GPT-4Vapi
- LayoutLMv3open-source
- Donutopen-source
- DocVQA-BERTopen-source
- + 1 more on the block page
- 03Document→TextDocument RAG Pipeline
Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.
Read the block page →Implementations- LlamaIndexopen-source
- LangChainopen-source
- Haystackopen-source
- RAGFlowopen-source
- + 4 more on the block page
Blocks, composed.
Frequently-assembled chains. Each is a reading of several blocks as one operation; the individual blocks remain linked for substitution.
Direct Visual Search
Embed images directly with CLIP/SigLIP, search by text or image query.
Image to Vector(Image → Vector)→Text to Vector(Text → Vector)Good for- Photo library search
- E-commerce visual search
Strengths- Real-time indexing
- Text-to-image search
- Simple pipeline
Trade-offs- May miss fine details
- Abstract concepts can be weak
Caption + RAG Visual Search
Generate captions for images, embed captions, search via text RAG.
Image to Text(Image → Text)→Text to Vector(Text → Vector)Good for- Detailed scene search
- Accessibility-first apps
Strengths- Human-readable index
- Can describe complex scenes
- Debuggable
Trade-offs- Slower indexing
- Caption quality limits retrieval
- Higher cost
Document RAG Pipeline
Extract text from documents, chunk, embed, retrieve, generate with LLM.
Document to Structured(Document → Structured Data)→Text to Vector(Text → Vector)→Text to Text(Text → Text)Good for- Enterprise search
- Legal document QA
- Knowledge base
Strengths- Grounds LLM in your data
- Citable sources
Trade-offs- Chunking strategy matters
- Multi-step latency
Voice Assistant Pipeline
Speech-to-text, process with LLM, text-to-speech response.
Good for- Voice assistants
- Call center bots
- Accessibility
Strengths- Natural interaction
- Hands-free
Trade-offs- Latency stacks up
- Error propagation
Where each block lands.
- /tasksAll tasksThe alphabetical index of every benchmark the registry tracks.
- /visionVision routerThe image-and-video benchmark pages, with current SOTA and reproduction notes.
- /ocrOCR & document AIDocument-input blocks with verified CER and F1 on the canonical splits.
- /speechSpeechSpeech-input blocks — ASR, diarisation, TTS — tracked by WER and MOS.
- /llmLLMsText-input and text-output blocks, benchmarked across reasoning, coding and retrieval.
- /methodologyMethodologyHow the registry gates every number each of these blocks points to.