Codesota · Building blocksThe composable operations of a pipelineIssue: April 22, 2026

Editorial · Building blocks

The composable
operations of AI.

Every AI pipeline reduces to a sequence of typed transformations: something goes in, something else comes out. This index catalogues those transformations by input modality, lists the implementations that currently matter, and links each block to its registry page.

Start from what you have — image, text, audio, video, a document — and follow the arrows to what you need.

§ From text

Text, in and out.

18 blocksLLM registry →

01
Text→Vector
Text Embedding
Convert text into dense vector representations for semantic search, clustering, and retrieval.
Read the block page →
Implementations
- OpenAI text-embedding-3-largeapi
- Cohere embed-v3api
- Voyage AI voyage-3api
- BGE-M3open-source
- + 2 more on the block page
02
Text→Text
Language Model
Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.
Read the block page →
Implementations
- GPT-4oapi
- Claude 3.5 Sonnetapi
- Gemini 1.5 Proapi
- Llama 3.1 405Bopen-source
- + 2 more on the block page
03
Text→Image
Image Generation
Generate images from text descriptions. Powers creative tools, marketing, and synthetic data.
Read the block page →
Implementations
- DALL-E 3api
- Midjourneyapi
- Stable Diffusion 3open-source
- FLUX.1open-source
- + 1 more on the block page
04
Text→Audio
Text to Speech
Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.
Read the block page →
Implementations
- ElevenLabsapi
- OpenAI TTSapi
- Coqui XTTSopen-source
- Barkopen-source
05
Text→3D Model
Text to 3D
Generate 3D models from text descriptions. Enables rapid prototyping and creative 3D content generation.
Read the block page →
Implementations
- Shap-Eopen-source
- Point-Eopen-source
- MVDreamopen-source
- Meshyapi
- + 1 more on the block page
06
Text→Video
Text to Video
Generate videos from text descriptions. The frontier of generative AI for content creation.
Read the block page →
Implementations
- Soraapi
- Runway Gen-3api
- Klingapi
- CogVideoXopen-source
- + 1 more on the block page
07
Text→Structured Data
Text Classification
Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.
Read the block page →
Implementations
- SetFitopen-source
- DistilBERTopen-source
- DeBERTa-v3open-source
- BART-large-mnliopen-source
- + 1 more on the block page
08
Text→Text
Machine Translation
Translate text between languages. Essential for global communication, localization, and cross-lingual applications.
Read the block page →
Implementations
- NLLB-200open-source
- Google Cloud Translationapi
- DeepLapi
- MADLAD-400open-source
- + 1 more on the block page
09
Text→Text
Text Summarization
Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.
Read the block page →
Implementations
- BART-large-cnnopen-source
- Pegasusopen-source
- LongT5open-source
- Claudeapi
- + 1 more on the block page
10
Text→Text
Question Answering
Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.
Read the block page →
Implementations
- RoBERTa-SQuADopen-source
- DPR (Dense Passage Retrieval)open-source
- FiD (Fusion-in-Decoder)open-source
- Perplexity APIapi
- + 1 more on the block page
11
Text→Structured Data
Named Entity Recognition
Extract named entities (people, organizations, locations, dates) from text. Key for information extraction and knowledge graphs.
Read the block page →
Implementations
- spaCyopen-source
- GLiNERopen-source
- BERT-NERopen-source
- Flairopen-source
- + 1 more on the block page
12
Text→Structured Data
Cross-Encoder Reranking
Re-score retrieved passages with a cross-encoder to boost search precision.
Read the block page →
Implementations
- Cohere Rerankapi
- BGE-Rerankeropen-source
- monoT5open-source
13
Text→Structured Data
PII Detection & Anonymization
Detect and redact personally identifiable information to stay compliant.
Read the block page →
Implementations
- Microsoft Presidioopen-source
- spaCy PII Pipelinesopen-source
- AWS Comprehend PIIapi
14
Text→Structured Data
Hallucination Detection
Score or flag generated text for factuality and grounding.
Read the block page →
Implementations
- RAGASopen-source
- SelfCheckGPTopen-source
- G-Evalopen-source
15
Text→Text
Controllable Generation
Generate text with constraints on style, length, structure, or safety guardrails.
Read the block page →
Implementations
- Guidance/Outlinesopen-source
- Guardrails AIopen-source
- NeMo Guardrailsopen-source
16
Text→Text
Long-Context Summarization
Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.
Read the block page →
Implementations
- Gemini 1.5 Proapi
- Claude 3.5 Sonnet 200Kapi
- LLama 3.1 70B 128Kopen-source
17
Text→Text
Code Generation & Repair
Generate, refactor, or fix code with language models specialized for programming.
Read the block page →
Implementations
- GPT-4o (Code)api
- DeepSeek-Coder-V2open-source
- CodeLlama 70B Instructopen-source
18
Text→Structured Data
Hybrid Sparse + Dense Retrieval
Combine lexical (BM25) and dense retrieval with weighted fusion or cascades to improve recall and precision for search and RAG.
Read the block page →
Implementations
- Elasticsearch + ELSER/BM25open-source
- Pyserini + Faissopen-source
- Weaviate Hybridopen-source

§ From image

Image, in and out.

15 blocksVision router →

01
Image→Vector
Image Embedding
Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.
Read the block page →
Implementations
- OpenAI CLIPopen-source
- SigLIPopen-source
- OpenCLIPopen-source
- DINOv2open-source
02
Image→Text
Image Captioning
Generate natural language descriptions of image content. Enables text-based search over visual content.
Read the block page →
Implementations
- GPT-4 Visionapi
- Claude 3.5 Sonnetapi
- LLaVAopen-source
- BLIP-2open-source
- + 1 more on the block page
03
Image→Bounding Boxes
Object Detection
Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.
Read the block page →
Implementations
- YOLOv8/YOLOv11open-source
- RT-DETRopen-source
- Grounding DINOopen-source
- Florence-2open-source
- + 1 more on the block page
04
Image→Segmentation Mask
Image Segmentation
Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.
Read the block page →
Implementations
- Segment Anything (SAM)open-source
- SAM 2open-source
- Mask2Formeropen-source
- YOLOv8-segopen-source
- + 1 more on the block page
05
Image→Depth Map
Depth Estimation
Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.
Read the block page →
Implementations
- Depth Anything V2open-source
- MiDaSopen-source
- ZoeDepthopen-source
- Marigoldopen-source
06
Image→3D Model
Image to 3D
Generate 3D models from single or multiple images. Powers 3D asset creation, VR/AR, and e-commerce.
Read the block page →
Implementations
- TripoSRopen-source
- LGM (Large Gaussian Model)open-source
- InstantMeshopen-source
- CSM (Common Sense Machines)api
- + 1 more on the block page
07
Image→Video
Image to Video
Animate still images into videos. Bring photos to life with natural motion.
Read the block page →
Implementations
- Stable Video Diffusionopen-source
- Runway Gen-3 Alphaapi
- Klingapi
- Pikaapi
08
Image→Text
Visual Question Answering
Answer natural language questions about images. Combines vision and language understanding.
Read the block page →
Implementations
- GPT-4Vapi
- Claude 3.5 Sonnetapi
- LLaVAopen-source
- Qwen-VLopen-source
- + 1 more on the block page
09
Image→Image
Image Transformation
Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.
Read the block page →
Implementations
- Stable Diffusion XLopen-source
- ControlNetopen-source
- InstructPix2Pixopen-source
- Real-ESRGANopen-source
- + 1 more on the block page
10
Image→Text
Optical Character Recognition
Detect and read text in images and documents. Core for document intake, receipts, and scene text search.
Read the block page →
Implementations
- PaddleOCRopen-source
- TrOCRopen-source
- Tesseractopen-source
11
Image→Structured Data
Pose Estimation
Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.
Read the block page →
Implementations
- RTMPose (MMPose)open-source
- MoveNetopen-source
- OpenPoseopen-source
12
Image→Structured Data
Optical Flow
Estimate pixel-wise motion between frames. Useful for video editing, stabilization, and robotics.
Read the block page →
Implementations
- RAFTopen-source
- GMFlowopen-source
- FlowFormeropen-source
13
Image→Image
Background Removal
Segment foreground and remove or replace backgrounds for product photos and portraits.
Read the block page →
Implementations
- MODNetopen-source
- U^2-Netopen-source
- Segment Anythingopen-source
14
Image→Image
Face Anonymization
Blur, mask, or re-synthesize faces to protect privacy in images and video frames.
Read the block page →
Implementations
- DeepPrivacy2open-source
- YOLOv8-face + OpenCVopen-source
- BriaRMBG + Bluropen-source
15
Image→Structured Data
Chart and Table Understanding
Parse charts, diagrams, and tables into structured data for analysis and QA.
Read the block page →
Implementations
- Table Transformer (TATR)open-source
- DocTRopen-source
- ChartQA Modelsopen-source

§ From audio

Audio, in and out.

9 blocksSpeech registry →

01
Audio→Text
Speech Recognition
Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.
Read the block page →
Implementations
- OpenAI Whisper APIapi
- Whisper (local)open-source
- Deepgramapi
- AssemblyAIapi
- + 2 more on the block page
02
Audio→Structured Data
Audio Classification
Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.
Read the block page →
Implementations
- Audio Spectrogram Transformer (AST)open-source
- Wav2Vec2open-source
- CLAPopen-source
- YAMNetopen-source
- + 1 more on the block page
03
Audio→Structured Data
Voice Activity Detection
Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.
Read the block page →
Implementations
- Silero VADopen-source
- WebRTC VADopen-source
- pyannote VADopen-source
- Speechbrain VADopen-source
04
Audio→Audio
Audio Transformation
Transform audio signals: enhance, denoise, separate sources, change voice, or convert music styles.
Read the block page →
Implementations
- Demucsopen-source
- RVC (Retrieval Voice Conversion)open-source
- so-vits-svcopen-source
- DeepFilterNetopen-source
- + 1 more on the block page
05
Audio→Structured Data
Speaker Diarization
Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.
Read the block page →
Implementations
- pyannote.audioopen-source
- NVIDIA NeMo Diarizationopen-source
- Resemblyzeropen-source
06
Audio→Structured Data
Keyword Spotting
Detect wake words and short commands with low latency and tiny footprints.
Read the block page →
Implementations
- Google Speech Commands KWSopen-source
- openWakeWordopen-source
- Picovoice Porcupineapi
07
Audio→Structured Data
Speech Emotion Recognition
Classify speaker emotion or affective state from voice.
Read the block page →
Implementations
- SpeechBrain SERopen-source
- Wav2Vec2-Emotionopen-source
- Emo-CLAPopen-source
08
Audio→Audio
Voice Cloning
Replicate a speaker’s voice or convert one voice to another (TTS-to-TTS).
Read the block page →
Implementations
- RVCopen-source
- so-vits-svcopen-source
- OpenVoiceopen-source
09
Audio→Structured Data
Audio Watermark Detection
Detect or verify watermarks in synthetic or distributed audio.
Read the block page →
Implementations
- Audiowmarkopen-source
- AudioSeal Detectoropen-source
- Stable Signature (beta)api

§ From video

Video, in and out.

5 blocksVision router →

01
Video→Text
Video Understanding
Understand and describe video content. Powers video search, summarization, and analysis.
Read the block page →
Implementations
- Gemini 1.5 Proapi
- GPT-4V (with frames)api
- VideoLLaMAopen-source
- InternVideo2open-source
- + 1 more on the block page
02
Video→Structured Data
Action Recognition
Classify actions or activities in video clips for safety, sports, and analytics.
Read the block page →
Implementations
- TimeSformeropen-source
- VideoMAEopen-source
- SlowFastopen-source
03
Video→Structured Data
Multi-Object Tracking
Track multiple objects across video frames with consistent identities.
Read the block page →
Implementations
- ByteTrackopen-source
- StrongSORT/BoT-SORTopen-source
- OC-SORTopen-source
04
Video→Text
Video OCR
Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.
Read the block page →
Implementations
- PaddleOCR + FFmpegopen-source
- EasyOCRopen-source
- TrOCR + SAM trackingopen-source
05
Video→Audio
Audio-Visual Speech Separation
Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.
Read the block page →
Implementations
- AV-Separation (MS3)open-source
- SpeechSplit + Visual Conditioningopen-source
- NeMo AV-Diarizationopen-source

§ From document

Document, in and out.

3 blocksOCR & document AI →

01
Document→Structured Data
Document Extraction
Extract structured information from documents like PDFs, invoices, forms, and contracts.
Read the block page →
Implementations
- Docling (IBM)open-source
- Unstructured.ioopen-source
- Azure Document Intelligenceapi
- Google Document AIapi
- + 1 more on the block page
02
Document→Text
Document Question Answering
Answer questions about document content including text, tables, and layouts. Essential for document AI.
Read the block page →
Implementations
- GPT-4Vapi
- LayoutLMv3open-source
- Donutopen-source
- DocVQA-BERTopen-source
- + 1 more on the block page
03
Document→Text
Document RAG Pipeline
Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.
Read the block page →
Implementations
- LlamaIndexopen-source
- LangChainopen-source
- Haystackopen-source
- RAGFlowopen-source
- + 4 more on the block page

§ Common pipelines

Blocks, composed.

Frequently-assembled chains. Each is a reading of several blocks as one operation; the individual blocks remain linked for substitution.

Direct Visual Search
Embed images directly with CLIP/SigLIP, search by text or image query.
Image to Vector(Image → Vector)→Text to Vector(Text → Vector)
Good for
- Photo library search
- E-commerce visual search
Strengths
- Real-time indexing
- Text-to-image search
- Simple pipeline
Trade-offs
- May miss fine details
- Abstract concepts can be weak
Caption + RAG Visual Search
Generate captions for images, embed captions, search via text RAG.
Image to Text(Image → Text)→Text to Vector(Text → Vector)
Good for
- Detailed scene search
- Accessibility-first apps
Strengths
- Human-readable index
- Can describe complex scenes
- Debuggable
Trade-offs
- Slower indexing
- Caption quality limits retrieval
- Higher cost
Document RAG Pipeline
Extract text from documents, chunk, embed, retrieve, generate with LLM.
Document to Structured(Document → Structured Data)→Text to Vector(Text → Vector)→Text to Text(Text → Text)
Good for
- Enterprise search
- Legal document QA
- Knowledge base
Strengths
- Grounds LLM in your data
- Citable sources
Trade-offs
- Chunking strategy matters
- Multi-step latency
Voice Assistant Pipeline
Speech-to-text, process with LLM, text-to-speech response.
Audio to Text(Audio → Text)→Text to Text(Text → Text)→Text to Audio(Text → Audio)
Good for
- Voice assistants
- Call center bots
- Accessibility
Strengths
- Natural interaction
- Hands-free
Trade-offs
- Latency stacks up
- Error propagation

Related · Further reading

Where each block lands.

All routes verified live · April 2026

The composable operations of AI.

Text, in and out.

Image, in and out.

Audio, in and out.

Video, in and out.

Document, in and out.

Blocks, composed.

Direct Visual Search

Caption + RAG Visual Search

Document RAG Pipeline

Voice Assistant Pipeline

Where each block lands.

The composable
operations of AI.