Codesota · Building blocksThe composable operations of a pipelineIssue: April 22, 2026
Editorial · Building blocks

The composable
operations of AI.

Every AI pipeline reduces to a sequence of typed transformations: something goes in, something else comes out. This index catalogues those transformations by input modality, lists the implementations that currently matter, and links each block to its registry page.

Start from what you have — image, text, audio, video, a document — and follow the arrows to what you need.

§ From text

Text, in and out.

  • 01
    TextVector
    Text Embedding

    Convert text into dense vector representations for semantic search, clustering, and retrieval.

    Read the block page →
    Implementations
    • OpenAI text-embedding-3-largeapi
    • Cohere embed-v3api
    • Voyage AI voyage-3api
    • BGE-M3open-source
    • + 2 more on the block page
  • 02
    TextText
    Language Model

    Transform, generate, or reason about text. The core building block for chatbots, summarization, translation, and more.

    Read the block page →
    Implementations
    • GPT-4oapi
    • Claude 3.5 Sonnetapi
    • Gemini 1.5 Proapi
    • Llama 3.1 405Bopen-source
    • + 2 more on the block page
  • 03
    TextImage
    Image Generation

    Generate images from text descriptions. Powers creative tools, marketing, and synthetic data.

    Read the block page →
    Implementations
    • DALL-E 3api
    • Midjourneyapi
    • Stable Diffusion 3open-source
    • FLUX.1open-source
    • + 1 more on the block page
  • 04
    TextAudio
    Text to Speech

    Convert text to natural-sounding speech. Powers voice assistants, audiobooks, and accessibility features.

    Read the block page →
    Implementations
    • ElevenLabsapi
    • OpenAI TTSapi
    • Coqui XTTSopen-source
    • Barkopen-source
  • 05
    Text3D Model
    Text to 3D

    Generate 3D models from text descriptions. Enables rapid prototyping and creative 3D content generation.

    Read the block page →
    Implementations
    • Shap-Eopen-source
    • Point-Eopen-source
    • MVDreamopen-source
    • Meshyapi
    • + 1 more on the block page
  • 06
    TextVideo
    Text to Video

    Generate videos from text descriptions. The frontier of generative AI for content creation.

    Read the block page →
    Implementations
    • Soraapi
    • Runway Gen-3api
    • Klingapi
    • CogVideoXopen-source
    • + 1 more on the block page
  • 07
    TextStructured Data
    Text Classification

    Classify text into predefined categories. Powers spam detection, sentiment analysis, topic categorization, and content moderation.

    Read the block page →
    Implementations
    • SetFitopen-source
    • DistilBERTopen-source
    • DeBERTa-v3open-source
    • BART-large-mnliopen-source
    • + 1 more on the block page
  • 08
    TextText
    Machine Translation

    Translate text between languages. Essential for global communication, localization, and cross-lingual applications.

    Read the block page →
    Implementations
    • NLLB-200open-source
    • Google Cloud Translationapi
    • DeepLapi
    • MADLAD-400open-source
    • + 1 more on the block page
  • 09
    TextText
    Text Summarization

    Condense long documents into concise summaries. Essential for news aggregation, research, and document processing.

    Read the block page →
    Implementations
    • BART-large-cnnopen-source
    • Pegasusopen-source
    • LongT5open-source
    • Claudeapi
    • + 1 more on the block page
  • 10
    TextText
    Question Answering

    Answer questions based on context or knowledge. Foundation for chatbots, search, and knowledge systems.

    Read the block page →
    Implementations
    • RoBERTa-SQuADopen-source
    • DPR (Dense Passage Retrieval)open-source
    • FiD (Fusion-in-Decoder)open-source
    • Perplexity APIapi
    • + 1 more on the block page
  • 11
    TextStructured Data
    Named Entity Recognition

    Extract named entities (people, organizations, locations, dates) from text. Key for information extraction and knowledge graphs.

    Read the block page →
    Implementations
    • spaCyopen-source
    • GLiNERopen-source
    • BERT-NERopen-source
    • Flairopen-source
    • + 1 more on the block page
  • 12
    TextStructured Data
    Cross-Encoder Reranking

    Re-score retrieved passages with a cross-encoder to boost search precision.

    Read the block page →
    Implementations
    • Cohere Rerankapi
    • BGE-Rerankeropen-source
    • monoT5open-source
  • 13
    TextStructured Data
    PII Detection & Anonymization

    Detect and redact personally identifiable information to stay compliant.

    Read the block page →
    Implementations
    • Microsoft Presidioopen-source
    • spaCy PII Pipelinesopen-source
    • AWS Comprehend PIIapi
  • 14
    TextStructured Data
    Hallucination Detection

    Score or flag generated text for factuality and grounding.

    Read the block page →
    Implementations
    • RAGASopen-source
    • SelfCheckGPTopen-source
    • G-Evalopen-source
  • 15
    TextText
    Controllable Generation

    Generate text with constraints on style, length, structure, or safety guardrails.

    Read the block page →
    Implementations
    • Guidance/Outlinesopen-source
    • Guardrails AIopen-source
    • NeMo Guardrailsopen-source
  • 16
    TextText
    Long-Context Summarization

    Summarize 100K+ token inputs like transcripts, hearings, or books with structured outputs.

    Read the block page →
    Implementations
    • Gemini 1.5 Proapi
    • Claude 3.5 Sonnet 200Kapi
    • LLama 3.1 70B 128Kopen-source
  • 17
    TextText
    Code Generation & Repair

    Generate, refactor, or fix code with language models specialized for programming.

    Read the block page →
    Implementations
    • GPT-4o (Code)api
    • DeepSeek-Coder-V2open-source
    • CodeLlama 70B Instructopen-source
  • 18
    TextStructured Data
    Hybrid Sparse + Dense Retrieval

    Combine lexical (BM25) and dense retrieval with weighted fusion or cascades to improve recall and precision for search and RAG.

    Read the block page →
    Implementations
    • Elasticsearch + ELSER/BM25open-source
    • Pyserini + Faissopen-source
    • Weaviate Hybridopen-source
§ From image

Image, in and out.

  • 01
    ImageVector
    Image Embedding

    Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.

    Read the block page →
    Implementations
    • OpenAI CLIPopen-source
    • SigLIPopen-source
    • OpenCLIPopen-source
    • DINOv2open-source
  • 02
    ImageText
    Image Captioning

    Generate natural language descriptions of image content. Enables text-based search over visual content.

    Read the block page →
    Implementations
    • GPT-4 Visionapi
    • Claude 3.5 Sonnetapi
    • LLaVAopen-source
    • BLIP-2open-source
    • + 1 more on the block page
  • 03
    ImageBounding Boxes
    Object Detection

    Locate and classify objects in images with bounding boxes. Foundational for autonomous vehicles, surveillance, and robotics.

    Read the block page →
    Implementations
    • YOLOv8/YOLOv11open-source
    • RT-DETRopen-source
    • Grounding DINOopen-source
    • Florence-2open-source
    • + 1 more on the block page
  • 04
    ImageSegmentation Mask
    Image Segmentation

    Classify each pixel in an image. Enables precise object boundaries for medical imaging, autonomous vehicles, and image editing.

    Read the block page →
    Implementations
    • Segment Anything (SAM)open-source
    • SAM 2open-source
    • Mask2Formeropen-source
    • YOLOv8-segopen-source
    • + 1 more on the block page
  • 05
    ImageDepth Map
    Depth Estimation

    Predict depth from a single image. Critical for 3D reconstruction, AR/VR, and robotics.

    Read the block page →
    Implementations
    • Depth Anything V2open-source
    • MiDaSopen-source
    • ZoeDepthopen-source
    • Marigoldopen-source
  • 06
    Image3D Model
    Image to 3D

    Generate 3D models from single or multiple images. Powers 3D asset creation, VR/AR, and e-commerce.

    Read the block page →
    Implementations
    • TripoSRopen-source
    • LGM (Large Gaussian Model)open-source
    • InstantMeshopen-source
    • CSM (Common Sense Machines)api
    • + 1 more on the block page
  • 07
    ImageVideo
    Image to Video

    Animate still images into videos. Bring photos to life with natural motion.

    Read the block page →
    Implementations
    • Stable Video Diffusionopen-source
    • Runway Gen-3 Alphaapi
    • Klingapi
    • Pikaapi
  • 08
    ImageText
    Visual Question Answering

    Answer natural language questions about images. Combines vision and language understanding.

    Read the block page →
    Implementations
    • GPT-4Vapi
    • Claude 3.5 Sonnetapi
    • LLaVAopen-source
    • Qwen-VLopen-source
    • + 1 more on the block page
  • 09
    ImageImage
    Image Transformation

    Transform images: style transfer, inpainting, super-resolution, editing, or generation from image prompts.

    Read the block page →
    Implementations
    • Stable Diffusion XLopen-source
    • ControlNetopen-source
    • InstructPix2Pixopen-source
    • Real-ESRGANopen-source
    • + 1 more on the block page
  • 10
    ImageText
    Optical Character Recognition

    Detect and read text in images and documents. Core for document intake, receipts, and scene text search.

    Read the block page →
    Implementations
    • PaddleOCRopen-source
    • TrOCRopen-source
    • Tesseractopen-source
  • 11
    ImageStructured Data
    Pose Estimation

    Detect human or object keypoints. Enables AR overlays, sports analytics, and motion capture.

    Read the block page →
    Implementations
    • RTMPose (MMPose)open-source
    • MoveNetopen-source
    • OpenPoseopen-source
  • 12
    ImageStructured Data
    Optical Flow

    Estimate pixel-wise motion between frames. Useful for video editing, stabilization, and robotics.

    Read the block page →
    Implementations
    • RAFTopen-source
    • GMFlowopen-source
    • FlowFormeropen-source
  • 13
    ImageImage
    Background Removal

    Segment foreground and remove or replace backgrounds for product photos and portraits.

    Read the block page →
    Implementations
    • MODNetopen-source
    • U^2-Netopen-source
    • Segment Anythingopen-source
  • 14
    ImageImage
    Face Anonymization

    Blur, mask, or re-synthesize faces to protect privacy in images and video frames.

    Read the block page →
    Implementations
    • DeepPrivacy2open-source
    • YOLOv8-face + OpenCVopen-source
    • BriaRMBG + Bluropen-source
  • 15
    ImageStructured Data
    Chart and Table Understanding

    Parse charts, diagrams, and tables into structured data for analysis and QA.

    Read the block page →
    Implementations
    • Table Transformer (TATR)open-source
    • DocTRopen-source
    • ChartQA Modelsopen-source
§ From audio

Audio, in and out.

  • 01
    AudioText
    Speech Recognition

    Transcribe spoken audio into text. The foundation for voice interfaces, meeting transcription, and audio search.

    Read the block page →
    Implementations
    • OpenAI Whisper APIapi
    • Whisper (local)open-source
    • Deepgramapi
    • AssemblyAIapi
    • + 2 more on the block page
  • 02
    AudioStructured Data
    Audio Classification

    Classify audio into categories like music genres, environmental sounds, speaker emotions, or speech commands.

    Read the block page →
    Implementations
    • Audio Spectrogram Transformer (AST)open-source
    • Wav2Vec2open-source
    • CLAPopen-source
    • YAMNetopen-source
    • + 1 more on the block page
  • 03
    AudioStructured Data
    Voice Activity Detection

    Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.

    Read the block page →
    Implementations
    • Silero VADopen-source
    • WebRTC VADopen-source
    • pyannote VADopen-source
    • Speechbrain VADopen-source
  • 04
    AudioAudio
    Audio Transformation

    Transform audio signals: enhance, denoise, separate sources, change voice, or convert music styles.

    Read the block page →
    Implementations
    • Demucsopen-source
    • RVC (Retrieval Voice Conversion)open-source
    • so-vits-svcopen-source
    • DeepFilterNetopen-source
    • + 1 more on the block page
  • 05
    AudioStructured Data
    Speaker Diarization

    Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.

    Read the block page →
    Implementations
    • pyannote.audioopen-source
    • NVIDIA NeMo Diarizationopen-source
    • Resemblyzeropen-source
  • 06
    AudioStructured Data
    Keyword Spotting

    Detect wake words and short commands with low latency and tiny footprints.

    Read the block page →
    Implementations
    • Google Speech Commands KWSopen-source
    • openWakeWordopen-source
    • Picovoice Porcupineapi
  • 07
    AudioStructured Data
    Speech Emotion Recognition

    Classify speaker emotion or affective state from voice.

    Read the block page →
    Implementations
    • SpeechBrain SERopen-source
    • Wav2Vec2-Emotionopen-source
    • Emo-CLAPopen-source
  • 08
    AudioAudio
    Voice Cloning

    Replicate a speaker’s voice or convert one voice to another (TTS-to-TTS).

    Read the block page →
    Implementations
    • RVCopen-source
    • so-vits-svcopen-source
    • OpenVoiceopen-source
  • 09
    AudioStructured Data
    Audio Watermark Detection

    Detect or verify watermarks in synthetic or distributed audio.

    Read the block page →
    Implementations
    • Audiowmarkopen-source
    • AudioSeal Detectoropen-source
    • Stable Signature (beta)api
§ From video

Video, in and out.

  • 01
    VideoText
    Video Understanding

    Understand and describe video content. Powers video search, summarization, and analysis.

    Read the block page →
    Implementations
    • Gemini 1.5 Proapi
    • GPT-4V (with frames)api
    • VideoLLaMAopen-source
    • InternVideo2open-source
    • + 1 more on the block page
  • 02
    VideoStructured Data
    Action Recognition

    Classify actions or activities in video clips for safety, sports, and analytics.

    Read the block page →
    Implementations
    • TimeSformeropen-source
    • VideoMAEopen-source
    • SlowFastopen-source
  • 03
    VideoStructured Data
    Multi-Object Tracking

    Track multiple objects across video frames with consistent identities.

    Read the block page →
    Implementations
    • ByteTrackopen-source
    • StrongSORT/BoT-SORTopen-source
    • OC-SORTopen-source
  • 04
    VideoText
    Video OCR

    Extract on-screen text from video frames for subtitles, broadcast monitoring, and compliance.

    Read the block page →
    Implementations
    • PaddleOCR + FFmpegopen-source
    • EasyOCRopen-source
    • TrOCR + SAM trackingopen-source
  • 05
    VideoAudio
    Audio-Visual Speech Separation

    Separate or enhance speech in videos using both audio and lip cues. Improves meeting transcription, TV/movie captioning, and noisy recordings.

    Read the block page →
    Implementations
    • AV-Separation (MS3)open-source
    • SpeechSplit + Visual Conditioningopen-source
    • NeMo AV-Diarizationopen-source
§ From document

Document, in and out.

  • 01
    DocumentStructured Data
    Document Extraction

    Extract structured information from documents like PDFs, invoices, forms, and contracts.

    Read the block page →
    Implementations
    • Docling (IBM)open-source
    • Unstructured.ioopen-source
    • Azure Document Intelligenceapi
    • Google Document AIapi
    • + 1 more on the block page
  • 02
    DocumentText
    Document Question Answering

    Answer questions about document content including text, tables, and layouts. Essential for document AI.

    Read the block page →
    Implementations
    • GPT-4Vapi
    • LayoutLMv3open-source
    • Donutopen-source
    • DocVQA-BERTopen-source
    • + 1 more on the block page
  • 03
    DocumentText
    Document RAG Pipeline

    Build a complete Retrieval-Augmented Generation system for documents. Parse PDFs and documents, chunk intelligently, embed for semantic search, retrieve relevant context, and generate grounded answers with LLMs.

    Read the block page →
    Implementations
    • LlamaIndexopen-source
    • LangChainopen-source
    • Haystackopen-source
    • RAGFlowopen-source
    • + 4 more on the block page
§ Common pipelines

Blocks, composed.

Frequently-assembled chains. Each is a reading of several blocks as one operation; the individual blocks remain linked for substitution.

  • Direct Visual Search

    Embed images directly with CLIP/SigLIP, search by text or image query.

    Image to Vector(Image Vector)Text to Vector(Text Vector)
    Good for
    • Photo library search
    • E-commerce visual search
    Strengths
    • Real-time indexing
    • Text-to-image search
    • Simple pipeline
    Trade-offs
    • May miss fine details
    • Abstract concepts can be weak
  • Caption + RAG Visual Search

    Generate captions for images, embed captions, search via text RAG.

    Image to Text(Image Text)Text to Vector(Text Vector)
    Good for
    • Detailed scene search
    • Accessibility-first apps
    Strengths
    • Human-readable index
    • Can describe complex scenes
    • Debuggable
    Trade-offs
    • Slower indexing
    • Caption quality limits retrieval
    • Higher cost
  • Document RAG Pipeline

    Extract text from documents, chunk, embed, retrieve, generate with LLM.

    Document to Structured(Document Structured Data)Text to Vector(Text Vector)Text to Text(Text Text)
    Good for
    • Enterprise search
    • Legal document QA
    • Knowledge base
    Strengths
    • Grounds LLM in your data
    • Citable sources
    Trade-offs
    • Chunking strategy matters
    • Multi-step latency
  • Voice Assistant Pipeline

    Speech-to-text, process with LLM, text-to-speech response.

    Audio to Text(Audio Text)Text to Text(Text Text)Text to Audio(Text Audio)
    Good for
    • Voice assistants
    • Call center bots
    • Accessibility
    Strengths
    • Natural interaction
    • Hands-free
    Trade-offs
    • Latency stacks up
    • Error propagation
Related · Further reading

Where each block lands.

All routes verified live · April 2026