Four tasks, one backbone.
Classification — “is this a cat?” — is the oldest framing and the one that set the pace. After AlexNet in 2012, ResNet’s skip connections (2015) and the Vision Transformer (2020) pushed ImageNet top-1 from ~74% to ~90%. Detection and segmentation are classification at finer spatial granularity: boxes and masks over the same feature maps.
The modern pattern is a shared encoder — a ViT, a ConvNeXt, or a hybrid Swin backbone — feeding a task head. Co-DETR and DINO attach a transformer decoder for set-prediction detection; Mask2Former unifies semantic, instance and panoptic heads over the same features; SAM 2 trains the head to accept point, box and mask prompts.
What changed after 2023 is that the encoder can be multimodal. CLIP and SigLIP align image and text into a shared vector space; the same embedding that powers visual search also conditions a vision-language model like Qwen3-VL or GPT-5.4, which then runs document parsing, VQA and captioning through one decoder.
Document OCR is the current hinge task. A VLM that reads PDFs well has to handle layout, table structure, multi-column text and chart reasoning — a broader surface than any prior CV benchmark. That’s why PaddleOCR-VL, Qwen3-VL and dots.ocr are leading where hand-tuned OCR stacks did a year ago.