Vision & Documents
Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.
Tasks in Vision & Documents
Image Classification
Categorizing images into predefined classes.
Object Detection
Locating and classifying objects in images.
Semantic Segmentation
Pixel-level image classification.
Instance Segmentation
Separating object instances at pixel level.
Depth Estimation
Estimating scene depth from visual input.
Pose / Keypoint Detection
Detecting body, object, or hand keypoints.
Video Understanding
Understanding actions, events, and semantics in video.
Document OCR
Converting scanned documents and images into machine-readable text.
Scene Text Detection
Detecting text regions in natural scene images.
Scene Text Recognition
Recognizing text content in natural scene images.
Handwriting Recognition
Recognizing handwritten text from images.
Document Layout Analysis
Identifying document blocks, tables, figures, and reading order.
Document Parsing
Converting PDFs and document images into structured formats.
Table Recognition
Extracting table structure and cells from documents.
Document Question Answering
Answering questions over visual documents.
Visual Anomaly Detection
Finding visual defects and outliers in images.
Explore Other Areas
Language & Knowledge
Language understanding, retrieval, QA, RAG, factuality, information extraction, multilingual evaluation, and knowledge-heavy reasoning.
Audio & Speech
ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.
Multimodal Media
Cross-modal image, text, audio, video, and 3D tasks where input and output span multiple media types.
Code & Software Engineering
Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.