
CVonline: Image Databases
This is a collated list of image and video databases that people
have found useful for computer vision research and algorithm evaluation.
An important article
How Good Is My Test Data? Introducing Safety Analysis for Computer Vision
(by Zendel, Murschitz, Humenberger, and Herzner)
introduces a methodology for ensuring that your dataset has sufficient
variety that algorithm results on the dataset are representative of
the results that one could expect in a real setting.
In particular, the team have produced a
Checklist of potential
hazards (imaging situations) that may cause algorithms to have problems.
Ideally, test datasets should have examples of the relevant hazards.
Index by Topic
- Action Databases
- Agriculture
- Animals (including Insects)
- Attribute recognition
- Autonomous Driving
- Biological/Medical
- Camera calibration
- Character and Document Understanding
- Event Camera Data
- Face and Eye/Iris Databases
- Fingerprints and Other Non-Face/Eye Biometric Datasets
- General Images
- General RGBD, 3D Point Cloud, and depth datasets
- General Videos
- Hand, Hand Grasp, Hand Action and Gesture Databases
- Image, Video and Shape Database Retrieval
- Object Databases
- People (static and dynamic), human body pose
- People Detection and Tracking Databases (See also Surveillance)
- Remote Sensing
- Robotics
- Scenes or Places, Scene Segmentation or Classification
- Segmentation
- Simultaneous Localization and Mapping
- Surveillance and Tracking (See also People)
- Textures
- Underwater Data
- Urban Datasets
- Vision and Natural Language
- Other Collection Pages
- Miscellaneous Topics
Other helpful sites are:
- Kaggle's computer vision dataset listing - 3615 (on June 13, 2025) ungrouped datasets
- Papers with code and image datasets - 3224 (on June 13, 2025) ungrouped datasets
- Academic Torrents - computer vision - a set of 30+ large datasets available in BitTorrent form
- Machine learning datasets - see CV tab
- YACVID - a tagged index to some computer vision datasets
Action Databases
See also:
Action Recognition's dataset summary with league tables (Gall, Kuehne, Bhattarai).
- 19NonSense - a non-obstructive sensing based and fully annotated accelerometer and gyroscope data of 19 activities under multiple contexts, capturing 13 subjects wearing e-Shoes and a smart watch (Thi-Lan Le) [26/12/2020]
- 20bn-Something-Something - densely-labeled video clips that show humans performing predefined basic actions with everyday objects. 108499 video clips involving actions between 2 objects, 174 labels)(Twenty Billion Neurons GmbH) [Before 28/12/19]
- 3D online action dataset - There are seven action categories (Microsoft and Nanyang Technological University) [Before 28/12/19]
- 50 Salads - fully annotated 4.5 hour dataset of RGB-D video + accelerometer data, capturing 25 people preparing two mixed salads each (Dundee University, Sebastian Stein) [Before 28/12/19]
- ActivityNet - A Large-Scale Video Benchmark for Human Activity Understanding (200 classes, 100 videos per class, 648 video hours) (Heilbron, Escorcia, Ghanem and Niebles) [Before 28/12/19]
- Action Detection in Videos - MERL Shopping Dataset consists of 106 videos, each of which is a sequence about 2 minutes long (Michael Jones, Tim Marks) [Before 28/12/19]
- Actor and Action Dataset - 3782 videos, seven classes of actors performing eight different actions (Xu, Hsieh, Xiong, Corso) [Before 28/12/19]
- An analyzed collation of various labeled video datasets for action recognition (Kevin Murphy) [Before 28/12/19]
- Animal Kingdom - multiple annotated tasks to enable a more thorough understanding of natural animal behaviors, which includes 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes (Ng, Ong, Zheng, Ni, Yeo, Liu) [8/10/24]
- ARIC: Activity Recognition In Classroom Dataset ARIC is a multimodal classroom surveillance image dataset featuring 32 distinct activity categories captured from multiple perspectives. (L. Xu, F. Meng, Q Wu et al.) [11/07/2025]
- AQA-7 - Dataset for assessing the quality of 7 different actions. It contains 1106 action samples and AQA scores. (Parmar, Morris) [29/12/19]
- ASLAN Action similarity labeling challenge database (Orit Kliper-Gross) [Before 28/12/19]
- Attribute Learning for Understanding Unstructured Social Activity - Database of videos containing 10 categories of unstructured social events to recognise, also annotated with 69 attributes. (Y. Fu Fudan/QMUL, T. Hospedales Edinburgh/QMUL) [Before 28/12/19]
- Audio-Visual Event (AVE) dataset- AVE dataset contains 4143 YouTube videos covering 28 event categories and videos in AVE dataset are temporally labeled with audio-visual event boundaries. (Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu) [Before 28/12/19]
- AVA: A Video Dataset of Atomic Visual Action- 80 atomic visual actions in 430 15-minute movie clips. (Google Machine Perception Research Group) [Before 28/12/19]
- AVA-Kinetics Localized Human Actions Video - AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. (A. Li, M. Thotakuri, et al.) [24/07/25]
- BBDB - Baseball Database (BBDB) is a large-scale baseball video dataset that contains 4200 hours of full baseball game videos with 400,000 temporally annotated activity segments. (Shim, Minho, Young Hwi, Kyungmin, Kim, Seon Joo) [Before 28/12/19]
- BEHAVE Interacting Person Video Data with markup (Scott Blunsden, Bob Fisher, Aroosha Laghaee) [Before 28/12/19]
- Berkeley Multimodal Human Action Database (MHAD) MHAD is a controlled multimodal human action dataset capturing 11 actions performed by 12 subjects (seven male, five female) with five repetitions per action.(F. Ofli, R. Chaudhry, G. Kurillo et al.) [31/07/25]
- BU-action Datasets - Three image action datasets (BU101, BU101-unfiltered, BU203-unfiltered) that have 1:1 correspondence with classes of the video datasets UCF101 and ActivityNet. (S. Ma, S. A. Bargal, J. Zhang, L. Sigal, S. Sclaroff.) [Before 28/12/19]
- Berkeley MHAD: A Comprehensive Multimodal Human Action Database (Ferda Ofli) [Before 28/12/19]
- Berkeley Multimodal Human Action Database - five different modalities to expand the fields of application (University of California at Berkeley and Johns Hopkins University) [Before 28/12/19]
- BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis - a dataset for audio-conditioned dance motion synthesis focusing on breakdancing sequences, with high quality annotations for complex body poses and dance movements (Moltisanti, Wu, Dai, Loy) [3/12/2022]
- Breakfast dataset - It's a dataset with 1712 video clips showing 10 kitchen activities, which are hand segmented into 48 atomic action classes . (H. Kuehne, A. B. Arslan and T. Serre ) [Before 28/12/19]
- Bristol Egocentric Object Interactions Dataset - Contains videos shot from a first-person (egocentric) point of view of 3-5 users performing tasks in six different locations (Dima Damen, Teesid Leelaswassuk and Walterio Mayol-Cuevas, Bristol University) [Before 28/12/19]
- Brown Breakfast Actions Dataset - 70 hours, 4 million frames of 10 different breakfast preparation activities (Kuehne, Arslan and Serre) [Before 28/12/19]
- CAD-120 dataset - focuses on high level activities and object interactions (Cornell University) [Before 28/12/19]
- CAD-60 dataset - The CAD-60 and CAD-120 data sets comprise of RGB-D video sequences of humans performing activities (Cornell University) [Before 28/12/19]
- CarDA - Car door Assembly Activities Dataset A multi-modal dataset for car door assembly activities with synchronized RGB-D videos and pose data in a real manufacturing setting (K. Papoutsakis, N. Bakalos, K. Fragkoulis et al.) [10/07/25]
- Cardiopulmonary Resuscitation Performance: Video, Demographic and Evaluation Data - A video dataset for action quality assessment (AQA) in cardiopulmonary resuscitation (CPR). (Constable, Zhang, Connor, Monk, Rajsic, Ford, Park, Barker, Platt, Porteous, Grierson, Shum) [13/8/25]
- CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning - A synthetic video understanding benchmark, with tasks that by-design require temporal reasoning to be solved (Girdhar, Ramanan) [29/12/19]
- ChimpACT We propose ChimpACT, a comprehensive dataset for deciphering the longitudinal behavior and social relations of chimpanzees within a social group, hoping to advance our understanding of communication and sociality in non-human primates.(X. Ma, S. P. Kaufhold et al.) [10/07/25]
- CholecT50 - Endoscopic videos of laparoscopic cholecystectomy surgeries, with annotations formatted as (instrument, verb, target), to support research on detailed action recognition in laparoscopic procedures. (Nwoye, Yu, Gonzalez, Seeliger, Mascagni, Mutter, Marescaux, Padoy) [1/8/24]
- CVBASE06: annotated sports videos (Janez Pers) [Before 28/12/19]
- Charades Dataset - 10,000 videos from 267 volunteers, each annotated with multiple activities, captions, objects, and temporal localizations. (Sigurdsson, Varol, Wang, Laptev, Farhadi, Gupta) [Before 28/12/19]
- CMU MMAC (CMU Kitchen) CMU MMAC is a multi-modal activity dataset captured in a fully instrumented kitchen, featuring 25 subjects performing 5 cooking recipes.(F. De la Torre, J. Hodgins, J. Montano et al.) [31/07/25]
- Composable activities dataset - Different combinations of 26 atomic actions formed 16 activity classes which were performed by 14 subjects and annotations were provided (Pontificia Universidad Catolica de Chile and Universidad del Norte) [Before 28/12/19]
- Continuous Multimodal Multi-view Dataset of Human Fall - The dataset consists of both normal daily activities and simulated falls for evaluating human fall detection. (Thanh-Hai Tran) [Before 28/12/19]
- CONVERSE - Human Conversational Interaction Dataset - a human interaction recognition dataset intended for the exploration of classifying naturally executed conversational scenarios between a pair of individuals via the use of pose-and appearance-based features (Edwards, Deng, Xie) [29/12/2020]
- Cornell Activity Datasets CAD 60, CAD 120 (Cornell Robot Learning Lab) [Before 28/12/19]
- Dataset & Benchmark for Understanding Human-Centric Situations A dataset called MovieGraphs which provides detailed graph-based annotations of social situations depicted in movie clips. (P. Vicol and M. Tapaswi, et al.) [05/06/25]
- DeepMind Kinetics datasets - 650,000 video clips, 700 human action classes, including human-object and human-human interactions. Includes AVA Kinetics, Kinetics 700, Kinetics 600, Kinetics 400 (DeepMind Carreira, Noland, Hillier, Zisserman) [31/5/20]
- DeepSport - pairs of successive images captured in different basketball arenas during professional games, with ground ruth annotations of the ball position. (UC Louvain ISPGroup)
- DemCare dataset - DemCare dataset consists of a set of diverse data collection from different sensors and is useful for human activity recognition from wearable/depth and static IP camera, speech recognition for Alzheimer's disease detection and physiological data for gait analysis and abnormality detection. (K. Avgerinakis, A.Karakostas, S.Vrochidis, I. Kompatsiaris) [Before 28/12/19]
- Depth-included Human Action video dataset - It contains 23 different actions (CITI in Academia Sinica) [Before 28/12/19]
- DMLSmartActions dataset - Sixteen subjects performed 12 different actions in a natural manner. (University of British Columbia) [Before 28/12/19]
- DogCentric Activity Dataset - first-person videos taken from a camera mounted on top of a *dog* (Michael Ryoo) [Before 28/12/19]
- "DVSACT16 - DVS Datasets for Object Tracking, Action Recognition and Object Recognition" - Dataset contains recordings from DVS on tracking datasets. (Hu, Liu, Pfeiffer, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- EATSENSE - 135 RGBD videos, averaging 11 minutes each, captured at 15 FPS. Each frame is labeled with the main atomic action primitive. A single person is in each video, eating a variety of foods, with a variety of styles. (Ahmed Raza) [11/11/2024]
- Edinburgh ceilidh overhead video data - 16 ground-truthed dances viewed from overhead, where the 10 dancers follow a structured dance pattern (2 different dances). The dataset is useful for highly structured behavior understanding (Aizeboje, Fisher) [Before 28/12/19]
- EgoExoLearn - EgoExoLearn is a large-scale video dataset comprising 120-hours of asynchronous egocentric (first-person) and exocentric (third-person) demonstration-following footage, complete with high-quality gaze data and detailed multimodal annotations, to support cross-view benchmarks like association, planning, and skill assessment. (Y. Huangl) [22/06/25]
- EPIC-KITCHENS-100 - The largest annotated egocentric action dataset composed of 100 hours, 20M frames, 90K actions capturing long-term unscripted activities in 45 environments (Damen et al.) [29/12/2020]
- EPIC-Kitchens-100-SPMV - A subset of EPIC-Kitchens-100 validation set in multiple verb labels for single positive multi-verb training which tackles verb ambiguity. (Kim, Moltisanti, Mac Aodha, Sevilla-Lara) [5/12/2022]
- EPIC-KITCHENS - egocentric video recorded by 32 participants in their native kitchen environments, non-scripted daily activities, 11.5M frames, 39.6K frame-level action segments and 454.2K object bounding boxes (Damen, Doughty, Fidler, et al) [Before 28/12/19]
- EPFL crepe cooking videos - 6 types of structured cooking activity (12) videos in 1920x1080 resolution (Lee, Ognibene, Chang, Kim and Demiris) [Before 28/12/19]
- ETS Hockey Game Event Data Set - This data set contains footage of two hockey games captured using fixed cameras. (M.-A. Carbonneau, A. J. Raymond, E. Granger, and G. Gagnon) [Before 28/12/19]
- Falling Detection dataset - Six subjects in two sceneries performed a series of actions continuously (University of Texas) [Before 28/12/19]
- FCVID: Fudan-Columbia Video Dataset - 91,223 Web videos annotated manually according to 239 categories (Jiang, Wu, Wang, Xue, Chang) [Before 28/12/19]
- FineGym - A hierarchical video dataset for fine-grained action understanding, providing coarse-to-fine annotations both temporally and semantically. It contains over 30k fine-grained action instances from more than 300 well-defined categories (Shao, Zhao, Dai, Lin) [27/12/2020]
- FPV-O FPV-O is a first-person video dataset of everyday office activities captured with a chest-mounted camera, comprising ~3 hours of footage from 12 subjects performing 20 interaction and object-manipulation activities.(G. Abebe, A. Catala, A. Cavallaro et al.) [31/07/25]
- G3D - synchronised video, depth and skeleton data for 20 gaming actions captured with Microsoft Kinect (Victoria Bloom) [Before 28/12/19]
- G3Di - This dataset contains 12 subjects split into 6 pairs (Kingston University) [Before 28/12/19]
- Gaming 3D dataset - real-time action recognition in gaming scenario (Kingston University) [Before 28/12/19]
- Georgia Tech Egocentric Activities - Gaze(+) - videos of where people look at and their gaze location (Fathi, Li, Rehg) [Before 28/12/19]
- HiEve: Large-scale Human-centric Video Analysis in Complex Events - 1M+ poses, 56k+ complex-event action labels, and 2.6k+ trajectories, covering a wide range of human-centric analysis tasks (Lin, Qi, Sebe, Xu, Xiong, Shah) [28/12/2020]
- HMDB: A Large Human Motion Database (Serre Lab) [Before 28/12/19]
- Hollywood 3D dataset - 650 3D video clips, across 14 action classes (Hadfield and Bowden) [Before 28/12/19]
- Howto100m136 m instructional videos, 23K domains (Miech, Zhukov) [29/2/24]
- Human Actions and Scenes Dataset (Marcin Marszalek, Ivan Laptev, Cordelia Schmid) [Before 28/12/19]
- Human Searches Search sequences of human annotators that were tasked to spot actions in AVA and THUMOS14 datasets. (Alwassel, H., Caba Heilbron, F., Ghanem, B.) [Before 28/12/19]
- Hollywood Extended - 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies. (Bojanowski, Lajugie, Bach, Laptev, Ponce, Schmid, and Sivic) [Before 28/12/19]
- HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion (Brown University) [Before 28/12/19]
- I-LIDS video event image dataset (Imagery library for intelligent detection systems) (Paul Hosner) [Before 28/12/19]
- I3DPost Multi-View Human Action Datasets (Hansung Kim) [Before 28/12/19]
- IAS-lab Action dataset - contain sufficient variety of actions and number of people performing the actions (IAS Lab at the University of Padua) [Before 28/12/19]
- ICS-FORTH MHAD101 Action Co-segmentation - 101 pairs of long-term action sequences that share one or multiple common actions to be co-segmented, contains both 3d skeletal and video related frame-based features (Universtiy of Crete and FORTH-ICS, K. Papoutsakis) [Before 28/12/19]
- IIIT Extreme Sports - 160 first person (egocentric) sport videos from YouTube with frame level annotations of 18 action classes. (Suriya Singh, Chetan Arora, and C. V. Jawahar. Trajectory Aligned) [Before 28/12/19]
- IKEA assembly dataset - a multi-modal and multi-view video dataset of assembly tasks which contains 371 samples of furniture assemblies and their ground-truth annotations. Each sample includes 3 RGB views, one depth stream, atomic actions, human poses, object segments, object tracking, and extrinsic camera calibration (Ben-Shabat, Yu, Saleh, Campbell, Rodriguez, Li, Gould) [27/12/2020]
- InHAC - A 3D skeletal motion dataset on human-human and human-object interaction (Yijun Shen, Longzhi Yang, Edmond S. L. Ho and Hubert P. H. Shum) [1/2/21]
- INRIA Xmas Motion Acquisition Sequences (IXMAS) (INRIA) [Before 28/12/19]
- InfAR Dataset -Infrared Action Recognition at Different Times Neurocomputing(Chenqiang Gao, Yinhe Du, Jiang Liu, Jing Lv, Luyu Yang, Deyu Meng, Alexander G. Hauptmann) [Before 28/12/19]
- Jena Action Recognition Dataset - Aibo dog actions (Korner and Denzler) [Before 28/12/19]
- Jester - 148092 video clips of people making gestures in front of laptop/webcam (27 labels) (Materzynska, Berger, Bax, Memisevic) [31/5/2020]
- JHMDB: Joints for the HMDB dataset (J-HMDB) based on 928 clips from HMDB51 comprising 21 action categories (Jhuang, Gall, Zuffi, Schmid and Black) [Before 28/12/19]
- JPL First-Person Interaction dataset - 7 types of human activity videos taken from a first-person viewpoint (Michael S. Ryoo, JPL) [Before 28/12/19]
- K3Da - Kinect 3D Active dataset - K3Da (Kinect 3D active) is a realistic clinically relevant human action dataset containing skeleton, depth data and associated participant information (D. Leightley, M. H. Yap, J. Coulson, Y. Barnouin and J. S. McPhee) [Before 28/12/19]
- Kinetics Human Action Video Dataset - 300,000 video clips, 400 human action classe, 10 second clips, single action per clip (Kay, Carreira, et al) [Before 28/12/19]
- KIT Robo-Kitchen Activity Data Set - 540 clips of 17 people performing 12 complex kitchen activities. (L. Rybok, S. Friedberger, U. D. Hanebeck, R. Stiefelhagen) [Before 28/12/19]
- KTH human action recognition database (KTH CVAP lab) [Before 28/12/19]
- Karlsruhe Motion, Intention, and Activity Data set (MINTA) - 7 types of activities of daily living including fully motion primitive segments. (D. Gehrig, P. Krauthausen, L. Rybok, H. Kuehne, U. D. Hanebeck, T. Schultz, R. Stiefelhagen) [Before 28/12/19]
- LATTE-MV Dataset of extracted human poses and 3D ball positions during professional table tennis gameplay, captured from monocular videos. (D. Etaat, D. Kalaria, N. Rahmanian, S. S. Sastry.) [10/07/25]
- Leeds Activity Dataset--Breakfast (LAD--Breakfast) - It is composed of 15 annotated videos, representing five different people having breakfast or other simple meal; (John Folkesson et al.) [Before 28/12/19]
- LEGO A dataset of over 150k paired egocentric images captured before and during daily tasks, designed for visual instruction generation. (B. Lai, X. Dai, L. Chen, G. Pang,et al.) [10/07/25]
- LEMMA - Multi-agent Multi-view Activities with FPV+TPV (Jia, Chen, Huang, Zhu, Zhu) [26/12/2020]
- LIRIS Human Activities Dataset - contains (gray/rgb/depth) videos showing people performing various activities (Christian Wolf, et al, French National Center for Scientific Research) [Before 28/12/19]
- ManiAc RGB-D action dataset: different manipulation actions, 15 different versions, 30 different objects manipulated, 20 long and complex chained manipulation sequences (Eren Aksoy) [Before 28/12/19]
- The MECCANO Dataset - The MECCANO dataset is the first dataset of egocentric videos to study human-object interactions in industrial-like settings. (F. Ragusa, A. Furnari, S. Livatino, G. M. Farinella) [1/2/21]
- MEXaction2 action detection and localization dataset - To support the development and evaluation of methods for 'spotting' instances of short actions in a relatively large video database: 77 hours, 117 videos (Michel Crucianu and Jenny Benois-Pineau) [Before 28/12/19]
- Mivia dataset - It consists of 7 high-level actions performed by 14 subjects. (Mivia Lab at the University of Salemo) [Before 28/12/19]
- MLB-YouTube - Dataset for activity recognition in baseball videos (AJ Piergiovanni, Michael Ryoo) [Before 28/12/19]
- Moments in Time Dataset - Moments in Time Dataset 1M 3-second videos annotated with action type, the largest dataset of its kind for action recognition and understanding in video. (Monfort, Oliva, et al.) [Before 28/12/19]
- MoVi: A Large Multipurpose Human Motion and Video Dataset - MoVi is the first human motion dataset to contain synchronized pose, body meshes, and video recordings from a large population of subjects (Ghorbani, Mahdaviani, Thaler, Kording, Cook, Blohm, Troje) [27/12/2020]
- MPII Cooking Activities Dataset for fine-grained cooking activity recognition, which also includes the continuous pose estimation challenge (Rohrbach, Amin, Andriluka and Schiele) [Before 28/12/19]
- MPII Cooking 2 Dataset - A large dataset of fine-grained cooking activities, an extension of the MPII Cooking Activities Dataset. (Rohrbach, Rohrbach, Regneri, Amin, Andriluka, Pinkal, Schiele) [Before 28/12/19]
- MPHOI-120 - A more extensive video benchmark dataset of multi-person human-object interaction for activity recognition, with three people interacting. (Qiao, Li, Li, Kubotani, Morishima, Shum) [13/8/25]
- MPHOI-72 - video benchmark dataset of multi-person human-object interaction for activity recognition (Qiao, Men, Li, Kubotani, Morishima, Shum) [13/8/25]
- MSR-Action3D - benchmark RGB-D action dataset (Microsoft Research Redmond and University of Wollongong) [Before 28/12/19]
- MSRActionPair dataset - : Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences (University of Central Florida and Microsoft) [Before 28/12/19]
- MSRC-12 Kinect gesture data set - 594 sequences and 719,359 frames from people performing 12 gestures (Microsoft Research Cambridge) [Before 28/12/19]
- MSRC-12 dataset - sequences of human movements, represented as body-part locations, and the associated gesture (Microsoft Research Cambridge and University of Cambridge) [Before 28/12/19]
- MSRDailyActivity3D Dataset - There are 16 activities (Microsoft and the Northwestern University) [Before 28/12/19]
- MTL-AQA - Multitask learning dataset for assessing quality of Olympic Diving. More than 1500 samples. It contains videos of action samples, fine-grained action class, expert commentary (AQA-oriented captions), AQA scores from judges. Videos from multiple views included wherever available. Can be used for captioning, and fine-grained action recognition, apart from AQA. (Parmar, Morris) [29/12/19]
- MuHAVi - Multicamera Human Action Video Data (Hossein Ragheb) [Before 28/12/19]
- Multi-modal action detection (MAD) Dataset - It contains 35 sequential actions performed by 20 subjects. (CarnegieMellon University) [Before 28/12/19]
- Multiview 3D Event dataset - This dataset includes 8 categories of events performed by 8 subjects (University of California at Los Angles) [Before 28/12/19]
- Nagoya University Extremely Low-resolution FIR Image Action Dataset (Version 2018) - An action dataset obtained from a 16*16 far-infrared sensor array mounted on the ceiling (Yasutomo Kawanishi) [28/12/2020]
- NTU RGB+D Action Recognition Dataset - NTU RGB+D is a large scale dataset for human action recognition(Amir Shahroudy) [Before 28/12/19]
- Northwestern-UCLA Multiview Action 3D - There are 10 action categories:(Northwestern University and University of California at Los Angles) [Before 28/12/19]
- Office Activity Dataset - It consists of skeleton data acquired by Kinect 2.0 from different subjects performing common office activities. (A. Franco, A. Magnani, D. Maiop) [Before 28/12/19]
- Olympic Sports Dataset - videos of athletes practicing 16 Olympic sports. (Niebles, Chen, Fei-Fei.) [9/6/2022]
- Opportunity Activity Recognition Opportunity is a multimodal dataset of wearable, ambient, and object sensors recorded while users performed scripted and natural activities in a room setting with annotations at multiple levels. (D. Roggen, R. Chavarriaga, D. Nguyen Dinh et al.) [31/07/25]
- Oxford TV based human interactions (Oxford Visual Geometry Group) [Before 28/12/19]
- PA-HMDB51 - human action video (592) dataset with potential privacy leak attributes annotated: skin color, gender, face, nudity, and relationship (Wang, Wu, Wang, Wang, Jin) [Before 28/12/19]
- PAMAP2 Physical Activity Monitoring PAMAP2 is a multimodal, sensor-based dataset featuring recordings from 9 subjects performing 18 types of physical activities. (A. Reiss, D. Stricker) [31/07/25]
- Parliament - The Parliament dataset is a collection of 228 video sequences, depicting political speeches in the Greek parliament. (Michalis Vrigkas, Christophoros Nikou, Ioannins A. kakadiaris) [Before 28/12/19]
- PKU-MMD PKU-MMD is a large-scale continuous multimodal action detection dataset captured via Kinect v2. (C. Liu, Y. Hu, Y. Li et al.) [31/07/25]
- PLAICraft: Large-Scale Time-Aligned Vision-Speech-Action Dataset for Embodied AI A 10,000-hour dataset capturing multiplayer Minecraft interactions across five time-aligned modalities (Y. He, C. D. Weilbach, M. E. Wojciechowska et al.) [10/07/25]
- Procedural Human Action Videos - This dataset contains about 40,000 videos for human action recognition that had been generated using a 3D game engine. The dataset contains about 6 million frames which can be used to train and evaluate models not only action recognition but also models for depth map estimation, optical flow, instance segmentation, semantic segmentation, 3D and 2D pose estimation, and attribute learning. (Cesar Roberto de Souza) [Before 28/12/19]
- RealVAD - The upper body detection of attendees in a panel discussion, associated voice activity detection ground-truth (speaking, not-speaking) for them and acoustic features extracted from the video (Cigdem Beyan, Muhammad Shahid, Vittorio Murino) [1/2/21]
- ReC(3+1) - video dataset with synchronized ego and exo videos for temporal action segmentation in long cooking videos (Sayed, Ghoddoosian, Trivedi, Athitsos) [23/06/25]
- RGB-D activity dataset - Each video in the dataset contains 2-7 actions involving interaction with different objects. (Cornell University and Stanford University) [Before 28/12/19]
- RGBD-Action-Completion-2016 - This dataset includes 414 complete/incomplete object interaction sequences, spanning six actions and presenting RGB, depth and skeleton data. (Farnoosh Heidarivincheh, Majid Mirmehdi, Dima Damen) [Before 28/12/19]
- RGB-D-based Action Recognition Datasets - Paper that includes the list and links of different rgb-d action recognition datasets. (Jing Zhang, Wanqing Li, Philip O. Ogunbona, Pichao Wang, Chang Tang) [Before 28/12/19]
- RGBD-SAR Dataset - RGBD-SAR Dataset (University of Electronic Science and Technology of China and Microsoft) [Before 28/12/19]
- Rochester Activities of Daily Living Dataset (Ross Messing) [Before 28/12/19]
- SARAS endoscopic vision challenge for surgeon action detection - 22,601 annotated training frames with 28,055 action instances from 21 different action classes (Cuzzolin, Singh Bawa, Skarga-Bandurova, Singh) [16/4/20]
- SBU Kinect Interaction Dataset - It contains eight types of interactions (Stony Brook University) [Before 28/12/19]
- SBU-Kinect-Interaction dataset v2.0 - It comprises of RGB-D video sequences of humans performing interaction activities (Kiwon Yun etc.) [Before 28/12/19]
- SDHA Semantic Description of Human Activities 2010 contest - Human Interactions (Michael S. Ryoo, J. K. Aggarwal, Amit K. Roy-Chowdhury) [Before 28/12/19]
- SDHA Semantic Description of Human Activities 2010 contest - aerial views (Michael S. Ryoo, J. K. Aggarwal, Amit K. Roy-Chowdhury) [Before 28/12/19]
- SFU Volleyball Group Activity Recognition - 2 levels annotations dataset (9 players' actions and 8 scene's activity) for volleyball videos. (M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori / Simon Fraser University) [Before 28/12/19]
- ShakeFive Dataset - contains only two actions, namely hand shake and high five. (Universiteit Utrecht) [Before 28/12/19]
- ShakeFive2 - A dyadic human interaction dataset with limb level annotations on 8 classes in 153 HD videos (Coert van Gemeren, Ronald Poppe, Remco Veltkamp) [Before 28/12/19]
- SoccerNet - Scalable dataset for action spotting in soccer videos: 500 soccer games fully annotated with main actions (goal, cards, subs) and more than 13K soccer games annotated with 500K commentaries for event captioning and game summarization. (Silvio Giancola, Mohieddine Amine, Tarek Dghaily, Bernard Ghanem) [Before 28/12/19]
- Something-something v2 - 220847 video clips involving actions between 2 objects (30408 objects, 174 labels) (Goyal, Kahou, Michalski, Materzynska, Westphal, Kim, Haenel, Fruend, Yianilos, Mueller-Freitag, Hoppe, Thurau, Bax, Memisevic) [31/5/2020]
- Sports Videos in the Wild (SVW) - SVW is comprised of 4200 videos captured solely with smartphones by users of Coach Eye smartphone app, a leading app for sports training developed by TechSmith corporation. (Seyed Morteza Safdarnejad, Xiaoming Liu) [Before 28/12/19]
- STAIR Actions - dataset consisting of 100 everyday human action categories (Yoshikawa, Lin, Takeuchi) [26/12/2020]
- Stanford Sport Events dataset (Jia Li) [Before 28/12/19]
- Subtle Binary Activities - 3 new unbiased action recognition datasets that are challenging for state-of-the-art computer vision, but easily solved by humans (Jacquot, Ying, Kreiman) [27/12/2020]
- SYSU 3D Human-Object Interaction Dataset - Forty subjects perform 12 distinct activities (Sun Yat-sen University) [Before 28/12/19]
- TAPOS - dataset developed on sport videos with manual annotations of sub-actions, containing over 16K action instances in 21 Olympics sport classes (Shao, Zhao, Dai, Lin) [27/12/2020]
- THU-READ(Tsinghua University RGB-D Egocentric Action Dataset) - THU-READ is a large-scale dataset for action recognition in RGBD videos with pixel-layer hand annotation. (Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, Jie Zhou) [Before 28/12/19]
- THUMOS - Action Recognition in Temporally Untrimmed Videos! - 430 hours of video data and 45 million frames (Gorban, Idrees, Jiang, Zamir, Laptev Shah, Sukthanka) [Before 28/12/19]
- TinyVIRAT - A dataset for tiny action recognition in videos (Demir, Ugur, Yogesh S. Rawat, and Mubarak Shah) [1/2/21]
- Toyota Smarthome dataset - Dataset for Real-world activities of Daily Living (Toyota Motors Europe & INRIA Sophia Antipolis) [30/12/19]
- TUM Kitchen Data Set of Everyday Manipulation Activities (Moritz Tenorth, Jan Bandouch) [Before 28/12/19]
- TV Human Interaction Dataset (Alonso Patron-Perez) [Before 28/12/19]
- TJU dataset - contains 22 actions performed by 20 subjects in two different environments; a total of 1760 sequences. (Tianjin University) [Before 28/12/19]
- UCF-iPhone Data Set - 9 Aerobic actions were recorded from (6-9) subjects using the Inertial Measurement Unit (IMU) on an Apple iPhone 4 smartphone. (Corey McCall, Kishore Reddy and Mubarak Shah) [Before 28/12/19]
- UCI Human Activity Recognition Using Smartphones Data Set - recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors (Anguita, Ghio, Oneto, Parra, Reyes-Ortiz) [Before 28/12/19]
- UNLV Dive & Gymvault - Dataset for assessing quality of Olympic Diving and Olympic Gymnastic Vault. It consists of videos of action samples and corresponding action quality scores. (Parmar, Morris) [29/12/19]
- UPCV action dataset - The dataset consists of 10 actions performed by 20 subjects twice. (University of Patras) [Before 28/12/19]
- UC-3D Motion Database - Available data types encompass high resolution Motion Capture, acquired with MVN Suit from Xsens and Microsoft Kinect RGB and depth images. (Institute of Systems and Robotics, Coimbra, Portugal) [Before 28/12/19]
- UCF 101 action dataset 101 action classes, over 13k clips and 27 hours of video data (Univ of Central Florida) [Before 28/12/19]
- UCF-Crime Dataset: Real-world Anomaly Detection in Surveillance Videos - A large-scale dataset for real-world anomaly detection in surveillance videos. It consists of 1900 long and untrimmed real-world surveillance videos (of 128 hours), with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. (Center for Research in Computer Vision, University of Central Florida) [Before 28/12/19]
- UCFKinect - The dataset is composed of 16 actions (University of Central Florida Orlando) [Before 28/12/19]
- UCI HAR (Smartphones) UCI Human Activity Recognition Using Smartphones is a wearable sensor dataset collected from 30 subjects performing six daily activities. (D. Anguita, L. Oneto, J. Reyes-Ortiz et al.) [31/07/25]
- UCLA Human-Human-Object Interaction (HHOI) Dataset Vn1 - Human interactions in RGB-D videos (Shu, Ryoo, and Zhu) [Before 28/12/19]
- UCLA Human-Human-Object Interaction (HHOI) Dataset Vn2 - Human interactions in RGB-D videos (version 2) (Shu, Gao, Ryoo, and Zhu) [Before 28/12/19]
- UCR Videoweb Multi-camera Wide-Area Activities Dataset (Amit K. Roy-Chowdhury) [Before 28/12/19]
- UTD-MHAD - Eight subjects performed 27 actions four times. (University of Texas at Dallas) [Before 28/12/19]
- UTKinect dataset - Ten types of human actions were performed twice by 10 subjects (University of Texas) [Before 28/12/19]
- UWA3D Multiview Activity Dataset - Thirty activities were performed by 10 individuals (University of Western Australia) [Before 28/12/19]
- Univ of Central Florida - 50 Action Category Recognition in Realistic Videos (3 GB) (Kishore Reddy) [Before 28/12/19]
- Univ of Central Florida - ARG Aerial camera, Rooftop camera and Ground camera (UCF Computer Vision Lab) [Before 28/12/19]
- Univ of Central Florida - Feature Films Action Dataset (Univ of Central Florida) [Before 28/12/19]
- Univ of Central Florida - Sports Action Dataset (Univ of Central Florida) [Before 28/12/19]
- Univ of Central Florida - YouTube Action Dataset (sports) (Univ of Central Florida) [Before 28/12/19]
- Unsegmented Sports News Videos - Database of 74 sports news videos tagged with 10 categories of sports. Designed to test multi-label video tagging. (T. Hospedales, Edinburgh/QMUL) [Before 28/12/19]
- UT Interaction Dataset UT Interaction is a benchmark dataset of continuous human-human interaction videos captured in realistic outdoor scenes.(M. Ryoo, J. Aggarwal et al.) [31/07/25]
- Utrecht Multi-Person Motion Benchmark (UMPM). - a collection of video recordings of people together with a ground truth based on motion capture data. (N.P. van der Aa, X. Luo, G.J. Giezeman, R.T. Tan, R.C. Veltkamp.) [Before 28/12/19]
- VIRAT Video Dataset - event recognition from two broad categories of activities (single-object and two-objects) which involve both human and vehicles. (Sangmin Oh et al) [Before 28/12/19]
- Verona Social interaction dataset (Marco Cristani) [Before 28/12/19]
- ViHASi: Virtual Human Action Silhouette Data (userID: VIHASI password: virtual$virtual) (Hossein Ragheb, Kingston University) [Before 28/12/19]
- Videoweb (multicamera) Activities Dataset (B. Bhanu, G. Denina, C. Ding, A. Ivers, A. Kamal, C. Ravishankar, A. Roy-Chowdhury, B. Varda) [Before 28/12/19]
- Weizmann Space-Time Actions The Weizmann dataset ( Actions as Space-Time Shapes ) consists of 90 low-resolution video sequences (180-144, 25 fps) showing 9 actors performing 10 basic actions. (M. Blank, L. Gorelick, E. Shechtman et al.) [31/07/25]
- WVU Multi-view action recognition dataset (Univ. of West Virginia) [Before 28/12/19]
- WorkoutSU-10 Kinect dataset for exercise actions (Ceyhun Akgul) [Before 28/12/19]
- WorkoutSU-10 dataset - contains exercise actions selected by professional trainers for therapeutic purposes. (Sabanci University) [Before 28/12/19]
- Wrist-mounted camera video dataset - object manipulation (Ohnishi, Kanehira, Kanezaki, Harada) [Before 28/12/19]
- YouCook - 88 open-source YouTube cooking videos with annotations (Zhou, Louis, Xu, Corso) [Before 28/12/19]
- YouCook2 - 2000 open-source YouTube cooking videos of 89 recipes, 176 hours with annotations (Zhou, Louis, Xu, Corso) [29/2/24]
- YouTube-8M Dataset -A Large and Diverse Labeled Video Dataset for Video Understanding Research(Google Inc.) [Before 28/12/19]
Agriculture
- 102 Category Flower Dataset - Consists 102 flower categories commonly found in the UK. Each class consists of between 40 and 258 images. (Nilsback and Zisserman) [03/06/25]
- Aberystwyth Leaf Evaluation Dataset - Timelapse plant images with hand marked up leaf-level segmentations for some time steps, and biological data from plant sacrifice. (Bell, Jonathan; Dee, Hannah M.) [Before 28/12/19]
- Cherry CO Dataset High-resolution dataset of 3,006 labeled images for cherry detection and segmentation, accounting for ripeness and health (L. Cossio-Montefinale, R. Verschae, J. Ruiz-del-Solar) [10/07/25]
- COT-AD: Cotton Analysis Dataset - A large-scale cotton crop dataset of 25k+ images (5k annotated) and 140 dslr videos spanning the growth cycle, combining drone/aerial and high-resolution DSLR imagery with labels for detection, segmentation, and disease analysis to advance computer-vision research in precision agriculture. (Ali, Vyas, Debnath, Kamra, Khalane, Devanesan, Mastan, Sankaranarayanan, Khanna, Raman) [13/8/25]
- EBCA dataset - Dataset with images of plants for developing 3D vision system for robotic fruit harvesting based on stereo cameras and multiple cameras. (A. Kaczmarek) [22/06/25]
- Fieldsafe - A multi-modal dataset for obstacle detection in agriculture. (Aarhus University) [Before 28/12/19]
- Flower Datasets The dataset consists of 17 flower categories commonly found in the UK. Each class consists of between 80 images. (Nilsback and Zisserman) [05/06/25]
- Fruits-360 dataset -A dataset with 138704 images of 206 fruits, vegetables, nuts and seeds. (M. Oltean) [28/07/25]
- KOMATSUNA dataset - The datasets is designed for instance segmentation, tracking and reconstruction for leaves using both sequential multi-view RGB images and depth images. (Hideaki Uchiyama, Kyushu University) [Before 28/12/19]
- Leaf counting dataset - Dataset for estimating the growth stage of small plants. (Aarhus University) [Before 28/12/19]
- Leaf Segmentation ChallengeTobacco and arabidopsis plant images (Hanno Scharr, Massimo Minervini, Andreas Fischbach, Sotirios A. Tsaftaris) [Before 28/12/19]
- MAD - Monastery Apple Dataset - 14,667 annotated apple instances and 4,440 unlabeled images taken in complex orchard environments for semi-supervised apple detection (Johanson, Wilms, Frintrop) [14/8/25]
- Multi-species fruit flower detection - This dataset consists of four sets of flower images, from three different tree species: apple, peach, and pear, and accompanying ground truth images. (Philipe A. Dias, Amy Tabb, Henry Medeiros) [Before 28/12/19]
- North American Mushrooms Dataset - Images of popular North American mushrooms, Chicken of the Woods and Chanterelle, differentiating between the two species. (Morrow) [09/06/25]
- Pasadena Urban Trees -Includes roughly 30,000 trees labeled by geo-location and tree species located within Pasadena. (J. Wegner, S. Branson, D. Hall, et al.) [28/07/25]
- PlantDoc - 2,569 images across 13 plant species and 30 classes (diseased and healthy) for image classification and object detection. There are 8,851 labels. (Singh, Jain, et al.) [05/06/25]
- Plants Image Analysis Datasets - a compilation of several image datasets that features a whopping 1 million images of plants, with the choice of roughly 11 species of plants. (Verstraeten, Lobet, Cast) [03/06/25]
- Plant Phenotyping Datasets - plant data suitable for plant and leaf detection, segmentation, tracking, and species recognition (M. Minervini, A. Fischbach, H. Scharr, S. A. Tsaftaris) [Before 28/12/19]
- Plant seedlings dataset - High-resolution images of 12 weed species. (Aarhus University) [Before 28/12/19]
- Synthetic Fruit Dataset - Synthetic images that are composed of a background and a number of fruits superimposed on top with a random orientation, scale, and color transformation. (Dwyer) [09/06/25]
- Thermal Dogs and People - a collection of 203 thermal infrared images captured at various distances from people and dogs in a park and near a home. (Nelson) [05/06/25]
Animals (including Insects) Datasets
- 3D-POP ( Posture of pigeons) - The datasets is designed to solve 2D-3D posture tracking, multi-object tracking (MOT), Animal identification problem with birds with multiple-view (4) videos of different complexity (n = 1,2,5,10) (Naik and Chan et al) [23/06/25]
- Animal Parts dataset The dataset contains eye and foot keypoint annotations for 15K images of 100 animal classes from the "vertebrate" subtree of ILSVRC2012. (D. Novotny, D. Larlus, A. Vedaldi) [05/06/25]
- AnimalTrack AnimalTrack is the first dedicated benchmark for multi-animal tracking in the wild, comprising 58 manually annotated video sequences from 10 common species.(L. Zhang, J. Gao, Z. Xiao et al.) [31/07/25]
- AutoFish - 1500 high-quality images of fish on a conveyor belt featuring 454 unique fish with class labels, IDs, manual length measurements, and a total of 18,160 instance segmentation masks. (Bengtson, Lehotský, Ismiroglou, Madsen, Moeslund, Pedersen) [13/8/25]
- BIOSCAN Insect Dataset plus PyPi package - A comprehensive dataset containing multi-modal information for over 5 million insect specimens. (Gharaee, Lowe, Gong, Arias, Pellegrino, Wang, Haurum, Zarubiieva, Kari, Steinke, Taylor, Fieguth, Chang [13/8/25]
- BuckTales - The dataset is aimed at solving Multi-object tracking and re-identification of wild antelopes with one or more simultaneously flying UAVs (drones). The dataset consists of 30-120 individuals interacting naturally over up to 3 min long sequences (total ~12min); the first in the MOT category with wild animals. (Naik et al) [23/06/25]
- CaltechCameraTraps -This dataset of camera trap images contains 243,100 images from 140 camera locations from the Southwestern United States. It includes labels for 21 animal categories (plus empty), along with 66,000 bounding box annotations. (S. Beery, G. Horn, P. Perona) [28/07/25]
- Caltech-UCSD Birds-200-2011 (CUB-200-2011) -Caltech-UCSD Birds-200-2011 (CUB-200-2011) is an extended version of the CUB-200 dataset, with roughly double the number of images per class and new part location annotations. (C. Wah et al.) [28/07/25]
- ChimpACT We propose ChimpACT, a comprehensive dataset for deciphering the longitudinal behavior and social relations of chimpanzees within a social group, hoping to advance our understanding of communication and sociality in non-human primates.(X. Ma, S. P. Kaufhold et al.) [10/07/25]
- DAZZLE: Drone-Acquired Zebra data for Large-scale Ecology research - with >30K frames and >160K annotations, consists of semi-automatically labeled behaviors of plains and Grévy's zebras holistically for all individuals in complete scenes, in addition to bounding boxes on individual animal. (Price, Khandelwal, Rubenstein, Ahmad) [24/8/25]
- Edinburgh Pig Behavior dataset - 23 days of daytime pig video captured from a nearly overhead RGBD camera, of a pen with 8 growing pigs. Some ground truth is available for pig detection, tracking, and behavior. (Bergamini, Pini, Simoni, Vezzani, Calderara, D'Eath, Fisher) 11/7/21]
- Event Penguins - first dataset of continuous monitoring of a penguin colony in Antarctica using an event camera (Hamann, Ghosh, Juarez-Martinez, Hart, Kacelnik, Gallego) [24/8/25]
- GeoLifeCLEF 2020 -A collection of 1.9 million species observations paired with high-resolution remote sensing imagery, land cover data, and altitude, in addition to traditional low-resolution climate and soil variables. (Cole E, Deneu B, Lorieul T, et al.) [28/07/25]
- Honeybee segmentation dataset - It is a dataset containing positions and orientation angles of hundreds of bees on a 2D surface of honey comb. (Bozek K, Hebert L, Mikheyev AS, Stephesn GJ) [Before 28/12/19]
- IIT MBADA mice - Mice behavioral data. FLIR A315, spacial resolution of 320x240px at 30fps, 50x50cm open arena, two experts for three different mice pairs, mice identities. (Italian Inst. of Technology, PAVIS lab) [Before 28/12/19]
- iPanda-50 - a fine-grained giant panda recognition/identification dataset with additional eye patches annotations (Le Wang, Rizhi Ding, Yuanhao Zhai, Qilin Zhang, Wei Tang, Nanning Zheng, and Gang Hua) [1/2/21]
- iWildCam - The iWildCam datasets contain sets of camera trap imagery with a focus on tackling challenges in applying computer vision to ecological monitoring with static sensors. (J. Wegner, S. Branson, D. Hall, et al.) [28/07/25]
- LILA BC - Labeled Information Library of Alexandria: Biology and Conservation. Labeled images that support machine learning research around ecology and environmental science. (LILA working group) [28/07/25]
- Linnaeus 5 Dataset
- 5 classes: berry, bird, dog, flower, other (negative set), 1200 training images, 400 test images per class. (Chaladze, G. Kalatozishvili L.) [29/06/25]
- MIT CBCL Automated Mouse Behavior Recognition datasets (Jhuang, Garrote, Yu, Khilnani, Poggio, Steele and Serre) [Before 28/12/19]
- Moth fine-grained recognition - 675 similar classes, 5344 images (Erik Rodner et al) [Before 28/12/19]
- Mouse Embryo Tracking Database - cell division event detection (Marcelo Cicconet, Kris Gunsalus) [Before 28/12/19]
- MouseSIS: Space-Time Instance Segmentation of Mice - First dataset of mice (articulated moving objects) recorded with an event camera and a grayscale camera, including pixel-accurate annotations. Useful for prototyping algorithms in: tracking, segmentation, etc. (Hamann, Li, Mieske, Lewejohann, Gallego) [24/8/25]
- Moving Camouflaged Animals Dataset The largest video dataset for camouflaged animals discovery: 141 videos, 37k frames, and 67 categories of animals which have mastered camouflage (H. Lamdouar, C. Yang, W. Xie, A. Zisserman) [05/06/25]
- NeWT: Natural World Tasks - A new suite of challenging natural world visual benchmark tasks that are motivated by realworld image understanding use cases. The tasks are validated by experts and span a diverse range of visual concepts including behavior, age, health, and more. (M. Jia, M. Shi, M. Sirotenko, et al.) [28/07/25]
- Oxford-IIIT Pet Dataset - 37 category pet dataset with roughly 200 images for each class. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation. (Parkhi, Vedaldi, Zisserman, Jawahar) [03/06/25]
- Oxford Pets Dataset - The Oxford Pets dataset (also known as the "dogs vs cats" dataset) is a collection of images and annotations labeling various breeds of dogs and cats. There are approximately 100 examples of each of the 37 breeds. (Visual Geometry Group at University of Oxford) [09/06/25]
- Raccoon Dataset - 196 images of raccoons and 213 bounding boxes (some images have two raccoons). (Tran) [09/06/25]
- RatSI: Rat Social Interaction Dataset - 9 fully annotated (11 class) videos (15 minute, 25 FPS) of two rats interacting socially in a cage (Malte Lorbach, Noldus Information Technology) [Before 28/12/19]
- Rodent3D Rodent3D is a multi-modal animal motion dataset capturing ~200 minutes (~4 million frames) of rodents freely exploring and interacting, recorded with synchronized RGB-D and thermal infrared cameras. (M. Patel, Y. Gu, L. Carstensen et al.) [31/07/25]
- Stonefly9 This database contains 3826 images of 773 specimens of 9 taxa of Stoneflies (Tom etc.) [Before 28/12/19]
- SyDogVideo SyDogVideo is a synthetic video dataset for fine-grained dog action recognition and pose estimation, featuring labeled video clips with 3D joint annotations and action categories. (Q. Ma, Y. Jiang, X. Geng et al.) [11/07/25]
- Thermal Cheetah - This is a collection of images and video frames of cheetahs at the Omaha Henry Doorly Zoo taken in October, 2020. (Roboflow) [05/06/25]
- VarroaDataset - The purpose of this dataset is to provide high resolution images (160x280px) of honeybees and the parasite Varroa destructor. (CVL, Schurischuster Stefan, Martin Kampel) [1/2/21]
Attribute recognition
- Animals with Attributes 2 - 37322 (freely licensed) images of 50 animal classes with 85 per-class binary attributes. (Christoph H. Lampert, IST Austria) [Before 28/12/19]
- Attribute Dataset - contains 78,017 images of 230 classes which are annotated with 359 attributes of visual, semantic and subjective properties in instance-level (Zhao, Fu, Liang, Wu, Wang, Wang) [30/12/2020]
- Attribute Learning for Understanding Unstructured Social Activity - Database of videos containing 10 categories of unstructured social events to recognise, also annotated with 69 attributes. (Y. Fu Fudan/QMUL, T. Hospedales Edinburgh/QMUL) [Before 28/12/19]
- Banana Ripeness Classification Dataset This dataset consists of 5616 images of bananas varying in ripeness (Roboflow) [05/06/25]
- BirdsThis database contains 600 images (100 samples each) of six different classes of birds. (Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce) [Before 28/12/19]
- ButterfliesThis database contains 619 images of seven different classes of butterflies. (Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce) [Before 28/12/19]
- CAER (Context-Aware Emotion Recognition) - Large scale image and video dataset for emotion recognition, and facial expression recognition (Lee, Kim, Kim, Park, and Sohn) [29/12/19]
- CALVIN research group datasets - object detection with eye tracking, imagenet bounding boxes, synchronised activities, stickman and body poses, youtube objects, faces, horses, toys, visual attributes, shape classes (CALVIN group) [Before 28/12/19]
- CelebA - Large-scale CelebFaces Attributes Dataset(Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang) [Before 28/12/19]
- DukeMTMC-attribute - 23 pedestrian attributes for DukeMTMC-reID (Lin, Zheng, Zheng, Wu and Yang) [Before 28/12/19]
- EMOTIC (EMOTIons in Context) - Images of people (34357) embedded in their natural environments, annotated with 2 distinct emotion representation. (Ronak kosti, Agata Lapedriza, Jose Alvarez, Adria Recasens) [Before 28/12/19]
- Flowers_Classification Dataset This dataset is a classification detection dataset and consists of 1821 images of various flower species like dandelions and daisies (Roboflow) [05/06/25]
- HAT database of 27 human attributes (Gaurav Sharma, Frederic Jurie) [Before 28/12/19]
- Juice Box Quality Assurance Dataset This dataset consists of 37 images of juice boxes (Roboflow) [05/06/25]
- LFW-10 dataset for learning relative attributes - A dataset of 10,000 pairs of face images with instance-level annotations for 10 attributes. (CVIT, IIIT Hyderabad. ) [Before 28/12/19]
- Market-1501-attribute - 27 visual attributes for 1501 shoppers. (Lin, Zheng, Zheng, Wu and Yang) [Before 28/12/19]
- Multi-Class Weather Dataset - Our multi-class benchmark dataset contains 65,000 images from 6 common categories for sunny, cloudy, rainy, snowy, haze and thunder weather. This dataset benefits weather classification and attribute recognition. (Di Lin) [Before 28/12/19]
- Person Recognition in Personal Photo Collections - we introduced three harder splits for evaluation and long-term attribute annotations and per-photo timestamp metadata. (Oh, Seong Joon and Benenson, Rodrigo and Fritz, Mario and Schiele, Bernt) [Before 28/12/19]
- UT-Zappos50K Shoes - Large scale shoe dataset consisting of 50,000 catalog images and over 50,000 pairwise relative attribute labels on 11 fine-grained attributes (Aron Yu, Mark Stephenson, Kristen Grauman, UT Austin) [Before 28/12/19]
- Visual Attributes Dataset visual attribute annotations for over 500 object classes (animate and inanimate) which are all represented in ImageNet. Each object class is annotated with visual attributes based on a taxonomy of 636 attributes (e.g., has fur, made of metal, is round). (Before 30/12/19) [Before 28/12/19]
- Visual Privacy (VISPR) Dataset - Privacy Multilabel Dataset (22k images, 68 privacy attributes) (Orekondy, Schiele, Fritz) [Before 28/12/19]
- WIDER Attribute Dataset - WIDER Attribute is a large-scale human attribute dataset, with 13789 images belonging to 30 scene categories, and 57524 human bounding boxes each annotated with 14 binary attributes. (Li, Yining and Huang, Chen and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
Autonomous Driving
- aiMotive Multimodal Dataset a multimodal dataset for robust autonomous driving with long-range perception. The dataset consists of 176 scenes with synchronized and calibrated LiDAR, camera, and radar sensors covering a 360-degree field of view.(Matuszka, Tamas and Barton et al.) [05/06/25]
- A multimodal dataset for various forms of distracted driving - a multimodal dataset acquired in a controlled experiment on a driving simulator. The set includes data for n=68 volunteers that drove the same highway under four different conditions: No distraction, cognitive distraction, emotional distraction, and sensorimotor distraction. (S. Taamneh, P. Tsiamyrtzis, M. Dcosta, et al.) [31/07/25]
- AMUSE -The automotive multi-sensor (AMUSE) dataset taken in real traffic scenes during multiple test drives. (Philipp Koschorrek etc.) [Before 28/12/19]
- ApolloCar3D - 5000 labelled images with 60K car instances (Song, Wang, Zhou, Zhu, Guan, Dai, Su, Li, Yang) [26/1/20]
- ApolloScape - high resolution cameras and a Riegl acquisition system. Our dataset is collected in different cities under various traffic conditions. 74555 video frames and their pixel-level and instance-level annotations (Peking University / Baido) [18/1/20]
- Argoverse - Two public datasets supported by highly detailed maps to test, experiment, and teach self-driving vehicles how to understand the world around them; more than 300,000 curated scenarios, 3D tracking annotations for 113 scenes and 324,557 interesting vehicle trajectories for motion forecasting (Chang, Lambert, Sangkloy, Singh, Bak, Hartnett, Wang, Carr, Lucey, Ramanan, Hays) [18/1/20]
- Autonomous Driving - Semantic segmentation, pedestrian detection, virtual-world data, far infrared, stereo,driver monitoring. (CVC research center and the UAB and UPC universities) [Before 28/12/19]
- BOREAS - The Boreas dataset includes over 350km of driving data featuring a 128-channel Velodyne Alpha Prime lidar, a 360 Navtech CIR304-H scanning radar, a 5MP FLIR Blackfly S camera, and centimetre-accurate post-processed ground truth poses. (K. Burnett, D. Yoon, Y. Wu, et al.) [29/06/25]
- Bosch Small Traffic Lights Dataset (BSTLD) - A dataset for traffic light detection, tracking, and classification. [Before 28/12/19]
- DrivingStereo - A Large-Scale Dataset for Stereo Matching in Autonomous Driving Scenarios. 180k stereo images covering a diverse set of driving scenarios (Yang, Song, Huang, Deng, Shi, Zhou) [Before 28/12/19]
- Boxy vehicle detection dataset - A vehicle detection dataset with 1.99 million annotated vehicles in 200,000 images. It contains AABB and keypoint labels. [Before 28/12/19]
- CASR: Cyclist Arm Sign Recognition - Small clips of ~10 seconds showing cyclists performing arm signs. The videos are acquired with a consumer-graded camera. There are 219 arm sign actions annotated. (Zhijie Fang, Antonio M. Lopez) [13/1/20]
- CULane - a large-scale challenging traffic lane detection dataset, which has 133,235 images with lane marking annotations (Pan, Shi, Luo, Wang, Tang) [3/12/2022]
- DADE dataset – Driving Agents in Dynamic Environments - A synthetic dataset containing video sequences (RGB images) acquired by vehicles navigating dynamic environments and weather conditions, with semantic segmentation grounds truths, GNSS position data, and weather information. (Halin, Gérin, Cioppa, Henry, Ghanem, Macq, Vleeschouwer, Droogenbroeck) [13/8/25]
- DDD17 - DAVIS Driving Dataset 2017 - "Dataset contains recordings from DAVIS346 camera from driving scenarios primarily on highways along with ground truth car data such as speed, steering, GPS, etc.."(Binas, Neil, Liu, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DDD20 - End-to-End Event Camera Driving Dataset - Addition to the DDD17. An additional 41h of DAVIS E2E driving data has been collected and organized.(Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DET: A High-resolution DVS Dataset for Lane Extraction - A high-resolution DVS dataset for lane extraction.(Cheng, Luo, Yang, Yu, Chen, Li) [27/12/2020]
- Driving Event Camera Datasets - sequences that were recorded with a VGA (640x480) event camera (Samsung DVS Gen3) and a conventional RGB camera (Huawei P20 Pro) placed on the windshield of a car driving through Zurich. (Davide Scaramuzza, Henri Rebecq) [23/1/20]
- EVCS (Electric Vehicle Charging Stations) Dataset An object detection dataset of parking scenario images with electric vehicle charging stations (EVCS), which are annotated with EVCS type, bounding boxes, instance segmentation masks and visibility levels.(L. Chen, C. Riggers et al.) [10/07/25]
- DurLAR - A High-fidelity 128-channel LiDAR Dataset with Panoramic Ambient and Reflectivity Imagery for Multi-modal Autonomous Driving Applications (Li, Ismail, Shum, Breckon) [5/10/2022]
- ESRORAD - A dataset of images and point clouds for urban road and rail scenes from Le Havre and Rouen. (R. Khemmar, A. Mauri, C. Dulompont, et al.) [29/06/25]
- Ford Campus Vision and Lidar Data Set - time-registered data from professional (Applanix POS LV) and consumer (Xsens MTI-G) Inertial Measuring Unit (IMU), Velodyne 3D-lidar scanner, two push-broom forward looking Riegl lidars, and a Point Grey Ladybug3 omnidirectional camera system (Pandey, McBride, Eustice) [Before 28/12/19]
- FRSign - French Railway Signalling Dataset. A large-scale and accurate dataset for vision-based railway traffic light detection and recognition. (J. Harb, N. Rebena, R. Chosidow, et al.) [29/06/25]
- FRIDA (Foggy Road Image DAtabase) Image Database - images for performance evaluation of visibility and contrast restoration algorithms. FRIDA: 90 synthetic images of 18 urban road scenes. FRIDA2: 330 synthetic images of 66 diverse road scenes, with viewpoint closed to that of the vehicle's driver. (Tarel, Cord, Halmaoui, Gruyer, Hautiere) [Before 28/12/19]
- GEN1 Automotive Detection Dataset - "The dataset was recorded using a PROPHESEE GEN1 sensor with a resolution of 304x240 pixels, mounted on a car dashboard including bounding box annotations for pedestrian and cars."(de Tournemire, Nitti, Perot, Sironi) [27/12/2020]
- GERALD - The dataset contains 5000 images from a wide variety of railway scenes as well as annotations for the most common types of German mainline railway signals. (P. Leibner, F. Hampel, C. Schindler, et al.) [29/06/25]
- H3D - Honda Research 3D dataset - 360 degree LiDAR dataset (dense pointcloud from Velodyne-64), 160 crowded and highly interactive traffic scenes, 1,071,302 3D bounding box labels, 8 common classes of traffic participants (Patil, Malla, Gang, Chen) [18/1/20]
- House3D - House3D is a virtual 3D environment which consists of thousands of indoor scenes equipped with a diverse set of scene types, layouts and objects sourced from the SUNCG dataset. It consists of over 45k indoor 3D scenes, ranging from studios to two-storied houses with swimming pools and fitness rooms. All 3D objects are fully annotated with category labels. Agents in the environment have access to observations of multiple modalities, including RGB images, depth, segmentation masks and top-down 2D map views. The renderer runs at thousands frames per second, making it suitable for large-scale RL training. (Yi Wu, Yuxin Wu, Georgia Gkioxari, Yuandong Tian, facebook research) [Before 28/12/19]
- India Driving Dataset (IDD) - unstructured driving conditions from India with 50,000 frames (10,000 semantic, and 40,000 coarse annotations) for training autonomous cars to see using object detection, scene-level and instance-level semantic segmentation (CVIT, IIIT Hyderabad and Intel) [Before 28/12/19]
- Joint Attention in Autonomous Driving (JAAD) - The dataset includes instances of pedestrians and cars intended primarily for the purpose of behavioural studies and detection in the context of autonomous driving. (Iuliia Kotseruba, Amir Rasouli and John K. Tsotsos) [Before 28/12/19]
- LISA Vehicle Detection Dataset - colour first person driving video under various lighting and traffic conditions (Sivaraman, Trivedi) [Before 28/12/19]
- LLAMAS Unsupervised dataset - A lane marker detection and segmentation dataset of 100,000 images with 3d lines, pixel level dashed markers, and curves for individual lines. [Before 28/12/19]
- Lost and Found Dataset - The Lost and Found Dataset addresses the problem of detecting unexpected small road hazards (often caused by lost cargo) for autonomous driving applications. (Sebastian Ramos, Peter Pinggera, Stefan Gehrig, Uwe Franke, Rudolf Mester, Carsten Rother) [Before 28/12/19]
- Multi-cue onboard pedestrian detection - Multi-cue onboard pedestrian detection dataset is a dataset for detection of pedestrians. (C. Wojek, S. Walk, B. Schiele) [29/06/25]
- Multi-View UAV Dataset - A comprehensive multi-view UAV dataset for visual navigation research in GPS-denied urban environments, collected using the CARLA simulator. (Z. Fang) [22/06/25]
- Multi Vehicle Stereo Event Camera Dataset - Multiple sequences containing a stereo pair of DAVIS 346b event cameras with ground truth poses, depth maps and optical flow. (lex Zihao Zhu, Dinesh Thakur, Tolga Ozaslan, Bernd Pfrommer, Vijay Kumar, Kostas Daniilidis) [Before 28/12/19]
- nuTonomy scenes dataset (nuScenes) - The nuScenes dataset is a large-scale autonomous driving dataset. It features: Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS), 1000 scenes of 20s each, 1,440,000 camera images, 400,000 lidar sweeps, two diverse cities: Boston and Singapore, left versus right hand traffic, detailed map information, manual annotations for 25 object classes, 1.1M 3D bounding boxes annotated at 2Hz, attributes such as visibility, activity and pose. (Caesar et al) [Before 28/12/19]
- ODMS: Object Depth via Motion and Segmentation Dataset - dataset for learning Object Depth via Motion and Segmentation, which includes extensible training data and a benchmark evaluation across multiple application domains (Griffin,Corso) [26/12/2020]
- OSDaR23 - OSDaR23 is a multi-sensory dataset for detection of objects in the context of railways. (R. Tilly, R. Tagiew, P. Klasek, et al.) [29/06/25]
- Playing for Benchmarks (VIPER) - video sequences comprising 250K frames of urban scenes extracted from a photorealistic open world computer game. Ground truth annotations are available for several visual perception tasks (semantic, instance, panoptic segmentation, optical flow, 3D object detection, visual odometry) (Richter, Hayder, Koltun) [12/08/20]
- Playing for Data: Ground Truth from Computer Games - 25K synthetic images and semantic segmentation ground truth of urban scenes extracted from a photorealistic open world computer game (Richter, Vineet, Roth, Koltun) [12/08/20]
- RADIATE - autonomous driving dataset was collected in a variety of weather scenarios to facilitate the research on robust and reliable vehicle perception in adverse weathers. It includes multiple sensor modalities from radar and optical images to 3D LiDAR point clouds and GPS. (Sheeny, de Pellegrin, Saptarshi, Ahrabian, Wang and Wallace) [26/12/2020]
- Rail3D - This dataset covers three countries: Hungary, France, and Belgium, with a total length of almost 5.8 kilometers and approximately 288 million points. (Kharroubi, A., Ballouch, Z., Hajji, R., et al.) [29/06/25]
- RailCloud-HdF - LiDAR dataset in the context of railways. The dataset is annotated semantically. 8060.3 million data points. (M. Abid, M. Teixeira, A. Mahtani, et al.) [29/06/25]
- RailFOD23 - RailFOD23 comprises a total of 14,615 high-resolution images, making it a valuable resource for training and evaluating foreign object detection models in the railway domain. (Z. Chen, J. Yang, Z. Feng, H. Zhu) [29/06/25]
- RailGoerl24 - RGB and LiDAR dataset in the context of railways. The dataset is annotated boxwise. 12205 HD RGB frames and 383922305 LiDAR colored cloud points. (DZSF, PECS-WORK GmbH, EYYES Deutschland GmbH, TU Dresden.) [29/06/25]
- RailPC - LiDAR dataset in the context of railways. 3 billion data points. (T. Jiang, S. Li, Q. Zhang, et al.) [29/06/25]
- RailSem19 - RailSem19 offers 8500 unique images taken from a the ego-perspective of a rail vehicle (trains and trams). (O. Zendel, M. Murschitz, M. Zeilinger, et al.) [29/06/25]
- RAWPED - Railway Pedestrian Dataset (RAWPED) for benchmarking and developing pedestrian detection methods for on-board driver assistance systems. (T. Tugce, A. Burak, B. Burak, et al.) [29/06/25]
- RESIDE (Realistic Single Image DEhazing) - The current largest-scale benchmark consisting of both synthetic and real-world hazy images, for image dehazing research. RESIDE highlights diverse data sources and image contents, and serves various training or evaluation purposes. (Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, Zhangyang Wang) [Before 28/12/19]
- RoadSaW - Large scale dataset for the camera-based road surface and wetness estimation; demonstration video: https://youtu.be/yhH_Qeyhwho (Cordes, Reinders, Hindricks, Lammers, Rosenhahn, Broszio) [5/10/2022]
- RoadSC (Road Snow Coverage) Dataset Large-scale dataset for camera-based road snow coverage estimation, with demonstration video (K. Cordes, H. Broszio) [10/07/25]
- RUGD: Robot Unstructured Ground Driving - Video sequences collected from a mobile robot with benchmark of over 7,000 annotated frames for semantic segmentation to guide autonomous navigation in unstructured, off-road environments (Wigness, Eum, Rogers, Han, Kwon) [27/12/2020]
- semanticKITTI - A Dataset for Semantic Scene Understanding using LiDAR Sequences (Behley, Garbade, Milioto, Quenzel, Behnke, Stachniss, Gall) [18/1/20]
- SVIRO - Synthetic Vehicle Interior Rear Seat Occupancy - 25.000 synthetic sceneries across ten different vehicle interiors for several simulated sensor inputs and ground truth data (Dias Da Cruz, Wasenmueller, Beise, Stifter, Stricker) [29/12/2020]
- SYNTHetic collection of Imagery and Annotations - The purpose of aiding semantic segmentation and related scene understanding problems in the context of driving scenarios. (Computer vision center,UAB) [Before 28/12/19]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- TrimBot2020 Public Dataset for Garden Navigation - sensor data recorded from cameras and other sensors mounted on a robotic platform as well as additional external sensors capturing the robot in the garden, used for 3D Reconstruction Meets Semantics Challenge. Includes multicamera, 3D garden ground truth, semantic labels, sensor position. (TrimBot2020 team) [18/1/2022]
- TRoM: Tsinghua Road Markings - This is a dataset which contributes to the area of road marking segmentation for Automated Driving and ADAS. (Xiaolong Liu, Zhidong Deng, Lele Cao, Hongchao Lu) [Before 28/12/19]
- TUM City Campus - Urban point clouds taken by Mobile Laser Scanning (MLS) for classification, object extraction and change detection (Stilla, Hebel, Xu, Gehrung) [3/1/20]
- Udacity Self Driving Car - Relabelled dataset form the original Udacity Self Driving Car Dataset to add missing labels such as pedestrians, bikers, cars, and traffic lights. (Roboflow) [05/06/25]
- University of Michigan North Campus Long-Term Vision and LIDAR Dataset - 27 sessions spaced approximately biweekly over the course of 15 months, indoors and outdoors, varying trajectories, different times of the day across all four seasons. Includes: moving obstacles (e.g., pedestrians, bicyclists, and cars), changing lighting, varying viewpoint, seasonal and weather changes (e.g., falling leaves and snow), and long-term structural changes caused by construction. Includes ground-truth pose. (Carlevaris-Bianco, Ushani, Eustice) [Before 28/12/19]
- UZH-FPV Drone Racing Dataset - for visual inertial odometry and SLAM. 28 real-world first-person view sequences both indoors and outdoors, cintaining images, IMU, and events and ground truth (Delmerico, Cieslewski, Rebecq, Faessler, Scaramuzza) [Before 28/12/19]
- Vehicles-Open Images Dataset - 627 images of various vehicle classes for object detection. These images are derived from the Open Images open source computer vision datasets. (Solawetz) [09/06/25]
- VLMV (Vehicle Lane Merge Visual) Benchmark - Large scale data set with multi-view video (4 cameras) and object positioning (GNSS-RTK) for the observation of vehicle lane merge maneuvers with objectives (a) the evaluation of camera-based localization of vehicles and (b) learning cooperative maneuvers (K. Cordes and H. Broszio) [1/2/21]
- WHU-Railway3D - A diverse PCSS dataset specifically designed for railway scenes. WHU-Railway3D is categorized into urban, rural, and plateau railways based on scene complexity and semantic class distribution. (B. Qiu, Y. Zhou, L. Dai, et al.) [29/06/25]
Biological/Medical
- Cancer Imaging Archive A dataset consisting of TCIA data related by a common disease (e.g. lung cancer), image modality or type (MRI, CT, digital histopathology, etc) or research focus. (National Cancer Institute) [05/06/25]
- 2008 MICCAI MS Lesion Segmentation Challenge (National Institutes of Health Blueprint for Neuroscience Research) [Before 28/12/19]
- ALFI: Cell cycle phenotype annotations of label-free time-lapse imaging data from cultured human cells Dataset of label-free time-lapse microscopy images with segmentation, bounding boxes, and tracking annotations (F. Polverino, L. Antonelli, A. Albu et al.) [10/07/25]
- ASU DR-AutoCC Data - a Multiple-Instance Learning feature space for a diabetic retinopathy classification dataset (Ragav Venkatesan, Parag Chandakkar, Baoxin Li - Arizona State University) [Before 28/12/19]
- Aberystwyth Leaf Evaluation Dataset - Timelapse plant images with hand marked up leaf-level segmentations for some time steps, and biological data from plant sacrifice. (Bell, Jonathan; Dee, Hannah M.) [Before 28/12/19]
- ADP: Atlas of Digital Pathology - 17,668 histological patch images extracted from 100 slides annotated with up to 57 hierarchical tissue types (HTTs) from different organs - the aim is to provide training data for supervised multi-label learning of tissue types in a digitized whole slide image (Hosseini, Chan, Tse, Tang, Deng, Norouzi, Rowsell, Plataniotis, Damaskinos) [14/1/20]
- Annotated Spine CT Database for Benchmarking of Vertebrae Localization, 125 patients, 242 scans (Ben Glockern) [Before 28/12/19]
- BCCD Dataset - This is a dataset of blood cells photos, originally open sourced by cosmicad and akshaylambda. (Cosmicad, akshaylambda, and Roboflow) [09/06/25]
- Bone Texture Characterization for Osteoporosis Diagnosis - Textured images from the bone microarchitecture of osteoporotic and healthy subjects show a high degree of similarity, thus drastically increasing the difficulty of classifying such textures (Hospital of Orleans - France (Rachid Jennane)) [28/12/2020]
- Brain Stroke CT-Images - The data set has three categories of brain CT images named: train data, label data, and predict/output data. (V. Bandi, D. Bhattacharyya) [22/06/25]
- Brain Stroke CT Dataset This dataset contains 6,653 CT brain scans across three categories. (A. Rahman) [31/07/25]
- BRATS - the identification and segmentation of tumor structures in multiparametric magnetic resonance images of the brain (TU Munchen etc.) [Before 28/12/19]
- Breast Ultrasound Dataset B - 2D Breast Ultrasound Images with 53 malignant lesions and 110 benign lesions. (UDIAT Diagnostic Centre, M.H. Yap, R. Marti) [Before 28/12/19]
- BugNIST - dataset for volumetric analysis - to advance methods for classification and detection in volumetric 3D images (Jensen, Dahl, Gundlach, Engberg, Kjer, Dahl) [18/10/24]
- Calgary-Campinas Public Brain MR Dataset: T1-weighted brain MRI volumes acquired in 359 subjects on scanners from three different vendors (GE, Philips, and Siemens) and at two magnetic field strengths (1.5 T and 3 T). The scans correspond to older adult subjects. (Souza, Roberto, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Leticia Rittner, Richard Frayne, and Roberto Lotufo) [Before 28/12/19]
- CAMEL colorectal adenoma dataset - image-level labels for weakly supervised learning containing 177 whole slide images (156 contain adenoma) gathered and labeled by pathologists (Song and Wang) [29/12/19]
- Cardiovascular Disease Dataset - This heart disease dataset is acquired from one o f the multispecialty hospitals in India. Over 14 common features which makes it one of the heart disease dataset available so far for research purposes. (B. Doppala, D. Bhattacharyya) [22/06/25]
- Chest-CT Images - DICOM Images of 20 Subjects has been collected for the study in which 11 Subjects are identified with Cardiomegaly and 9 Subjects are Healthy. (B. Doppala, D. Bhattacharyya, M. Chakkravarthy) [22/06/25]
- ChestX-Det - chest X-Ray dataset with instance-level annotations, including instance-level annotations of 13 categories of disease/abnormality of 3,578 images (Deepwise AI Lab) [26/12/2020]
- CheXpert - a large dataset of chest X-rays and competition for automated chest x-ray interpretation, which features uncertainty labels and radiologist-labeled reference standard evaluation sets (Irvin, Rajpurkar et al) [Before 28/12/19]
- Cholec80: 80 gallbladder laparoscopic videos annotated with phase and tool information. (Andru Putra Twinanda) [Before 28/12/19]
- CRCHistoPhenotypes - Labeled Cell Nuclei Data - colorectal cancer?histology images?consisting of nearly 30,000 dotted nuclei with over 22,000 labeled with the cell type (Rajpoot + Sirinukunwattana) [Before 28/12/19]
- Cavy Action Dataset - 16 sequences with 640 x 480 resolutions recorded at 7.5 frames per second (fps) with approximately 31621506 frames in total (272 GB) of interacting cavies (guinea pig) (Al-Raziqi and Denzler) [Before 28/12/19]
- Cell Tracking Challenge Datasets - 2D/3D time-lapse video sequences with ground truth(Ma et al., Bioinformatics 30:1609-1617, 2014) [Before 28/12/19]
- Colon Polyp dataset - The dataset consists of the colonoscopy images of various patients. Along with the polyp images, ground truths and segmented masks of the polyps are attached. (M. Mahanty, D. Bhattacharyya, D. Midhunchakkaravarthy, et al) [22/06/25]
- Computed Tomography Emphysema Database (Lauge Sorensen) [Before 28/12/19]
- COPD Machine Learning Dataset - A collection of feature datasets derived from lung computed tomography (CT) images, which can be used in diagnosis of chronic obstructive pulmonary disease (COPD). The images in this database are weakly labeled, i.e. per image, a diagnosis(COPD or no COPD) is given, but it is not known which parts of the lungs are affected. Furthermore, the images were acquired at different sites and with different scanners. These problems are related to two learning scenarios in machine learning, namely multiple instance learning or weakly supervised learning, and transfer learning or domain adaptation. (Veronika Cheplygina, Isabel Pino Pena, Jesper Holst Pedersen, David A. Lynch, Lauge S., Marleen de Bruijne) [Before 28/12/19]
- CREMI: MICCAI 2016 Challenge - 6 volumes of electron microscopy of neural tissue,neuron and synapse segmentation, synaptic partner annotation. (Jan Funke, Stephan Saalfeld, Srini Turaga, Davi Bock, Eric Perlman) [Before 28/12/19]
- CRIM13 Caltech Resident-Intruder Mouse dataset - 237 10 minute videos (25 fps) annotated with actions (13 classes) (Burgos-Artizzu, Dollar, Lin, Anderson and Perona) [Before 28/12/19]
- CVC colon DB - annotated video sequences of colonoscopy video. It contains 15 short colonoscopy sequences, coming from 15 different studies. In each sequence one polyp is shown. (Bernal, Sanchez, Vilarino) [Before 28/12/19]
- Diabetic Foot Ulcers Classification Datasets - training data for diabetic foot ulcer classification (Goyal, Reeves, Davison, Rajbhandari, Spragg, Yap) [30/12/2020]
- Diabetic Foot Ulcers Object Detection Dataset - training data for diabetic foot ulcer detection (Cassidy, Reeves, Joseph, Gillespie, O'Shea, Rajbhandari, Maiya, Frank, Boulton, Armstrong, Najafi, Wu, Yap) [30/12/2020]
- DIADEM: Digital Reconstruction of Axonal and Dendritic Morphology Competition (Allen Institute for Brain Science et al) [Before 28/12/19]
- DIARETDB1 - Standard Diabetic Retinopathy Database (Lappeenranta Univ of Technology) [Before 28/12/19]
- DRIVE: Digital Retinal Images for Vessel Extraction (Univ of Utrecht) [Before 28/12/19]
- DeformIt 2.0 - Image Data Augmentation Tool: Simulate novel images with ground truth segmentations from a single image-segmentation pair (Brian Booth and Ghassan Hamarneh) [Before 28/12/19]
- Deformable Image Registration Lab dataset - for objective and rigrorous evaluation of deformable image registration (DIR) spatial accuracy performance. (Richard Castillo et al.) [Before 28/12/19]
- DERMOFIT Skin Cancer Dataset - 1300 lesions from 10 classes captured under identical controlled conditions. Lesion segmentation masks are included (Fisher, Rees, Aldridge, Ballerini, et al) [Before 28/12/19]
- Dermoscopy images (Eric Ehrsam) [Before 28/12/19]
- EATMINT (Emotional Awareness Tools for Mediated INTeraction) database - The EATMINT database contains multi-modal and multi-user recordings of affect and social behaviors in a collaborative setting. (Guillaume Chanel, Gaelle Molinari, Thierry Pun, Mireille Betrancourt) [Before 28/12/19]
- EPT29.This database contains 4842 images of 1613 specimens of 29 taxa of EPTs:(Tom etc.) [Before 28/12/19]
- EyePACS - retinal image database is comprised of over 3 million retinal images of diverse populations with various degrees of diabetic retinopathy (EyePACS) [Before 28/12/19]
- FIRE Fundus Image Registration Dataset - 134 retinal image pairs and groud truth for registration. (FORTH-ICS) [Before 28/12/19]
- FMD - Fluorescence Microscopy Denoising dataset - 12,000 real fluorescence microscopy images (Zhang, Zhu, Nichols, Wang, Zhang, Smith, Howard) [Before 28/12/19]
- FocusPath - Focus Quality Assessment for Digital Pathology (Microscopy) Images. 864 image pathes are naturally blurred by 16 levels of out-of-focus lens provided with GT scores of focus levels. (Hosseini, Zhang, Plataniotis) [Before 28/12/19]
- Histology Image Collection Library (HICL) - The HICL is a compilation of 3870histopathological images (so far) from various diseases, such as brain cancer,breast cancer and HPV (Human Papilloma Virus)-Cervical cancer. (Medical Image and Signal Processing (MEDISP) Lab., Department of BiomedicalEngineering, School of Engineering, University of West Attica) [Before 28/12/19]
- Indian Diabetic Retinopathy Image Dataset - This dataset consists of retinal fundus images annotated at pixel-level for lesions associated with Diabetic Retinopathy. Also, it provides the disease severity of diabetic retinopathy and diabetic macular edema. This dataset is useful for development and evaluation of image analysis algorithms for early detection of diabetic retinopathy. (Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, Fabrice Meriaudeau) [Before 28/12/19]
- IRMA(Image retrieval in medical applications) - This collection compiles anonymous radiographs (Deserno TM, Ott B) [Before 28/12/19]
- IVDM3Seg - 24 3D multi-modality MRI data sets of at least 7 IVDs of the lower spine, collected from 12 subjects in two different stages (Zheng, Li, Belavy) [Before 28/12/19]
- JIGSAWS - JHU-ISI Surgical Gesture and Skill Assessment Working Set (a surgical activity dataset for human motion modeling, captured using the da Vinci Surgical System from eight surgeons with different levels of skill performing five repetitions of three elementary surgical tasks. It contains: kinematic and video data, plus manual annotations. (Carol Reiley and Balazs Vagvolgyi) [Before 28/12/19]
- KID - A capsule endoscopy database for medical decision support (Anastasios Koulaouzidis and Dimitris Iakovidis) [Before 28/12/19]
- KneeBones3Dify-Annotated-Dataset MR images and knee bone annotations for validating segmentation and 3D reconstruction algorithms (E. Soscia, D. Romano, L. Maddalena et al.) [10/07/25]
- Leaf Segmentation ChallengeTobacco and arabidopsis plant images (Hanno Scharr, Massimo Minervini, Andreas Fischbach, Sotirios A. Tsaftaris) [Before 28/12/19]
- LIDC-IDRI - Lung Image Database Consortium image collection (LIDC-IDRI) consists of diagnostic and lung cancer screening thoracic computed tomography (CT) scans with marked-up annotated lesions. (Before 30/12/19) [Before 28/12/19]
- Large-Scale Benchmark Dataset The dataset is a large-scale benchmark comprising approximately 10,500 samples with corresponding annotations. (Khan, Habib, et al.) [05/06/25]
- LITS Liver Tumor Segmentation - 130 3D CT scans with segmentations of the liver and liver tumor. Public benchmark with leaderboard at Codalab.org (Patrick Christ) [Before 28/12/19]
- Mammographic Image Analysis Homepage - a collection of databases links [Before 28/12/19]
- MCC: Melanoma Cancer Cell Dataset - This multitemporal image dataset provides better understanding of the cancer cell migration and anti-migration promoted by specific drugs, classifying in treated and untreated cell, being possible to characterize phenotypic and morphologic drug effects (V. F. Mota) [29/12/2020]
- Medical image database - Database of ultrasound images of breast abnormalities with the ground truth. (Prof. Stanislav Makhanov, biomedsiit.com) [Before 28/12/19]
- Micro and Macro vascular complications in Type_II diabetes - This data set contains information about Micro and Macro vascular complications in Type-II diabetes patients. (B. Vamsi, D. Bhattacharyya) [22/06/25]
- MiniMammographic Database (Mammographic Image Analysis Society) [Before 28/12/19]
- MitoEM - Two 4096x4096x1000 volumes for mitochondria instance segmentation from electron microscopy (EM) images of brain tissues. (Wei et al., Harvard University) [26/12/2020]
- MUCIC: Masaryk University Cell Image Collection - 2D/3D synthetic images of cells/tissues for benchmarking(Masaryk University) [Before 28/12/19]
- NIH Chest X-ray Dataset - 112,120 X-ray images with disease labels from 30,805 unique patients. (NIH) [Before 28/12/19]
- OCTA-500 OCTA-500 is a large-scale multimodal retinal imaging dataset for segmentation, comprising OCT and OCTA volumetric data from 500 subjects. It includes six types of projection maps, four types of text labels (e.g., gender, disease), and seven segmentation annotations. (M. Li, Y. Zhang, Z. Ji et al.) [31/07/25]
- OASIS - Open Access Series of Imaging Studies - 500+ MRI data sets of the brain (Washington University, Harvard University, Biomedical Informatics Research Network) [Before 28/12/19]
- ORDS Dataset - Optic Disc Segmentation from Retinal Images (Sarhan, Abdullah, Ali Al-Khaz'Aly, Adam Gorner, Andrew Swift, Jon Rokne, Reda Alhajj, and Andrew Crichton) [1/2/21]
- ORVS Dataset - dataset is for retinal vessels segmentation from retinal images (Abdullah Sarhan, Jon Rokne , Reda Alhajj and Andrew Crichton) [1/2/21]
- Plant Phenotyping Datasets - plant data suitable for plant and leaf detection, segmentation, tracking, and species recognition (M. Minervini, A. Fischbach, H. Scharr, S. A. Tsaftaris) [Before 28/12/19]
- PTB-XL: Large Public 12-Lead ECG Dataset PTB-XL is a comprehensive, multi-label 12-lead electrocardiogram (ECG) dataset with 21,837 records from 18,885 patients (10 s each), annotated by cardiologists using SCP-ECG diagnostic, form, and rhythm statements. (P. Wagner, N. Strodthoff, R. D. Bousseljot, W. Samek, T. Schaeffter, et al.) [11/07/2025]
- RECOVERY-FA19 - ultra-widefield FA and annotated pixel-wise binary vessel maps that can be used for the development and evaluation of retinal vessel segmentation algorithms (Li Ding et. al) [29/12/2020]
- Retinal fundus images - Ground truth of vascular bifurcations and crossovers (Univ of Groningen) [Before 28/12/19]
- SARAS endoscopic vision challenge for surgeon action detection - 22,601 annotated training frames with 28,055 action instances from 21 different action classes (Cuzzolin, Singh Bawa, Skarga-Bandurova, Singh) [16/4/20]
- Simulated Brain Database The SBD contains a set of realistic MRI data volumes produced by an MRI simulator. (R.K.-S. Kwan, A.C. Evans, et al.) [05/06/25]
- SCORHE - 1, 2 and 3 mouse behavior videos, 9 behaviors, (Ghadi H. Salem, et al, NIH) [Before 28/12/19]
- SLP (Simultaneously-collected multimodal Lying Pose) - large scale dataset on in-bed poses includes: 2 Data Collection Settings: (a) Hospital setting: 7 participants, and (b) Home setting: 102 participants (29 females, age range: 20-40). 4 Imaging Modalities: RGB (regular webcam), IR (FLIR LWIR camera), DEPTH (Kinect v2) and Pressure Map (Tekscan Pressure Sensing Map). 3 Cover Conditions: uncover, bed sheet, and blanket. Fully labeled poses with 14 joints. (Ostadabbas and Liu) [2/1/20]
- SNEMI3D - 3D Segmentation of neurites in EM images [Before 28/12/19]
- Spine Dataset - All kinds of spinal images from X-ray to CT and MRI (Shuo Li) [30/12/2020]
- Stroke Analysis - This dataset contains primary attributes like Age, NIHSS, mRS, Systolic Blood pressure, Diastolic blood pressure, Glucose, Paralysis, Smoking, BMI, Cholesterol (V. Bandi, D. Midhunchakkaravarthy, D. Bhattacharyya) [22/06/25]
- STructured Analysis of the Retina - DESCRIPTION(400+ retinal images, with ground truth segmentations and medical annotations) (Before 30/12/19) [Before 28/12/19]
- Spine and Cardiac data (Digital Imaging Group of London Ontario, Shuo Li) [Before 28/12/19]
- Synthetic Migrating Cells -Six artificial migrating cells (neutrophils) over 98 time frames, various levels of Gaussian/Poisson noise and different paths characteristics with ground truth. (Dr Constantino Carlos Reyes-Aldasoro et al.) [Before 28/12/19]
- TMED2: (Tufts Medical Echocardiogram Dataset version 2) - A
clinically-motivated benchmark dataset for computer vision and machine learning from limited labeled data
focused on diagnosing the severity of aortic stenosis (AS), a common valve disease, from ultrasound
images of the heart (echocardiograms) (Huang, et al) [25/06/24]
- The Internet Brain Segmentation Repository This dataset provides manually-guided expert segmentation results along with magnetic resonance brain image data. (IBSR) [05/06/25]
- Tryp - It is a dataset of microscopy images of unstained thick blood smears for trypanosome parasite detection. (E. Anzaku, M. Mohammed, U. Ozbulak, et al) [22/06/25]
- UBFC-RPPG Dataset - remote photoplethysmography (rPPG) video data and ground truth acquired with a CMS50E transmissive pulse oximeter (Bobbia, Macwan, Benezeth, Mansouri, Dubois) [Before 28/12/19]
- Uni Bremen Open, Abdominal Surgery RGB Dataset - Recording of a complete, open, abdominal surgery using a Kinect v2 that was mounted directly above the patient looking down at patient and staff. (Joern Teuber, Gabriel Zachmann, University of Bremen) [Before 28/12/19]
- Univ of Central Florida - DDSM: Digital Database for Screening Mammography (Univ of Central Florida) [Before 28/12/19]
- VascuSynth - 120 3D vascular tree like structures with ground truth (Mengliu Zhao, Ghassan Hamarneh) [Before 28/12/19]
- VascuSynth - Vascular Synthesizer generates vascular trees in 3D volumes. (Ghassan Hamarneh, Preet Jassi, Mengliu Zhao) [Before 28/12/19]
- York Cardiac MRI dataset (Alexander Andreopoulos) [Before 28/12/19]
Camera calibration
- Affine Covariant Features Dataset The dataset contains images in PPM format and homographies between image pairs. (T. Tuytelaars, K. Mikolajczyk et al.) [05/06/25]
- BabelCalib - Camera calibration data with various cameras: narrow/medium FOV, fisheye, catadioptric; includes an acquired dataset (OV) and several established datasets (Kalibr, OCamCalib, UZH) with corner detections provided. [Lochman, Liepieshov, Chen, Perdoch, Zach, Pritts] [8/9/25]
- Catadioptric camera calibration images (Yalin Bastanlar) [Before 28/12/19]
- GoPro-Gyro Dataset - This dataset consists of a number of wide-angle rolling shutter video sequences with corresponding gyroscope measurements (Hannes etc.) [Before 28/12/19]
- LO-RANSAC - LO-RANSAC library for estimation of homography and epipolar geometry(K. Lebeda, J. Matas and O. Chum) [Before 28/12/19]
Character and Document Understanding Datasets
- Artificial Characters Dataset - Dataset artificially generated by using first order theory which describes structure of ten capital letters of English alphabet. (H. Guvenir, B. Acar, H. Muderrisoglu) [24/07/25]
- CASIA-OLHWDB -Online handwritten Chinese character database, collected using Anoto pen on paper. 3755 classes in the GB 2312 character set. (CASIA) [28/07/25]
- Character Trajectories Dataset -Labeled samples of pen tip trajectories for people writing simple characters. (B. Williams) [28/07/25]
- Extended MNIST Dataset - 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. (LeCun, Yann and Cortes, Corinna and Burges, CJ) [13/06/25]
- Gisette -GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'. This dataset is one of five datasets of the NIPS 2003 feature selection challenge. (I. Guyon, S. Gunn, A. Ben-Hur, G. Dror) [28/07/25]
- HASYv2 -HandWritten Alphanumeric chars-Numbers,Letters,Mathematical & Scientific Symbols. (M. Thoma) [28/07/25]
- Letter Recognition -Database of character image features; try to identify the letter. (D. Slate) [28/07/25]
- MNIST Dataset - 70,000 28x28 black-and-white images of handwritten digits extracted from two NIST databases. (LeCun, Yann and Cortes, Corinna and Burges, CJ) [13/06/25]
- Optical Recognition of Handwritten Digits -Normalized bitmaps of handwritten data. (E. Alpaydin, C. Kaynak) [28/07/25]
- Pen-Based Recognition of Handwritten Digits -Digit database of 250 samples from 44 writers. (E. Alpaydin, Fevzi. Alimoglu) [28/07/25]
- Semeion Handwritten Digit -1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values. (T. Srl) [28/07/25]
- Synthetic word and character datasets A dataset consisting of 9 million images covering 90k English words, and includes the training, validation and test splits used in our work (M. Jaderberg, K Simonyan, A. Vedaldi, A. Zisserman) [05/06/25]
- The noisy Bangla handwritten digit dataset -Includes Handwritten Numeral Dataset (10 classes) and Basic Character Dataset (50 classes), each dataset has three types of noise: white gaussian, motion blur, and reduced contrast. (M. Karki et al.) [28/07/25]
- UJI Pen Characters -Data consists of written characters in a UNIPEN-like format. (D. Llorens, F. Prat, A. Marzal, J. Vilar) [28/07/25]
Event Camera Data
- The ATIS Planes dataset - The ATIS Planes dataset is an event-based of free hand dropped airplane models. (Afshar, Tapson, van Schaik, Cohen) [27/12/2020]
- CED: Color Event Camera Dataset - CED features 50 minutes of footage with both color frames and color events from the Color-DAVIS346. (Scheerlinck, Rebecq, Stoffregen, Barnes, Mahony, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- Combined Dynamic Vision / RGB-D Dataset - "This dataset consists of recordings of the three data streams (color, depth, events) from the D-eDVS . a depth-augmented embedded dynamic vision sensor . and corresponding ground truth data from an external tracking system." (Weikersdorfer, Adrian, Cremers, Conradt) [27/12/2020]
- DailyDVS-200 - A Comprehensive Benchmark Dataset for Event-Based Action Recognition (Wang, Zhu) [8/10/24]
- DDD17 - DAVIS Driving Dataset 2017 - "Dataset contains recordings from DAVIS346 camera from driving scenarios primarily on highways along with ground truth car data such as speed, steering, GPS, etc.." (Binas, Neil, Liu, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DDD20 - End-to-End Event Camera Driving Dataset - Addition to the DDD17. An additional 41h of DAVIS E2E driving data has been collected and organized. (Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DET: A High-resolution DVS Dataset for Lane Extraction - A high-resolution DVS dataset for lane extraction. (Cheng, Luo, Yang, Yu, Chen, Li) [27/12/2020]
- DHP19 - DAVIS Human Pose Estimation and Action Recognition - Dataset contains synchronized Recordings from 4 DAVIS346 cameras with Vicon marker ground truth from 17 subjects doing repeated motions. (Balgrist University Hospital, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- Driving Event Camera Datasets - sequences that were recorded with a VGA (640x480) event camera (Samsung DVS Gen3) and a conventional RGB camera (Huawei P20 Pro) placed on the windshield of a car driving through Zurich. (Davide Scaramuzza, Henri Rebecq) [23/1/20]
- DSEC-FLOW - an Optical Flow Benchmark that includes high-quality labels for event-based optical flow methods recorded on over 10Km trajectories with a car, plus automatic code evaluation and leader board. (Gehrig, Millhausler, Gehrig, Scaramuzza). [19/10/2022]
- DSEC Semantic Dataset - large-scale event-based dataset with fine-grained labels (Sun, Messikommer, Gehrig, Scaramuzza). [19/10/2022]
- DVS09 - DVS128 Dynamic Vision Sensor Silicon Retina - Dataset containing sample DVS recordings. (Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DVSFLOW16 - DVS/DAVIS Optical Flow Dataset - "DVS optical flow dataset contains samples of a scene with boxes, moving sinusoidal gratings, and a rotating disk. The ground truth comes from the camera's IMU rate gyro." (Rueckauer, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DVSACT16 - DVS Datasets for Object Tracking, Action Recognition and Object Recognition - Dataset contains recordings from DVS on tracking datasets. (Hu, Liu, Pfeiffer, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DVSNOISE20 - This dataset is designed to evaluate event denoising algorithm performance against real sensor data and was collected using a DAVIS346 neuromorphic-camera. (Almatrafi,Baldwin, Aizawa, Hirakawa) [27/12/2020]
- EBS-EKF - Event-based Sensing Extended Kalman Filter for Star Tracking. Event sensor (Prophesee EVKv4) data of moving star fields with ground truth from a conventional star tracker. (Reed, Hashemi, Melamed, Menon, Hirakawa, McCloskey) [13/8/25]
- ECRot: Event Camera Rotation Dataset - A VGA-resolution event camera dataset designed for the development of event-based rotational motion estimation algorithms (Guo, Gallego) [24/8/25]
- ESL: Event-based Structured Light - first dataset for high-speed depth sensing with data from an event camera and a laser-point projector. (Muglikar, Gallego, Scaramuzza) [24/8/25]
- Event-aided Direct Sparse Odometry (EDS) dataset - VGA-resolution dataset of events and RGB images for visual odometry and low-level vision tasks with hybrid event-RGB sensors. (Hidalgo-Carrió, Gallego, Scaramuzza) [24/8/25]
- Event-based Background-Oriented Schlieren - first real-world dataset for estimation of schlieren imaging in fluid dynamics from event camera data (Shiba, Hamann, Aoki, Gallego) [24/8/25]
- Event-based, Direct Camera Tracking Dataset - The dataset consists of one or more trajectories of an event camera (stored as a rosbag) and corresponding photometric map in the form of a point cloud for real data and a textured mesh for simulated scenes as well as ground truth pose. (Bryner, Gallego, Rebecq, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- The Event-Based Space Situational Awareness (EBSSA) Dataset - "The EBSSA dataset is a collection of event-based recordings of resident space objects, planets and stars." (Afshar, Nicholson, van Schaik, Cohen) [27/12/2020]
- Event Penguins - first dataset of continuous monitoring of a penguin colony in Antarctica using an event camera (Hamann, Ghosh, Juarez-Martinez, Hart, Kacelnik, Gallego) [24/8/25]
- EventKubric - Synthetic low-level vision dataset with pixel-wise annotations obtained from rendering 3D scenes. Includes ego-motion and independently moving objects. (Hamann, Gehrig, Febryanto, Daniilidis, Gallego) [24/8/25]
- EVIMO - Dataset for motion segmentation, egomotion estimation and tracking using an event camera; the dataset is collected with DAVIS 346C and provides 3D poses for camera and independently moving objects, and pixelwise motion segmentation masks. (Mitrokhin, Ye, Fermuller, Aloimonos, Delbruck) [14/1/20]
- Extreme Event Dataset - An event dataset with multiple moving objects in challenging conditions (low lighting conditions and extreme light variation including flashing strobe lights). (Mitrokhin, Fermuller, Parameshwara, Aloimonos) [27/12/2020]
- GEN1 Automotive Detection Dataset - "The dataset was recorded using a PROPHESEE GEN1 sensor with a resolution of 304x240 pixels, mounted on a car dashboard including bounding box annotations for pedestrian and cars." (de Tournemire, Nitti, Perot, Sironi) [27/12/2020]
- High Quality Frames (HQF) dataset - The datasets contains events and groundtruth frames from a DAVIS240C that are well-exposed and minimally motion-blurred. (Stoffregen, Scheerlinck, Scaramuzza, Drummond, Barnes, Kleeman, Mahony) [27/12/2020]
- High Speed and HDR Datasets - "The sequences are used in the paper ""High Speed and High Dynamic Range Video with an Event Camera"" and include events from an event camera and images from a RGB camera." (Rebecq, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- MNIST-DVS and FLASH-MNIST-DVS Databases - The dataset is based on the original frame-based MNIST dataset and contains recordings a DVS (Dynamic Vision Sensor). (Yousefzadeh, Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- MouseSIS: Space-Time Instance Segmentation of Mice - First dataset of mice (articulated moving objects) recorded with an event camera and a grayscale camera, including pixel-accurate annotations. Useful for prototyping algorithms in: tracking, segmentation, etc. (Hamann, Li, Mieske, Lewejohann, Gallego) [24/8/25]
- Multi Vehicle Stereo Event Camera Dataset - Multiple sequences containing a stereo pair of DAVIS 346b event cameras with ground truth poses, depth maps and optical flow. (lex Zihao Zhu, Dinesh Thakur, Tolga Ozaslan, Bernd Pfrommer, Vijay Kumar, Kostas Daniilidis) [Before 28/12/19]
- N-Caltech101 (Neuromorphic-Caltech101) - The dataset is a spiking version of the original frame-based Caltech101 dataset. (Orchard, Cohen, Jayawant, Thakor) [27/12/2020]
- N-Cars - "The dataset is composed of 12,336 car samples and 11,693 non-cars samples (background) for classification recorded by an ATIS camera." (Sironi, Brambilla, Bourdis, Lagorce, Benosman) [27/12/2020]
- N-MNIST (Neuromorphic-MNIST) - The dataset is a spiking version of the original frame-based MNIST dataset of handwritten digits. (Orchard, Cohen, Jayawant, Thakor) [27/12/2020]
- "Neuromorphic Vision Datasets for Pedestrian Detection, Action Recognition, and Fall Detection" - "Neuromorphic vision datasets for pedestrian detection, action recognition and fall detection recorded with a DAVIS346redColor." (Miao, Chen, Ning, Zi, Ren, Bing, Knoll) [27/12/2020]
- N-SOD Dataset - "Neuromorphic Single Object Dataset (N-SOD), contains three objects with samples of varying length in time recorded with an event-based sensor." (Ramesh, Ussa, Vedovs, Yang, Orchard) [27/12/2020]
- POKER-DVS Database - "The POKER-DVS database consists of a set of 131 poker pip symbols tracked and extracted from 3 separate DVS recordings, while browsing very quickly poker cards." (Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- PRED18 - VISUALISE Predator/Prey Dataset - Dataset contains recordings from a DAVIS240 camera mounted on a computer-controlled robot (the predator) that chases and attempts to capture another human-controlled robot (the prey). (Moeys, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- Recycling Video Datasets for Event Cameras This repository contains code that implements video to events conversion as described in Gehrig et al. CVPR'20 and the used dataset.(D. Gehrig, M. Gehrig et al.) [05/06/25]
- RGB-DAVIS Dataset - The datasets contains indoor and outdoor sequences involving camera motion and/or scene motion collected with a RGB-DAVIS imaging system. (Wang, Duan, Cossairt, Katsaggelos, Huang, Shi) [27/12/2020]
- ROSHAMBO17 - RoShamBo Rock Scissors Paper game DVS dataset - "Dataset is recorded from ~20 persons each showing the rock, scissors and paper symbols for about 2m each with a variety of poses, distances, postiions, left/right hand. " (Lungu, Corradi, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- SL-ANIMALS-DVS Database - The SL-ANIMALS-DVS database consists of DVS recordings of humans performing sign language gestures of various animals as a continuous spike flow at very low latency. (Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- SLOW-POKER-DVS Database - "The SLOW-POKER-DVS database consists of 4 separate DVS recordings, while slowly moving a poker symbol in front of the camera for about 3 minutes." (Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- UZH-RPG stereo events dataset - First dataset for 3D Reconstruction with a Stereo Event Camera (DAVIS240C) (Zhou, Gallego, Rebecq, Kneip, Li, Scaramuzza) [24/8/25]
- ViViD : Vision for Visibility Dataset - "The dataset provides normal and poor illumination sequences recorded by thermal, depth, and temporal difference sensor for indoor and outdoor trajectories." (Lee, Cho, Yoon, Shin, Kim) [27/12/2020]
Face and Eye/Iris Databases
- See also: >Wikipedia's List of facial expression databases - 19 image and video datasets () [30/12/2023]
- 2D-3D face dataset - This dataset includes pairs of 2D face image and its corresponding 3D face geometry model with geometry details. (Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang, Jianmin Zheng) [Before 28/12/19]
- 300 Videos in the Wild (300-VW) - 68 Facial Landmark Tracking (Chrysos, Antonakos, Zafeiriou, Snape, Shen, Kossaifi, Tzimiropoulos, Pantic) [Before 28/12/19]
- 300W-Style - enhanced version of 300W by applying three style changes to the original images. It is used to facilitate the analysis of the facial landmark detection problem. (Xuanyi Dong) [29/12/19]
- 3D_RMA - 120 persons were asked to pose twice in front of the system: in Nov 97 (session1) and in January 98 (session2). For each session, 3 shots were recorded with different (but limited) orientations of the head: straight forward / Left or Right / Upward or downward. (Royal Military Academy (Belgium)) [24/07/25]
- 3D Mask Attack Dataset -76500 frames of 17 different people, facing the camera against a plain background. Two sets of the data are captured on the real subjects two weeks apart, while the final set consists of a single person wearing a fake face mask of the 17 different people. (N. Erdogmus, S. Marcel) [30/07/25]
- 3D facial expression - Binghamton University 3D Static and Dynamic Facial Expression Databases (Lijun Yin, Jeff Cohn, and teammates) [Before 28/12/19]
- 3DFE - Binghamton University 3D static facial expression database (Lijun Yin et al.) [28/12/2020]
- 4DFE - Binghamton University 3D dynamic facial expression database (Lijun Yin et al.) [28/12/2020]
- AFAD: Asian Face Age Dataset - a new dataset proposed for evaluating the performance of age estimation, which contains more than 160K facial images and the corresponding age and gender labels (Niu, Zhou, Gao, Hua) [27/12/2020]
- AffectNet -AffectNet contains more than 1M facial images collected from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. (A. Mollahosseini, B. Hasani, M. Mahoor) [31/07/25]
- Aff-Wild2 The largest in-the-wild A/V database consisting of around 3M frames annotated for valence-arousal, 12 action units and 8 expression categories (D. Kollias) [10/07/25]
- AFLW-Style - enhanced version of AFLW by applying three style changes to the original images. It is used to facilitate the analysis of the facial landmark detection problem. (Xuanyi Dong) [29/12/19]
- AginG Faces in the Wild v2 Database description: AGFW-v2 consists of 36,299 facial images divided into 11 age groups with a span of five years between groups. On average, there are 3,300 images per group. Facial images in AGFW-v2 are not public figures and less likely to have significant make-up or facial modifications, helping embed accurate aging effects during the learning process. (Chi Nhan Duong, Khoa Luu, Kha Gia Quach, Tien D. Bui) [Before 28/12/19]
- Audio-visual database for face and speaker recognition (Mobile Biometry MOBIO http://www.mobioproject.org/) [Before 28/12/19]
- Audiovisual Lombard grid speech corpus - a bi-view audiovisual Lombard speech corpus which can be used to support joint computational-behavioral studies in speech perception (Alghamdi, Maddock, Marxer, Barker and Brown) [31/12/19]
- BANCA face and voice database (Univ of Surrey) [Before 28/12/19]
- Binghampton Univ 3D static and dynamic facial expression database (Lijun Yin, Peter Gerhardstein and teammates) [Before 28/12/19]
- Binghamton-Pittsburgh 4D Spontaneous Facial Expression Database - consist of 2D spontaneous facial expression videos and FACS codes. (Lijun Yin et al.) [Before 28/12/19]
- BioID face database (BioID group) [Before 28/12/19]
- BioVid Heat Pain Database - This video (and biomedical signal) dataset contains facial and physiopsychological reactions of 87 study participants who were subjected to experimentally induced heat pain. (University of Magdeburg (Neuro-Information Technology group) and University of Ulm (Emotion Lab)) [Before 28/12/19]
- Biometric databases - biometric databases related to iris recognition (Adam Czajka) [Before 28/12/19]
- Biwi 3D Audiovisual Corpus of Affective Communication - 1000 high quality, dynamic 3D scans of faces, recorded while pronouncing a set of English sentences. [Before 28/12/19]
- BOBSL: BBC-Oxford British Sign Language Dataset - 1,962 episodes (approximately 1,400 hours) of BSL-interpreted BBC broadcast footage accompanied by written English subtitles. 39 signers. (Albanie, Varol, Momeni, Bull, et al) [22/11/21]
- Bosphorus 3D/2D Database of FACS annotated facial expressions, of head poses and of face occlusions (Bogazici University) [Before 28/12/19]
- BP4D - Binghamton-Pittsburgh 4D Spontaneous Facial Expression Database - 2D and 3D spontaneous facial expression videos and FACS codes (Lijun Yin et al.) [28/12/2020]
- BP4D+ - Multimodal Spontaneous Emotion database of 2D, 3D, thermal, and physiological spontaneous facial expression videos, FACS codes, and feature points. (Lijun Yin et al.) [28/12/2020]
- BU-3DFE - 100 subjects with 2500 facial expression models. (Binghamton University) [24/07/25]
- BUPT-Balancedface - A large-scale face database consisting of 1.2M in-the-wild images with equalized racial distribution. Also, a large-scale face database consisting of 2M in-the-wild images with the racial distribution according to world's population. (Wang, Zhang, Deng) [28/12/2020]
- CAER (Context-Aware Emotion Recognition) - Large scale image and video dataset for emotion recognition, and facial expression recognition (Lee, Kim, Kim, Park, and Sohn) [29/12/19]
- Cafca Synthetic Dataset - The full dataset contains 1,500 synthetic subjects, each of which is rendered in 3 environments with 13 random expressions from 30 views, resulting in 1.755 Mio. images. The dataset is split into 15 chunks. Each chunk is about 67 GB and contains 100 identities. (M. Buehler, G. Li, E. Wood, et al) [22/06/25]
- CALFW: Cross-age Labeled Faces in-the-Wild - A large-scale benchmark database designed to evaluate the accuracy of face-recognition models under variable age conditions (Zheng, Deng) [28/12/2020]
- Caricature/Photomates dataset - a dataset with frontal faces and corresponding Caricature line drawings (Tayfun Akgul) [Before 28/12/19]
- CASIA-IrisV3 (Chinese Academy of Sciences, T. N. Tan, Z. Sun) [Before 28/12/19]
- CASIA Web-Face Database - The CASIA-WebFace dataset is used for face verification and face identification tasks. The dataset contains 494,414 face images of 10,575 real identities collected from the web. (Yi et al.) [24/07/25]
- CASIR Gaze Estimation Database - RGB and depth images (from Kinect V1.0) and ground truth values of facial features corresponding to experiments for gaze estimation benchmarking: (Filipe Ferreira etc.) [Before 28/12/19]
- Celeb-DF - A new large-scale and challenging DeepFake video dataset, Celeb-DF, for the development and evaluation of DeepFake detection algorithms (Li, Yang, Sun, Qi and Lyu) [30/12/19]
- CMU Facial Expression Database (CMU/MIT) [Before 28/12/19]
- CMU Multi-PIE Face Database - more than 750,000 images of 337 people recorded in up to four sessions over the span of five months. (Jeff Cohn et al.) [Before 28/12/19]
- CMU Pose, Illumination, and Expression (PIE) Database (Simon Baker) [Before 28/12/19]
- CMU/MIT Frontal Faces (CMU/MIT) [Before 28/12/19]
- CMU/MIT Frontal Faces (CMU/MIT) [Before 28/12/19]
- CoMA 3D face dataset - 20,466 meshes (3D head scans and registrations in FLAME topology) of extreme facial expressions captured from 12 different subjects (Ranjan, Bolkart, Sanyal, Black) [Before 28/12/19]
- CPLFW: Cross-pose Labeled Faces in-the-Wild - A large-scale benchmark database designed to evaluate the accuracy of face-recognition models under variable pose conditions (Zheng, Deng) [28/12/2020]
- CSSE Frontal intensity and range images of faces (Ajmal Mian) [Before 28/12/19]
- CelebA - Large-scale CelebFaces Attributes Dataset(Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang) [Before 28/12/19]
- Celebrities in Frontal-Profile in the Wild - 500+ images of celebrities in frontal and profile views (Sengupta, Cheng, Castillo, Patel, Chellappa, Jacobs) [Before 28/12/19]
- CK+ Dataset -The Extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. (P. Lucey, J. Cohn, T. Kanade, et al.) [30/07/25]
- Cohn-Kanade AU-Coded Expression Database - 500+ expression sequences of 100+ subjects, coded by activated Action Units (Affect Analysis Group, Univ. of Pittsburgh) [Before 28/12/19]
- Cohn-Kanade AU-Coded Expression Database - for research in automatic facial image analysis and synthesis and for perceptual studies (Jeff Cohn et al.) [Before 28/12/19]
- Columbia Gaze Data Set - 5,880 images of 56 people over 5 head poses and 21 gaze directions (Brian A. Smith, Qi Yin, Steven K. Feiner, Shree K. Nayar) [Before 28/12/19]
- Computer Vision Laboratory Face Database (CVL Face Database) - Database contains 798 images of 114 persons, with 7 images per person and is freely available for research purposes. (Peter Peer etc.) [Before 28/12/19]
- DaiSEE: Dataset for Affective States in E-Environments - a multi-label video classification dataset comprising of 9068 video snippets captured from 112 users for recognizing user's affective states in e-environments (such as e-learning): boredom, confusion, engagement, and frustration "in the wild". (Gupta, D'Cunha, Awasthi, Balasubramanian) [27/12/2020]
- Deep future gaze - This dataset consists of 57 sequences on search and retrieval tasks performed by 55 subjects. Each video clip lasts for around 15 minutes with the frame rate 10 fps and frame resolution 480 by 640. Each subject is asked to search for a list of 22 items (including lanyard, laptop) and move them to the packing location (dining table). (National University of Singapore, Institute for Infocomm Research) [Before 28/12/19]
- DISFA+:Extended Denver Intensity of Spontaneous Facial Action Database - an extension of DISFA (M.H. Mahoor) [Before 28/12/19]
- DISFA:Denver Intensity of Spontaneous Facial Action Database - a non-posed facial expression database for those who are interested in developing computer algorithms for automatic action unit detection and their intensities described by FACS. (M.H. Mahoor) [Before 28/12/19]
- DHF1K - 1000 elaborately selected video sequences with fixation annotations from 17 viewers. (Prof. Jianbing Shen) [Before 28/12/19]
- DiveFace - A dataset to train unbiased and discrimination-aware face recognition algorithms. It contains annotations equally distributed among six classes related to gender and ethnicity. (A. Morales, J. Fierrez, R. Vera-Rodriguez, R. Tolosana) [1/2/21]
- EarVN1.0 - ear images of 164 Asian peoples and is composed of 28,412 colour images of 98 males and 66 females (Vinh Truong Hoang) [7/12/24]
- EB+ - expanded BP4D+ (Lijun Yin et al.) [28/12/2020]
- Edinburgh Daily Face Change Dataset - frontal images of 3 faces captured several times a day for several months (Sun, Fisher) [21/10/2021]
- ELFW: Extended Labeled Faces in-the-Wild - Additional face-related categories and segmented faces from the LFW dataset (Redondo, Gibert) [27/12/2020]
- ETH-XGaze - A large scale (over 1 million samples) gaze estimation dataset with high-resolution images under extreme head poses and gaze directions (AIT group at ETH Zurich and Google) [1/12/2021]
- EURECOM Facial Cosmetics Database - 389 images, 50 persons with/without make-up, annotations about the amount and location of applied makeup. (Jean-Luc DUGELAY et al) [Before 28/12/19]
- Eurecom Kinect Face Dataset - Images of faces captured under laboritory conditions, with different levels of occlusion and illumination, and with different facial expressions. (R. Min, N. Kose, J. Dugelay, et al.) [30/07/25]
- EVE - A video dataset consists of time-synchronized screen recordings, user-facing camera views, and eye gaze data (AIT group at ETH Zurich) [1/12/2021]
- EYEDIAP dataset - The EYEDIAP dataset was designed to train and evaluate gaze estimation algorithms from RGB and RGB-D data.It contains a diversity of participants, head poses, gaze targets and sensing conditions. (Kenneth Funes and Jean-Marc Odobez) [Before 28/12/19]
- EyeMouseMap - gaze and mouse tracking data collected during the visual search of four different point target symbols (Vlachou, Pappa, Liaskos, Krassanakis) [17/10/24]
- Face2BMI Dataset The Face2BMI dataset contains 2103 pairs of faces, with corresponding gender, height and previous and current body weights, which allows for training computer vision models that can predict body-mass index (BMI) from profile pictures. (Enes Kocabey, Ferda Ofli, Yusuf Aytar, Javier Marin, Antonio Torralba, Ingmar Weber) [Before 28/12/19]
- FaceForensics++a forensics dataset consisting of 1000 original video sequences that have been manipulated with four automated face manipulation methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures. (Rossler, Cozzolino, Verdoliva, Riess, Thies, Niesner) [17/12/2021].
- FDDB: Face Detection Data set and Benchmark - studying unconstrained face detection (University of Massachusetts Computer Vision Laboratory) [Before 28/12/19]
- FDDB-360 - face detection in 360 degree fisheye images (Fu, Alvar, Bajic, and Vaughan) [29/12/19]
- FEI Face Database -The FEI face database is a Brazilian face database that contains a set of face images taken between June 2005 and March 2006 at the Artificial Intelligence Laboratory of FEI in Sao Bernardo do Campo, Sao Paulo, Brazil. There are 14 images for each of 200 individuals, a total of 2800 images. (C. Thomaz) [31/07/25]
- FERA dataset - partial BP4D-Spontaneous Facial Expression Database - consist of 2D spontaneous facial expression videos and FACS codes (Lijun Yin et al.) [28/12/2020]
- FERG-3D-DB -Facial Expression Research Group 3D Database (FERG-3D-DB) is a database of 3D rigs of stylized characters with annotated facial expressions. The database contains 39574 annotated examples for four stylized characters. (D. Aneja, B. Chaudhuri, A. Colburn, et al.) [30/07/25]
- FG-Net Aging Database of faces at different ages (Face and Gesture Recognition Research Network) [Before 28/12/19]
- Face Recognition Grand Challenge datasets (FRVT - Face Recognition Vendor Test) [Before 28/12/19]
- FMTV - Laval Face Motion and Time-Lapse Video Database. 238 thermal/video subjects with a wide range of poses and facial expressions acquired over 4 years (Ghiass, Bendada, Maldague) [Before 28/12/19]
- Face Super-Resolution Dataset - Ground truth HR-LR face images captured with a dual-camera setup (Chengchao Qu etc.) [Before 28/12/19]
- FaceScrub - A Dataset With Over 100,000 Face Images of 530 People (50:50 male and female) (H.-W. Ng, S. Winkler) [Before 28/12/19]
- FaceTracer Database - 15,000 faces (Neeraj Kumar, P. N. Belhumeur, and S. K. Nayar) [Before 28/12/19]
- Facial Expression Dataset - This dataset consists of 242 facial videos (168,359 frames) recorded in real world conditions. (Daniel McDuff et al.) [Before 28/12/19]
- Flickr-Faces-HQ Dataset -Collection of images containing a face each, crawled from Flickr. (Karras et al.) [28/07/25]
- Florence 2D/3D Hybrid Face Dataset - bridges the gap between 2D, appearance-based recognition techniques, and fully 3D approaches (Bagdanov, Del Bimbo, and Masi) [Before 28/12/19]
- Face Painting Datasets A dataset consisting of face paintings (E. J. Crowley, O. M. Parkhi, A. Zisserman) [05/06/25]
- Facial Recognition Technology (FERET) Database (USA National Institute of Standards and Technology) [Before 28/12/19]
- Gi4E Database - eye-tracking database with 1300+ images acquired with a standard webcam, corresponding to different subjects gazing at different points on a screen, including ground-truth 2D iris and corner points (Villanueva, Ponz, Sesma-Sanchez, Mikel Porta, and Cabeza) [Before 28/12/19]
- Google Facial Expression Comparison dataset - a large-scale facial expression dataset consisting of face image triplets along with human annotations that specify which two faces in each triplet form the most similar pair in terms of facial expression, which is different from datasets that focus mainly on discrete emotion classification or action unit detection (Vemulapalli, Agarwala) [Before 28/12/19]
- Hannah and her sisters database - a dense audio-visual person-oriented ground-truth annotation of faces, speech segments, shot boundaries (Patrick Perez, Technicolor) [Before 28/12/19]
- Headspace dataset - The Headspace dataset is a set of 3D images of the full human head, consisting of 1519 subjects wearing tight fitting latex caps to reduce the effect of hairstyles. (Christian Duncan, Rachel Armstrong, Alder Hey Craniofacial Unit, Liverpool, UK) [Before 28/12/19]
- Hong Kong Face Sketch Database [Before 28/12/19]
- HPGEN - Synthetic image data set using a generative model with explicit control over the head pose. HPGEN offers a promising solution to address data set bias in the head pose estimation as current benchmarks suffer from a limited number of images, imbalanced data distributions, the high cost of annotation, and ethical concerns. (R. Valle, J. Buenaposada, L. Baumela) [22/06/25]
- IDIAP Head Pose Database (IHPD) - The dataset contains a set of meeting videos along with the head groundtruth of individual participants (around 128min)(Sileye Ba and Jean-Marc Odobez) [Before 28/12/19]
- IARPA Janus Benchmark datasets - IJB-A, IJB-B, IJB-C, FRVT (NIST) [Before 28/12/19]
- IISc Indian Face Dataset (IISCIFD) Dataset of Indian faces for fine-grained race classification (H. Katti, S. P. Arun) [10/07/25]
- IMDB-WIKI - 500k+ face images with age and gender labels (Rasmus Rothe, Radu Timofte, Luc Van Gool ) [Before 28/12/19]
- IMPA-FACE3D -2008 dataset showing neutral frontal, joy, sadness, surprise, anger, disgust, fear, opened, closed, kiss, left side, right side, neutral sagittal left, neutral sagittal right, nape and forehead (acquired sometimes). (VISGRAF) [31/07/25]
- Indian Movie Face database (IMFDB) - a large unconstrained face database consisting of 34512 images of 100 Indian actors collected from more than 100 videos (Vijay Kumar and C V Jawahar) [Before 28/12/19]
- Indian Semi-Acted Facial Expression Database (iSAFE) -Happy, Sad, Fear, Surprise, Angry, Neutral, Disgust emotions. (S. Singh, S. Benedict) [30/07/25]
- Indian Spontaneous Expression Database (ISED) -Near frontal face video was recorded for 50 participants while watching emotional video clips. (A. Happy) [30/07/25]
- Iranian Face Database - IFDB is the first image database in middle-east, contains color facial images with age, pose, and expression whose subjects are in the range of 2-85. (Mohammad Mahdi Dehshibi) [Before 28/12/19]
- IRISSEG-EP - Iris Segmentation Ground Truth Database -- Elliptical/Polynomial Boundaries - Elliptical ground truth masks and polynomial boundary parameters for unrolling on various databases (Casia, Ubiris, Notredame IITD). (Hofbauer, Alonso-Fernandez, Wild, Bigun, Uhl) [24/06/25]
- Japanese Female Facial Expression (JAFFE) Database (Michael J. Lyons) [Before 28/12/19]
- LaPa - a large-scale dataset for face parsing (Liu, Shi, Mei) [27/12/2020]
- LEGAN Perceptual Dataset - synthetic face images generated by 5 different approaches and their human-annotated perceptual labels (i.e. mean and standard deviation of naturalness score of each image) (Banerjee, Joshi, Mahajan, Bhattacharya, Kyal, Mishra) [3/12/2022]
- LIRIS Children Spontaneous Facial Expression Video Database - pontaneous / natural facial expressions of 12 children in diverse settings with variable video recording scenarios showing six universal or prototypic emotional expressions (happiness, sadness, anger, surprise, disgust and fear). Children are recorded in constraint free environment (no restriction on head movement, no restriction on hands movement, free sitting setting, no restriction of any sort) while they watched specially built / selected stimuli. This constraint free environment allowed us to record spontaneous / natural expression of children as they occur. The database has been validated by 22 human raters. (Khan, Crenn, Meyer, Bouakaz) [29/12/19]
- LFW: Labeled Faces in the Wild - unconstrained face recognition [Before 28/12/19]
- LIRIS Children Spontaneous Facial Expression Video Database (LIRIS-CSE) Emotional video database of ethnically diverse children with spontaneous expressions across six basic emotions (R. A. Khan, A. Crenn, A. Meyer et al.) [10/07/25]
- Localized Audio Visual DeepFake Dataset (LAV-DF) - The first multimodal deepfake dataset for temporal localization. (Z. Cai, S. Chosh, A. Dhall, et al) [22/06/25]
- LS3D-W - a large-scale 3D face alignment dataset annotated with 68 points containing faces captured in a "in-the-wild" setting. (Adrian Bulat, Georgios Tzimiropoulos) [Before 28/12/19]
- LSDIR LSDIR is a large-scale iris recognition dataset with over 460,000 images from more than 2,000 subjects, collected under diverse sensors and challenging conditions such as blur, occlusion, and lighting variation. (H. Nguyen, K. Cao, A. Ross et al.) [31/07/25]
- MAAD-Face: A Massively Annotated Attribute Dataset for Face Images - A large-scale face annotation dataset with 123.9M attribute annotations from 47 soft-biometric attributes. (Terhorst, Fahrmann, Kolf, Damer, Kirchbuchner, Kuijper) [3/12/2022]
- MAFA: MAsked FAces - 30,811 images with 35,806 labeled MAsked FAces, six main attributes of each masked face. (Shiming Ge, Jia Li, Qiting Ye, Zhao Luo) [Before 28/12/19]
- Makeup Induced Face Spoofing (MIFS) - 107 makeup-transformations attempting to spoof a target identity. Also other datasets. (Antitza Dantcheva) [Before 28/12/19]
- MaskedFace-Net - a dataset of human faces with a correctly or incorrectly worn mask (137,016 images) (Cabani, Hammoudi, Benhabiles, Melkemi) [26/12/2020]
- MERL-RAV - Dataset contains over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. (Abhinav Kumar, Tim K. Marks, Wenxuan Mou, Ye Wang, Michael Jones, Anoop Cherian, Toshiaki Koike-Akino, Xiaoming Liu and Chen Feng) [1/2/21]
- Mexculture142 - Mexican Cultural heritage objects and eye-tracker gaze fixations (Montoya Obeso, Benois-Pineau, Garcia-Vazquez, Ramirez Acosta) [Before 28/12/19]
- MIT CBCL Face Recognition Database (Center for Biological and Computational Learning) [Before 28/12/19]
- MIT Collation of Face Databases (Ethan Meyers) [Before 28/12/19]
- MIT eye tracking database (1003 images) (Judd et al) [Before 28/12/19]
- MMI Facial Expression Database - 2900 videos and high-resolution still images of 75 subjects, annotated for FACS AUs. [Before 28/12/19]
- MOBIUS (Mobile Ocular Biometrics In Unconstrained Settings dataset) Contains over 16,000 eye images captured using three mobile devices, with manual segmentation masks on a subset (M. Vitek) [10/07/25]
- MORPH (Craniofacial Longitudinal Morphological Face Database) (University of North Carolina Wilmington) [Before 28/12/19]
- MPIIGaze dataset - 213,659 samples with eye images and gaze target under different illumination conditions and nature head movement, collected from 15 participants with their laptop during daily using. (Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling.) [Before 28/12/19]
- Manchester Annotated Talking Face Video Dataset (Timothy Cootes) [Before 28/12/19]
- MegaFace - 1 million faces in bounding boxes (Kemelmacher-Shlizerman, Seitz, Nech, Miller, Brossard) [Before 28/12/19]
- Music video dataset - 8 music videos from YouTube for developing multi-face tracking algorithms in unconstrained environments (Shun Zhang, Jia-Bin Huang, Ming-Hsuan Yang) [Before 28/12/19]
- NIST Face Recognition Grand Challenge (FRGC) (NIST) [Before 28/12/19]
- NIST mugshot identification database (USA National Institute of Standards and Technology) [Before 28/12/19]
- Notre Dame Synthetic Face Image Dataset - 2M face images of 12K synthetic identities and 8K 3D real head models (Banerjee, Scheirer, Bowyer, and Flynn) [3/12/2022]
- NRC-IIT Facial Video Database - this database contains pairs of short video clips each showing a face of a computer user sitting in front of the monitor exhibiting a wide range of facial expressions and orientations (Dmitry Gorodnichy) [Before 28/12/19]
- Notre Dame Iris Image Dataset (Patrick J. Flynn) [Before 28/12/19]
- Notre Dame face, IR face, 3D face, expression, crowd, and eye biometric datasets (Notre Dame) [Before 28/12/19]
- ORL face database: 40 people with 10 views (ATT Cambridge Labs) [Before 28/12/19]
- OUI-Adience Faces - unfiltered faces for gender and age classification plus 3D faces (OUI) [Before 28/12/19]
- Oulu-CASIA NIR&VIS facial expression database -The dataset contains 80 people, ranging in age from 23 to 58 years old, and 73.81% of the subjects are male. (Chinese Academy of Sciences, University of Oulu) [30/07/25]
- Oxford: faces, flowers, multi-view, buildings, object categories, motion segmentation, affine covariant regions, misc (Oxford Visual Geometry Group) [Before 28/12/19]
- Paintings Datasets A dataset consisting of paintings where objects have a variety of sizes, poses and depictive styles, and can be partially occluded or truncated.(E. J. Crowley, A. Zisserman) [05/06/25]
- Pandora - POSEidon: Face-from-Depth for Driver Pose (Borghi, Venturelli, Vezzani, Cucchiara) [Before 28/12/19]
- PubFig: Public Figures Face Database (Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar) [Before 28/12/19]
- QMUL-SurvFace - A large-scale face recognition benchmark dedicated for real-world surveillance face analysis and matching. (QMUL Computer Vision Group) [Before 28/12/19]
- RaFD Radboud Faces Database -RaFD is a set of pictures of 67 models (including Caucasian males and females, Caucasian children, both boys and girls, and Moroccan Dutch males) displaying 8 emotional expressions. (Langner, O., Dotsch, R., Bijlstra, G., et al.) [30/07/25]
- RAF-DB: Real-world Affective Faces Database - A large-scale facial expression database consisting of 30K in-the-wild facial images with accurately estimated expression labels (Li, Deng) [28/12/2020]
- RAF-ML: Real-world Affective Faces Multi-Label - A large-scale facial expression database consisting of 5K in-the-wild facial images with accurately estimated blended expression labels (Li, Deng) [28/12/2020]
- RAVDESS - The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. (S. Livingstone, F. Russo) [24/07/25]
- Re-labeled Faces in the Wild - original images, but aligned using "deep funneling" method. (University of Massachusetts, Amherst) [Before 28/12/19]
- RFW: Racial Face in-the-Wild - A large-scale benchmark database designed to evaluate the fairness of face-recognition models under unconstrained conditions (Wang, Zhang, Deng) [28/12/2020]
- RT-GENE: Real-Time Eye Gaze Estimation in Natural Environments 122,531 images with the subjects' ground truth eye gaze and head pose labels under free-viewing conditions and large camera-subject distances (Fischer, Chang, Demiris, Imperial College London) [Before 28/12/19]
- S3DFM - Edinburgh Speech-driven 3D Facial Motion Database. 77 people with 10 repetitions of speaking a passphrase: 1 second of 500 frame per second 600x600 pixels of {IR intensity video, registered depth images} plus synchronized 44.1 Khz audio. There are an additional 26 people (10 repetitions) moving their heads while speaking (Zhang, Fisher) [Before 28/12/19]
- Salient features in gaze-aligned recordings of human visual input - TB of human gaze-contingent data "in the wild" (Frank Schumann etc.) [Before 28/12/19]
- SAMM Dataset of Micro-Facial Movements - The dataset contains 159 spontaneous micro-facial movements obtained from 32 participants from 13 different ethnicities. (A.Davison, C.Lansley, N.Costen, K.Tan, M.H.Yap) [Before 28/12/19]
- SBVPI (Sclera Blood Vessels, Periocular, and Iris dataset) Nearly 2,000 high-quality eye images with full and partial segmentation masks of the sclera, iris, pupil, and more (P. Rot) [10/07/25]
- Skin Segmentation - The Skin Segmentation dataset is constructed over B, G, R color space. Skin and Nonskin dataset is generated using skin textures from face images of diversity of age, gender, and race people. (R. Bhatt, A. Dhall) [24/07/25]
- SLLFW: Similar-looking Labeled Faces in-the-Wild - A large-scale benchmark database designed to evaluate the accuracy of face-recognition models on discriminating similar-looking faces (Zhang, Deng) [28/12/2020]
- SCface - Surveillance Cameras Face Database (Mislav Grgic, Kresimir Delac, Sonja Grgic, Bozidar Klimpak) [Before 28/12/19]
- SiblingsDB - The SiblingsDB contains two datasets depicting images of individuals related by sibling relationships. (Politecnico di Torino/Computer Graphics & Vision Group) [Before 28/12/19]
- SoF dataset - 42,592 face images with glasses under different illumination conditions; provided with face region, facial landmarks, facial expression, subject ID, gender, and age information (Afifi, Abdelhamed) [29/12/19]
- Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Methods - These datasets were generated for calibrating robot-camera systems. (Amy Tabb) [Before 28/12/19]
- Spontaneous Emotion Multimodal Database (SEM-db) - non-posed reactions to visual stimulus data recorded with HD RGB, depth and IR frames of the face, EEG signal and eye gaze data (Fernandez. Montenegro, Gkelias, Argyriou) [Before 28/12/19]
- The Belfast Induced Natural Emotion Database -Video clips of natural emotion. (I. Sneddon, M. McRorie, G. Mckeown, et al.) [30/07/25]
- The FAce Semantic SEGmentation repository The FASSEG repository is composed by two datasets (frontal01 and frontal02) for frontal face segmentation, and one dataset (multipose01) with labeled faces in multiple poses. (K. Khan, M. Mauro et al.) [05/06/25]
- The MUG Facial Expression Database -The database consists of image sequences of 86 subjects performing facial expressions. The subjects were sitting in a chair in front of one camera. The background was a blue screen. (Multimedia Understanding Group) [30/07/25]
- UNBC-McMaster Shoulder Pain Expression Archive Database - Painful data: The UNBC-McMaster Shoulder Pain Expression Archive Database (Lucy et al.) [Before 28/12/19]
- VOCASET - 4D face dataset with about 29 minutes of 3D head scans captured at 60 fps and synchronized audio from 12 speakers (Cudeiro, Bolkart, Laidlaw, Ranjan, Black) [Before 28/12/19]
- Trondheim Kinect RGB-D Person Re-identification Dataset (Igor Barros Barbosa) [Before 28/12/19]
- UB KinFace Database - University of Buffalo kinship verification and recognition database [Before 28/12/19]
- UBIRIS: Noisy Visible Wavelength Iris Image Databases (University of Beira) [Before 28/12/19]
- UMDFaces - About 3.7 million annotated video frames from 22,000 videos and 370,000 annotated still images. (Ankan Bansal et al.) [Before 28/12/19]
- UoY 3D Face Dataset - The UoY 3D face dataset is a set of 3D images of the human face and consists of around 5000 3D images of approximately 350 people (15 models each). (University of York) [24/07/25]
- UPNA Head Pose Database - head pose database, with 120 webcam videos containing guided-movement sequences and free-movement sequences, including ground-truth head pose and automatically annotated 2D facial points. (Ariz, Bengoechea, Villanueva, Cabeza) [Before 28/12/19]
- UPNA Synthetic Head Pose Database - a synthetic replica of the UPNA Head Pose Database, with 120 videos with their 2D ground truth landmarks projections, their corresponding head pose ground truth, 3D head models and camera parameters. (Larumbe, Segura, Ariz, Bengoechea, Villanueva, Cabeza) [Before 28/12/19]
- UTIRIS cross-spectral iris image databank (Mahdi Hosseini) [Before 28/12/19]
- UvA-NEMO Smile Database - 1240 smile videos (597 spontaneous and 643 posed) from 400 subjects, including age, gender, and kinship annotations (Gevers, Dibeklioglu, Salah) [Before 28/12/19]
- VATE - Video-Audio-Text for Affective Evaluation Dataset. 21,871 raw videos along with corresponding audio recordings, and text transcriptions from emotion-inducing interviews. It is designed for self-supervised representation learning of human affective states. (Agnelli and Lanzarotti) [7/10/24]
- VGGFace2 - VGGFace2 is a large-scale face recognition dataset covering large variations in pose, age, illumination, ethnicity and profession. (Oxford Visual Geometry Group) [Before 28/12/19]
- VIPSL Database - VIPSL Database is for research on face sketch-photo synthesis and recognition, including 200 subjects (1 photo and 5 sketches per subject). (Nannan Wang) [Before 28/12/19]
- Visual Search Zero Shot Database - Collection of human eyetracking data in three increasingly complex visual search tasks: object arrays, natural images and Waldo images. (Kreiman lab) [Before 28/12/19]
- VT-KFER: A Kinect-based RGBD+Time Dataset for Spontaneous and Non-Spontaneous Facial Expression Recognition - 32 subjects, 1,956 sequences of RGBD, six facial expressions in 3 poses (Aly, Trubanova, Abbott, White, and Youssef) [Before 28/12/19]
- Washington Facial Expression Database (FERG-DB) - a database of 6 stylized (Maya) characters with 7 annotated facial expressions (Deepali Aneja, Alex Colburn, Gary Faigin, Linda Shapiro, and Barbara Mones) [Before 28/12/19]
- WebCaricature Dataset - The WebCaricature dataset is a large photograph-caricature dataset consisting of 6042 caricatures and 5974 photographs from 252 persons collected from the web. (Jing Huo, Wenbin Li, Yinghuan Shi, Yang Gao and Hujun Yin) [Before 28/12/19]
- WIDER FACE: A Face Detection Benchmark - 32,203 images with 393,703 labeled faces, 61 event classes (Shuo Yang, Ping Luo, Chen Change Loy, Xiaoou Tang) [Before 28/12/19]
- Wider-360 - Datasets for face and object detection in fisheye images (Fu, Bajic, and Vaughan) [29/12/19]
- XM2VTS Face video sequences (295): The extended M2VTS Database (XM2VTS) - (Surrey University) [Before 28/12/19]
- Yale Face Database - 11 expressions of 10 people (A. Georghaides) [Before 28/12/19]
- Yale Face Database B - 576 viewing conditions of 10 people (A. Georghaides) [Before 28/12/19]
- York 3D Ear Dataset - The York 3D Ear Dataset is a set of 500 3D ear images, synthesized from detailed 2D landmarking, and available in both Matlab format (.mat) and PLY format (.ply). (Nick Pears, Hang Dai, Will Smith, University of York) [Before 28/12/19]
- York Univ Eye Tracking Dataset (120 images) (Neil Bruce) [Before 28/12/19]
- YouTube Face database - supporting face verification and open/closed-set identification (Yoanna Martinez-Diaz, Heydi Mendez-Vazquez, Leyanis Lopez-Avila, Leonardo Chang, L. Enrique Sucar, Massimo Tistarelli) [1/2/21]
- YouTube Faces DB - 3,425 videos of 1,595 different people. (Wolf, Hassner, Maoz) [Before 28/12/19]
- Zurich Natural Image - the image material used for creating natural stimuli in a series of eye-tracking studies (Frey et al.) [Before 28/12/19]
Fingerprint and Other Non-Face/Eye Biometric Datasets
- Biometrics Evaluation and Testing - Evaluation of identification technologies, including Biometrics (European computing e-infrastructure) [Before 28/12/19]
- CRL-Person - first continual learning benchmark for biometric identification which contains 90k images of 7k identities (Zhao, Tang, Chen, Bilen, Zhao) [30/12/2020]
- Extensive collation of fingerprint datasets - (Robert Vazan) [5/10/2022]
- FVC fingerpring verification competition 2002 dataset (University of Bologna) [Before 28/12/19]
- FVC fingerpring verification competition 2004 dataset (University of Bologna) [Before 28/12/19]
- Fingerprint Manual Minutiae Marker (FM3) Databases: - Fingerprint Manual Minutiae Marker (FM3) Databases( Mehmet Kayaoglu, Berkay Topcu and Umut Uludag) [Before 28/12/19]
- L3-SF - Level Three Synthetic Fingerprint Generation - A public database of L3 synthetic fingerprint images with five subsets of 148 identities, with 10 samples per identity, totaling 7400 fingerprint images, including sweat pore annotations for 740 images to assist in pore detection research. (Andre Brasil Vieira Wyzykowski and Mauricio Pamplona Segundo and Rubisley de Paula Lemes) [1/2/21]
- LSDIR LSDIR is a large-scale iris recognition dataset with over 460,000 images from more than 2,000 subjects, collected under diverse sensors and challenging conditions such as blur, occlusion, and lighting variation. (H. Nguyen, K. Cao, A. Ross et al.) [31/07/25]
- NIST fingerprint databases (USA National Institute of Standards and Technology) [Before 28/12/19]
- OU-ISIR Gait Database - six video-based gait data sets, two inertial sensor-based gait datasets, and a gait-relevant biometric score data set. (Yasushi Makihara) [Before 28/12/19]
- PLUS-MSL-FP - The PLUS Multi-Sensor and Longitudinal Fingerprint Dataset. With this work a new fingerprint dataset (108,106 samples) is introduced which can be used for inter-session (longitudinal) as well as inter-sensor fingerprint investigations (Kirchgasser, Kauba, Uhl) [24/06/25]
- PLUS-SynFP - A collection of synthetic fingerprints (Sollinger, Kirchgasser, Makrushin, Dittmann, Uhl) [24/06/25]
- PLUSVein-Contactless - Contactless Finger and Hand Vein Data Set. The data set itself consists of 3 subsets: a palmar finger vein one, acquired using the light transmission illuminator and two hand vein ones, one acquired with the 850 nm reflected light illuminator and the other one with the 950 nm reflected light illuminator. (Prommegger, Uhl) [24/06/25]
- PLUSVein-FR - Finger Rotation Data Set. The data set contains finger images captured from 63 different subjects, 4 fingers (right and left index and middle finger, respectively) per subject, which sums up to a total of 252 unique fingers. (Prommegger, Uhl) [24/06/25]
- PLUSVein-FV3 - Finger Vein Data Set. The finger-vein data set itself consists of 4 subsets: one dorsal and one palmar finger-vein subset captured using transillumination with the LED and the laser module based scanner, respectively. (Prommegger, Uhl) [24/06/25]
- PTB-XL: Large Public ECG Dataset PTB-XL is a comprehensive, multi-label 12-lead electrocardiogram (ECG) dataset with 21,837 records from 18,885 patients (10s each), annotated by cardiologists using SCP-ECG diagnostic, form, and rhythm statements. (P. Wagner, N. Strodthoff, R. D. Bousseljot, W. Samek, T. Schaeffter, et al.) [11/07/2025]
- SPD2010 Fingerprint Singular Points Detection Competition (SPD 2010 committee) [Before 28/12/19]
- Transient Biometrics Nails Dataset V01 (Igor Barros Barbosa) [Before 28/12/19]
General Images
- 8KDehaze - a large-scale ultra-high-resolution image dehazing dataset, consisting of diverse hazy/clear image pairs for evaluating and training haze removal algorithms. (Chen, Yan, et al) [24/06/25]
- A Dataset for Real Low-Light Image Noise Reduction - It contains pixel and intensity aligned pairs of images corrupted by low-light camera noise and their low-noise counterparts. (J. Anaya, A. Barbu) [Before 28/12/19]
- A database of paintings related to Vincent van Gogh - This is the dataset VGDB-2016 built for the paper "From Impressionism to Expressionism: Automatically Identifying Van Gogh's Paintings" (Guilherme Folego and Otavio Gomes and Anderson Rocha) [Before 28/12/19]
- AeroRIT - Hyperspectral semantic segmentation dataset (Rangnekar, Mokashi, Ientilucci, Kanan, Hoffman) [26/12/2020]
- AMOS: Archive of Many Outdoor Scenes (20+m) (Nathan Jacobs) [Before 28/12/19]
- Aerial imagesBuilding detection from aerial images using invariant color features and shadow information. (Beril Sirmacek) [Before 28/12/19]
- Approximated overlap error datasetImage pairs with sparse sets of ground-truth matches for evaluating local image descriptors (Fabio Bellavia) [Before 28/12/19]
- AutoDA (Automatic Dataset Augmentation) - An automatically constructed image dataset including 12.5 million images with relevant textual information for the 1000 categories of ILSVRC2012 (Bai, Yang, Ma, Zhao) [Before 28/12/19]
- BAPPS BAPPS (Berkeley Adobe Perceptual Patch Similarity) is a perceptual similarity dataset containing 325k pairs of image patches spanning traditional distortions and CNN-based perturbations. (R. Zhang, P. Isola, A. Efros et al.) [31/07/25]
- BGU Hyperspectral Image Database of Natural Scenes (Ohad Ben-Shahar and Boaz Arad) [Before 28/12/19]
- Brown Univ Large Binary Image Database (Ben Kimia) [Before 28/12/19]
- Butterfly-200 - Butterfly-20 is a image dataset for fine-grained image classification, which contains 25,279 images and covers four levels categories of 200 species, 116 genera, 23 subfamilies, and 5 families. (Tianshui Chen) [Before 28/12/19]
- CADB: Image Composition Assessment DataBase - 9497 real-world images, each image has the composition quality scores annotated by five professional raters. Useful for image composition assessment, image aesthetic assesment, etc. (Bo Zhang, Li Niu, Liqing Zhang) [23/11/21]
- CIFAR-10 classes with different WB settings - 15,098 rendered images that reflect real in-camera white-balance settings (Afifi, Brown) [29/12/19]
- CIFAR-100 - 100 classes containing 600 images each, also grouped into 20 superclasses (Alex Krizhevsky) [1/6/20]
- CLEVR CLEVR is a diagnostic benchmark of 100,000 synthetic 3D-rendered images paired with nearly 1 million automatically generated reasoning questions. (J. Johnson, B. Hariharan, L. van der Maaten et al.) [31/07/25]
- CINIC10 Widely used image classification dataset bridging CIFAR and ImageNet (Darlow L. N., Crowley E. J., Antoniou A. et al.) [10/07/25]
- CMP Facade Database - Includes 606 rectified images of facades from various places with 12 architectural classes annotated. (Radim Tylecek) [Before 28/12/19]
- Caltech-UCSD Birds-200-2011 (Catherine Wah) [Before 28/12/19]
- Color correction dataset - Homography-based registered images for evaluating color correction algorithms for image stitching. (Fabio Bellavia) [Before 28/12/19]
- Columbia Multispectral Image Database (F. Yasuma, T. Mitsunaga, D. Iso, and S.K. Nayar) [Before 28/12/19]
- DAQUAR (Visual Turing Challenge) - A dataset containing questions and answers about real-world indoor scenes. (Mateusz Malinowski, Mario Fritz) [Before 28/12/19]
- Darmstadt Noise Dataset - 50 pairs of real noisy images and corresponding ground truth images (RAW and sRGB) (Tobias Plotz and Stefan Roth) [Before 28/12/19]
- Dataset of American Movie Trailers 2010-2014 - Contains links to 474 hollywood movie
trailers along with associated metadata (genre, budget, runtime, release, MPAA rating, screens released, sequel indicator) (USC Signal Analysis and Interpretation Lab) [Before 28/12/19]
- DAVANet: Stereo Deblurring with View Aggregation - a large-scale multi-scene dataset for stereo deblurring in dynamic scenes (both indoor and outdoor). It contains 20,637 blurry-sharp stereo images from 135 diverse video clips (480 fps). (Zhou, Zhang, Zuo, Xie, Pan, Ren) [27/12/2020]
- DIML Multimodal Benchmark - To evaluate matching performance under photometric and geometric variations, 100 images of 1200 x 800 size. (Yonsei University) [Before 28/12/19]
- DSLR Photo Enhancement Dataset (DPED) - 22K photos taken synchronously in the wild by three smartphones and one DSLR camera, useful for comparing infered high quality images from multiple low quality images (Ignatov, Kobyshev, Timofte, Vanhoey, and Van Gool). [Before 28/12/19]
- Flickr-style - 80K Flickr photographs annotated with 20 curated style labels, and 85K paintings annotated with 25 style/genre labels (Sergey Karayev) [Before 28/12/19]
- Flickr1024: A Dataset for Stereo Image Super-resolution - 1024 high-quality images pairs and covers diverse senarios (Wang, Wang, Yang, An, Guo) [Before 28/12/19]
- FlickrUser Dataset - A large-scale dataset, featuring approximately 500,000 Flickr images across ~2,500 users, annotated with user metadata, image metadata, perceptual features, connotative features, as well as the presented "Commonly Interestingness (CI)" score. (F. Abdullahu, H. Grabner) [22/06/25]
- Forth Multispectral Imaging Datasets - images from 23 spectral bands each from 5 paintings. Images are annotated with ground truth data. (Karamaoynas Polykarpos et al) [Before 28/12/19]
- GDXray: xray images for nondestructive testing - 19,407 images in 167 series. (Mery, Riffo, Zscherpel, Mondragon, Lillo, Zuccar, Lobel, Carrasco) [17/10/24]
- General 100 Dataset - General-100 dataset contains 100 bmp-format images (with no
compression), which are well-suited for super-resolution training(Dong, Chao and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
- GOPRO dataset - Blurred image dataset with sharp image ground truth (Nah, Kim, and Lee) [Before 28/12/19]
- HDRT: A Large-Scale Dataset for Infrared-Guided HDR Imaging - The HDRT dataset includes 50K aligned IR, SDR, and HDR images to facilitate research in multi-modal fusion, HDR imaging, and any other related research. (J. Peng, T. Bashford-Rogers, F. Banterle, et al) [22/06/25]
- HIPR2 Image Catalogue of different types of images (Bob Fisher et al) [Before 28/12/19]
- HPatches - A benchmark and evaluation of handcrafted and learned local descriptors (Balntas, Lenc, Vedaldi, Mikolajczyk) [Before 28/12/19]
- Hyperspectral images for spatial distributions of local illumination in natural scenes - Thirty calibrated hyperspectral radiance images of natural scenes with probe spheres embedded for local illumination estimation. (Nascimento, Amano & Foster) [Before 28/12/19]
- Hyperspectral images of natural scenes - 2002 (David H. Foster) [Before 28/12/19]
- Hyperspectral images of natural scenes - 2004 (David H. Foster) [Before 28/12/19]
- ISPRS multi-platform photogrammetry dataset - 1: Nadir and oblique aerial images plus 2: Combined UAV and terrestrial images (Francesco Nex and Markus Gerke) [Before 28/12/19]
- Image & Video Quality Assessment at LIVE - used to develop picture quality algorithms (the University of Texas at Austin) [Before 28/12/19]
- ImageNet-Bg A modified version of the ImageNet test set (50,000 images) with class-relevant content removed, designed to evaluate model robustness against background noise as out-of-distribution samples.(Z. Xu, X. Xiang, Y. Liang) [10/07/25]
- ImageNet Large Scale Visual Recognition Challenges - Currently 200 object classes and 500+K images (Alex Berg, Jia Deng, Fei-Fei Li and others) [Before 28/12/19]
- ImageNet Linguistically organised (WordNet) Hierarchical Image Database - 10E7 images, 15K categories (Li Fei-Fei, Jia Deng, Hao Su, Kai Li) [Before 28/12/19]
- Improved 3D Sparse Maps for High-performance Structure from Motion with Low-cost Omnidirectional Robots - Evaluation Dataset - Data set used in research paper doi:10.1109/ICIP.2015.7351744 (Breckon, Toby P., Cavestany, Pedro) [Before 28/12/19]
- Intel Image Classification -The Intel Image Classification dataset, initially compiled by Intel, contains approximately 25,000 images of natural scenes from around the world. The images are divided into categories such as mountains, glaciers, seas, forests, buildings, and streets. (Intel) [28/07/25]
- Konstanz visual quality databases - Large-scale image and video databases for the development and evaluation of visual quality assessment algorithms. (MMSP group, University of Konstanz) [Before 28/12/19]
- Kodak McMaster demosaic dataset - (Zhang, Wu, Buades, Li) [Before 28/12/19]
- LabelMe - Image dataset featuring 187,240 images, 62,197 previously-annotated images across 658,992 labeled objects. (MIT CSAIL) [03/06/25]
- LabelMeFacade Database - 945 labeled building images (Erik Rodner et al) [Before 28/12/19]
- Local illumination hyperspectral radiance images - Thirty hyperspectral radiance images of natural scenes with embedded probe spheres for local illumination estimates(Sgio M. C. Nascimento, Kinjiro Amano, David H. Foster) [Before 28/12/19]
- McGill Calibrated Colour Image Database (Adriana Olmos and Fred Kingdom) [Before 28/12/19]
- Multiply Distorted Image Database -a database for evaluating the results of image quality assessment metrics on multiply distorted images. (Fei Zhou) [Before 28/12/19]
- NAS-Bench-201 - An algorithm-agnostic nas benchmark with detailed information (training/validation/test loss/accuracy etc) of 15,625 architectures on three datasets (Xuanyi Dong) [28/12/2020]
- NATS-Bench - An architecture dataset with the information of 15,625 neural cell candidates for architecture topology and 32,768 for architecture size on CIFAR-10/100 and ImageNet-16-120. (Xuanyi Dong) [30/12/2020]
- NPRgeneral - A standardized collection of images for evaluating image stylization algorithms. (David Mould, Paul Rosin) [Before 28/12/19]
- NPRportrait 1.0 Benchmark - three-level benchmark for non-photorealistic rendering of portraits (Rosin, Mould) [13/8/25]
- nuTonomy scenes dataset (nuScenes) - The nuScenes dataset is a large-scale autonomous driving dataset. It features: Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS), 1000 scenes of 20s each, 1,440,000 camera images, 400,000 lidar sweeps, two diverse cities: Boston and Singapore, left versus right hand traffic, detailed map information, manual annotations for 25 object classes, 1.1M 3D bounding boxes annotated at 2Hz, attributes such as visibility, activity and pose. (Caesar et al) [Before 28/12/19]
- NYU Symmetry Database - 176 single-symmetry and 63 multyple-symmetry images (Marcelo Cicconet and Davi Geiger) [Before 28/12/19]
- Open Images A dataset consisting of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories(I. Krasin, T. Duerig) [05/06/25]
- Open Images Dataset V7 and Extensions - many images with bounding boxes and image level and object classes plus relationships (Krasin, Duerig, Alldrin, Ferrari, et al) [1/06/25]
- OTCBVS Thermal Imagery Benchmark Dataset Collection (Ohio State Team) [Before 28/12/19]
- PAnorama Sparsely STructured Areas Datasets - the PASSTA datasets used for evaluation of the image alignment (Andreas Robinson) [Before 28/12/19]
- PASS - 1.4m images that do not depict humans, human parts or other identifiable information, useful for self-supervised pretraining (Yuki Asano) [06/10/21]
- Photographic Defect Dataset - 12,853 photos from Flickr with one of the 3 levels of defect severity annotations: Severe, Mild, None (Yu, Shen, Lin, Mah, Barnes) [27/12/2020]
- QMUL-OpenLogo - A logo detection benchmark for testing the model generalisation capability in detecting a variety of logo objects in natural scenes with the majority logo classes unlabelled. (QMUL Computer Vision Group) [Before 28/12/19]
- RELLISUR: A Real Low-Light Image Super-Resolution Dataset - A dataset of real low-light low-resolution and normal-light high-resolution image pairs (Aakerberg, Nasrollahi, Moeslund) [3/12/2022]
- RealSR RealSR is a real-world dataset for single image super-resolution, containing image pairs captured using DSLR cameras at different focal lengths to simulate realistic high- and low-resolution conditions. (J. Cai, H. Zeng, H. Yong et al.) [31/07/25]
- ReLoBlur - a real-world local motion deblurring dataset (Haoying Li) [1/8/245]
- RESIDE (Realistic Single Image DEhazing) - The current largest-scale benchmark consisting of both synthetic and real-world hazy images, for image dehazing research. RESIDE highlights diverse data sources and image contents, and serves various training or evaluation purposes. (Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, Zhangyang Wang) [Before 28/12/19]
- Rijksmuseum Challenge 2014 - It consist of 100K art objects from the rijksmuseum and comes with an extensive xml files describing each object. (Thomas Mensink and Jan van Gemert) [Before 28/12/19]
- See in the Dark - 77 Gb of dark images (Chen, Chen, Xu, and Koltun) [Before 28/12/19]
- Smartphone Image Denoising Dataset (SIDD) - The Smartphone Image Denoising Dataset (SIDD) consists of about 30,000 noisy images with corresponding high-quality ground truth in both raw-RGB and sRGB spaces obtained from 10 scenes with different lighting conditions using five representative smartphone cameras. (Abdelrahman Abdelhamed, Stephen Lin, Michael S. Brown) [Before 28/12/19]
- Rendered WB dataset - 100,000+ rendered sRGB images with different white balance (WB) settings (Afifi, Price, Cohen, Brown) [29/12/19]
- Spatially Variant Super-Resolution (SVSR) benchmarking dataset - 1119 real low-res images with complex noise of varying intensity and type, and their corresponding real noise free X2 and X4 high-res counterparts, for evaluation of the robustness of real-world super-resolution methods. (A. Aakerberg, M. Helou, K. Nasrollahi, T. Moeslund) [22/06/25]
- Spring - A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo, with about 6000 matched image sets for each based on rendered scenes from the open-source Blender movie "Spring". (Mehl, Schmalfuss, Jahedi, Nalivayko, Bruhn) [5/12/2023]
- Stanford Street View Image, Pose, and 3D Cities Dataset - a large scale dataset of street view images (25 million images and 118 matching image pairs) with their relative camera pose, 3D models of cities, and 3D metadata of images. (Zamir, Wekel, Agrawal, Malik, Savarese) [Before 28/12/19]
- Sushi-50 - A dataset contains 50 fine-grained classes of sushi (Jianing Qiu et al) [28/12/2020]
- TESTIMAGES - Huge and free collection of sample images designed for analysis and quality assessment of different kinds of displays (i.e. monitors, televisions and digital cinema projectors) and image processing techniques. (Nicola Asuni) [Before 28/12/19]
- Time-Lapse Hyperspectral Radiance Images of Natural Scenes - Four time-lapse sequences of 7-9 calibrated hyperspectral radiance images of natural scenes taken over the day. (Foster, D.H., Amano, K., & Nascimento, S.M.C.) [Before 28/12/19]
- Time-lapse hyperspectral radiance images - Four time-lapse sequences of 7-9 calibrated hyperspectral images of natural scenes, spectra at 10-nm intervals(David H. Foster, Kinjiro Amano, Sgio M. C. Nascimento) [Before 28/12/19]
- Tiny Images Dataset 79 million 32x32 color images (Fergus, Torralba, Freeman) [Before 28/12/19]
- TURBID Dataset - five different subsets of degraded images with its respective ground-truth. Subsets Milk and DeepBlue have 20 images each and the subset Chlorophyll has 42 images (Amanda Duarte) [Before 28/12/19]
- USEE DB - University of Salzburg Encryption Evaluation Database. A Database of Encrypted Images with Subjective Recognition Ground Truth (Hofbauer, Autrusseau, Uhl) [24/06/25]
- USEEQ DB - University of Salzburg Encryption Evaluation Quality Database. Database of mean observer quality scores for low quality images. (Hofbauer, Autrusseau, Uhl) [24/06/25]
- Urban100 Urban100 is a dataset for benchmarking image super-resolution methods, consisting of 100 high-resolution urban images with complex structural patterns such as buildings and signs. (H. Raone) [31/07/25]
- UT Snap Angle 360˚ Dataset - A list of 360˚ videos of four activities (disney, parade, ski, concert) from youtube (Kristen Grauman, UT Austin) [Before 28/12/19]
- UT Snap Point Dataset - Human judgement on snap point quality of a subset of frames from UT Egocentric dataset and a newly collected mobile robot dataset (frames are also included) (Bo Xiong, Kristen Grauman, UT Austin) [Before 28/12/19]
- UVA Intrinsic Images and Semantic Segmentation Dataset - RGB dataset with ground-truth albedo, shading, and semantic annotations (TrimBot2020 consortium) [26/2/20]
- VIREDA: A new VIdeo REal-world dataset for the comparison of Defogging Algorithms Image and video datasets containing foggy scenes and associated ground truth with different density levels and illumination conditions (A. Duminil, J. P. Tarel, R. Bremond) [10/07/25]
- Visual Dialog - 120k human-human dialogs on COCO images, 10 rounds of QA per dialog (Das, Kottur, Gupta, Singh, Yadav, Moura, Parikh, Batra) [Before 28/12/19]
- Visual Question Answering - 254K images, 764K questions, ground truth (Agrawal, Lu, Antol, Mitchell, Zitnick, Batra, Parikh) [Before 28/12/19]
- Visual Question Generation - 15k images (including both object-centric and event-centric images), 75k natural questions asked about the images which can evoke further conversation(Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiao dong He, Lucy Vanderwende) [Before 28/12/19]
- VQA Human Attention - 60k human attention maps for visual question answering i.e. where humans choose to look to answer questions about images (Das, Agrawal, Zitnick, Parikh, Batra) [Before 28/12/19]
- Wild Web tampered image dataset - A large collection of tampered images from Web and social media sources, including ground-truth annotation masks for tampering localization (Markos Zampoglou, Symeon Papadopoulos) [Before 28/12/19]
- YFCC100M: The New Data in Multimedia Research - This publicly available curated dataset of 100 million photos and videos is free and legal for all. (Bart Thomee, Yahoo Labs and Flickr in San Francisco,etc.) [Before 28/12/19]
General RGBD, 3D Point Cloud, and Depth Datasets
Note: there are 3D datasets elsewhere as well, e.g. in
Objects, Scenes, and Actions.
See also: List of RGBD datasets.
- 3D-FRONT 3D-FRONT is a large-scale synthetic indoor scene dataset containing professionally designed house layouts and richly textured furniture, encompassing nearly 19,000 rooms across thousands of distinct houses. (H. Fu, B. Cai, L. Gao et al.) [31/07/25]
- 3D60: 3D Vision Indoor Spherical Panoramas - A multimodal dataset of 360 spherical panoramas containing paired color images, depth and normal maps, as well as vertical and horizontal stereo pairs (with their assorted depth and normal maps as well) that can be used to train or evaluate a variety of 3D vision tasks. (Nikolaos Zioulis, Antonis Karakottas, Dimitrios Zarpalas, Petros Daras) [Before 28/12/19]
- 3D-Printed RGB-D Object Dataset - 5 objects with groundtruth CAD models and camera trajectories, recorded with various quality RGB-D sensors. (Siemens & TUM) [Before 28/12/19]
- 3DCOMET - 3DCOMET is a dataset for testing 3D data compression methods. (Miguel Cazorla, Javier Navarrete,Vicente Morell, Miguel Cazorla, Diego Viejo, Jose Garcia-Rodriguez, Sergio Orts.) [Before 28/12/19]
- 3D articulated body - 3D reconstruction of an articulated body with rotation and translation. Single camera, varying focal. Every scene may have an articulated body moving. There are four kinds of data sets included. A sample reconstruction result included which uses only four images of the scene. (Prof Jihun Park) [Before 28/12/19]
- A Dataset for Non-Rigid Reconstruction from RGB-D Data - Eight scenes for reconstructing non-rigid geometry from RGB-D data, each containing several hundred frames along with our results. (Matthias Innmann, Michael Zollhoefer, Matthias Niessner, Christian Theobalt, Marc Stamminger) [Before 28/12/19]
- A Large Dataset of Object Scans - 392 objects in 9 casses, hundreds of frames each (Choi, Zhou, Miller, Koltun) [Before 28/12/19]
- Aria-Digital-Twin (ADT) ADT is an egocentric dataset captured via Aria glasses, comprising 236 real activity sequences in two fully digitized indoor spaces. (X. Pan, N. Charron, Y. Yang et al.) [31/07/25]
- Articulated Object Challenge - 4 articulated objects consisting of rigids parts connected by 1D revolute and prismatic joints, 7000+ RGBD images with annotations for 6D pose estimation(Frank Michel, Alexander Krull, Eric Brachmann, Michael. Y. Yang,Stefan Gumhold, Carsten Rother) [Before 28/12/19]
- BigBIRD - 100 objects with for each object, 600 3D point clouds and 600 high-resolution color images spanning all views (Singh, Sha, Narayan, Achim, Abbeel) [Before 28/12/19]
- CAESAR Civilian American and European Surface Anthropometry Resource Project - 4000 3D human body scans (SAE International) [Before 28/12/19]
- CIN 2D+3D object classification dataset - segmented color and depth images of objects from 18 categories of common household and office objects (Bjorn Browatzki et al) [Before 28/12/19]
- CoRBS - an RGB-D SLAM benchmark, providing the combination of real depth and color data together with a ground truth trajectory of the camera and a ground truth 3D model of the scene (Oliver Wasenmuller) [Before 28/12/19]
- CSIRO synthetic deforming people - synthetic RGBD dataset for evaluating non-rigid 3D reconstruction: 2 subjects and 4 camera trajectories (Elanattil and Moghadam) [Before 28/12/19]
- CAD-Estate CAD-Estate is a large-scale RGB video dataset for 3D object annotation using CAD models. It contains ~101k CAD-aligned object instances (12k unique models) across ~20k real-estate videos, each placed in the 3D coordinate frame with full 9 DoF pose and room layout annotations. (K. K. Maninis, S. Popov, M. Niessner et al.) [31/07/25]
- Crowdbot (EPFL LASA) dataset This dataset captures pedestrian interactions from a mobile service robot (Qolo) navigating in crowds, recording over 250k frames (~200 minutes) with frontal and rear 3D LiDAR (Velodyne VLP-16 at 20 Hz). (D. Paez-Granados, Y. He, D. Gonon et al.) [31/07/25]
- CTU Garment Folding Photo Dataset - Color and depth images from various stages of garment folding. (Sushkov R., Melkumov I., Smutn y V. (Czech Technical University in Prague)) [Before 28/12/19]
- CTU Garment Sorting Dataset - Dataset of garment images, detailed stereo images, depth images and weights. (Petrik V., Wagner L. (Czech Technical University in Prague)) [Before 28/12/19]
- Clothing part dataset - The clothing part dataset consists of image and depth scans, acquired with a Kinect, of garments laying on a table, with over a thousand part annotations (collar, cuffs, hood, etc) using polygonal masks. (Arnau Ramisa, Guillem Aleny, Francesc Moreno-Noguer and Carme Torras) [Before 28/12/19]
- Cornell-RGBD-Dataset - Office Scenes (Hema Koppula) [Before 28/12/19]
- CVSSP Dynamic RGBD Modelling - Eight RGBD sequences of general dynamic scenes captured using the Kinect V1/V2 as well as two synthetic sequences. Designed for non-rigid reconstruction. (C. Malleson, J. Guillemaut, A. Hilton) [30/07/25]
- CVSSP Dynamic RGBD Modelling 2015 - This dataset contains eight RGBD sequences of general dynamic scenes captured using the Kinect V1/V2 as well as two synthetic sequences. (Charles Malleson, CVSSP, University of Surrey) [Before 28/12/19]
- Deformable 3D Reconstruction Dataset - two single-stream RGB-D sequences of dynamically moving mechanical toys together with ground-truth 3D models in the canonical rest pose. (Siemens, TUM) [Before 28/12/19]
- Delft Windmill Interior and Exterior Laser Scanning Point Clouds (Beril Sirmacek) [Before 28/12/19]
- Diabetes60 - RGB-D images of 60 western dishes, home made. Data was recorded using a Microsoft Kinect V2. (Patrick Christ and Sebastian Schlecht) [Before 28/12/19]
- DiLiGenT-PI DiLiGenT-PI is a real-world photometric stereo benchmark focusing on near-planar surfaces with rich geometric details. It contains 30 objects fabricated with four material types (metallic, specular, rough, translucent), each captured under 100 calibrated directional lighting conditions, providing ground-truth normal maps measured via profilometry. (F. Wang, J. Ren, H. Guo et al.) [31/07/25]
- ETH3D - Benchmark for multi-view stereo and 3D reconstruction, covering a variety of indoor and outdoor scenes, with ground truth acquired by a high-precision laser scanner. (Thomas Schops, Johannes L. Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, Andreas Geiger) [Before 28/12/19]
- EURECOM Kinect Face Database - 52 people, 2 sessions, 9 variations, 6 facial landmarks. (Jean-Luc DUGELAY et al) [Before 28/12/19]
- FIORD: A Fisheye Indoor-Outdoor Dataset with LIDAR Ground Truth for 3D Scene Reconstruction and Benchmarking FIORD features dual 200 degree fisheye images capturing 5 indoor and 5 outdoor scenes.(U. Gunes, M. Turkulainen, X. Ren, et al.) [10/07/2025]
- FoundationStereo Dataset (FSD) - A large-scale (1M) realistic synthetic dataset of stereo images and their ground-truth disparity maps. (B. Wen, M. Trepte, K. Aribido, et al) [22/06/25]
- G4S meta rooms - RGB-D data 150 sweeps with 18 images per sweep. (John Folkesson et al.) [Before 28/12/19]
- Georgiatech-Metz Symphony Lake Dataset - 5 million RGBD outdoor images over 4 years from 121 surveys of a lakeshore. (Griffith and Pradalier) [Before 28/12/19]
- GMU Kitchen Dataaset - 9 video sequences captured from 4 different kitchens, each containing objects from the BigBIRD dataset. (G. Georgakis, M. Reza, A. Mousavian, et al.) [29/07/25]
- Goldfinch: GOogLe image-search Dataset for FINe grained CHallenges - a largescale dataset for finegrained bird (11K species),butterfly (14K species), aircraft (409 types), and dog (515 breeds) recognition. (Jonathan Krause, Benjamin Sapp, Andrew Howard, Howard Zhou, Alexander Toshev, Tom Duerig, James Philbin, Li Fei-Fei) [Before 28/12/19]
- H3DS H3DS is a high-resolution full-head 3D dataset comprising 60+ textured head scans and posed multi-view images with ground-truth masks and landmark alignments (10-70 views per subject). (E. Ramon, G. Triginer, J. Escur et al.) [31/07/25]
- HairNet Dataset HairNet Dataset comprises 40K synthetic 3D hair models paired with 160K rendered orientation-field images, used to train real-time dense hair reconstruction from 2D inputs. (L. Hu, Y. Zhou, J. Xing et al.) [31/07/25]
- Headspace dataset - The Headspace dataset is a set of 3D images of the full human head, consisting of 1519 subjects wearing tight fitting latex caps to reduce the effect of hairstyles. (Christian Duncan, Rachel Armstrong, Alder Hey Craniofacial Unit, Liverpool, UK) [Before 28/12/19]
- HM3D: Habitat-Matterport 3D Dataset HM3D is a large-scale dataset of high-quality 3D reconstructions from real-world indoor environments, used for embodied AI tasks. It contains over 1,000 scenes with RGB-D imagery, 3D meshes, semantic annotations, and physical realism. (S. Lai, M. Szot, E. Undersander et al.) [10/07/2025]
- House3D - House3D is a virtual 3D environment which consists of thousands of indoor scenes equipped with a diverse set of scene types, layouts and objects sourced from the SUNCG dataset. It consists of over 45k indoor 3D scenes, ranging from studios to two-storied houses with swimming pools and fitness rooms. All 3D objects are fully annotated with category labels. Agents in the environment have access to observations of multiple modalities, including RGB images, depth, segmentation masks and top-down 2D map views. The renderer runs at thousands frames per second, making it suitable for large-scale RL training. (Yi Wu, Yuxin Wu, Georgia Gkioxari, Yuandong Tian, facebook research) [Before 28/12/19]
- ICL-NUIM dataset - Eight synthetic RGBD video sequences: four from a office scene and four from a living room scene. Simulated camera trajectories are taken from a Kintinuous output from a sensor being moved around a real-world room. (A. Handa, T. Whelan, J.B. McDonald, A.J. Davison) [30/07/25]
- IMPART multi-view/multi-modal 2D+3D film production dataset - LIDAR, video, 3D models, spherical camera, RGBD, stereo, action, facial expressions, etc. (Univ. of Surrey) [Before 28/12/19]
- Industrial 3D Object Detection Dataset (MVTec ITODD) - depth and gray value data of 28 objects in 3500 labeled scenes for 3D object detection and pose estimation with a strong focus on industrial settings and applications (MVTec Software GmbH, Munich) [Before 28/12/19]
- JHU CoSTAR Block Stacking Dataset - A robot dynamically interacts with 5.1 cm colored blocks via real time RGBD data to complete an order-fulfillment style block stacking task, with over 12k stacking attempts and 2m frames with applications in deep learning, neural networks, reinforcement learning, and more. (Hundt, Jain, Lin, Paxton, Hager) [27/12/2020]
- Kinect v2 Dataset - Efficient Multi-Frequency Phase Unwrapping using Kernel Density Estimation (Felix etc.) [Before 28/12/19]
- KOMATSUNA dataset - The datasets is designed for instance segmentation, tracking and reconstruction for leaves using both sequential multi-view RGB images and depth images. (Hideaki Uchiyama, Kyushu University) [Before 28/12/19]
- LUCES-MV: A Multi-View Dataset for Near-Field Point Light Source Photometric Stereo LUCES-MV is the first real-world photometric stereo dataset supporting near-field lighting. It includes 15 diverse objects captured under 15 LEDs from 36 viewpoints (10 posed, 5 unposed).(F. Logothetis, I. Budvytis, S. Liwicki, R. Cipolla) [10/07/2025]
- Make3D Laser+Image data - about 1000 RGB outdoor images with aligned laser depth images (Saxena, Chung, Ng, Sun) [Before 28/12/19]
- McGill-Reparti Artificial Perception Database - RGBD data from four cameras and unfiltered Vicon skeletal data of two human subjects performing simulated assembly tasks on a car door (Andrew Phan, Olivier St-Martin Cormier, Denis Ouellet, Frank P. Ferrie). [Before 28/12/19]
- Meta rooms - RGB-D data comprised of 28 aligned depth camera images collected by having robot go to specific place and do 360 degrees of pan with various tilts. (John Folkesson et al.) [Before 28/12/19]
- METU Multi-Modal Stereo Datasets Benchmark Datasets for Multi-Modal Stereo-Vision - The METU Multi-Modal Stereo Datasets includes benchmark datasets for for Multi-Modal Stereo-Vision which is composed of two datasets: (1) The synthetically altered stereo image pairs from the Middlebury Stereo Evaluation Dataset and (2) the visible-infrared image pairs captured from a Kinect device. (Dr. Mustafa Yaman, Dr. Sinan Kalkan) [Before 28/12/19]
- MHT RGB-D - collected by a robot every 5 min over 16 days by the University of Lincoln. (John Folkesson et al.) [Before 28/12/19]
- ModelNet10 - Princeton 3D Object Dataset ModelNet10 dataset is a part of ModelNet40 dataset, containing 4,899 pre-aligned shapes from 10 categories. There are 3,991 (80%) shapes for training and 908 (20%) shapes for testing. (Z. Wu, S. Song, A. Khosla, F. Yu, et al.) [10/07/25]
- ModelNet40 - Princeton 3D Object Dataset ModelNet40 dataset contains 12,311 pre-aligned shapes from 40 categories, which are split into 9,843 (80%) for training and 2,468 (20%) for testing. (Z. Wu, S. Song, A. Khosla, F. Yu, et al.) [10/07/25]
- Moving INfants In RGB-D (MINI-RGBD) - A synthetic, realistic RGB-D data set for infant pose estimation containing 12 sequences of moving infants with ground truth joint positions. (N. Hesse, C. Bodensteiner, M. Arens, U. G. Hofmann, R. Weinberger, A. S. Schroeder) [Before 28/12/19]
- Multi-sensor 3D Object Dataset for Object Recognition with Full Pose Estimation - Multi-sensor 3D Object Dataset for Object Recognition and Pose Estimation(Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea,etc.) [Before 28/12/19]
- MWAG: Multi-Season Wide-Area Air Ground Dataset MWAG is a dataset containing over 9,000 high-resolution RGB images captured from both aerial and ground views across multiple seasons. It includes precise geodetic camera pose metadata, supporting tasks such as 3D scene reconstruction, cross-view synthesis, and novel view generation under varying environmental conditions. (S. Zhu, Y. Han, J. Xu, et al.) [10/07/25]
- NPM3D - NPM3D, various 3d point clouds datasets (Paris-Lille-3D, Paris-CARLA-3D, KITTI-CARLA, SimKITTI32, Coins-Riedones3D) in autonomous driving, mobile laser scanning and also object pattern registration. (J. Deschaud, NPM3D Research Group) [22/06/25]
- NTU RGB+D Action Recognition Dataset - NTU RGB+D is a large scale dataset for human action recognition(Amir Shahroudy) [Before 28/12/19]
- nuTonomy scenes dataset (nuScenes) - The nuScenes dataset is a large-scale autonomous driving dataset. It features: Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS), 1000 scenes of 20s each, 1,440,000 camera images, 400,000 lidar sweeps, two diverse cities: Boston and Singapore, left versus right hand traffic, detailed map information, manual annotations for 25 object classes, 1.1M 3D bounding boxes annotated at 2Hz, attributes such as visibility, activity and pose. (Caesar et al) [Before 28/12/19]
- NYU Depth Dataset V2 - Indoor Segmentation and Support Inference from RGBD Images [Before 28/12/19]
- NYU Depth V1 - Indoor Scene Segmentation using a Structured Light Sensor. (N. Silberman, R. Fergus) [29/07/25]
- Oakland 3-D Point Cloud Dataset (Nicolas Vandapel) [Before 28/12/19]
- Object Detection and Classification from Large-Scale Cluttered Indoor Scans - Faro lidar scans of ~40 academic offices, with 2-3 scans per office. Each scan is 0.25GB-2GB. Scans include depth and RGB. (Mattausch, Oliver, Panozzo, et al.) [29/07/25]
- Object Disappearance for Object Discovery - Three datasets: Small, with still images. Medium, video data from an office environment. Large, video over several rooms. Large dataset has 7 unique objects seen in 397 frames. Data is in ROS bag format. (J. Mason, B. Marthi, R. Parr) [29/07/25]
- Object Discovery in 3D scenes via Shape Analysis - KinFu meshes of 58 very cluttered indoor scenes. (A. Karpathy, S. Miller, F. Li) [29/07/25]
- OcMotion (CHOMP) OcMotion is the first video dataset explicitly designed for occluded human body motion capture, containing ~300 K frames across 43 motion sequences with accurate 3D joint-level annotations. (B. Huang, Y. Shu, J. Ju et al.) [31/07/25]
- Pacman project - Synthetic RGB-D images of 400 objects from 20 classes. Generated from 3D mesh models (Vladislav Kramarev, Umit Rusen Aktas, Jeremy L. Wyatt.) [Before 28/12/19]
- PKU-MMD PKU-MMD is a large-scale continuous multimodal action detection dataset captured via Kinect v2. (C. Liu, Y. Hu, Y. Li et al.) [31/07/25]
- Princeton Tracking Benchmark - 100 RGBD videos of moving objects such as humans, balls and cars. (Princeton Unviersity) [30/07/25]
- Procedural Human Action Videos - This dataset contains about 40,000 videos for human action recognition that had been generated using a 3D game engine. The dataset contains about 6 million frames which can be used to train and evaluate models not only action recognition but also models for depth map estimation, optical flow, instance segmentation, semantic segmentation, 3D and 2D pose estimation, and attribute learning. (Cesar Roberto de Souza) [Before 28/12/19]
- RefRef: A Synthetic Dataset and Benchmark for Reconstructing Refractive & Reflective 3D Scenes A synthetic dataset and benchmark comprising 150 unique scenes featuring objects of varying geometric and material complexity in diverse environments (Y. Yin, E. Tao, W. Deng et al.) [10/07/25]
- Replica Replica is a dataset of 18 highly photo-realistic 3D indoor scene reconstructions, featuring dense meshes, HDR textures, semantic class and instance segmentation, planar mirror and glass reflector information, and Habitat-SDK compatibility. (J. Straub, T. Whelan, L. Ma et al.) [31/07/25]
- RGB-D-based Action Recognition Datasets - Paper that includes the list and links of different rgb-d action recognition datasets. (Jing Zhang, Wanqing Li, Philip O. Ogunbona, Pichao Wang, Chang Tang) [Before 28/12/19]
- RGB-D Dataset 7-Scenes - The 7-Scenes dataset is a collection of tracked RGB-D camera frames. The dataset may be used for evaluation of methods for different applications such as dense tracking and mapping and relocalization techniques. (Microsoft) [29/07/25]
- RGB-D Part Affordance Dataset - RGB-D images and ground-truth affordance labels for 105 kitchen, workshop and garden tools, and 3 cluttered scenes (Myers, Teo, Fermuller, Aloimonos) [Before 28/12/19]
- RGB-D People Dataset - This dataset contains 3000+ RGB-D frames acquired in a university hall from three vertically mounted Kinect sensors. (L. Spinello, K. Arras, et al.) [30/07/25]
- RGB-D Scenes - This dataset contains 8 scenes annotated with objects that belong to the RGB-D Object Dataset. Each scene is a single video sequence consisting of multiple RGB-D frames. (K. Lai, L. Bo, X. Ren, D. Fox) [29/07/25]
- RGB-D Scenes v2 - The RGB-D Scenes Dataset v2 consists of 14 scenes containing furniture (chair, coffee table, sofa, table) and a subset of the objects in the RGB-D Object Dataset (bowls, caps, cereal boxes, coffee mugs, and soda cans). (K. Lai, L. Bo, D. Fox) [29/07/25]
- SBU Kinect Interaction Dataset SBU Kinect Interaction Dataset is a two-person interaction video collection captured via Kinect, containing 8 human interaction types.(Yun, K., Honorio, J., Chattopadhyay, D., et al.) [31/07/25]
- Scan2CAD Scan2CAD is a large-scale RGB-D scan to CAD alignment dataset built upon 1,506 ScanNet scenes and 14,225 CAD model instances (3,049 unique models). It contains 97,607 pairwise keypoint correspondences and ground-truth 9 DoF object alignments for evaluating scan-to-model alignment methods. (A. Avetisyan, M. Dahnert, A. Dai et al.) [31/07/25]
- ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes - ScanNet is a dataset of richly-annotated RGB-D scans of real-world environments containing 2.5M RGB-D images in more than 1500 scans, annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentations. (Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, Matthias Niessner) [Before 28/12/19]
- ScanNet++: A High-Fidelity Dataset of 3D Indoor Scene - 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames. Each scene is captured with a high-end laser scanner at sub-millimeter resolution, along with registered 33-megapixel images from a DSLR camera, and RGB-D streams from an iPhone. (Yeshwanth, Liu, Niessner, Dai) [Sept 18, 2023]
- ScanObjectNN a new benchmark dataset containing ~15,000 object that are categorized into 15 categories with 2902 unique object instances. The raw objects are represented by a list of points with global and local coordinates, normals, colors attributes and semantic labels. (M. Angelina Uy, Q. Pham et al.) [10/07/25]
- SceneNet RGB-D - 5 million images rendered of 16,895 indoor scenes. Room configuration randomly generated with physics simulator. (J. McCormac, A. Kanda, S. Leutenegger, A. Davison) [30/07/25]
- SceneNN - Videos of indoor scenes, registered into triangle meshes. (B. Hua, Q. Pham, D. Nguyen, et al.) [29/07/25]
- SceneNN: A Scene Meshes Dataset with aNNotations - RGB-D scene dataset with 100+ indoor scenes, labeled triangular mesh, voxel and pixel. (Hua, Pham, Nguyen, Tran, Yu, and Yeung) [Before 28/12/19]
- SCoDA (ScanSalon) SCoDA introduces ScanSalon, a domain-adaptive shape completion dataset containing 800 real-scan and artist-created mesh pairs across six object categories (chair, desk, sofa, bed, lamp, car). It provides point clouds, paired CAD shapes, and alignment annotations to support 3D shape completion from real scans. (Y. Wu, Z. Yan, C. Chen et al.) [31/07/25]
- Semantic-8: 3D point cloud classification with 8 classes (ETH Zurich) [Before 28/12/19]
- SeMantic InDustry (S.MID) - A dataset designed to advance the field of LiDAR semantic segmentation (hybrid-solid LiDAR), specifically for robotic applications and large-scale industrial scene (Wang, Zhao, Cao, Deng, Wang, Chen) [5/10/24]
- Shading-based Refinement on Volumetric Signed Distance Functions - Four RGBD sequences of small statues and artefacts. (M. Zollhofer, A. Dai, M. Innman, et al.) [30/07/25]
- ShapeNetPart Dataset ShapeNet Part Dataset is a dataset designed for 3D part-based segmentation, derived from the ShapeNetCore dataset. It contains 3D models of objects with annotations at the part level. Each object is represented by a point cloud (XYZ coordinates) along with part labels that indicate the specific part of the object.(MIT) [10/07/25]
- Small office data sets - Kinect depth images every 5 seconds beginning in April 2014 and on-going. (John Folkesson et al.) [Before 28/12/19]
- Stanford Large-Scale 3D Indoor Spaces Dataset (S3DIS) SS3DIS comprises 6 colored 3D point clouds from 6 large-scale indoor areas, along with semantic instance annotations for 12 object categories: wall, floor, ceiling, beam, column, window, door, sofa, desk, chair, bookcase, and board.(Armeni, I., Sener, O., Zamir, A., et al.) [10/07/25]
- Stereo and ToF dataset with ground truth - The dataset contains 5 different scenes acquired with a Time-of-flight sensor and a stereo setup. Ground truth information is also provided. (Carlo Dal Mutto, Pietro Zanuttigh, Guido M. Cortelazzo) [Before 28/12/19]
- STPLS3D - a large database of annotated ground truth 3D point clouds reconstructed using aerial photogrammetry for training and validating 3D semantic and instance segmentation algorithms (Chen, Hu, Yu, Thomas, Feng, Hou, McCullough, Ren, Soibelman) [3/12/2022]
- Structured3D Structured3D is a large-scale photo-realistic synthetic indoor scene dataset comprising 3,500 professional house designs rendered into 2D images. (J. Zheng, J. Zhang, J. Li et al.) [31/07/25]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- Taskonomy - Over 4.5 million real images each with ground truth for 25 semantic, 2D, and 3D tasks. (Zamir, Sax, Shen, Guibas, Malik, Savarese) [Before 28/12/19]
- TAU Agent dataset - A stereo RGB-D dataset created by the open-source 3D animation software blender and containing 525 image-pairs with a resolution of 512x1024 along with corresponding ground truth pixel-wise depth maps. (Haim, Elmalem, Gil, Giryes, Bronstein, Marom) [27/12/2020]
- THU-READ(Tsinghua University RGB-D Egocentric Action Dataset) - THU-READ is a large-scale dataset for action recognition in RGBD videos with pixel-lever hand annotation. (Yansong Tang, Yi Tian, Jiwen Lu, Jianjiang Feng, Jie Zhou) [Before 28/12/19]
- TICaM - A Time-of-flight In-car Cabin Monitoring Dataset (Jigyasa Singh Katrolia, Bruno Mirbach, Ahmed El-Sherif, Hartmut Feld, Jason Rambach, Didier Stricker) [1/2/21]
- TIMo - Time-of-flight indoor monitoring dataset - a time-of-flight camera depth and IR dataset for indoor person detection and segmentation and anomaly detection (Schneider, Anisimov, Islam, Mirbach, Rambach, Grandidier, Stricker) [23/11/21]
- TrimBot2020 Dataset for Garden Navigation sensor RGBD data recorded from cameras and other sensors mounted on a robotic platform as well as additional external sensors capturing the garden (TrimBot2020 consortium) [26/2/20]
- Trimbot-Wageningen-SLAM-Dataset This dataset is a real outdoor garden dataset captured in Wageningen for Trimbot2020 project. The dataset could be used for depth estimation, pose estimation, SLAM, 3D reconstruction, etc..(Pu, Can and Yang, et al.) [05/06/25]
- TUM RGB-D Benchmark - Dataset and benchmark for the evaluation of RGB-D visual odometry and SLAM algorithms (Jorgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard and Daniel Cremers) [Before 28/12/19]
- UC-3D Motion Database - Available data types encompass high resolution Motion Capture, acquired with MVN Suit from Xsens and Microsoft Kinect RGB and depth images. (Institute of Systems and Robotics, Coimbra, Portugal) [Before 28/12/19]
- Uni Bremen Open, Abdominal Surgery RGB Dataset - Recording of a complete, open, abdominal surgery using a Kinect v2 that was mounted directly above the patient looking down at patient and staff. (Joern Teuber, Gabriel Zachmann, University of Bremen) [Before 28/12/19]
- UniG3D UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on Objaverse and ShapeNet datasets. (Q. Sun, Y. Li, Z. Liu et al.) [31/07/25]
- UniMod1K UniMod1K is a multimodal dataset for universal segmentation tasks, containing 1,000 high-quality images across RGB, depth, infrared, and thermal modalities with pixel-level annotations. (X. Zhu, Z. Yuan, J. Zhang et al.) [11/07/25]
- UNIST LS3DPC Dataset - 11 large-scale 3D point clouds captured with a terrestrial LiDAR scanner for reflection removal (Yun, Sim) [27/12/2020]
- URFD: University of Rzeszow Fall Detection Dataset URFD is an RGB-D video dataset for fall detection, capturing realistic simulated falls and daily activities using a Kinect sensor. It includes depth maps, skeleton data, and annotated video sequences for human action recognition. (K. Kwolek & M. Kepski) [10/07/2025]
- USF Range Image Database - 400+ laser range finder and structured light camera images, many with ground truth segmentations (Adam et al.) [Before 28/12/19]
- ViDSOD-100 - RGB-D Video Salient Object Detection. 9 classes, 43 sub-classes, 100 videos within a total of 9362 frames, acquired from diverse natural scenes (Lin, Zhu, Shen, Fu, Zhang, Wang) [28/10/24]
- Visual Odometry with Inertial and Depth (VOID) Dataset - VOID consists of 48K synchronized inertial, image and depth frames of indoor and outdoor scenes along with sparse point cloud obtained by visual inertial odometry system (XIVO) for the purpose of 3D reconstruction from images and sparse depth. (Wong, Alex and Fei, Xiaohan and Tsuei, Stephanie and Soatto,Stefano) [1/2/21]
- vsenseVVDB: V-SENSE Volumetric Video Quality Database #1 - A collection of 32 compressed point clouds (from 2 reference point clouds) and their mean opinion scores, collected from 19 participants. (Zerman, Gao, Ozcinar, Smolic) [22/11/2021]
- vsenseVVDB2: V-SENSE Volumetric Video Quality Database #2 - A collection of 152 compressed volumetric video (both coloured point cloud and textured 3D meshes) and their mean opinion scores, collected from 23 participants. (Zerman, Ozcinar, Gao, Smolic) [22/11/2021]
- Washington RGB-D Object Dataset - 300 common household objects and 14 scenes. (University of Washington and Intel Labs Seattle) [Before 28/12/19]
- Witham Wharf - For RGB-D of eight locations collect by robot every 10 min over ~10 days by the University of Lincoln. (John Folkesson et al.) [Before 28/12/19]
- York 3D Ear Dataset - The York 3D Ear Dataset is a set of 500 3D ear images, synthesized from detailed 2D landmarking, and available in both Matlab format (.mat) and PLY format (.ply). (Nick Pears, Hang Dai, Will Smith, University of York) [Before 28/12/19]
- ZJU-MoCap (LightStage & Mirrored-Human) ZJU-MoCap is a multi-view human motion capture dataset featuring two parts. (S. Peng, X. Xu, Q. Shuai et al.) [31/07/25]
General Videos
- Black Swan: Abductive and Defeasible Video Reasoning in
Unpredictable Events BlackSwan is a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Explaining and understanding such out-of-distribution events requires models to
extend beyond basic pattern recognition and regurgitation of their prior knowledge.(A. Chinchure, S. Ravi et al.) [10/07/25]
- 360+x : A Panoptic Multi-modal Scene Understanding Dataset - 360+x dataset introduces a unique panoptic perspective to scene understanding, differentiating itself from existing datasets, by offering multiple viewpoints and modalities, captured from a variety of daily scenes. (H. Chen, Y. Hou, C. Qu, et al.) [22/06/25]
- AlignMNIST - An artificially extended version of the MNIST handwritten dataset. (en Hauberg) [Before 28/12/19]
- Audio-Visual Event (AVE) dataset- AVE dataset contains 4143 YouTube videos covering 28 event categories and videos in AVE dataset are temporally labeled with audio-visual event boundaries. (Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu) [Before 28/12/19]
- ChaLearn LAP First Impressions (Dataset-24) Dataset-24 ("First Impressions") comprises ~10,000 short (~15 sec) YouTube video clips annotated with apparent personality trait labels (Big-Five) via pairwise human judgments; it supports vision-based personality inference tasks. (V. Ponce-Lopez, B. Chen, M. Oliu et al.) [31/07/25]
- COMSATS Structured Video Tampering Evaluation Dataset (CSVTED) - three-level dataset for video tampering evaluation, structured by tampering quality and video complexity (Akhtar, Saddique, Rosin, Sun, Hussain, Habib) [13/8/25]
- CVQAD (Compressed Video Quality Assessment Dataset) - 1,022 compressed videos (including UGC), 32 encoders of 5 compression standards (H.265/HEVC, H.266/VVC, AV1, H.264/AVC, and VP9), 3 target bitrates (1,000 kbps, 2,000 kbps, and 4,000 kbps) (MSU Graphics and Media Lab) [8/3/2023]
- Dataset of Multimodal Semantic Egocentric Video (DoMSEV) - Labeled 80-hour Dataset of Multimodal Semantic Egocentric Videos (DoMSEV) covering a wide range of activities, scenarios, recorders, illumination and weather conditions. (UFMG, Michel Silva, Washington Ramos, Joao Ferreira, Felipe Chamone, Mario Campos, Erickson R. Nascimento) [Before 28/12/19]
- DAVIS: Video Object Segmentation dataset 2016 - A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung) [Before 28/12/19]
- DAVIS: Video Object Segmentation dataset 2017 - The 2017 DAVIS Challenge on Video Object Segmentation (J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, and L. Van Gool) [Before 28/12/19]
- EGO-CH - a large egocentric video dataset acquired by real visitors in two different cultural sites. The dataset includes more than 27 hours of video acquired by 70 different subjects. The overall dataset includes labels for 26 environments and over 200 Points of Interest (POIs). (Giovanni Maria Farinella) [31/12/19]
- Egocentric 4D Perception A massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video.(K. Kitani, X Liu, et al.) [05/06/25]
- FAIR-Play - 1,871 video clips (~5 hrs) and their corresponding binaural audio clips recorded in a music room (Gao and Grauman) [29/12/19]
- FLIGHTMARE - a photorealistic, customizable, and easy to use simulator for quadrotors! It is compatible with ROS, Gazebo, OpenAI Gym, and even Oculus #VR headsets. (Song, Naji, Kaufmann, Loquercio, Scaramuzza) [26/12/2020]
- FPV-O FPV-O is a first-person video dataset of everyday office activities captured with a chest-mounted camera, comprising ~3 hours of footage from 12 subjects performing 20 interaction and object-manipulation activities.(G. Abebe, A. Catala, A. Cavallaro et al.) [31/07/25]
- GoPro-Gyro Dataset - ego centric videos (Linkoping Computer Vision Laboratory) [Before 28/12/19]
- HD-VILA-100M Dataset HD-VILA-100M is a large-scale, high-resolution, and diversified video-language dataset to facilitate the multimodal representation learning.(Xue, Hongwei and Hang, et al.) [05/06/25]
- HowTo100M HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. (A. Miech, M. Tapaswi et al.) [05/06/25]
- Image & Video Quality Assessment at LIVE - used to develop picture quality algorithms (the University of Texas at Austin) [Before 28/12/19]
- Large scale YouTube video dataset - 156,823 videos (2,907,447 keyframes) crawled from YouTube videos (Yi Yang) [Before 28/12/19]
- LoopNav - A loop-based navigation dataset in Minecraft aiming to boost spatial consistency over long horizons. (K. Lian) [22/06/25]
- Movie Memorability Dataset - memorable movie clips and ground truth of detail memorability, 660 short movie excerpts extracted from 100 Hollywood-like movies (Cohendet, Yadati, Duong and Demarty) [Before 28/12/19]
- MovieQA - each machines to understand stories by answering questions about them. 15000 multiple choice QAs, 400+ movies. (M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler) [Before 28/12/19]
- Multispectral visible-NIR video sequences - Annotated multispectral video, visible + NIR (LE2I, Universit de Bourgogne) [Before 28/12/19]
- Moments in Time Dataset - Moments in Time Dataset 1M 3-second videos annotated with action type, the largest dataset of its kind for action recognition and understanding in video. (Monfort, Oliva, et al.) [Before 28/12/19]
- MovieNet A holistic dataset for movie understanding. (H. Qingqiu, X. Yu, R. Anyi ) [05/06/25]
- Near duplicate video retrieval dataset - This database consists of 156,823 videos sequences (2,907,447 keyframes), which were crawled from YouTube during the period of July 2010 to September 2010. (Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, Richang Hong) [Before 28/12/19]
- PHD2: Personalized Highlight Detection Dataset - PHD2 is a dataset with personalized highlight information, which allows to train highlight detection models that use information about the user, when making predictions. (Ana Garcia del Molino, Michael Gygli) [Before 28/12/19]
- Sports-1M - Dataset for sports video classification containing 487 classes and 1.2M videos. (Andrej Karpathy and George Toderici and Sanketh Shetty and Thomas Leung and Rahul Sukthankar and Li Fei-Fei.) [Before 28/12/19]
- nuTonomy scenes dataset (nuScenes) - The nuScenes dataset is a large-scale autonomous driving dataset. It features: Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS), 1000 scenes of 20s each, 1,440,000 camera images, 400,000 lidar sweeps, two diverse cities: Boston and Singapore, left versus right hand traffic, detailed map information, manual annotations for 25 object classes, 1.1M 3D bounding boxes annotated at 2Hz, attributes such as visibility, activity and pose. (Caesar et al) [Before 28/12/19]
- REDS (REalistic and Dynamic Scenes) - high-quality realistic blurry video dataset with reference sharp frames (improved version of GOPRO) (Nah, Baik, Hong, Moon, Son, Timofte and Lee) [4/1/20]
- Sapsucker Woods 60 Audiovisual Dataset - 60 species of birds and is comprised of images from existing datasets, and brand new, expert curated audio and video datasets. (Grant Van Horn) [1/12/2022]
- UCF101 - Action Recognition Data Set UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 dataset which has 50 action categories. (K. Soomro, A. Zamir and M. Shah,) [10/07/25]
- USEEVid DB - University of Salzburg Encryption Evaluation - Video Database. This is a database of videos with quality and recognizability scores by human observers. (Hofbauer, Autrusseau, Uhl) [24/06/25]
- Video Sequencesused for research on Euclidean upgrades based on minimal assumptions about the camera(Kenton McHenry) [Before 28/12/19]
- Video Stacking Dataset - A Virtual Tripod for Hand-held Video Stacking on Smartphones (Erik Ringaby etc.) [Before 28/12/19]
- VideoMem Dataset - The VideoMem or Video Memorability Database is a collection of sound-less video excerpts and their corresponding ground-truth memorability files. The memorability scores are computed based on the measurement of short-term and long-term memory performances when recognizing small video excerpts a few minutes after viewing them for the short-term case, and 24 to 72 hours later, for the long-term case. It is accompanied with video features extracted from the video excerpts. It is intended to be used for understanding the memorability of videos and for assessing the quality of methods for predicting the memorability of multimedia content. (Cohendet, Demarty, Duong and Engilberge) [6/1/20]
- Video Object Instance Segmentation Dataset A dataset consisting of Internet collected video clips with avarage length around 10s, and resolution is over 1920 x 1080.(maadaa.ai) [05/06/25]
- YFCC100M videos - A benchmark on the video subset of YFCC100M which includes the videos, he video content features and the API to a sate-of-the-art video content engine. (Lu Jiang) [Before 28/12/19]
- YFCC100M: The New Data in Multimedia Research - This publicly available curated dataset of 100 million photos and videos is free and legal for all. (Bart Thomee, Yahoo Labs and Flickr in San Francisco,etc.) [Before 28/12/19]
- Youtube-360 - A collection of 360 degree videos with spatial audio (first order ambisonics) from YouTube, containing clips from a diverse set of topics such as musical performances, vlogs, sports, and others. (Morgado, Li, Vasconcelos) [7/1/2021]
- YouTube-8M - Dataset for video classification in the wild, containing pre-extracted frame level features from 8M videos, and 4800 classes. (Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev,George Toderici, Balakrishnan Varadarajan, Sudheendra Vijayanarasimhan) [Before 28/12/19]
- YouTube-BoundingBoxes - 5.6 million accurate human-annotated BB from 23 object classes tracked across frames, from 240,000 YouTube videos, with a strong focus on the person class (1.3 million boxes) (Real, Shlens, Pan, Mazzocchi, Vanhoucke, Khan, Kakarla et al) [Before 28/12/19]
- YUP++ / Dynamic Scenes dataset - 20 outdoor scene classes, each with 60 colour videos (each 5 seconds, 480 pixels wide, 24-30 fps) from 60 different scenes. Half of the videos are with a static camera and half with a moving camera (Feichtenhofer, Pinz, Wildes) [Before 28/12/19]
Hand, Hand Grasp, Hand Action and Gesture Databases
- 11k Hands - 11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 - 75, with metadata (id, gender, age, skin color, handedness, which hand, accessories, etc). (Mahmoud Afifi) [Before 28/12/19]
- 3D Articulated Hand Pose Estimation with Single Depth Images (Tang, Chang, Tejani, Kim, Yu) [Before 28/12/19]
- A-STAR Annotated Hand-Depth Image Dataset and its Performance Evaluation - depth data and data glove data, 29 images of 30 volunteers, Chinese number counting and American Sign Language (Xu and Cheng) [Before 28/12/19]
- American Sign Language Letters - Object detection dataset of each ASL letter with a bounding box. David Lee, a data scientist focused on accessibility, curated and released the dataset for public use. (Lee) [09/06/25]
- ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation - A dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. (Fan, Zicong, Taheri, et al) [22/06/25]
- A Dataset of Human Manipulation Actions - RGB-D of 25 objects and 6 actions (Alessandro Pieropan) [Before 28/12/19]
- AUTSL AUTSL is a large-scale multi-modal Turkish Sign Language dataset containing 38,336 isolated sign video samples (226 unique signs) performed by 43 signers, recorded in RGB, depth, and skeleton formats with challenging indoor/outdoor backgrounds. (O. M. Sincan, H. Y. Keles et al.) [31/07/25]
- BOBSL BOBSL is a large-scale British Sign Language dataset derived from BBC broadcast content, covering ~1,940 episodes (~1,400 hours) interpreted by 37 distinct signers.(S. Albanie, G. Varol, L. Momeni et al.) [31/07/25]
- BSLDict BSLDict is a machine-readable British Sign Language dictionary dataset featuring over 14,000 isolated-sign video clips covering approximately 9,000 unique words, collected from multiple signers and sources. (G. Varol, L. Momeni, S. Albanie et al.) [31/07/25]
- Bosphorus Hand Geometry Database and Hand-Vein Database (Bogazici University) [Before 28/12/19]
- ContactPose - A large-scale functional grasping dataset with hand-object contact, hand and object pose, and 2.9 M RGB-D grasp images (Brahmbhatt, Tang, Twigg, Kemp, Hays) [30/12/2020]
- DemCare dataset - DemCare dataset consists of a set of diverse data collection from different sensors and is useful for human activity recognition from wearable/depth and static IP camera, speech recognition for Alzheimer's disease detection and physiological data for gait analysis and abnormality detection. (K. Avgerinakis, A.Karakostas, S.Vrochidis, I. Kompatsiaris) [Before 28/12/19]
- Drone Gesture Control Dataset - Consists of hand and body gesture commands that you can command your drone to either ,'take-off', 'land' and'follow'. (Dwyer) [09/06/25]
- DVS128 Gesture Dataset - Event-based dataset, containing sequences of 11 hand gestures, performed by 29 subjects under several illumination conditions,captured using a DVS128 sensor. Each sequence is annotated with the start and stop times of each gesture. (Amir, Taba, Berg, Melano, McKinstry, Di Nolfo, Nayak, Andreopoulos, Garreau, Mendoza, Kusnitz, Debole, Esser, Delbruck, Flickner, and Modha) [7/1/20]
- EgoDaily - Egocentric hand detection dataset with variability on people, activities and places, to simulate daily life situations (Cruz, Chan) [30/12/2020]
- EgoGesture Dataset - First-person view gestures with 83 classes, 50 subjects, 6 scenes, 24161 RGB-D video samples (Zhang, Cao, Cheng, Lu) [Before 28/12/19]
- EgoHands - A large dataset with over 15,000 pixel-level-segmented hands recorded from egocentric cameras of people interacting with each other. (Sven Bambach) [Before 28/12/19]
- EgoHands Dataset - 4800 annotated images of human hands from a first-person view. (IU Computer Vision Lab) [09/06/25]
- Ego3DHands - RGB-D Synthetic large-scale egocentric two-hand dataset for 3D global pose estimation (Lin, Wilhelm) [28/12/2020]
- EgoYouTubeHands dataset - An egocentric hand segmentation dataset consists of 1290 annotated frames from YouTube videos recorded in unconstrained real-world settings. The videos have variation in environment, number of participants, and actions. This dataset is useful to study hand segmentation problem in unconstrained settings. (Aisha Urooj, A. Borji) [Before 28/12/19]
- Face-Touching-Behavior - 2 million video frames annotation as face-touch, no-face-touch for a datset composed of audio-visual recordings of mall group social interactions. (C. Beyan, M. Bustreo, M. Shahid, G. Bailo, N. Carissimi, A. Del Bue) [1/2/21]
- FORTH Hand tracking library (FORTH) [Before 28/12/19]
- FreiHAND FreiHAND is a large-scale RGB image dataset for 3D hand pose and shape estimation, featuring ~32,560 training and ~3,960 evaluation real-world frames annotated with full 3D hand joint keypoints and mesh-level hand shape. (C. Zimmermann, D. Ceylan, J. Yang et al.) [31/07/25]
- General HANDS: general hand detection and pose challenge - 22 sequences with different gestures, activities and viewpoints (UC Irvine) [Before 28/12/19]
- GraspXL: Generating Grasping Motions for Diverse Objects at Scale - Contains 10M+ diverse grasping motions for 500k+ objects of different dexterous hands. (Zhang, Hui, Christen, et al) [22/06/25]
- GRASP MultiCam data set - combines video from a synchronized stereo monochrome camera and IMU with depth images from a Time-of-flight depth sensor, allowing for accurate Visual-Inertial Odometry (VIO) and recovery of 3D structure from the depth sensor point clouds (Pfrommer, Owens, Shariati, Skandan, Taylor, Daniilidis) [27/12/2020]
- Grasp UNderstanding (GUN-71) dataset - 12,000 first-person RGB-D images of object manipulation scenes annotated using a taxonomy of 71 fine-grained grasps. (Rogez, Supancic and Ramanan) [Before 28/12/19]
- A Hand Gesture Detection Dataset (Javier Molina et al) [Before 28/12/19]
- Hand gesture and marine silhouettes (Euripides G.M. Petrakis) [Before 28/12/19]
- HandNet: annotated depth images of articulated hands 214971 annotated depth images of hands captured by a RealSense RGBD sensor of hand poses. Annotations: per pixel classes, 6D fingertip pose, heatmap. Train: 202198, Test: 10000, Validation: 2773. Recorded at GIP Lab, Technion. [Before 28/12/19]
- HandOverFace dataset - A hand segmentation dataset consists of 300 annotated frames from the web to study the hand-occluding-face problem. (Aisha Urooj, A. Borji) [Before 28/12/19]
- HO-3D HO-3D is a dataset for joint 3D hand and object pose estimation with single and multi-camera RGB-D sequences. (S. Hampali, M. Rad, V. Lepetit et al.) [31/07/25]
- IDIAP Hand pose/gesture datasets (Sebastien Marcel) [Before 28/12/19]
- Jester - densely-labeled video clips that show humans performing predefined hand gestures in front of a laptop camera or webcam (Twenty Billion Neurons GmbH) [Before 28/12/19]
- Kinect and Leap motion gesture recognition dataset - The dataset contains 1400 different gestures acquired with both the Leap Motion and the Kinect devices(Giulio Marin, Fabio Dominio, Pietro Zanuttigh) [Before 28/12/19]
- Kinect and Leap motion gesture recognition dataset - The dataset contains several different static gestures acquired with the Creative Senz3D camera. (A. Memo, L. Minto, P. Zanuttigh) [Before 28/12/19]
- LEGO A dataset of over 150k paired egocentric images captured before and during daily tasks, designed for visual instruction generation. (B. Lai, X. Dai, L. Chen, G. Pang, et al.) [10/07/25]
- LISA CVRR-HANDS 3D - 19 gestures performed by 8 subjects as car driver and passengers (Ohn-Bar and Trivedi) [Before 28/12/19]
- MPI Dexter 1 Dataset for Evaluation of 3D Articulated Hand Motion Tracking - Dexter 1: 7 sequences of challenging, slow and fast hand motions, RGB + depth (Sridhar, Oulasvirta, Theobalt) [Before 28/12/19]
- MSR Realtime and Robust Hand Tracking from Depth - (Qian, Sun, Wei, Tang, Sun) [Before 28/12/19]
- Mobile and Webcam Hand images database - MOHI and WEHI - 200 people, 30 images each (Ahmad Hassanat) [Before 28/12/19]
- MS-ASL MS-ASL is a large-scale real-world American Sign Language dataset containing over 25,000 annotated videos spanning 1,000 signs performed by more than 200 distinct signers, captured in unconstrained everyday recording conditions. (H. Vaezi-Joze, O. Koller et al.) [31/07/25]
- NTU-Microsoft Kinect HandGesture Dataset - This is a RGB-D dataset of hand gestures, 10 subjects x 10 hand gestures x 10 variations. (Zhou Ren, Junsong Yuan, Jingjing Meng, and Zhengyou Zhang) [Before 28/12/19]
- NUIG_Palm1 - Database of palmprint images acquired in unconstrained conditions using consumer devices for palmprint recognition experiments. (Adrian-Stefan Ungureanu) [Before 28/12/19]
- NYU Hand Pose Dataset - 8252 test-set and 72757 training-set frames of captured RGBD data with ground-truth hand-pose, 3 views (Tompson, Stein, Lecun, Perlin) [Before 28/12/19]
- PRAXIS gesture dataset - RGB-D upper-body data from 29 gestures, 64 volunteers, several repetitions, many volunteers have some cognitive impairment (Farhood Negin, INRIA) [Before 28/12/19]
- Rendered Handpose Dataset - Synthetic dataset for 2D/ 3D Handpose Estimation with RGB, depth, segmentation masks and 21 keypoints per hand (Christian Zimmermann and Thomas Brox) [Before 28/12/19]
- RGB-D In-Hand Manipulation Dataset A dataset of high-fidelity RGB-D video sequences capturing in-hand object manipulation by humans.(M. Smith, L. Chen, R. Patel et al.) [11/07/25]
- ROSHAMBO17 - RoShamBo Rock Scissors Paper game DVS dataset - "Dataset is recorded from ~20 persons each showing the rock, scissors and paper symbols for about 2m each with a variety of poses, distances, postiions, left/right hand. "(Lungu, Corradi, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- RWTH-Boston-50 and RWTH-Boston-104 - American Sign Language hand gesture video datasets, containing 201 annotated sentences captured by 4 cameras (2 B/W stereo, 1 color, one side view B/W) atg 30 fps and 312*242 pixels. The 50 dataset has 483 utterances of 50 words. (Dreuw, Keysers, Forster, Deselaers, Rybach, Zahedi, Ney) [14/3/20]
- Sahand LMC Sign Language Database - This database is collected by a webcam and Leap Motion Controller (LMC) which comprises of 32 class including 24 American letters (J and Z are excluded because they are dynamic gestures) and numbers from 0 to 9 (gesture for 6 and w, also gestures for 9 and F are same). Each class of database contains 2000 samples. (Mahdikhanlou, Ebrahimnezhad) [27/12/2020]
- Sahand Dynamic Hand Gesture Database - This database contains 11 Dynamic gestures designed to convey the functions of mouse and touch screens to computers. (Behnam Maleki, Hossein Ebrahimnezhad) [Before 28/12/19]
- Sheffield gesture database - 2160 RGBD hand gesture sequences, 6 subjects, 10 gestures, 3 postures, 3 backgrounds, 2 illuminations (Ling Shao) [Before 28/12/19]
- SL-ANIMALS-DVS Database - The SL-ANIMALS-DVS database consists of DVS recordings of humans performing sign language gestures of various animals as a continuous spike flow at very low latency.(Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- UT Grasp Data Set - 4 subjects grasping a variety of objectss with a variety of grasps (Cai, Kitani, Sato) [Before 28/12/19]
- WLASL - word-level American Sign Langauge dataset, containing 2,000 common words with 21k RGB videos performed by more than a hundred native signers (Li, Rodriguez, Yu, Li) [27/12/2020]
- Yale human grasping data set - 27 hours of video with tagged grasp, object, and task data from two housekeepers and two machinists (Bullock, Feix, Dollar) [Before 28/12/19]
Image, Video and Shape Database Retrieval
- 2D-to-3D Deformable Sketches - A collection of deformable 2D contours in pointwise correspondence with deformable 3D meshes of the same class; around 10 object classes are provided, including humans and animals. (Lahner, Rodola) [Before 28/12/19]
- 3D Deformable Objects in Clutter - A dataset for 3D deformable object-in-clutter, with point-wise ground truth correspondence across hundreds of scenes and spanning multiple classes (humans, animals). (Cosmo, Rodola, Masci, Torsello, Bronstein) [Before 28/12/19]
- 3D eye fixation dataset - A saliency dataset on 3D models (Shanfeng Hu, Xiaohui Liang, Hubert P. H. Shum, Fredrick W. B. Li and Nauman Aslam) [1/2/21]
- ANN_SIFT1M - 1M Flickr images encoded by 128D SIFT descriptors (Jegou et al) [Before 28/12/19]
- A Content-Driven Micro-Video Recommendation Dataset at Scale MicroLens is a large-scale micro-video dataset with multimodal content for content-driven recommendation and video understanding (Representation Learning Lab, WestLake University) [10/07/25]
- Augmented ICL-NUIM Dataset - An augmentation of the ICL-NUIM dataset, with camera paths added to allow it to be used for scene reconstruction. (S. Choi, Q. Zhou, V. Koltun) [30/07/25]
- Brown Univ 25/99/216 Shape Databases (Ben Kimia) [Before 28/12/19]
- CIFAR-10 - 60K 32x32 images from 10 classes, with a 512D GIST descriptor (Alex Krizhevsky) [Before 28/12/19]
- CLEF-IP 2011 evaluation on patent images [Before 28/12/19]
- Contour Drawing Dataset - a dataset of 5,000 paired images and contour drawings for the study of visual understanding and sketch generation (Li, Lin, Mech, Yumer, and Ramanan) [9/1/20]
- DeepFashion - Large-scale Fashion Database(Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, Xiaoou Tang) [Before 28/12/19]
- Ego4D: Around the World in 3,000 Hours of Egocentric Video - A massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. (K. Grauman, A. Westbury, E. Byrne, et al.) [29/06/25]
- EMODB - Thumbnails of images in the picsearch image search engine together with the picsearch emotion keywords (Reiner Lenz etc.) [Before 28/12/19]
- ETU10 Silhouette Dataset - The dataset consists of 720 silhouettes of 10 objects, with 72 views per object. (M. Akimaliev and M.F. Demirci) [Before 28/12/19]
- European Flood 2013 - 3,710 images of a flood event in central Europe, annotated with relevance regarding 3 image retrieval tasks (multi-label) and important image regions. (Friedrich Schiller University Jena, Deutsches GeoForschungsZentrum Potsdam) [Before 28/12/19]
- Fashion-MNIST - A MNIST-like fashion product database. (Han Xiao, Zalando Research) [Before 28/12/19]
- Fish Shape Database - It's a Fish Shape Database with 100, 2D point set shapes. (Adrian M. Peter) [Before 28/12/19]
- Flickr 30K - images, actions and captions (Peter Young et al) [Before 28/12/19]
- Flickr15k - Sketch based Image Retrieval (SBIR) Benchmark - Dataset of 330 sketches and 15,024 photos comprising 33 object categories,benchmark dataset commonly used to evaluate Sketch based Image Retrieval (SBIR) algorithms. (Hu and Collomosse, CVIU 2013) [Before 28/12/19]
- GMU Kitchen Dataaset - 9 video sequences captured from 4 different kitchens, each containing objects from the BigBIRD dataset. (G. Georgakis, M. Reza, A. Mousavian, et al.) [29/07/25]
- Hands in action (HIC) IJCV dataset - Data (images, models, motion) for tracking 1 hand or 2 hands with/o 1 object. Includes both *single-view RGB-D sequences (1 subject, >18 annotated sequences, 4 objects, complete RGB image), and *multi-view RGB sequences (1 subject, HD, 8 views, 8 sequences - 1 annotated, 2 objects). (Dimitrios Tzionas, Luca Ballan, Abhilash Srikantha, Pablo Aponte, Marc Pollefeys, Juergen Gall) [Before 28/12/19]
- IAPR TC-12 Image Benchmark (Michael Grubinger) [Before 28/12/19]
- IAPR-TC12 Segmented and annotated image benchmark (SAIAPR TC-12): (Hugo Jair Escalante) [Before 28/12/19]
- ICL-NUIM dataset - Eight synthetic RGBD video sequences: four from a office scene and four from a living room scene. Simulated camera trajectories are taken from a Kintinuous output from a sensor being moved around a real-world room. (A. Handa, T. Whelan, J.B. McDonald, A.J. Davison) [30/07/25]
- ImageCLEF 2010 Concept Detection and Annotation Task (Stefanie Nowak) [Before 28/12/19]
- ImageCLEF 2011 Concept Detection and Annotation Task - multi-label classification challenge in Flickr photos [Before 28/12/19]
- INRIA Copydays dataset - for evaluation of copy detection: JPEG, cropping and "strong" copy attacks. (INRIA) [Before 28/12/19]
- INRIA Holidays dataset - for evaluation of image search: 500 queries and 991 corresponding relevant images (Jegou, Douze and Schmid) [Before 28/12/19]
- MA14KD (Movie Attraction 14K Dataset) Dataset - 14K movie/TV trailers, 10 features each, links to a rating dataset (Elahi, Moghaddam, Hosseini, Trattner, Tkalcic) [Before 28/12/19]
- METU Trademark datasetThe METU Dataset is composed of more than 900K real logos belonging to companies worldwide. (Usta Bilgi Sistemleri A.S. and Grup Ofis Marka Patent A.S) [Before 28/12/19]
- McGill 3D Shape Benchmark (Siddiqi, Zhang, Macrini, Shokoufandeh, Bouix, Dickinson) [Before 28/12/19]
- MPEG-7 Core Experiment CE-Shape-1 - 1400 binary 2D shapes grouped into 70 classes with 20 shapes in each class (Latecki, Lakamper, Eckhardt) [29/12/2020]
- MPI MANO & SMPL+H dataset - Models, 4D scans and registrations for the statistical models MANO (hand-only) and SMPL+H (body+hands). For MANO there are ~2k static 3D scans of 31 subjects performing up to 51 poses. For SMPL+H we include 39 4D sequences of 11 subjects. (Javier Romero, Dimitrios Tzionas and Michael J Black) [Before 28/12/19]
- Multiview Stereo Evaluation - Each dataset is registered with a "ground-truth" 3D model acquired via a laser scanning process(Steve Seitz et al) [Before 28/12/19]
- NIST SHREC - 2014 NIST retrieval contest databases and links (USA National Institute of Standards and Technology) [Before 28/12/19]
- NIST SHREC - 2013 NIST retrieval contest databases and links (USA National Institute of Standards and Technology) [Before 28/12/19]
- NIST SHREC 2010 - Shape Retrieval Contest of Non-rigid 3D Models (USA National Institute of Standards and Technology) [Before 28/12/19]
- NIST TREC Video Retrieval Evaluation Database (USA National Institute of Standards and Technology) [Before 28/12/19]
- NUS-WIDE - 269K Flickr images annotated with 81 concept tags, enclded as a 500D BoVW descriptor (Chau et al) [Before 28/12/19]
- Princeton Shape Benchmark (Princeton Shape Retrieval and Analysis Group) [Before 28/12/19]
- PairedFrames - evaluation of 3D pose tracking error - Synthetic and Real dataset to test 3D pose tracking/refinement with pose initialization close/far to/from minima. Establishes testing frame pairs of increasing difficulty, to measure the pose estimation error separately, without employing a full tracking pipeline. (Dimitrios Tzionas, Juergen Gall) [Before 28/12/19]
- Queensland cross media dataset - millions of images and text documents for "cross-media" retrieval (Yi Yang) [Before 28/12/19]
- Reconstructing Articulated Rigged Models from RGB-D Videos (RecArt-D) - Dataset of objects deforming during manipulation. Includes 4 RGB-D sequences (RGB image complete), result of deformable tracking for each object, as well as 3D mesh and Ground-Truth 3D skeleton for each object. (Dimitrios Tzionas, Juergen Gall) [Before 28/12/19]
- Reconstruction from Hand-Object Interactions (R-HOI) - Dataset of one hand interacting with an unknown object. Includes 4 RGB-D sequences, in total 4 objects, the RGB image is complete. Includes tracked 3D motion and Ground-Truth meshes for the objects. (Dimitrios Tzionas, Juergen Gall) [Before 28/12/19]
- Revisiting Oxford and Paris (RevisitOP) - Improved and more challenging version (fixed errors, new annotation and evaluation protocols, new query images) of the well known landmark/building retrieval datasets accompanied with 1M distractor images. (F. Radenovic, A. Iscen, G. Tolias, Y. Avrithis, O. Chum) [Before 28/12/19]
- SBU Captions Dataset - image captions collected for 1 million images from Flickr (Ordonez, Kulkarni and Berg) [Before 28/12/19]
- SHREC'16 Deformable Partial Shape Matching - A collection of around 400 3D deformable shapes undergoing strong partiality transformations, with point-to-point ground truth correspondence included. (Cosmo, Rodola, Bronstein, Torsello) [Before 28/12/19]
- SHREC 2016 - 3D Sketch-Based 3D Shape Retrieval - data to evaluate the performance of different 3D sketch-based 3D model retrieval algorithms using a hand-drawn 3D sketch query dataset on a generic 3D model dataset (Bo Li) [Before 28/12/19]
- SHREC'17 Deformable Partial Shape Retrieval - A collection of around 4000 deformable 3D shapes undergoing severe partiality transformations, in the form of irregular missing parts and range data; ground truth class information is provided. (Lahner, Rodola) [Before 28/12/19]
- SHREC Watertight Models Track (of SHREC 2007) - 400 watertight 3D models (Daniela Giorgi) [Before 28/12/19]
- SHREC Partial Models Track (of SHREC 2007) - 400 watertight 3D DB models and 30 reduced watertight query models (Daniela Giorgi) [Before 28/12/19]
- Sketch me That Shoe - Sketch-based object retrieval in a fine-grained setting. Match sketches to specific shoes and chairs. (Qian Yu, QMUL, T. Hospedales Edinburgh/QMUL). [Before 28/12/19]
- SmartphoneDataset - a dataset of personal photos taken by a mobile phone belonging to 40 subjects (Lonn S, Radeva P, Dimiccoli M.) [1/2/21]
- SPair-71k: A Large-scale Benchmark for Semantic Correspondence a new large-scale benchmark dataset of semantically paired images, SPair-71k, which contains 70,958 image pairs with diverse variations in viewpoint and scale.(J. Min, J. Lee, J. Ponce, and M. Cho) [05/06/25]
- SPARE3D - contains various line-drawing-based spatial IQ tests (shape consistency, camera pose, and shape generation) designed for deep networks, where state-of-the-art networks perform like almost random guesses (NYU AI4CE Lab) [27/12/2020]
- TOSCA 3D shape database (Bronstein, Bronstein, Kimmel) [Before 28/12/19]
- Totally Looks Like - A benchmark for assessment of predicting human-based image similarity (Amir Rosenfeld, Markus D. Solbach, John Tsotsos) [Before 28/12/19]
- UCF-CrossView Dataset: Cross-View Image Matching for Geo-localization in Urban Environments - A new dataset of street view and bird's eye view images for cross-view image geo-localization. (Center for Research in Computer Vision, University of Central Florida) [Before 28/12/19]
- YouTube-8M Dataset - A Large and Diverse Labeled Video Dataset for Video Understanding Research. (Google Inc.) [Before 28/12/19]
Object Databases
- 2.5D/3D Datasets of various objects and scenes (Ajmal Mian) [Before 28/12/19]
- 3D Object Recognition Stereo DatasetThis dataset consists of 9 objects and 80 test images. (Akash Kushal and Jean Ponce) [Before 28/12/19]
- 3D Photography Dataseta collection of ten multiview data sets captured in our lab(Yasutaka Furukawa and Jean Ponce) [Before 28/12/19]
- 3D-Printed RGB-D Object Dataset - 5 objects with groundtruth CAD models and camera trajectories, recorded with various quality RGB-D sensors(Siemens & TUM) [Before 28/12/19]
- 3DNet Dataset - The 3DNet dataset is a free resource for object class recognition and 6DOF pose estimation from point cloud data. (John Folkesson et al.) [Before 28/12/19]
- 80 Million Tiny Images
- A Large Data Set for Nonparametric Object and Scene Recognition. (A. Torralba, R. Fergus, W. Freeman) [29/06/25]
- ABC Dataset - A million CAD models, including ground analytical descriptions (spline patches), dense meshes, point clouds, normals. (Koch, Matveev, Jiang, Williams, Artemov, Burnaev, Alexa, Zorin, Panozzo) [2/1/20]
- Aligned 2.5D/3D datasets of various objects - Synthesized and real-world datasets for object reconstruction from a single depth view. (Bo Yang, Stefano Rosa, Andrew Markham, Niki Trigoni, Hongkai Wen) [Before 28/12/19]
- 6 Sided Dice Dataset - 359 total images from a few sets: 154 single dice of various styles on a white table, 388 Catan Dice (Red and Yellow, some rolled on a white table, 160 on top of or near the Catan board), 13 mass groupings of dice in various styles. (Roboflow) [09/06/25]
- Amsterdam Library of Object Images (ALOI): 100K views of 1K objects (University of Amsterdam/Intelligent Sensory Information Systems) [Before 28/12/19]
- ATRW - Amur Tiger Re-identification in the Wild - 8,000 Amur tiger video clips of 92 individuals (MakerCollider and WWF) [26/1/20]
- Animals with Attributes 2 - 37322 (freely licensed) images of 50 animal classes with 85 per-class binary attributes. (Christoph H. Lampert, IST Austria) [Before 28/12/19]
- ASU Office-Home Dataset - Object recognition dataset of everyday objects for domain adaptation (Venkateswara, Eusebio, Chakraborty, Panchanathan) [Before 28/12/19]
- The ATIS Planes dataset - The ATIS Planes dataset is an event-based of free hand dropped airplane models.(Afshar, Tapson, van Schaik, Cohen) [27/12/2020]
- B3DO: Berkeley 3-D Object Dataset - household object detection (Janoch et al) [Before 28/12/19]
- Boggle Boards - annotated photos of the popular board game, Boggle. Images are predominantly from 4x4 Boggle with about 30 images from Big Boggle (5x5). 357 images, 7110 annotated letter cubes. (Roboflow) [05/06/25]
- Bristol Egocentric Object Interactions Dataset - egocentric object interactions with synchronised gaze (Dima Damen) [Before 28/12/19]
- CholecTrack20 - A surgical video dataset for laparoscopic cholecystectomy, designed for tracking surgical tools with detailed annotations for multi-class, multi-tool, and multi-perspective trajectories (Nwoye, Elgohary, Srinivas, Zaid, Lavanchy, Padoy) [1/8/24]
- CIFAR-10H - a new dataset of soft labels reflecting human perceptual uncertainty for the 10,000-image CIFAR-10 test set (Peterson, Battleday, Griffiths, Russakovsky) [14/1/20]
- CINIC-10
-An augmented extension of CIFAR-10. It contains the images from CIFAR-10 (60,000 images, 32x32 RGB pixels) and a selection of ImageNet database images (210,000 images downsampled to 32x32). It was compiled as a 'bridge' between CIFAR-10 and ImageNet, for benchmarking machine learning applications. (Darlow, Crowley, Antoniou, Storkey) [29/06/25]
- CO3D
- Common Objects in 3D (CO3D) is a dataset designed for learning category-specific 3D reconstruction and new-view synthesis using multi-view images of common object categories. (J. Reizenstein, R. Shapovalov, P. Henzler, et al.) [29/06/25]
- COD - Canonical Objaverse Dataset - over 1,000 object categories, aligned by orientation, position, and scale. (Li Jin) [10/7/25]
- COMPASS-XP: Photographic and X-Ray Matched Object Dataset COMPASS-XP contains matched pairs of photographic (RGB) and multi-channel X-ray scans (low-energy, high-energy, density, greyscale, false-colour) of single objects in multiple poses. (L. Griffin, M. Caldwell, J. Andrews) [11/07/25]
- CORE image dataset - to help learn more detailed models and for exploring cross-category generalization in object recognition. (Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth) [Before 28/12/19]
- COSMICA A curated dataset of over 10,000 annotated telescope images of comets, galaxies, nebulae, globular clusters, and background regions for training and evaluating astronomical object detectors(Piratinskii, E., Rabaev, I.) [10/07/25]
- Cottontail-Rabbits Dataset - Images of cottontail rabbits that are commonly found in North America. (Sahoo) [09/06/25]
- COYO-700M
-747M image-text pairs as well as many other meta-attributes to increase the usability to train various models. (Byeon, Minwoo, Park, et al.) [29/06/25]
- CTU Color and Depth Image Dataset of Spread Garments - Images of spread garments with annotated corners. (Wagner, L., Krejov D., and Smutn V. (Czech Technical University in Prague)) [Before 28/12/19]
- Caltech 101 (now 256) category object recognition database (Li Fei-Fei, Marco Andreeto, Marc'Aurelio Ranzato) [Before 28/12/19]
- CANDLE - Image dataset for causal analysis in disentangled representations (Reddy, Godfrey L, Balasubramanian) [29/12/2021]
- COCO - Common Objects in COntext - a large-scale object detection, segmentation, and captioning dataset: 330K images, 200K labeled, 1.5m object instances, 80 object categories, 91 stuff categories, 250K people (Lin, Patterson, Ronchi, Cui, Maire, Belongie, Bourdev, Girshick, Hays, Perona, Ramanan, Zitnick, Dollar) [12/08/20]
- COCO-Stuff dataset - 164K images labeled with 'things' and 'stuff' (Caesar, Uijlings, Ferrari) [Before 28/12/19]
- COCO-Tasks - 40k images from the coco dataset are annotated with the most appropriate object for solving 14 tasks (University of Bonn) [27/12/2020]
- Columbia COIL-100 3D object multiple views (Columbia University) [Before 28/12/19]
- CompCars - images of cars and parts. 136,726 images from web with 163 car makes with 1,716 car models. 50,000 front view surveillance images. (Yang, Luo, Loy, Tang) [1/6/20]
- CORe50
- Specifically designed for continuous object recognition, and introduce baseline approaches for different continuous learning scenarios.(V. Lomonaco, D. Maltoni) [29/06/25]
- Country Flags in the Wild - 12,854 train images and 6,110 test images of the flags of 224 different countries manually cropped to loosely fit to the inlying flags. (Jetley) [Before 28/12/19]
- COWC - Cars Overhead with Context. 32,716 unique annotated cars. 58,247 unique negative examples. 15 cm per pixel resolution, from six distinct locations. (Lawrence Livermore National Laboratory) [Before 28/12/19]
- CURE-OR - Challenging Unreal and Real Environments for Object Recognition (D. Temel and J. Lee and G. AlRegib) [1/2/21]
- CURE-TSD - Challenging Unreal and Real Environments for Traffic Sign Detection (D. Temel and M. Chen and G. AlRegib) [1/2/21]
- CURE-TSR - Challenging unreal and real environments for traffic sign recognition (D. Temel and G. Kwon and M. Prabhushankar and G. AlRegib) [1/2/21]
- DAWN: Vehicle Detection in Adverse Weather Nature - a collection of 1000 images from real-traffic environments divided into four sets of weather conditions: fog, snow, rain and sandstorms (Kenk, Hassaballah) [28/12/2020]
- Deeper, Broader and Artier Domain Generalization - Domain generalisation task dataset. (Da Li, QMUL) [Before 28/12/19]
- Densely sampled object views: 2500 views of 2 objects, eg for view-based recognition and modeling (Gabriele Peters, Universiteit Dortmund) [Before 28/12/19]
- DET-COMPASS - A benchmark for open-vocabulary object detection in security X-ray scans. It contains bounding-box annotations across paired RGB and X‑ray (low-energy, high-energy, density, greyscale, and false-colour) modalities for 370 distinct object classes. (Garcia-Fernandez, Vaquero, Liu, Xue, Cores, Sebe, Mucientes, Ricci) [13/8/25]
- Edinburgh Kitchen Utensil Database - 897 raw and binary images of 20 categories of kitchen utensil, a resource for training future domestic assistance robots (D. Fullerton, A. Goel, R. B. Fisher) [Before 28/12/19]
- EDUB-Obj - Egocentric dataset for object localization and segmentation. (Marc Bolaños and Petia Radeva.) [Before 28/12/19]
- Ego4D: Around the World in 3,000 Hours of Egocentric Video - A massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. (K. Grauman, A. Westbury, E. Byrne, et al.) [29/06/25]
- Ellipse finding dataset (Dilip K. Prasad et al) [Before 28/12/19]
- FGVC-Aircraft Benchmark - 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants (Maji, Kannala, Rahtu, Blaschko, Vedaldi) [Before 28/12/19]
- FIN-Benthic - This is a dataset for automatic fine-grained classification of benthic macroinvertebrates. There are 15074 images from 64 categories. The number of images per category varies from 577 to 7. (Jenni Raitoharju, Ekaterina Riabchenko, Iftikhar Ahmad, Alexandros Iosifidis, Moncef Gabbouj, Serkan Kiranyaz, Ville Tirronen, Johanna Arje) [Before 28/12/19]
- FOD-A: Foreign Object Debris in Airports - images of 31 common Foreign Object Debris (FOD) with a runway or taxiway background (Munyer, Brinkman, Huang, Huang, Zhong) [11/8/23]
- Food 101 -This dataset consists of 101,000 images of diverse dishes for restaurant recommendation systems or dietary analysis. With 750 training and 250 test images for each category, the labels for test images have been manually cleaned. (L. Bossard, M. Guillaumin, L. Gool) [28/07/25]
- GERMS - The object set we use for GERMS data collection consists of 136 stuffed toys of different microorganisms. The toys are divided into 7 smaller categories, formed by semantic division of the toy microbes. The motivation for dividing the objects into smaller categories is to provide benchmarks with different degrees of difficulty. (Malmir M, Sikka K, Forster D, Movellan JR, Cottrell G.) [Before 28/12/19]
- GDXray:X-ray images for X-ray testing and Computer Vision - GDXray includes five groups of images: Castings, Welds*,Baggages, Nature and Settings. (Domingo Mery, Catholic University of Chile) [Before 28/12/19]
- GMU Kitchens Dataset - instance level annotation of 11 common household products from BigBird dataset across 9 different kitchens (George Mason University) [Before 28/12/19]
- Grasping In The Wild - Egocentric video dataset of natural everyday life objects. 16 objects in 7 kitchens. (Benois-Pineau, Larrousse, de Rugy) [Before 28/12/19]
- GRAZ-02 Database (Bikes, cars, people) (A. Pinz) [Before 28/12/19]
- GREYC 3D - The GREYC 3D Colored mesh database is a set of 15 real objects with different colors, geometries and textures that were acquired using a 3D color laser scanner. (Anass Nouri, Christophe Charrier, Olivier Lezoray) [Before 28/12/19]
- GTSDB: German Traffic Sign Detection Benchmark and GTSRB: German Traffic Sign Recognition Benchmark (Ruhr-Universitat Bochum) [Before 28/12/19]
- Home Objects - Annotated image dataset of household objects from the RoboFEI@Home team. (Meneghetti, Domingues, Perez, et al.) [03/06/25]
- ICubWorld - iCubWorld datasets are collections of images acquired by recording from the cameras of the iCub humanoid robot while it observes daily objects. (Giulia Pasquale, Carlo Ciliberto, Giorgio Metta, Lorenzo Natale, Francesca Odone and Lorenzo Rosasco.) [Before 28/12/19]
- Images of LEGO Bricks A dataset consisting of 50 different LEGO bricks rendered by 800 different angles (Joost Hazelzet) [05/06/25]
- Images of Common Objects common objects that one might find in images on the web (Everingham, M. , Van Gool, et al.) [05/06/25]
- ImageNet - an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. 14,197,122 images, 21841 synsets indexed (Fei-Fei, Deng, Russakovsky, Berg, Li, et al) [11/2/25]
- Industrial 3D Object Detection Dataset (MVTec ITODD) - depth and gray value data of 28 objects in 3500 labeled scenes for 3D object detection and pose estimation with a strong focus on industrial settings and applications (MVTec Software GmbH, Munich) [Before 28/12/19]
- Instagram Food Dataset - A database of 800,000 food images and associated metadata posted to Instagram over 6 week period. Supports food type recognition and social network analysis. (T. Hospedales. Edinburgh/QMUL) [Before 28/12/19]
- Keypoint-5 dataset - a dataset of five kinds of furniture with their 2D keypoint labels (Jiajun Wu, Tianfan Xue, Joseph Lim, Yuandong Tian, Josh Tenenbaum, Antonio Torralba, Bill Freeman) [Before 28/12/19]
- KTH-3D-TOTAL - RGB-D Data with objects on desktops annotated. 20 Desks, 3 times per day, over 19 days. (John Folkesson et al.) [Before 28/12/19]
- Laval 6 DOF Object Tracking Dataset - A Dataset of 297 RGB-D sequences with 11 objects for 6 DOF object Tracking. (Mathieu Garon, Denis Laurendeau, Jean-Francois Lalonde) [Before 28/12/19]
- Linnaeus 5 Dataset
- 5 classes: berry, bird, dog, flower, other (negative set), 1200 training images, 400 test images per class. (Chaladze, G. Kalatozishvili L.) [29/06/25]
- LISA Traffic Light Dataset - 6 light classes in various lighting conditions (Jensen, Philipsen, Mogelmose, Moeslund, and Trivedi) [Before 28/12/19]
- LISA Traffic Sign Dataset - video of 47 US sign types with 7855 annotations on 6610 frames (Mogelmose, Trivedi, and Moeslund) [Before 28/12/19]
- Linkoping 3D Object Pose Estimation Database (Fredrik Viksten and Per-Erik Forssen) [Before 28/12/19]
- LVIS -A Dataset for Large Vocabulary Instance Segmentation. (A. Gupta, P. Dollar, R. Girshick, et al.) [29/06/25]
- Linkoping Traffic Signs Dataset - 3488 traffic signs in 20K images (Larsson and Felsberg) [Before 28/12/19]
- Longterm Labeled - This dataset contains a subset of the observations from the longterm dataset (longterm dataset above). (John Folkesson et al.) [Before 28/12/19]
- Main Product Detection Dataset - Contains textual metadata of fashion products and their images with bounding boxes of the main product (the one referred by the text). (A. Rubio, L. Yu, E. Simo-Serra and F. Moreno-Noguer) [Before 28/12/19]
- MCIndoor20000 - 20,000 digital images from three different indoor object categories: doors, stairs, and hospital signs. (Bashiri, LaRose, Peissig, and Tafti) [Before 28/12/19]
- Mexculture142 - Mexican Cultural heritage objects and eye-tracker gaze fixations (Montoya Obeso, Benois-Pineau, Garcia-Vazquez, Ramirez Acosta) [Before 28/12/19]
- MinneApple: A Benchmark Dataset for Apple Detection and Segmentation - high-resolution images acquired in orchards with over 40000 annotated object instances in 1000 images. Useful for detection, clusterin, yield estimation (Haeni, Roy, Isler) [30/12/2020]
- MIO-TCD - 786,702 vehicle images with 648,959 classification images and 137,743 localization images. Acquired at different times of the day and different periods of the year by thousands of traffic cameras. (Luo, Charron, Lemaire, Konrad, Li, Mishra, Achkar, Eichel, Jodoin) [1/6/20]
- MIT CBCL Car Data (Center for Biological and Computational Learning) [Before 28/12/19]
- MIT CBCL StreetScenes Challenge Framework: (Stan Bileschi) [Before 28/12/19]
- Microsoft COCO - Common Objects in Context (Tsung-Yi Lin et al) [Before 28/12/19]
- Microsoft Object Class Recognition image databases (Antonio Criminisi, Pushmeet Kohli, Tom Minka, Carsten Rother, Toby Sharp, Jamie Shotton, John Winn) [Before 28/12/19]
- Microsoft salient object databases (labeled by bounding boxes) (Liu, Sun Zheng, Tang, Shum) [Before 28/12/19]
- MNIST-DVS and FLASH-MNIST-DVS Databases - The dataset is based on the original frame-based MNIST dataset and contains recordings a DVS (Dynamic Vision Sensor).(Yousefzadeh, Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- Mountain Dew Commercial Dataset - Three images per each second of the commercial (91 images from ~30 seconds of commercial), and annotated all visible bottles. (Roboflow) [09/06/25]
- Moving Labled - This dataset extends the longterm datatset with more locations within the same office environment at KTH. (John Folkesson et al.) [Before 28/12/19]
- N-Caltech101 (Neuromorphic-Caltech101) - The dataset is a spiking version of the original frame-based Caltech101 dataset.(Orchard, Cohen, Jayawant, Thakor) [27/12/2020]
- N-Cars - "The dataset is composed of 12,336 car samples and 11,693 non-cars samples (background) for classification recorded by an ATIS camera."(Sironi, Brambilla, Bourdis, Lagorce, Benosman) [27/12/2020]
- N-MNIST (Neuromorphic-MNIST) - The dataset is a spiking version of the original frame-based MNIST dataset of handwritten digits.(Orchard, Cohen, Jayawant, Thakor) [27/12/2020]
- N-SOD Dataset - "Neuromorphic Single Object Dataset (N-SOD), contains three objects with samples of varying length in time recorded with an event-based sensor."(Ramesh, Ussa, Vedovs, Yang, Orchard) [27/12/2020]
- NABirds Dataset - 70,000 annotated photographs of the 400 species of birds commonly observed in North America (Grant Van Horn) [Before 28/12/19]
- NEC Toy animal object recognition or categorization database (Hossein Mobahi) [Before 28/12/19]
- NORB: Generic Object Recognition in Images - stereo image pairs of 50 uniform-colored toys under 36 azimuths, 9 elevations, and 6 lighting conditions (for a total of 194,400 individual images). (Huang, LeCun, Bottau) [13/06/25]
- NORB 50 toy image database (NYU) [Before 28/12/19]
- NTU-VOI: NTU Video Object Instance Dataset - video clips with frame-level bounding box annotations of object instances for evaluating object instance search and localization in large scale videos. (Jingjing Meng, et. al.) [Before 28/12/19]
- ObjectNet - 50,000 image test set, same as ImageNet, with controls for rotation, background, and viewpoint. (Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz) [1/2/21]
- ObjectNet3D
- Consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. (Y. Xiang, W. Kim, W. Chen, et al.) [29/06/25]
- Object Pose Estimation Database - This database contains 16 objects, each sampled at 5 degrees angle increments along two rotational axes (F. Viksten etc.) [Before 28/12/19]
- Object Recognition DatabaseThis database features modeling shots of eight objects and 51 cluttered test shots containing multiple objects. (Fred Rothganger, Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. ) [Before 28/12/19]
- Omniglot - 1623 different handwritten characters from 50 different alphabets (Lake, Salakhutdinov, Tenenbaum) [Before 28/12/19]
- OpenLORIS-Object
- A Robotic Vision Dataset and Benchmark for Lifelong Deep Learning. (Q. She, F. Feng, X. Hao, et al.) [29/06/25]
- Open Images Dataset V7 and Extensions - many images with bounding boxes and image level and object classes plus relationships (Krasin, Duerig, Alldrin, Ferrari, et al) [1/06/25]
- Open Museum Identification Challenge (Open MIC)Open MIC contains photos of exhibits captured in 10 distinct exhibition spaces (painting, sculptures, jewellery, etc.) of several museums and the protocols for the domain adaptation and few-shot learning problems. (P. Koniusz, Y. Tas, H. Zhang, M. Harandi, F. Porikli, R. Zhang) [Before 28/12/19]
- ORIDa - Real-world object-centric image composition dataset (Jinwoo Kim) [13/8/25]
- Osnabrück Synthetic Scalable Cube Dataset - 830000 different cubes captured from 12 different viewpoints for ANN training (Schöning, Behrens, Faion, Kheiri, Heidemann & Krumnack) [Before 28/12/19]
- Packages Dataset - Collection of packages located at the doors of various apartments and homes. Packages are flat envelopes, small boxes, and large boxes. Some images contain multiple annotated packages. (Roboflow Community) [09/06/25]
- Paintings Datasets A dataset consisting of paintings where objects have a variety of sizes, poses and depictive styles, and can be partially occluded or truncated.(E. J. Crowley, A. Zisserman) [05/06/25]
- Pistols - 2986 images and 3448 labels across a single annotation class: pistols. Images are wide-ranging: pistols in-hand, cartoons, and staged studio quality images of guns. (University of Grenada) [05/06/25]
- Princeton ModelNet - 127,915 CAD Models, 662 Object Categories, 10 Categories with Annotated Orientation (Wu, Song, Khosla, Yu, Zhang, Tang, Xiao) [Before 28/12/19]
- PacMan datasets - RGB and 3D synthetic and real data for graspable cookware and crockery (Jeremy Wyatt) [Before 28/12/19]
- PACS (Photo Art Cartoon Sketch) - An object category recognition dataset dataset for testing domain generalisation: How well can a classifier trained on object images in one domain recognise objects in another domain? (Da Li QMUL, T. Hospedales. Edinburgh/QMUL) [Before 28/12/19]
- PASCAL 2007 Challange Image Database (motorbikes, cars, cows) (PASCAL Consortium) [Before 28/12/19]
- PASCAL 2008 Challange Image Database (PASCAL Consortium) [Before 28/12/19]
- PASCAL 2009 Challange Image Database (PASCAL Consortium) [Before 28/12/19]
- PASCAL 2010 Challange Image Database (PASCAL Consortium) [Before 28/12/19]
- PASCAL 2011 Challange Image Database (PASCAL Consortium) [Before 28/12/19]
- PASCAL 2012 Challange Image Database Category classification, detection, and segmentation, and still-image action classification (PASCAL Consortium) [Before 28/12/19]
- PASCAL Image Database (motorbikes, cars, cows) (PASCAL Consortium) [Before 28/12/19]
- PASCAL Parts dataset - PASCAL VOC with segmentation annotation for semantic parts of objects (Alan Yuille) [Before 28/12/19]
- PASCAL-Context dataset - annotations for 400+ additional categories (Alan Yuille) [Before 28/12/19]
- PASCAL 3D/Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild - 12 class, 3000+ images each with 3D annotations (Yu Xiang, Roozbeh Mottaghi, Silvio Savarese) [Before 28/12/19]
- POKER-DVS Database - "The POKER-DVS database consists of a set of 131 poker pip symbols tracked and extracted from 3 separate DVS recordings, while browsing very quickly poker cards."(Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- Physics 101 dataset - a video dataset of 101 objects in five different scenarios (Jiajun Wu, Joseph Lim, Hongyi Zhang, Josh Tenenbaum, Bill Freeman) [Before 28/12/19]
- Plant seedlings dataset - High-resolution images of 12 weed species. (Aarhus University) [Before 28/12/19]
- Raindrop Detection - Improved Raindrop Detection using Combined Shape and Saliency Descriptors with Scene Context Isolation - Evaluation Dataset (Breckon, Toby P., Webster, Dereck D.) [Before 28/12/19]
- ReferIt Dataset (IAPRTC-12 and MS-COCO) - referring expressions for objects in images from the IAPRTC-12 and MS-COCO datasets (Kazemzadeh, Matten, Ordonez, and Berg) [Before 28/12/19]
- Rock Paper Scissors Dataset Rock Paper Scissors contains images from a variety of different hands, from different races, ages and genders, posed into Rock / Paper or Scissors and labelled as such. Contains 2925 images. (Roboflow) [05/06/25]
- roboflow Chess Pieces object detection dataset - a dataset of Chess board photos and various pieces. All photos were captured from a constant angle, a tripod to the left of the board. The bounding boxes of all pieces are annotated with bounding boxes. There are 2894 labels across 292 images () [29/12/2020]
- RWDS: Benchmarking Real World Distribution Shifts in Object Detection RWDS is a standardised suite of benchmark datasets for assessing the robustness of object detectors under realistic domain generalisation scenarios. (S. A. Al-Emadi, Y. Yang, F. Ofli.) [10/07/25]
- SCoDA (ScanSalon) SCoDA introduces ScanSalon, a domain-adaptive shape completion dataset containing 800 real-scan and artist-created mesh pairs across six object categories (chair, desk, sofa, bed, lamp, car).(Y. Wu, Z. Yan, C. Chen et al.) [31/07/25]
- SAIL-VOS - The Semantic Amodal Instance Level Video Object Segmentation (SAIL-VOS) dataset provides accurate ground truth annotations to develop methods for reasoning about occluded parts of objects while enabling to take temporal information into account (Hu, Chen, Hui, Huang, Schwing) [29/12/19]
- Scanned Objects by Google Research - A Dataset of 3D-Scanned Common Household Items. (Google AI) [29/06/25]
- SeaShips - 31455 side images of boats near land, from 7 classes, extracted from surveillance video (Shao, Wu, Wang, Du, Li) [Before 28/12/19]
- Separated COCO & Occluded COCO - Automatically generated large-scale occlusion-related dataset, collecting separated objects and partially occluded objects for a large variety of categories in a scalable manner, to benchmark a model's capability of detecting occluded objects. (Zhan, Xie, Zisserman) [3/12/2022]
- ShapeNet - 3D models of 55 common object categories with about 51K unique 3D models. Also 12K models over 270 categories. (Princeton, Stanford and TTIC) [Before 28/12/19]
- SHORT-100 dataset - 100 categories of products found on a typical shopping list. It aims to benchmark the performance of algorithms for recognising hand-held objects from either snapshots or videos acquired using hand-held or wearable cameras. (Jose Rivera-Rubio, Saad Idrees, Anil A. Bharath) [Before 28/12/19]
- Sileane: Object Detection and Pose Estimation - RGBD dataset of bulk objects in bin-picking scenarios (6D poses, instance segmentations, noisy and ground truth depth maps) (Bregier et al. - Sileane/INRIA Grenobl) [17/2/2022]
- SIXray: A Large-scale Security Inspection X-Ray Benchmark for Prohibited Item Discovery SIXray contains 1,059,231 real-world X-ray baggage images from subway stations, with image-level labels and bounding boxes for six prohibited item categories (gun, knife, wrench, pliers, scissors, hammer). (C. Miao, L. Xie, F. Wan et al.) [11/07/25]
- SkelNetOn - The SkelNetOn Challenge is structured around shape understanding in four domains: shape silhouettes, RGB images, point clouds, and parametric representations. We provide shape datasets, some complementary resources (e.g, pre/post-processing, sampling, and data augmentation scripts), and the testing platform for skeleton extraction in four categories. (credits) [29/12/2020]
- SLOW-POKER-DVS Database - "The SLOW-POKER-DVS database consists of 4 separate DVS recordings, while slowly moving a poker symbol in front of the camera for about 3 minutes."(Serrano-Gotarredona, Linares-Barranco) [27/12/2020]
- SOR3D - The SOR3D dataset consists of over 20k instances of human-object interactions, 14 object types, and 13 object affordances. (pyridon Thermos) [Before 28/12/19]
- Space Object Pose Estimation Challenge Dataset - 12000 synthetic images for training, 2998 similar synthetic test images, and 305 real images (Space Rendezvous Laboratory (SLAB)) [26/1/20]
- Stanford Dogs Dataset - The Stanford Dogs dataset contains images of 120 breeds of dogs from around the world. This dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization. (Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, Li Fei-Fei, Stanford University) [Before 28/12/19]
- Stream-51 - a dataset for streaming continual learning (classification) consisting of temporally correlated images from 51 distinct object categories and additional evaluation classes outside of the training distribution to test novelty (open set) recognition (Roady, Hayes, Vaidya, Kanan) [26/12/2020]
- SVHN: Street View House Numbers Dataset - like MNIST, but an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). (Netzer, Wang, Coates, Bissacco, Wu, Ng) [Before 28/12/19]
- Swedish Leaf Dataset - These images contains leaves from 15 treeclasses (Oskar J. O. S?derkvist) [Before 28/12/19]
- T-LESS - An RGB-D dataset for 6D pose estimation of texture-less objects. (Tomas Hodan, Pavel Haluza, Stepan Obdrzalek, Jiri Matas, Manolis Lourakis, Xenophon Zabulis) [Before 28/12/19]
- Taobao Commodity Dataset - TCD contains 800 commodity images (dresses, jeans, T-shirts, shoes and hats) for image salient object detection from the shops on the Taobao website. (Keze Wang, Keyang Shi, Liang Lin, Chenglong Li ) [Before 28/12/19]
- TenCent open-source multi-label image database - 17,609,752 training and 88,739 validation image URLs, which are annotated with up to 11,166 categories (Wu, Chen, Fan, Zhang, Hou, Liu, Zhang) [16/4/20]
- tieredImageNet dataset - a larger subset of ILSVRC-12 with 608 classes (779,165 images) grouped into 34 higher-level nodes in the ImageNet human-curated hierarchy. (Ren, Triantafillou, Ravi, Snell, Swersky, Tenenbaum, Larochelle, Zemel) [17/1/20]
- ToolArtec point clouds - 50 kitchen tool 3D scans (ply) from an Artec EVA scanner. See also ToolKinect - 13 scans using a Kinect 2 and ToolWeb - 116 point clouds of synthetic household tools with mass and affordance groundtruth for 5 tasks. (Paulo Abelha) [Before 28/12/19]
- Trans2k Trans2k is a large-scale synthetic dataset for transparent object segmentation, containing over 2,000 high-resolution images with pixel-level annotations of transparent objects like glass and plastic. (Z. Shu, Z. Wang, Q. Chen et al.) [11/07/25]
- TUW Object Instance Recognition Dataset - Annotations of object instances and their 6DoF pose for cluttered indoor scenes observed from various viewpoints and represented as Kinect RGB-D point clouds (Thomas, A. Aldoma, M. Zillich, M. Vincze) [Before 28/12/19]
- TUW dat sets - Several RGB-D Ground truth and annotated data sets from TUW. (John Folkesson et al.) [Before 28/12/19]
- TV News Channel Commercial Detection Dataset -TV commercials and news broadcasts. (P. Guha, R. Kannao, R. Soni) [29/06/25]
- UAH Traffic Signs Dataset (Arroyo etc.) [Before 28/12/19]
- UIUC Car Image Database (UIUC) [Before 28/12/19]
- UIUC Dataset of 3D object categories (S. Savarese and L. Fei-Fei) [Before 28/12/19]
- Uno Cards - 8,992 images of Uno cards and 26,976 labeled examples on various textured backgrounds. (Crawshaw) [09/06/25]
- USPS Handwritten Digits dataset - 7291 train and 2007 test images. The images are 16*16 grayscale pixels (Hull) [Before 28/12/19]
- VAIS - VAIS contains simultaneously acquired unregistered thermal and visible images of ships acquired from piers, and it was created to faciliate autonomous ship development. (Mabel Zhang, Jean Choi, Michael Wolf, Kostas Daniilidis, Christopher Kanan) [Before 28/12/19]
- Venezia 3D object-in-clutter recognition and segmentation (Emanuele Rodola) [Before 28/12/19]
- Visual Attributes Dataset visual attribute annotations for over 500 object classes (animate and inanimate) which are all represented in ImageNet. Each object class is annotated with visual attributes based on a taxonomy of 636 attributes (e.g., has fur, made of metal, is round). [Before 28/12/19]
- Visual Hull Data Setsa collection of visual hull datasets (Svetlana Lazebnik, Yasutaka Furukawa, and Jean Ponce) [Before 28/12/19]
- VOC-360 - Dataset for object detection and segmentation in fisheye images (Fu, Bajic, and Vaughan) [29/12/19]
- XrayVision X-Ray Security Imaging Datasets& Benchmark Suite A curated repository listing and benchmarking public X-ray security screening datasets (2D& 3D), including LD-Xray, DVXRAY, LPIXray, SIXray, PIDray, and more. (N. Bhowmik et al.) [11/07/25]
- YCB Benchmarks Object and Model Set - 77 objects in 5 categories (food, kitchen, tool, shape, task) each with 600 RGBD and high-res RGB images, calibration data, segmentation masks, mesh models (Calli, Dollar, Singh, Walsman, Srinivasa, Abbeel) [Before 28/12/19]
- xView - Features over 1 million objects across complex scenery and large images in one of the largest publicly available overhead image datasets. (Defense Innovation Unit Experimental (DIUx) and National Geospatial-Intelligence Agency (NGA)) [03/06/25]
- YouTube-BoundingBoxes - 5.6 million accurate human-annotated BB from 23 object classes tracked across frames, from 240,000 YouTube videos, with a strong focus on the person class (1.3 million boxes) (Real, Shlens, Pan, Mazzocchi, Vanhoucke, Khan, Kakarla et al) [Before 28/12/19]
People (static and dynamic), human body pose
- 3D articulated body - 3D reconstruction of an articulated body with rotation and translation. Single camera, varying focal. Every scene may have an articulated body moving. There are four kinds of data sets included. A sample reconstruction result included which uses only four images of the scene. (Prof Jihun Park) [Before 28/12/19]
- 3dhumans - 180 meshes of people in diverse body shapes in various garments styles and sizes (Avinash Sharma) [1/2/2023]
- 4D-DRESS: High-quality human scans with garment segmentation - 4D-DRESS captures 32 subjects with 64 real-world human outfits in more than 520 motion sequences and 78k scan frames. (Wang, Wenbo, Ho, et al) [22/06/25]
- BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis - a dataset for audio-conditioned dance motion synthesis focusing on breakdancing sequences, with high quality annotations for complex body poses and dance movements (Moltisanti, Wu, Dai, Loy) [3/12/2022]
- BUFF dataset - About 10K scans of people in clothing and the estimated body shape of people underneath. Scans contain texture so synthetic videos/images are easy to generate. (Zhang, Pujades, Black and Pons-Moll) [Before 28/12/19]
- Aria-Digital-Twin (ADT) ADT is an egocentric dataset captured via Aria glasses, comprising 236 real activity sequences in two fully digitized indoor spaces. (X. Pan, N. Charron, Y. Yang et al.) [31/07/25]
- CAPE dataset - 140K SMPL mesh registration of 4D scans of people in clothing, including 15 subjects, ~600 motion sequences, along with registered scans of the ground truth body shapes under clothing (Ma, Yang, Ranjan, Pujades, Pons-Moll, Tang, Black) [28/12/2020]
- CarDA - Car door Assembly Activities Dataset A multi-modal dataset for car door assembly activities with synchronized RGB-D videos and pose data in a real manufacturing setting (K. Papoutsakis, N. Bakalos, K. Fragkoulis et al.) [10/07/25]
- CASR: Cyclist Arm Sign Recognition - Small clips of ~10 seconds showing cyclists performing arm signs. The videos are acquired with a consumer-graded camera. There are 219 arm sign actions annotated. (Zhijie Fang, Antonio M. Lopez) [13/1/20]
- ChaLearn Gesture Dataset (CGD 2011) - Originally designed for one-shot learning, for a Kaggle competition. Action being performed in a subset of the videos. Body part annotations. (ChaLearn) [30/07/25]
- CustomHumans - High-quality huamn scans. (H. Ho, L. Xue, J. Song, O. Hilliges) [22/06/25]
- Dataset of a human performing daily life activities in a scene with occlusions - 12 RGB-D video sequences of a person performing activities with obstacles occluding the view from the Kinect. (A. Dib, F. Charpillet) [30/07/25]
- Dataset of Celebrities A dataset consisting of images of 'Sherlock' actors from freely available sources on the web (A. Nagrani, A. Zisserman) [05/06/25]
- DIP-IMU - a dataset consisting of 10 subjects wearing 17 IMUs for validation in 64 sequences with 330,000 time instants; this constitutes the largest IMU dataset publicly available. (Huang, Yinghao, Kaufmann, et al) [22/06/25]
- Dynamic Dyna - More than 40K 4D 60fps high resolution scans and models of people very accurately registered. Scans contain texture so synthetic videos/images are easy to generate. (Pons-Moll, Romero, Mahmood and Black) [Before 28/12/19]
- Dynamic Faust - More than 40K 4D 60fps high resolution scans of people very accurately registered. Scans contain texture so synthetic videos/images are easy to generate. (Bogo, Romero, Pons-Moll and Black) [Before 28/12/19]
- EHF dataset - 100 curated frames (+ code) of one subject in minimal clothing performing various expressive poses involving the body, hands and face. Each frame contains a full-body RGB image, detected 2D OpenPose features (body, hands, face), a 3D scan of the subject, and a 3D SMPL-X mesh as pseudo ground-truth (Pavlakos, Choutas, Ghorbani, Bolkart, Osman, Tzionas, Black) [Before 28/12/19]
- EMDB - EMDB contains SMPL poses with global body root and camera trajectories. (M. Kaufnamm, J. Song, C. Guo, et al) [22/06/25]
- EmoReact EmoReact is a multimodal dataset of 1,102 audiovisual clips featuring children aged 4-14 reacting to various stimuli, annotated for 17 affective states.(B. Nojavanasghari, T. Baltrusaitis, C. E. Hughes et al.) [31/07/25]
- ExPose - Dataset of expressive 3D humans. It contains ~32k pairs of RGB images and SMPL-X humans (parameters & meshes). It was created by applying SMPLify-X on the LSP, LSP extended and MPII datasets, and carefully curating the results to obtain pseudo ground truth. The dataset was used to train ExPose, a model that predicts expressive 3D humans from an RGB image. (Choutas, Pavlakos, Bolkart, Tzionas, Black) [1/2/21]
- Extended Chictopia dataset - 14K image Chictopia dataset with additional processed annotations (face) and SMPL body model fits to the images. (Lassner, Pons-Moll and Gehler) [Before 28/12/19]
- FALL DATABASE - The database consists of falls and activities of daily living performed by two persons (person1 and person2) ? each person performed all activities twice (CVL, Planinc Rainer,? Martin Kampel) [1/2/21]
- Frames Labeled In Cinema (FLIC) - 20928 frames labeled with human pose (Sapp, Taskar) [Before 28/12/19]
- GPA: geometric pose affordance dataset - Dataset of real 3D people interacting with real 3D scenes. 300k static RGB frames of 13 subject in 8 scenes with ground-truth scene meshes, and motion capture script focus on the interaction between subject and scene geometry, human dynamics, and mimic of human action with scene geometry around. (Wang, Chen, Rathore, Shin, Fowlkes) [29/12/19]
- GRAB - Dataset of dynamic whole-body grasps. It contains sequences of 3D SMPL-X humans (articulated body + hands + face), interacting with rigid 3D object meshes with their whole body, e.g. lifting a cup with the hand and bringing it in contact with the lips to drink. We use it to train GrabNet, a model that predicts 3D hand grasps (MANO) for unseen 3D object shapes. (Taheri, Ghorbani, Black, Tzionas) [1/2/21]
- H3DS H3DS is a high-resolution full-head 3D dataset comprising 60+ textured head scans and posed multi-view images with ground-truth masks and landmark alignments (10-70 views per subject).(E. Ramon, G. Triginer, J. Escur et al.) [31/07/25]
- Hard Hat Workers - Workers in workplace settings that require a hard hat. Annotations also include examples of just "person" and "head," for when an individual may be present without a hard hart. (Northeastern University - China) [05/06/25]
- Health& Gait: a dataset for gait-based analysis - first video-based dataset for gait analysis without specific sensors, combining visual features with gait and anthropometric data for research in health, sports, and biomechanics (Zafra-Palma, Marin-Jiminez, Castro-Pinero, Cuenca-Garcia, Munoz-Salinas, Marin-Jimenez) [23/06/25]
- Hi4D: 4D Instance Segmentation of Close Human Interaction - Hi4D is the first dataset containing rich interaction centric annotations and high-quality 4D textured geometry of closely interacting humans.(Yin, Yifei, Guo, et al) [22/06/25]
- Human3.6M - 11 different humans performing 17 different activities. Data comes from four calibrated video cameras, 1 time-of-flight camera and (static) 3D laser scans of the actors. (C. Ionescu, D. Papava, V. Olaru, et al.) [30/07/25]
- Identity Preserved Tracking - The IPT (Identity Preserved Tracking) dataset consisting of 10 sequences of depth data recorded using an Orbbec Astra depth sensor. (CVL,? Thomas Heitzinger,? Martin Kempel) [1/2/21]
- Indoor Skeleton Tracking Dataset - Description This database contains skeleton tracking information obtained by the Asus Xtion pro live sensor in combination with OpenNI. (CVL, Rainer Planinc,? Martin Kampel) [1/2/21]
- KIDS dataset - A collection of 30 high-resolution 3D shapes undergoing nearly-isometric and non-isometric deformations, with point-to-point ground truth as well as ground truth for left-to-right bilateral symmetry. (Rodola, Rota Bulo, Windheuser, Vestner, Cremers) [Before 28/12/19]
- Kinect2 Human Pose Dataset (K2HPD) - Kinect2 Human Pose Dataset (K2HPD) includes about 100K depth images with various human poses under challenging scenarios. (Keze Wang, Liang Lin, Shengfu Zhai, Dengke Dong) [Before 28/12/19]
- Kinetic Gesture Dataset - The Microsoft Research Cambridge-12 Kinect gesture data set consists of sequences of human movements, represented as body-part locations, and the associated gesture to be recognized by the system.(Microsoft Research) [30/07/25]
- LATTE-MV Dataset of extracted human poses and 3D ball positions during professional table tennis gameplay, captured from monocular videos. (D. Etaat, D. Kalaria, N. Rahmanian, S. S. Sastry.) [10/07/25]
- Leeds Sports Pose Dataset - 2000 pose annotated images of mostly sports people (Johnson, Everingham) [Before 28/12/19]
- Look into Person Dataset - 50,000 images with elaborated pixel-wise annotations with 19 semantic human part labels and 2D hposes with 16 key points. (Gong, Liang, Zhang, Shen, Lin) [Before 28/12/19]
- LWIRPOSE - a RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, comprising over 2,400 high-quality LWIR (thermal) images. Each image is meticulously annotated with 2D human poses from m seven actors performing diverse everyday activities like sitting, eating, and walking. (Upadhyay, Dhupar, Sharma, Shukla, Abraham) [9/10/24]
- M3Act - Large-scale photorealistic synthetic video datasets with complex human group activities beneficial for enhancing multi-person and multi-group perception tasks, from paper titled "Learning from Synthetic Human Group Activities" CVPR 2024 (Chang, Che-Jui) [25/06/24]
- Maadata AI Photo-Video Editing Open Dataset - Bodies, faces, and image details (Maadata) [17/11/2023]
- Maadata Fashion & E-commerce Open Dataset - Faces, Bodies, Clothing, and E-Commerce images (Maadata) [17/11/2023]
- Manga109: manga (comic) dataset - 109 volumes, more than 21,000 pages, 109 volumes, more than 21,000 pages (Kiyoharu Aizawa) [29/12/19]
- Mannequin in-bed pose datasets via RGB webcam - This in-bed pose dataset is collected via regular webcam in a simulated hospital room at Northeastern University. (Shuangjun Liu and Sarah Ostadabbas, ACLab) [Before 28/12/19]
- Mannequin IRS in-bed dataset - This in-bed pose dataset is collected via our infrared selective (IRS) system in a simulated hospital room at Northeastern University. (Shuangjun Liu and Sarah Ostadabbas, ACLab) [Before 28/12/19]
- Mask Wearing Dataset - Individuals wearing various types of masks and those without masks. (Nelson) [05/06/25]
- MMM - Monocular Multi-huMan (MMM) dataset made by using a hand-held smartphone, which contains six sequences with two to four persons in each sequence. (Jiang, Zeren, Guo, et al) [22/06/25]
- Montalbano gesture dataset - 13858 sequences each depicting one of 27 humans performing one of 20 Italian gestures. (ChaLearn) [30/07/25]
- MoPoTS-3D - Multi-person 3D body pose benchmark for monocular RGB based methods, with 20 sequences in indoor and outdoor settings (MPI For Informatics) [Before 28/12/19]
- MoVi: A Large Multipurpose Human Motion and Video Dataset - MoVi is the first human motion dataset to contain synchronized pose, body meshes, and video recordings from a large population of subjects (Ghorbani, Mahdaviani, Thaler, Kording, Cook, Blohm, Troje) [27/12/2020]
- MPI-INF-3DHP - Single-person 3D body pose dataset and evaluation benchmark, with extensive pose coverage across a broad set of activities, and extensive scope of appearance augmentation. Multi-view RGB frames are available for the training set, and monocular view frames for the test set. (MPI For Informatics) [Before 28/12/19]
- MPI MANO & SMPL+H dataset - Models, 4D scans and registrations for the statistical models MANO (hand-only) and SMPL+H (body+hands). For MANO there are ~2k static 3D scans of 31 subjects performing up to 51 poses. For SMPL+H we include 39 4D sequences of 11 subjects. (Javier Romero, Dimitrios Tzionas and Michael J Black) [Before 28/12/19]
- MPII Human Pose Dataset - 25K images containing over 40K people with annotated body joints, 410 human activities {Andriluka, Pishchulin, Gehler, Schiele) [Before 28/12/19]
- MPII Human Pose Dataset - MPII Human Pose dataset is a de-facto standard benchmark for evaluation of articulated human pose estimation. (Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, Bernt Schiele) [Before 28/12/19]
- MSR 3D Online Action Dataset - Videos of human-object interaction, in seven categories, plus a negative class. (Microsoft Research) [30/07/25]
- MSR Action3D Dataset - Videos of 10 humans performing 20 action types. Each subject performs each action 2 or 3 times. (Microsoft Research) [30/07/25]
- MSRDailyActivity3D - 10 humans performing 16 activities, e.g. read book, play guitar. Each activity performed in sitting and standing positions. (Microsoft Research) [30/07/25]
- MSRGesture3D - 10 humans performing 12 American Sign Language gestures, each gesture being performed 2-3 times. The hands have been segmented. (Microsoft Research) [30/07/25]
- MuCo-3DHP - Large scale dataset of composited multi-person RGB images with 3D pose annotations, generated from MPI-INF-3DHP dataset (MPI For Informatics) [Before 28/12/19]
- MVOR: A Multi-view Multi-person RGB-D Operating Room Dataset for 2D and 3D Human Pose Estimation - multi-view images captured by 3 RGB-D cameras during real clinical interventions (Padoy) [Before 28/12/19]
- Northwestern-UCLA Multiview Action 3D Dataset - Three Kinects used to simulatneously record 10 actions each being performed by 10 humans. (J. Wang) [30/07/25]
- Objaverse-XL - A dataset of over 10 million 3D objects. (M. Deitke, R. Liu, M. Wallingford, et al.) [29/06/25]
- OcMotion (CHOMP) OcMotion is the first video dataset explicitly designed for occluded human body motion capture, containing ~300 K frames across 43 motion sequences with accurate 3D joint-level annotations. (B. Huang, Y. Shu, J. Ju et al.) [31/07/25]
- OmniObject3D - A large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects to facilitate the development of 3D perception, reconstruction, and generation in the real world. (T. Wu, J. Zhang, X. Fu, et al.) [29/06/25]
- People In Photo Albums - Social media photo dataset with images from Flickr, and manual annotations on person heads and their identities. (Ning Zhang and Manohar Paluri and Yaniv Taigman and Rob Fergus and Lubomir Bourdev) [Before 28/12/19]
- People Snapshot Dataset - Monocular video of 24 subjects rotating in front of a fixed camera. Annotation in form of segmentation and 2D joint positions is provided. (Alldieck, Magnor, Xu, Theobalt, Pons-Moll) [Before 28/12/19]
- Person Recognition in Personal Photo Collections - we introduced three harder splits for evaluation and long-term attribute annotations and per-photo timestamp metadata. (Oh, Seong Joon and Benenson, Rodrigo and Fritz, Mario and Schiele, Bernt) [Before 28/12/19]
- PKU-MMD PKU-MMD is a large-scale continuous multimodal action detection dataset captured via Kinect v2. (C. Liu, Y. Hu, Y. Li et al.) [31/07/25]
- Pointing'04 ICPR Workshop Head Pose Image Database [Before 28/12/19]
- Pose estimation - This dataset has a total of 155,530 images. These images were obtained through the recording of members of CIDIS, in 4 sessions. In total, 10 videos with a duration of 4 minutes each were obtained. The participants were asked to bring different clothes, in order to give variety to the images. After this, the frames of the videos were separated at a rate of 5 frames per second. All these images were captured from a top view perspective. The original images have a resolution of 1280x720 pixels. (CIDIS) [Before 28/12/19]
- PROX dataset - Dataset (+code) of real 3D people interacting with real 3D scenes. "Quantitative PROX": 180 static RGB-D frames of 1 subject in 1 scene with ground-truth SMPL-X meshes. "Qualitative PROX": 100K dynamic RGB-D sequences of 20 subjects in 12 scenes with pseudo ground-truth SMPL-X meshes. (Hassan, Choutas, Tzionas, Black) [Before 28/12/19]
- RefGTA - Synthesized dataset for referring expression generation including the time required to locate the referred objects by humans. (Mikihiro Tanaka, Takayuki Itamochi,Kenichi Narioka, Ikuro Sato, Yoshitaka Ushiku and Tatsuya Harada) [1/2/21]
- RGB-D People Dataset - This dataset contains 3000+ RGB-D frames acquired in a university hall from three vertically mounted Kinect sensors. (L. Spinello, K. Arras, et al.) [30/07/25]
- SBU Kinect Interaction Dataset SBU Kinect Interaction Dataset is a two-person interaction video collection captured via Kinect, containing 8 human interaction types.(T. Vicente, L. Hou, C.-P. Yu et al.) [31/07/25]
- SHOT7M2 SHOT7M2 is a large-scale video dataset for evaluating real-world short-video reasoning and temporal grounding tasks, sourced from TikTok. It contains over 7 million videos annotated with temporal event information, captions, and questions for evaluation. (A. Mathis, R. Zhang, R. W. T. Schick et al.) [11/07/25]
- SHREC'16 Topological KIDS - A collection of 40 high-resolution and low-resolution 3D shapes undergoing nearly-isometric deformations in addition to strong topological artifacts, self-contacts and mesh gluing, with point-to-point ground truth. (Lahner, Rodola) [Before 28/12/19]
- SIZER - A dataset 3D scans, clothing segmentations, labels and SMPL+G registrations of 100 subjects(~2000 scans) in various garments styles and sizes in A-pose (Tiwari, Pons-Moll) [27/12/2020]
- Social Distancing Dataset - The upper body detection of attendees in a panel discussion, associated voice activity detection ground-truth (speaking, not-speaking) for them and acoustic features extracted from the video (M. Aghaei, M. Bustreo, Y. Wang, G. Bailo, P. Morerio, A. Del Bue) [1/2/21]
- SURREAL - 60,000 synthetic videos of people under large variations in shape, texture, view-point and pose. (Varol, Romero, Martin, Mahmood, Black, Laptev, Schmid) [Before 28/12/19]
- Synthetic Depth & Thermal (SDT) Dataset - The Synthetic Depth & Thermal (SDT) dataset consists of 40k synthetic and 8k real depth and thermal stereo images, depicting human behavior in indoor environments. (CVL, Strohmayer Julian, Pramerdorfer Christopher, Kampel Martin) [1/2/21]
- SynWild - A new dataset called SynWild to evaluate the human surface reconstruction from monocular videos in the wild.(Guo, Chen, Jiang, et al) [22/06/25]
- TNT 15 dataset - Several sequences of video synchronised by 10 Inertial Sensors (IMU) worn at the extremities. (von Marcard, Pons-Moll and Rosenhahn) [Before 28/12/19]
- uCO3D - UnCommon Objects in 3D. A dataset designed for training and benchmarking deep learning models aiding tasks like 3D generation and reconstruction. The dataset contains 170k videos from 1k LVIS categories. (Liu, Xingchen, Tayal, et al.) [29/06/25]
- UC-3D Motion Database - Available data types encompass high resolution Motion Capture, acquired with MVN Suit from Xsens and Microsoft Kinect RGB and depth images. (Institute of Systems and Robotics, Coimbra, Portugal) [Before 28/12/19]
- United People (UP) Dataset - ˜8,000 images with keypoint and foreground segmentation annotations as well as 3D body model fits. (Lassner, Romero, Kiefel, Bogo, Black, Gehler) [Before 28/12/19]
- VGG Human Pose Estimation datasets including the BBC Pose (20 videos with an overlaid sign language interpreter), Extended BBC Pose (72 additional training videos), Short BBC Pose (5 one hour videos with sign language signers), and ChaLearn Pose (23 hours of Kinect data of 27 persons performing 20 Italian gestures). (Charles, Everingham, Pfister, Magee, Hogg, Simonyan, Zisserman) [Before 28/12/19]
- VRLF: Visual Lip Reading Feasibility - audio-visual corpus of 24 speakers recorded in Spanish (Fernandez-Lopez, Martinez and Sukno) [Before 28/12/19]
- Weizmann Space-Time Actions The Weizmann dataset ( Actions as Space-Time Shapes ) consists of 90 low-resolution video sequences (180x144, 25 fps) showing 9 actors performing 10 basic actions. (M. Blank, L. Gorelick, E. Shechtman et al.) [31/07/25]
- Werewolf Among Us The first multi-modal, multi-party dataset of social interactions, including synchronized video, audio, transcripts, and detailed annotations for social behavior analysis. (B. Lai, H. Zhang, M. Liu, et al.) [10/07/25]
- X-Humans: Expressive Human Avatars -20 subjects, 233 sequences, 35,427 frames. High-quality textured scans, SMPL[-X] registrations. Body pose + hand gesture + facial expression. Various clothing types, hair styles, genders and ages. (Shen, Kaiyue, Guo, et al) [22/06/25]
- xR-EgoPose - 3D Human pose estimation from an egocentric perspective (Denis Tome) [27/12/2020]
- ZJU-MoCap (LightStage& Mirrored-Human) ZJU-MoCap is a multi-view human motion capture dataset featuring two parts. (S. Peng, X. Xu, Q. Shuai et al.) [31/07/25]
People Detection and Tracking Databases
- 3D KINECT Gender Walking data base (L. Igual, A. Lapedriza, R. Borràs from UB, CVC and UOC, Spain) [Before 28/12/19]
- AAU VAP Trimodal People Segmentation Dataset - People detection and segmentation dataset captured with depth, RGB, and thermal sensors (Palmero, Clapes, Bahnsen, Mogelmose, Moeslund, Escalera) [Before 28/12/19]
- Aerial Gait Dataset - people walking as viewed from an aerial (moving) platform (Perera, Law, Chahl) [Before 28/12/19]
- AGORASET: a dataset for crowd video analysis (Nicolas Courty et al) [Before 28/12/19]
- ARIC: Activity Recognition In Classroom Dataset ARIC is a multimodal classroom surveillance image dataset featuring 32 distinct activity categories captured from multiple perspectives. (L. Xu, F. Meng, Q. Wu et al.) [11/07/2025]
- Berkeley Multimodal Human Action Database (MHAD) MHAD is a controlled multimodal human action dataset capturing 11 actions performed by 12 subjects (seven male, five female) with five repetitions per action.(F. Ofli, R. Chaudhry, G. Kurillo et al.) [31/07/25]
- C2A - Combination to Application (C2A) Dataset for Human Detection in Disaster Scenarios. A synthetic dataset of human poses overlaid on UAV imagery of diverse disaster scenes for training deep learning models. (Nihal, Yen, Itoyama, Nakadai) [8/12/24]
- CarDA - Car door Assembly Activities Dataset A multi-modal dataset for car door assembly activities with synchronized RGB-D videos and pose data in a real manufacturing setting (K. Papoutsakis, N. Bakalos, K. Fragkoulis et al.) [10/07/25]
- CASIA gait database (Chinese Academy of Sciences) [Before 28/12/19]
- CAVIAR project video sequences with tracking and behavior ground truth (CAVIAR team/Edinburgh University - EC project IST-2001-37540) [Before 28/12/19]
- CEPDOF: Challenging Events for Person Detection from Overhead Fisheye images - Eight 1kx1k or 2kx2k overhead fisheye RGB videos (over 25k frames) showing up to 13 people in a small classroom recorded in diverse, challenging scenarios (including no light), annotated with bounding box for each person, including angle (Tezcan, Duan, Ishwar, Konrad) [27/12/2020]
- CMU-MMAC (CMU Kitchen) CMU-MMAC is a multi-modal activity dataset captured in a fully instrumented kitchen, featuring 25 subjects performing 5 cooking recipes.(F. De la Torre, J. Hodgins, J. Montano et al.) [31/07/25]
- CMU Panoptic Studio Dataset - Multiple people social interaction dataset captured by 500+ synchronized video cameras, with 3D full body skeletons and calibration data. (H. Joo, T. Simon, Y. Sheikh) [Before 28/12/19]
- COCO-WholeBody - large-scale 2D whole-body pose estimation dataset (Jin, Xu, Xu, Wang, Liu, Qian, Ouyang, Luo) [26/12/2020]
- Crowdbot (EPFL LASA) dataset This dataset captures pedestrian interactions from a mobile service robot (Qolo) navigating in crowds, recording over 250k frames (~200 minutes) with frontal and rear 3D LiDAR (Velodyne VLP-16 at 20 Hz). (D. Paez-Granados, Y. He, D. Gonon et al.) [31/07/25]
- CUHK Crowd Dataset - 474 video clips from 215 crowded scenes (Shao, Loy, and Wang) [Before 28/12/19]
- CUHK01 Dataset : Person re-id dataset with 3, 884 images of 972 pedestrians (Rui Zhao et al) [Before 28/12/19]
- CUHK02 Dataset : Person re-id dataset with five camera view settings. (Rui Zhao et al) [Before 28/12/19]
- CUHK03 Dataset : Person re-id dataset with 13,164 images of 1,360 pedestrians (Rui Zhao et al) [Before 28/12/19]
- Caltech Pedestrian Dataset (P. Dollar, C. Wojek, B. Schiele and P. Perona) [Before 28/12/19]
- City1M - A large synthetic group re-identification dataset containing over 1M images (Zhang, Dang, Lai, Feng, Xie) [6/12/2022]
- CLOTH3D - RGB videos of 3D dressed humans with high quality rendering and rich cloth dynamics. 8K different sequences, body shapes and outfits. (Bertiche, Madadi, Escalera) [26/12/2020]
- Daimler Pedestrian Detection Benchmark 21790 images with 56492 pedestrians plus empty scenes. (D. M. Gavrila et al) [Before 28/12/19]
- Datasets (Color & Infrared) for Fusion A series of images in color and infrared captured from a parallel two-camera setup under different environmental conditions. (Juan Serrano-Cuerda, Antonio Fernandez-Caballero, Maria T. Lopez) [Before 28/12/19]
- Dataset of Celebrities A dataset consisting of images of 'Sherlock' actors from freely available sources on the web (A. Nagrani, A. Zisserman) [05/06/25]
- DHP19 - DAVIS Human Pose Estimation and Action Recognition - Dataset contains synchronized Recordings from 4 DAVIS346 cameras with Vicon marker ground truth from 17 subjects doing repeated motions.(Balgrist University Hospital, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DEPOF - Distance Estimation between People from Overhead Fisheye cameras. Training and testing data for distance estimation between people captured by overhead fisheye cameras (Lu, Cokbas, Ishwar, Konrad) [25/06/25]
- Driver Monitoring Video Dataset (RobeSafe + Jesus Nuevo-Chiquero) [Before 28/12/19]
- DukeMTMC: Duke Multi-Target Multi-Camera tracking dataset - 8 cameras, 85 min, 2m frames, 2000 people of video (Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, Carlo Tomasi) [Before 28/12/19]
- Edinburgh overhead camera person tracking dataset (Bob Fisher, Bashia Majecka, Gurkirt Singh, Rowland Sillito) [Before 28/12/19]
- EDVARS - a dataset for domain generalizable person re-identification. 763K images, 23K identities, 98 cameras, 8 scenes (Hu, Liu, Zheng, Zheng, Zha) [28/10/24]
- FRIDA - Fisheye Re-Identification Dataset with Annotations. 4 videos from 3 time-synchronized overhead fisheye cameras with fully-overlapping fields of view - 20 people moving around, 18,318 annotated frames with 242,809 bounding-box labels (Cokbas, Bolognino, Ishwar, Konrad) [25/06/25]
- FVessel - a benchmark dataset for maritime vessel detection, tracking, and (multi-sensor) data fusion/ 26+ videos. (Guo, Ryan Wen Liu*, Jingxiang Qu, Yuxu Lu, Fenghua Zhu, Yisheng Lv) [4/3/24]
- GMDCSA-24 - A dataset for human fall detection in videos. Four actors, 3 home settings, different ;ighting and clothes (Alam, Sufian, Dutta, Leo, Hameed) 15/12/24]
- GVVPerfcapEva - Repository of human shape and performance capture data, including full body skeletal, hand tracking, body shape, face performance, interactions (Christian Theobalt) [Before 28/12/19]
- HABBOF: Human-Aligned Bounding Boxes from Overhead Fisheye cameras - Four 2kx2k overhead fisheye RGB videos (almost 6k frames) showing up to 4 people in conference room and lab, annotated with bounding box for each person, including angle (Tezcan, Li, Ishwar, Konrad) [27/12/2020]
- HAT Database of 27 human attributes (Gaurav Sharma, Frederic Jurie) [Before 28/12/19]
- Immediacy Dataset - This dataset is designed for estimation personal relationships. (Xiao Chu et al.) [Before 28/12/19]
- Inria Dressed human bodies in motion benchmark - Benchmark containing 3D motion sequences of different subjects, motions, and clothing styles that allows to quantitatively measure the accuracy of body shape estimates. (Jinlong Yang, Jean-Sbastien Franco, Franck H=E9troy-Wheeler, and Stefanie Wuhrer) [Before 28/12/19]
- INRIA Person Dataset (Navneet Dalal) [Before 28/12/19]
- IU ShareView - IU ShareView dataset consists of nine sets of synchronized (two first-person) videos with a total of 1,227 pixel-level ground truth segmentation maps of 2,654 annotated person instances. (Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S. Ryoo, David J. Crandall) [Before 28/12/19]
- Izmir - omnidirectional and panoramic image dataset (with annotations) to be used for human and car detection (Yalin Bastanlar) [Before 28/12/19]
- Joint Attention in Autonomous Driving (JAAD) - The dataset includes instances of pedestrians and cars intended primarily for the purpose of behavioural studies and detection in the context of autonomous driving. (Iuliia Kotseruba, Amir Rasouli and John K. Tsotsos) [Before 28/12/19]
- JTL Stereo Tacking Dataset for Person Following Robots - 11 different indoor and outdoor places for the task of robots following people under challenging situations (Chen, Sahdev, Tsotsos) [Before 28/12/19]
- KAIST Multispectral Pedestrian Detection Benchmark - 95k color-thermal pairs (640x480, 20Hz) images, with 103,128 dense annotations and 1,182 unique pedestrians (Hwang, Park, Kim, Choi, Kweon) [Before 28/12/19]
- MAHNOB: MHI-Mimicry database - A 2 person, multiple camera and microphone database for studying mimicry in human-human interaction scenarios. (Sun, Lichtenauer, Valstar, Nijholt, and Pantic) [Before 28/12/19]
- ME-ReID - a real surveillance multi-environment person ReID dataset, containing various environments like daytime and night, sunny and snowy weather, indoors and outdoors, etc (Liu, Li, Wang) [3/12/2022]
- MIT CBCL Pedestrian Data (Center for Biological and Computational Learning) [Before 28/12/19]
- MPI DYNA - A Model of Dynamic Human Shape in Motion (Max Planck Tubingen) [Before 28/12/19]
- MPI FAUST Dataset A data set containing 300 real, high-resolution human scans, with automatically computed ground-truth correspondences (Max Planck Tubingen) [Before 28/12/19]
- MPI JHMDB dataset - Joint-annotated Human Motion Data Base - 21 actions, 928 clips, 33183 frames (Jhuang, Gall, Zuffi, Schmid and Black) [Before 28/12/19]
- MPI MOSH Motion and Shape Capture from Markers. MOCAP data, 3D shape meshes, 3D high resolution scans. (Max Planck Tubingen) [Before 28/12/19]
- MVHAUS-PI - a multi-view human interaction recognition dataset (Saeid et al.) [Before 28/12/19]
- Market-1501 Dataset - 32,668 annotated bounding boxes of 1,501 identities from up to 6 cameras (Liang Zheng et al) [Before 28/12/19]
- Modena and Reggio Emilia first person head motion videos (Univ of Modena and Reggio Emilia) [Before 28/12/19]
- Multimodal Activities of Daily Living - including video, audio, physiological, sleep, motion and plug sensors. (Alexia Briasouli) [Before 28/12/19]
- Multiple Object Tracking Benchmark - A collection of datasets with ground truth, plus a performance league table (ETHZ, U. Adelaide, TU Darmstadt) [Before 28/12/19]
- Multispectral visible-NIR video sequences - Annotated multispectral video, visible + NIR (LE2I, Universit de Bourgogne) [Before 28/12/19]
- NYU Multiple Object Tracking Benchmark (Konrad Schindler et al) [Before 28/12/19]
- "Neuromorphic Vision Datasets for Pedestrian Detection, Action Recognition, and Fall Detection" - "Neuromorphic vision datasets for pedestrian detection, action recognition and fall detection recorded with a DAVIS346redColor."(Miao, Chen, Ning, Zi, Ren, Bing, Knoll) [27/12/2020]
- Occluded Articulated Human Body Dataset - Body pose extraction and tracking under occlusions, 6 RGB-D sequences in total (3500 frames) with one, two and three users, marker-based ground truth data. (Markos Sigalas, Maria Pateraki, Panos Trahanias) [Before 28/12/19]
- Opportunity Activity Recognition Opportunity is a multimodal dataset of wearable, ambient, and object sensors recorded while users performed scripted and natural activities in a room setting with annotations at multiple levels. (D. Roggen, R. Chavarriaga, D. Nguyen Dinh et al.) [31/07/25]
- OxUva - A large-scale long-term tracking dataset composed of 366 long videos of about 14 hours in total, with separate dev (public annotations) and test sets (hidden annotations), featuring target object disappearance and continuous attributes. (Jack Valmadre, Luca Bertinetto, Joao F. Henriques, Ran Tao, Andrea Vedaldi, Arnold Smeulders, Philip Torr, Efstratios Gavves) [Before 28/12/19]
- PAMAP2 Physical Activity Monitoring PAMAP2 is a multimodal, sensor-based dataset featuring recordings from 9 subjects performing 18 types of physical activities. (A. Reiss, D. Stricker) [31/07/25]
- PARSE Dataset Additional Data - facial expression, gaze direction, and gender (Antol, Zitnick, Parikh) [Before 28/12/19]
- PARSE Dataset of Articulated Bodies - 300 images of humans and horses (Ramanan) [Before 28/12/19]
- PathTrack dataset: a large-scale MOT dataset - PathTrack is a large scale multi-object tracking dataset of more than 15,000 person trajectories in 720 sequences. (Santiago Manen, Michael Gygli, Dengxin Dai, Luc Van Gool) [Before 28/12/19]
- PDbm: People Detection benchmark repository - realistic sequences, manually annotated people detection ground truth and a complete evaluation framework (Garcia-Martin, Martinez, Bescos) [Before 28/12/19]
- PDds: A Person Detection dataset - several annotated surveillance sequences of different levels of complexity (Garcia-Martin, Martinez, Bescos) [Before 28/12/19]
- PETS 2009 Crowd Challange dataset (Reading University & James Ferryman) [Before 28/12/19]
- PETS Winter 2009 workshop data (Reading University & James Ferryman) [Before 28/12/19]
- PETS: 2015 Performance Evaluation of Tracking and Surveillance (Reading University & James Ferryman) [Before 28/12/19]
- PETS: 2015 Performance Evaluation of Tracking and Surveillance (Reading University & Luis Patino) [Before 28/12/19]
- PETS 2016 datasets - multi-camera (including thermal cameras) video recordings of human behavior around a stationary vehicle and around a boat (Thomas Cane) [Before 28/12/19]
- PIROPO - People in Indoor ROoms with Perspective and Omnidirectional cameras, with more than 100,000 annotated frames (GTI-UPM, Spain) [Before 28/12/19]
- People-Art - a databased containing people labelled in photos and artwork (Qi Wu and Hongping Cai) [Before 28/12/19]
- Photo-Art-50 - a databased containing 50 object classes annoted in photos and artwork (Qi Wu and Hongping Cai) [Before 28/12/19]
- Pixel-based change detection benchmark dataset (Goyette et al) [Before 28/12/19]
- Precarious Dataset - unusual people detection dataset (Huang) [Before 28/12/19]
- Princeton Tracking Benchmark - 100 RGBD videos of moving objects such as humans, balls and cars. (Princeton Unviersity) [30/07/25]
- RAiD - Re-Identification Across Indoor-Outdoor Dataset: 43 people, 4 cameras, 6920 images (Abir Das et al) [Before 28/12/19]
- RPIfield - Person re-identification dataset containing 4108 person images with timestamps. (Meng Zheng, Srikrishna Karanam, Richard J. Radke) [Before 28/12/19]
- Singapore Maritime Dataset - Visible range videos and Infrared videos. (Dilip K. Prasad) [Before 28/12/19]
- SkiTB - the largest and most annotated dataset for computer vision-based performance analytics in skiing. It features bounding-box tracks for athletes across their complete performance. (Dunnhofer, Sordi, Martinel, Micheloni) [13/8/25]
- SLP (Simultaneously-collected multimodal Lying Pose) - large scale dataset on in-bed poses includes: 2 Data Collection Settings: (a) Hospital setting: 7 participants, and (b) Home setting: 102 participants (29 females, age range: 20-40). 4 Imaging Modalities: RGB (regular webcam), IR (FLIR LWIR camera), DEPTH (Kinect v2) and Pressure Map (Tekscan Pressure Sensing Map). 3 Cover Conditions: uncover, bed sheet, and blanket. Fully labeled poses with 14 joints. (Ostadabbas and Liu) [2/1/20]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- Shinpuhkan 2014 - A Person Re-identification dataset containing 22,000 images of 24 people captured by 16 cameras. (Yasutomo Kawanishi et al.) [Before 28/12/19]
- Stanford Structured Group Discovery dataset - Discovering Groups of People in Images (W. Choi et al) [Before 28/12/19]
- TIDOS: Thermal Images for Door-based Occupancy Sensing - Six low-resolution (32x24) thermal sequences with over 100k frames captured by sensors mounted above two doors of a room to count people, annotated with a person's time of entry/exit (Cokbas, Ishwar, Konrad) [27/12/2020]
- TrackingNet - Large-scale dataset for tracking in the wild: more than 30k annotated sequences for training, more than 500 sequestered sequences for testing, evaluation server and leaderboard for fair ranking. (Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi and Bernard Ghanem) [Before 28/12/19]
- Temple Color 128 - Color Tracking Benchmark - Encoding Color Information for Visual Tracking (P. Liang, E. Blasch, H. Ling) [Before 28/12/19]
- TUM Gait from Audio, Image and Depth (GAID) database - containing tracked RGB video, tracked depth video, and audio for 305 subjects (Babaee, Hofmann, Geiger, Bachmann, Schuller, Rigoll) [Before 28/12/19]
- TVPR (Top View Person Re-identification) dataset - person re-identification using an RGB-D camera in a Top-View configuration: indoor 23 sessions, 100 people, 8 days (Liciotti, Paolanti, Frontoni, Mancini and Zingaretti) [Before 28/12/19]
- A Survey of UAV-Based Re-Identification: Datasets, Methods, and Challenges A comprehensive survey of UAV (drone) re-identification research covering publicly available datasets, key methodological approaches, and open challenges in aerial person tracking from UAV viewpoints. (X. Zhang, Y. Li, B. Chen et al.) [11/07/25]
- UCI HAR (Smartphones) UCI Human Activity Recognition Using Smartphones is a wearable sensor dataset collected from 30 subjects performing six daily activities. (D. Anguita, L. Oneto, J. Reyes-Ortiz et al.) [31/07/25]
- UCLA Aerial Event Dataset - Human activities in aerial videos with annotations of people, objects, social groups, activities and roles (Shu, Xie, Rothrock, Todorovic, and Zhu) [Before 28/12/19]
- Univ of Central Florida - Crowd Dataset (Saad Ali) [Before 28/12/19]
- Univ of Central Florida - Crowd Flow Segmentation datasets (Saad Ali) [Before 28/12/19]
- URFD: University of Rzeszow Fall Detection Dataset URFD is an RGB-D video dataset for fall detection, capturing realistic simulated falls and daily activities using a Kinect sensor. It includes depth maps, skeleton data, and annotated video sequences for human action recognition. (K. Kwolek, M. Kepski) [10/07/2025]
- UT Interaction Dataset UT Interaction is a benchmark dataset of continuous human-human interaction videos captured in realistic outdoor scenes.(M. Ryoo, J. Aggarwal et al.) [31/07/25]
- UTA-RLDD - UTA Real-Life Drowsiness Detection Dataset: 30 hours of RGB videos of 60 subjects for multi-stage and realistic drowsiness detection (Ghoddoosian, Galib, Athitsos) [26/12/2020]
- VIPeR: Viewpoint Invariant Pedestrian Recognition - 632 pedestrian image pairs taken from arbitrary viewpoints under varying illumination conditions. (Gray, Brennan, and Tao) [Before 28/12/19]
- Visual object tracking challenge datasets - The VOT datasets is a collection of fully annotated visual object tracking datasets used in the single-target short-term visual object tracking challenges. (The VOT committee) [Before 28/12/19]
- vsenseVVDB: V-SENSE Volumetric Video Quality Database #1 - A collection of 32 compressed point clouds (from 2 reference point clouds) and their mean opinion scores, collected from 19 participants. (Zerman, Gao, Ozcinar, Smolic) [22/11/2021]
- vsenseVVDB2: V-SENSE Volumetric Video Quality Database #2 - A collection of 152 compressed volumetric video (both coloured point cloud and textured 3D meshes) and their mean opinion scores, collected from 23 participants. (Zerman, Ozcinar, Gao, Smolic) [22/11/2021]
- WEPDTOF - In-the-Wild Events for People Detection and Tracking from Overhead Fisheye Cameras. 16 overhead fisheye videos from YouTube with real-life challenges, up to 35 people per frame and 188 person identities consistently annotated across time (Tezcan, Duan, Cokbas, Ishwar, Konrad) [25/06/25]
- WIDER Attribute Dataset - WIDER Attribute is a large-scale human attribute dataset, with 13789 images belonging to 30 scene categories, and 57524 human bounding boxes each annotated with 14 binary attributes. (Li, Yining and Huang, Chen and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
- WUds: Wheelchair Users Dataset - wheelchair users detection data, to extend people detection, providing a more general solution to detect people in environments such as independent and assisted living, hospitals, healthcare centers and senior residences (Martin-Nieto, Garcia-Martin, Martinez) [Before 28/12/19]
- xR-EgoPose - Photorealistic synthetic dataset for 3D human pose estimation from an ego-centric perspective (Tome, Peluse, Agapito and Badino) [4/1/20]
- YouTube-BoundingBoxes - 5.6 million accurate human-annotated BB from 23 object classes tracked across frames, from 240,000 YouTube videos, with a strong focus on the person class (1.3 million boxes) (Real, Shlens, Pan, Mazzocchi, Vanhoucke, Khan, Kakarla et al) [Before 28/12/19]
Remote Sensing
- Aerial Imagery for Roof Segmentation (AIRS) - 457 km2 coverage of orthorectified aerial images with over 220,000 buildings for roof segmentation. (Lei Wang, Qi Chen) [Before 28/12/19]
- Aerial Maritime Drone Dataset - 74 images of aerial maritime photographs taken with via a Mavic Air 2 drone and 1,151 bounding boxes, consisting of docks, boats, lifts, jetskis, and cars. (Solawetz) [05/06/25]
- AIDER: Aerial Image Database for Emergency Response applications - RGB images for four disaster events Fire/Smoke, Flood, Collapsed Building/Rubble, and for Traffic Accidents, as well as for normal class that does not signal the presence of a disaster. Used for aerial remote sensing and classification applications with UAVs (Kyrkou) [26/12/2020]
- Brazilian Cerrado-Savanna Scenes Dataset - Composition of IR-R-G scenes taken by RapidEye sensor for vegetation classification in Brazilian Cerrado-Savanna. (K. Nogueira, J. A. dos Santos, T. Fornazari, T. S. Freire, L. P. Morellato, R. da S. Torres) [Before 28/12/19]
- Brazilian Coffee Scenes Dataset - Composition of IR-R-G scenes taken by SPOT sensor for identification of coffee crops in Brazilian mountains. (O. A. B. Penatti, K. Nogueira, J. A. dos Santos.) [Before 28/12/19]
- Building Detection Benchmark -14 images acquired from IKONOS (1 m) and QuickBird (60 cm)(Ali Ozgun Ok and Caglar Senaras) [Before 28/12/19]
- C2A - Combination to Application (C2A) Dataset for Human Detection in Disaster Scenarios. A synthetic dataset of human poses overlaid on UAV imagery of diverse disaster scenes for training deep learning models. (Nihal, Yen, Itoyama, Nakadai) [8/12/24]
- CBERS-2B, Landsat 5 TM, Geoeye, Ikonos-2 MS and ALOS-PALSAR - land-cover classification using optical images(D. Osaku et al. ) [Before 28/12/19]
- COSMICA A curated dataset of over 10,000 annotated telescope images of comets, galaxies, nebulae, globular clusters, and background regions for training and evaluating astronomical object detectors(Piratinskii, E., Rabaev, I.) [10/07/25]
- Data Fusion Contest 2015 (Zeebruges) - This dataset provides a RGB aerial dataset (5cm) and a Lidar point cloud (65pts/m2) over the harbor of the city of Zeebruges (Belgium). It also provided a DSM derived from the point cloud and a semantic segmentation ground truth of five of the seven 10000 x 10000 pixels tiles. An evaluation server is used to evaluate the results on the two other tiles. (Image analysis and Data Fusion Technical Committee, IEEE Geoscience, Remote Sensing Society) [Before 28/12/19]
- Data Fusion Contest 2017 - This dataset provides satellite (Landsat, Sentinel 2) and vector GIS layers (e.g. buildings and road footprint) for nine cities worldwide. The task is to predict land use classes useful for climate models at a 100m prediction grid, given data of different resolution and types of features. 5 cities come with labels, 4 others are kept hidden for scoring on an evaluation server. (Image analysis and Data Fusion Technical Committee, IEEE Geoscience, Remote Sensing Society) [Before 28/12/19]
- deepGlobe challenge - This datasets comprises three challenges, road extraction, buildings detection and semantic segmentation of land cover. A series of satellite images from Digital Globe (RGB, 50 cm resolution) and labels over several countries worldwide are provided. The results were presented at the DeepGlobe workshop at CVPR 2018. (Facebook, Digital Globe) [Before 28/12/19]
- DeepGlobe Satellite Image Understanding Challenge - Datasets and evaluation platforms for three deep learning tasks on satellite images: road extraction, building detection, and land type classification. (Demir, Ilke and Koperski, Krzysztof and Lindenbaum, David and Pang, Guan and Huang, Jing and Basu, Saikat and Hughes, Forest and Tuia, Devis and Raskar, Ramesh) [Before 28/12/19]
- DeepSat (SAT-4) Airborne Dataset -500,000 image patches covering four broad land cover classes. (S. Basu et al.) [28/07/25]
- DeepSat (SAT-6) Airborne Dataset -405,000 image patches each of size 28x28 and covering 6 landcover classes. (S. Basu et al.) [28/07/25]
- DOTA - 2806 large aerial images with 188,282 over 15 categories (Xia, Bai, Ding, Zhu, Belongie, Luo, Datcu, Pelillo, Zhang) [Before 28/12/19]
- DublinCity: Annotated LiDAR Point Cloud and its Applications - Annotated (13 labels) aerial lidar scan of central Dublin (Zolanvari, Ruano, Rana, Cummins, da Silva, Rahbar, Smolic) [Before 28/12/19]
- EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision EarthView comprises approximately 15 terapixels of global remote-sensing imagery from Sentinel-1/2, NEON, and high-resolution (1 m) Satellogic data (2017-2022). (D. Velazquez, P. R. Lopez, S. Alonso, et al.) [10/07/25]
- EORSSD: Extended Optical Remote Sensing Saliency Detection dataset - salient object detection in optical remote sensing images (Zhang, Cong, Li, Cheng, Fang, Cao, Zhao, Kwong) [27/12/2020]
- F3 Facies - Fully-annotated 3D geological model of the Netherlands F3 Block for facies classification benchmark (Yazeed Alaudah and Patrycja Micha?owicz and Motaz Alfarraj and Ghassan AlRegib) [1/2/21]
- FineAir - It is a high-resolution optical satellite imagery dataset specifically designed for fine-grained airplane classification by leveraging transponders. (M. Osswald, L. Niederloehner, S. Koejer, et al) [22/06/25]
- Forest Type Mapping -Multi-temporal remote sensing data of a forested area in Japan. The goal is to map different forest types using spectral data. (Brian Johnson) [28/07/25]
- FORTH Multispectral Imaging (MSI) datasets - 5 datasets for Multispectral Imaging (MSI), annotated with ground truth data (Polykarpos Karamaoynas) [Before 28/12/19]
- Furnas and Tiete - sediment yield classification( Pisani et al.) [Before 28/12/19]
- H2OPM Image Registration - The H2OPM Image Registration Dataset is a dataset for the evaluation of (groupwise) registration methods. (CVL, Zambanini, Sebastian) [1/2/21]
- HSRC - High Resolution Optical Satellite Image Dataset for Ship Recognition. 1061 ships images over 3 subclass levels (Liu, Yuan, Weng, Yang) [Before 28/12/19]
- iSAID -Instance Segmentation in Aerial Images Dataset. Precise instance-level annotatio carried out by professional annotators, cross-checked and validated by expert annotators complying with well-defined guidelines. (S. Zamir, A. Arora, A. Gupta, et al.) [28/07/25]
- ISPRS 2D semantic labeling - Height models and true ortho-images with a ground sampling distance of 5cm have been prepared over the city of Potsdam/Germany (Franz Rottensteiner, Gunho Sohn, Markus Gerke, Jan D. Wegner) [Before 28/12/19]
- ISPRS 3D semantic labeling - nine class airborne laser scanning data (Franz Rottensteiner, Gunho Sohn, Markus Gerke, Jan D. Wegner) [Before 28/12/19]
- Inria Aerial Image Labeling Dataset - 9000 square kilometeres of color aerial imagery over U.S. and Austrian cities. (Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, Pierre Alliez.) [Before 28/12/19]
- KIT AIS Data Set -Multiple labeled training and evaluation datasets of aerial images of crowds. (M. Butenuth et al.) [28/07/25]
- LArge North-Sea Dataset of Migrated Aggregated Seismic Structures - LArge North-Sea Dataset of Migrated Aggregated Seismic Structures (Y. Alaudah, M. Alfarraj, and G. Al Regib) [1/2/21]
- Linkoping Thermal InfraRed dataset - The LTIR dataset is a thermal infrared dataset for evaluation of Short-Term Single-Object (STSO) tracking (Linkoping University) [Before 28/12/19]
- MASATI: MAritime SATellite Imagery dataset - MASATI is a dataset composed of optical aerial imagery with 6212 samples which were obtained from Microsoft Bing Maps. They were labeled and classified into 7 classes of maritime scenes: land, coast, sea, coast-ship, sea-ship, sea with multi-ship, sea-ship in detail. (University of Alicante) [Before 28/12/19]
- MUUFL Gulfport Hyperspectral and LiDAR data set - Co-registered aerial hyperspectral and lidar data over the University of Southern Mississippi Gulfpark campus containing several sub-pixel targets. (Gader, Zare, Close, Aitken, Tuell) [Before 28/12/19]
- NWPU-RESISC45 - A large-scale benchmark dataset used for remote sensing image scene classification containing 31500 images covered by 45 scene classes. (Cheng, Han, Lu) [Before 28/12/19]
- NWPU VHR-10 dataset - 800 high resolution satellite images of 10 classes (airplane, ship, storage tank, baseballdiamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle) (Cheng, Han, Zhou, Guo) [Before 28/12/19]
- Oil Spill Detection Dataset - A set of SAR images and their corresponding ground truth masks, depicting oil spills and other relevant classes (e.g. look-alikes, ships, etc.) for oil spill detection/segmentation. (Krestenitis, M., Orfanidis, G., Ioannidis, K., Avgerinakis, K., Vrochidis, S., Kompatsiaris, I.) [1/2/21]
- Overhead Imagery Datasets for Object Detection -Annotated overhead imagery. Images with multiple objects. (F. Tanner et al.) [28/07/25]
- PaCaBa - Parking Cars Barcelona Dataset - a WorldView-3 stereo satellite image dataset with labeled parking cars in the city of Barcelona (Zambanini, Loghin, Pfeifer, Soley, Sablatnig) [27/12/2020]
- RIT-18 - a high-resolution multispectral dataset for semantic segmentation. (Ronald Kemker, Carl Salvaggio, Christopher Kanan) [Before 28/12/19]
- SAR SHIP DATASET - 43 Synthetic Aperture Radar images (Schwegmann, Kleynhans, Salmon, Mdakane, Meyer) [Before 28/12/19]
- SatUAV - Aerial photos collected by UAVs and corresponding Satellite Images (paired images) including 13 places from Asia and Europ, e.g, Suzhou, Kunshan, Weihai, Shennongjia, Wuxi, Birmingham, Coventry, Liverpool, Peak District, Merlischachen, Renens, Lausanne, Le Bourget Airport(Paris). Among them, original aerial photos from Merlischachen, Renens, Lausanne, Le Bourget Airport(Paris) belong to senseFly (Nian Xue, Liang Niu, Xianbin Hong, Zhen Li, Larissa Hoffaeller, Christina P?pper) [1/2/21]
- Semantic Drone Dataset - 20 houses from nadir (bird's eye) view acquired at 5 to 30 meters above ground. 400 public and 200 private high resolution images of 6000x4000px (24Mpx). [Before 28/12/19]
- SkyScenes - A Synthetic Dataset for Aerial Scene Understanding. A synthetic, densely annotated UAV aerial image dataset with 33,600 diverse images designed to overcome limitations in real-world aerial scene understanding (Khose, Pal, Agarwal, Deepanshi, Hoffman, Chattopadhyay) [7/12/24]
- SpaceNet -The SpaceNet Dataset is hosted as an Amazon Web Services (AWS) Public Dataset. It contains ~67,000 square km of very high-resolution imagery, >11M building footprints, and ~20,000 km of road labels to ensure that there is adequate open source data available for geospatial machine learning research. (DigitalGlobe) [28/07/25]
- STPLS3D - a large database of annotated ground truth 3D point clouds reconstructed using aerial photogrammetry for training and validating 3D semantic and instance segmentation algorithms (Chen, Hu, Yu, Thomas, Feng, Hou, McCullough, Ren, Soibelman) [3/12/2022]
- uav-search-and-rescue - A people detection dataset specifically conceived for search-and-rescue operations from drones with computer vision. As it is small-sized, the dataset is currently intended for testing and evaluation purposes only (Castellano; Carone; Scigliuto; Vessio) [28/12/2020]
- UC Merced Land Use Dataset 21 class land use image dataset with 100 images per class, largely urban, 256x256 resolution, 1 foot pixels (Yang and Newsam) [Before 28/12/19]
- UCF-CrossView Dataset: Cross-View Image Matching for Geo-localization in Urban Environments - A new dataset of street view and bird's eye view images for cross-view image geo-localization. (Center for Research in Computer Vision, University of Central Florida) [Before 28/12/19]
- Wilt -High-resolution Remote Sensing data set (Quickbird). Small number of training samples of diseased trees, large number for other land cover. Testing data set from stratified random sample of image. (Brian Johnson) [28/07/25]
- Zurich Summer dataset - t is intended for semantic segmentation of very high resolution satellite images of urban scenes, with incomplete ground truth (Michele Volpi and Vitto Ferrari.) [Before 28/12/19]
- Zurich Urban Micro Aerial Vehicle Dataset - time synchronized aerial high-resolution images of 2 km of Zurich, with associated other data (Majdik, Till, Scaramuzza) [Before 28/12/19]
Robotics
- Anki Vector Robot - This dataset has been designed to allow one to train a model which can detect a Vector robot in the camera feed of another Vector robot. (Banerjee) [05/06/25]
- DISC - A large-scale virtual dataset for simulating disaster scenarios (Jeon, Im, Lee, Choi, Hebert, Kweon) [28/12/2020]
- Crowdbot (EPFL LASA) dataset This dataset captures pedestrian interactions from a mobile service robot (Qolo) navigating in crowds, recording over 250k frames (~200-minutes) with frontal and rear 3D LiDAR (Velodyne VLP-16 at 20 Hz). (D. Paez-Granados, Y. He, D. Gonon et al.) [31/07/25]
- Edinburgh Kitchen Utensil Database - 897 raw and binary images of 20 categories of kitchen utensil, a resource for training future domestic assistance robots (D. Fullerton, A. Goel, R. B. Fisher) [Before 28/12/19]
- Event-Camera Dataset - This presents the world's first collection of datasets with an event-based camera for high-speed robotics (E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, D. Scaramuzza) [Before 28/12/19]
- FLYBO - A Unified Benchmark Environment for Autonomous Flying Robots. FLYBO provides datasets, references and a framework to benchmark autonomous exploration MAV systems w.r.t their volumetric exploration and online surface reconstruction capabilities. (Brunel, Bourki, Strauss, Demonceaux) [1/12/2021]
- Improved 3D Sparse Maps for High-performance Structure from Motion with Low-cost Omnidirectional Robots - Evaluation Dataset - Data set used in research paper doi:10.1109/ICIP.2015.7351744 (Breckon, Toby P., Cavestany, Pedro) [Before 28/12/19]
- Indoor Place Recognition Dataset for localization of Mobile Robots - The dataset contains 17 different places built from 2 different robots (virtualMe and pioneer) (Raghavender Sahdev, John K. Tsotsos.) [Before 28/12/19]
- JHU CoSTAR Block Stacking Dataset - A robot dynamically interacts with 5.1 cm colored blocks via real time RGBD data to complete an order-fulfillment style block stacking task, with over 12k stacking attempts and 2m frames with applications in deep learning, neural networks, reinforcement learning, and more. (Hundt, Jain, Lin, Paxton, Hager) [27/12/2020]
- JTL Stereo Tacking Dataset for Person Following Robots - 11 different indoor and outdoor places for the task of robots following people under challenging situations (Chen, Sahdev, Tsotsos) [Before 28/12/19]
- LATTE-MV Dataset of extracted human poses and 3D ball positions during professional table tennis gameplay, captured from monocular videos. (D. Etaat, D. Kalaria, N. Rahmanian, S. S. Sastry.) [10/07/25]
- M3VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation A segmentation benchmark capturing phase-aware dynamics and transitions across robotic scenarios (J. Li, et al.) [10/07/25]
- Meta rooms - RGB-D data comprised of 28 aligned depth camera images collected by having robot go to specific place and do 360 degrees of pan with various tilts. (John Folkesson et al.) [Before 28/12/19]
- PanoNavi dataset - A panoramic dataset for robot navigation, consisted of 5 videos lasting about 1 hour. (Lingyan Ran) [Before 28/12/19]
- PanoraMIS - thousands of ultra-wide field of view images acquired with cameras (catadioptric, dual-fisheye) on robots (wheeled, aerial and arm) with accurate 3D position and orientation ground truth (robot encoders, GNSS, IMU) indoors and outdoors (Benseddik, Morbidi, Caron) [12/08/20]
- PRED18 - VISUALISE Predator/Prey Dataset - Dataset contains recordings from a DAVIS240 camera mounted on a computer-controlled robot (the predator) that chases and attempts to capture another human-controlled robot (the prey).(Moeys, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- Robotic 3D Scan Repository - 3D point clouds from robotic experiments of scenes (Osnabruck and Jacobs Universities) [Before 28/12/19]
- Swiss3DCities - Urban aerial photogrammetric 3D pointclouds at three point-density levels and with per-point semantic labels from three cities in Switzerland (Nomoko AG) [30/12/2020]
- Solving the Robot-World Hand-Eye(s) Calibration Problem with Iterative Methods - These datasets were generated for calibrating robot-camera systems. (Amy Tabb) [Before 28/12/19]
- TrimBot2020 Public Dataset for Garden Navigation - sensor data recorded from cameras and other sensors mounted on a robotic platform as well as additional external sensors capturing the robot in the garden, used for 3D Reconstruction Meets Semantics Challenge. Includes multicamera, 3D garden ground truth, semantic labels, sensor position. (TrimBot2020 team) [18/1/2022]
- ViDRILO - ViDRILO is a dataset containing 5 sequences of annotated RGB-D images acquired with a mobile robot in two office buildings under challenging lighting conditions. (Miguel Cazorla, J. Martinez-Gomez, M. Cazorla, I. Garcia-Varea and V. Morell.) [Before 28/12/19]
- Witham Wharf - For RGB-D of eight locations collect by robot every 10 min over ~10 days by the University of Lincoln. (John Folkesson et al.) [Before 28/12/19]
- 3D-FRONT 3D-FRONT is a large-scale synthetic indoor scene dataset containing professionally designed house layouts and richly textured furniture, encompassing nearly 19,000 rooms across thousands of distinct houses. (H. Fu, B. Cai, L. Gao et al.) [31/07/25]
- 3DRMS Challenge Dataset 2017 - real garden stereo image pairs with camera poses and semantic annotation captured by a small mobile robot (TrimBot2020 consortium) [26/2/20]
- 3DRMS Challenge Dataset 2018 - synthetic garden stereo image pairs with depths, camera poses and semantic annotation (TrimBot2020 consortium) [26/2/20]
- Aria-Digital-Twin (ADT) ADT is an egocentric dataset captured via Aria glasses, comprising 236 real activity sequences in two fully digitized indoor spaces. (X. Pan, N. Charron, Y. Yang et al.) [31/07/25]
- Augmented ICL-NUIM Dataset - An augmentation of the ICL-NUIM dataset, with camera paths added to allow it to be used for scene reconstruction. (S. Choi, Q. Zhou, V. Koltun) [30/07/25]
- Background Models Challenge - provides videos for testing your background subtraction algorithm (A. Vacavant, T. Chateau, A. Wilhelm, L. Lequievre) [1/2/21]
- Barcelona - 15,150 images, urban views of Barcelona (Tighe and Lazebnik) [Before 28/12/19]
- CAD-Estate CAD-Estate is a large-scale RGB video dataset for 3D object annotation using CAD models. It contains ~101k CAD-aligned object instances (12k unique models) across ~20k real-estate videos, each placed in the 3D coordinate frame with full 9 DoF pose and room layout annotations. (K. K. Maninis, S. Popov, M. Niessner et al.) [31/07/25]
- Cross-modal Landmark Identification Benchmark - Dandmark-identification benchmark taken under varying weather conditions, which consists of 17 landmark images taken under several weather conditions, e.g., sunny, cloudy, snowy, and sunset. (Yonsei University) [Before 28/12/19]
- CMU Visual Localization Data Set - Dataset collected over the period of a year using the
Navlab 11 equipped with IMU, GPS, INS, Lidars and cameras. (Hernan Badino, Daniel Huber and Takeo Kanade) [Before 28/12/19]
- COLD (COsy Localization Database) - place localization (Ullah, Pronobis, Caputo, Luo, and Jensfelt) [Before 28/12/19]
- DAVIS: Video Object Segmentation dataset 2016 - A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation (F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung) [Before 28/12/19]
- DAVIS: Video Object Segmentation dataset 2017 - The 2017 DAVIS Challenge on Video Object Segmentation (J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, and L. Van Gool) [Before 28/12/19]
- EDEN: multimodal synthetic dataset of Enclosed garDEN scenes - more than 300K images captured from more than 100 garden models. Each image is annotated with various low/high-level vision modalities, including semantic segmentation, depth, surface normals, intrinsic colors, and optical flow (Le, Das, Mensink, Karaoglu, Gevers) [7/1/2021]
- EDUB-Seg- Egocentric dataset for event segmentation. (Mariella Dimiccoli, Marc Bolaños, Estefania Talavera, Maedeh Aghaei, Stavri G. Nikolov, and Petia Radeva.) [Before 28/12/19]
- ESSEX3IN1-Dataset ESSEX3IN1 is the first dataset that consists of images of places that are confusing for both VPR and human-cognition. It contains confusing and challenging dynamic objects, natural scenes and low-informative frames. (Zaffar, Mubariz and Ehsan, et al.) [05/06/25]
- European Flood 2013 - 3,710 images of a flood event in central Europe, annotated with relevance regarding 3 image retrieval tasks (multi-label) and important image regions. (Friedrich Schiller University Jena, Deutsches GeoForschungsZentrum Potsdam) [Before 28/12/19]
- Fieldsafe - A multi-modal dataset for obstacle detection in agriculture. (Aarhus University) [Before 28/12/19]
- Fifteen Scene Categories - A dataset of fifteen natural scene categories. (Fei-Fei Li and Aude Oliva) [Before 28/12/19]
- FIGRIM (Fine Grained Image Memorability Dataset) - A subset of images from the SUN database used for human memory experiments, and provided along with memorability scores. (Bylinskii, Isola, Bainbridge, Torralba, Oliva) [Before 28/12/19]
- Fukuoka Datasets - We present several multimodal 3D datasets for the categorization of places. In addition to 3D depth information, other modalities like RGB or reflectance images are included (O. M. Mozos, K. Nakashima, H. Jung, Y. Iwashita, and R. Kurazume,) [1/2/21]
- FVessel - a benchmark dataset for maritime vessel detection, tracking, and (multi-sensor) data fusion/ 26+ videos. (Guo, Ryan Wen Liu*, Jingxiang Qu, Yuxu Lu, Fenghua Zhu, Yisheng Lv) [4/3/24]
- Geometric Context - scene interpretation images (Derek Hoiem) [Before 28/12/19]
- GLDv2: Google Landmarks Dataset v2 - 4,132,914 training images, 761,757 index images, and 117,577 test images annotated with labels representing human-made and natural landmarks (Weyand, Araujo, Cao, Sim) [16/4/20]
- HyKo: A Spectral Dataset for Scene Understanding - The HyKo dataset was captured with compact, low-cost, snapshot mosaic (SSM) imaging cameras, which are able to capture a whole spectral cube in one shot recorded from a moving vehicle enabling hyperspectral scene analysis for road scene understanding. (Active Vision Group, University of Koblenz-Landau) [Before 28/12/19]
- ICL-NUIM dataset - Eight synthetic RGBD video sequences: four from a office scene and four from a living room scene. Simulated camera trajectories are taken from a Kintinuous output from a sensor being moved around a real-world room. (A. Handa, T. Whelan, J.B. McDonald, A.J. Davison) [30/07/25]
- iNaturalist Species Classification and Detection Dataset - The iNaturalist 2017 species classification and detection dataset has been collected and annotated by citizen scientists and contains 859,000 images from over 5,000 different species of plants and animals. (Caltech) [Before 28/12/19]
- Incidents - a dataset for detecting natural disasters, damage, and incidents in the wild - A large-scale dataset of scene-centric images annotated by humans that cover 43 disaster or incident categories and 49 place categories (Weber, Marzo, Papadopoulos, Biswas, Lapedriza, Ofli, Imran, Torralba) [27/12fg/2020]
- Indoor Place Recognition Dataset for localization of Mobile Robots - The dataset contains 17 different places built from 2 different robots (virtualMe and pioneer) (Raghavender Sahdev, John K. Tsotsos.) [Before 28/12/19]
- Indoor Scene Recognition - 67 Indoor categories, 15620 images (Quattoni and Torralba) [Before 28/12/19]
- Intrinsic Images in the Wild (IIW) - Intrinsic Images in the Wild, is a large-scale, public dataset for evaluating intrinsic image decompositions of indoor scenes (Sean Bell, Kavita Bala, Noah Snavely) [Before 28/12/19]
- IRS: Large Synthetic Indoor Robotics Stereo Dataset - 103,316 samples covering a wide range of indoor scenes, such as home, office, store and restaurant (Wang, Zheng, Yan, Deng, Zhao, Chu) [Before 28/12/19]
- LM+SUN - 45,676 images, mainly urban or human related scenes (Tighe and Lazebnik) [Before 28/12/19]
- Mallscape dataset - a collection of 33K localized and time-stamped images captured in two large shopping malls during two different sessions temporally separated by several months, enabling to evaluate Point-of-Interests (POI) change detection methods in realistic conditions (Revaud, Sampaio De Rezende, Heo, You, Jeong) [2/1/20]
- Maritime Imagery in the Visible and Infrared Spectrums - VAIS contains simultaneously acquired unregistered thermal and visible images of ships acquired from piers (Zhang, Choi, Daniilidis, Wolf, & Kanan) [Before 28/12/19]
- MASATI: MAritime SATellite Imagery dataset - MASATI is a dataset composed of optical aerial imagery with 6212 samples which were obtained from Microsoft Bing Maps. They were labeled and classified into 7 classes of maritime scenes: land, coast, sea, coast-ship, sea-ship, sea with multi-ship, sea-ship in detail. (University of Alicante) [Before 28/12/19]
- Materials in Context (MINC) - The Materials in Context Database (MINC) builds on OpenSurfaces, but includes millions of point annotations of material labels. (Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala) [Before 28/12/19]
- MIT Intrinsic Images - 20 objects (Roger Grosse, Micah K. Johnson, Edward H. Adelson, and William T. Freeman) [Before 28/12/19]
- NYU V2 Mixture of Manhattan Frames Dataset - We provide the Mixture of Manhattan Frames (MMF) segmentation and MF rotations on the full NYU depth dataset V2 by Silberman et al. (Straub, Julian and Rosman, Guy and Freifeld, Oren and Leonard, John J. and Fisher III, John W.) [Before 28/12/19]
- OFFSED/OPEDD Dataset - Off-Road Semantic Segmentation & Pedestrian Detection Dataset (Peter Neigel, Jason Rambach, Didier Stricker) [1/2/21]
- OpenSurfaces - OpenSurfaces consists of tens of thousands of examples of surfaces segmented from consumer photographs of interiors, and annotated with material parameters, texture information, and contextual information . (Kavita Bala et al.) [Before 28/12/19]
- Oxford Audiovisual Segmentation Dataset - Oxford Audiovisual Segmentation Dataset with Oxford Audiovisual Segmentation Dataset including audio recordings of objects being struck (Arnab, Sapienza, Golodetz, Miksik and Torr) [Before 28/12/19]
- Places 2 Scene Recognition database -365 scene categories and 8 millions of images (Zhou, Khosla, Lapedriza, Torralba and Oliva) [Before 28/12/19]
- Places Scene Recognition database - 205 scene categories and 2.5 millions of images (Zhou, Lapedriza, Xiao, Torralba, and Oliva) [Before 28/12/19]
- Replica Replica is a dataset of 18 highly photo-realistic 3D indoor scene reconstructions, featuring dense meshes, HDR textures, semantic class and instance segmentation, planar mirror and glass reflector information, and Habitat-SDK compatibility. (J. Straub, T. Whelan, L. Ma et al.) [31/07/25]
- RGB-NIR Scene Dataset - 477 images in 9 categories captured in RGB and Near-infrared (NIR) (Brown and Susstrunk) [Before 28/12/19]
- RMS2017 - Reconstruction Meets Semantics outdoor dataset - 500 semantically annotated images with poses and point cloud from a real garden (Tylecek, Sattler) [Before 28/12/19]
- RMS2018 - Reconstruction Meets Semantics virtual dataset - 30k semantically annotated images with poses and point cloud from 6 virtual gardens (Le, Tylecek) [Before 28/12/19]
- Scan2CAD Scan2CAD is a large-scale RGB-D scan to CAD alignment dataset built upon 1,506 ScanNet scenes and 14,225 CAD model instances (3,049 unique models). It contains 97,607 pairwise keypoint correspondences and ground-truth 9 DoF object alignments for evaluating scan-to-model alignment methods. (A. Avetisyan, M. Dahnert, A. Dai et al.) [31/07/25]
- SceneNet RGB-D - 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth including RGB and depth (McCormac, Handa, Leutenegger, Davison) [Before 28/12/19]
- Scene Understanding (SUN) dataset -a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. (J. Xiao, K. Ehinger, H. Hays, et al.) [28/07/25]
- SEMFIRE forest dataset - for semantic segmentation and data augmentation, including camera images and their labeled images, including in total about 1700 image pairs, corresponding ROS .bag files with multispectral images, 3D LiDAR point clouds, thermal images, GPS and IMU data, depth and RGB images. (Andrada, Portugal) [24/1/2022]
- Sift Flow (also known as LabelMe Outdoor, LMO) - 2688 images, mainly outdoor natural and urban (Tighe and Lazebnik) [Before 28/12/19]
- Southampton-York Natural Scenes Dataset 90 scenes, 25 indoor and outdoor scene categories, with spherical LiDAR, HDR intensity, stereo intensity panorama. (Adams, Elder, Graf, Leyland, Lugtigheid, Muryy) [Before 28/12/19]
- Stanford Background Dataset - 715 images of outdoor scenes containing at least one foreground object (Gould et al) [Before 28/12/19]
- Structured3D Structured3D is a large-scale photo-realistic synthetic indoor scene dataset comprising 3,500 professional house designs rendered into 2D images. (J. Zheng, J. Zhang, J. Li et al.) [31/07/25]
- SUN 2012 - 16,873 fully annotated scene images for scene categorization (Xiao et al) [Before 28/12/19]
- SUN 397 - 397 scene categories for scene classification (Xiao et al) [Before 28/12/19]
- SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite - 10,000 RGB-D images, 146,617 2D polygons and 58,657 3D bounding boxes (Song, Lichtenberg, and Xiao) [Before 28/12/19]
- Surface detection - Real-time traversable surface detection by colour space fusion and temporal analysis - Evaluation Dataset (Breckon, Toby P., Katramados, Ioannis) [Before 28/12/19]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- Taskonomy - Over 4.5 million real images each with ground truth for 25 semantic, 2D, and 3D tasks. (Zamir, Sax, Shen, Guibas, Malik, Savarese) [Before 28/12/19]
- TB-Places - a data set of garden images for bechmarking algorithms for image retrieval and visual place recognition (Maria Leyva-Vallina, TrimBot2020 consortium) [26/2/20]
- Thermal Road Dataset - Our thermal-road dataset provides around 6000 thermal-infrared images captured in the road scene with manually annotated ground-truth. (3500: general road, 1500: complicated road, 1000: off-road). (Jae Shin Yoon) [Before 28/12/19]
- TrimBot2020 Dataset for Garden Navigation sensor RGBD data recorded from cameras and other sensors mounted on a robotic platform as well as additional external sensors capturing the garden (TrimBot2020 consortium) [26/2/20]
- TrimBot2020 Public Dataset for Garden Navigation - sensor data recorded from cameras and other sensors mounted on a robotic platform as well as additional external sensors capturing the robot in the garden, used for 3D Reconstruction Meets Semantics Challenge. Includes multicamera, 3D garden ground truth, semantic labels, sensor position. (TrimBot2020 team) [18/1/2022]
- TUM City Campus - Urban point clouds taken by Mobile Laser Scanning (MLS) for classification, object extraction and change detection (Stilla, Hebel, Xu, Gehrung) [3/1/20]
- UVA Intrinsic Images and Semantic Segmentation Dataset RGB dataset with ground-truth albedo, shading, and semantic annotations (TrimBot2020 consortium) [26/2/20]
- ViDRILO - ViDRILO is a dataset containing 5 sequences of annotated RGB-D images acquired with a mobile robot in two office buildings under challenging lighting conditions. (Miguel Cazorla, J. Martinez-Gomez, M. Cazorla, I. Garcia-Varea and V. Morell.) [Before 28/12/19]
- Virtual Gallery - a synthetic dataset that targets multiple challenges such as varying lighting conditions and different occlusion levels for various tasks such as depth estimation, instance segmentation and visual localization (Weinzaepfel, Csurka, Cabon, Humenberger) [7/1/20]
- Wireframe dataset - A set of RGB images of man-made scenes are annotated with junctions and lines, which describes the large-scale geometry of the scenes. (Huang et al.) [Before 28/12/19]
Segmentation (General)
- A Dataset for Sky Segmentation - sentence describing it: This Sky dataset was used to evaluate the method IFT-SLIC and other superpixel algorithms, using the superpixel-based sky segmentation method proposed by Juraj Kostolansky. It contains a collection of 60 images based on the Caltech Airplanes Side dataset by R. Fergus with ground truth for sky segmentation. (Eduardo B. Alexandre, Paulo A. V. Miranda, R. Fergus) [Before 28/12/19]
- Aberystwyth Leaf Evaluation Dataset - Timelapse plant images with hand marked up leaf-level segmentations for some time steps, and biological data from plant sacrifice. (Bell, Jonathan; Dee, Hannah M.) [Before 28/12/19]
- ADE20K - 22+K hierarchically segmented and labeled scene images (900 scene categories, 3+K classes and subpart classes) (Zhou, Zhao, Puig, Fidler, Barriuso, Torralba) [Before 28/12/19]
- AeroRIT - Hyperspectral semantic segmentation dataset (Rangnekar, Mokashi, Ientilucci, Kanan, Hoffman) [26/12/2020]
- Alpert et al. Segmentation evaluation database (Sharon Alpert, Meirav Galun, Ronen Basri, Achi Brandt) [Before 28/12/19]
- BMC (Background Model Challenge) - A dataset for comparing background subtraction algorithms, composed of real and synthetic videos(Antoine) [Before 28/12/19]
- Berkeley Segmentation Dataset and Benchmark (David Martin and Charless Fowlkes) [Before 28/12/19]
- CAD 120 affordance dataset - Pixelwise affordance annotation in human context (Sawatzky, Srikantha, Gall) [Before 28/12/19]
- COLT - The dataset contains 40 imagenet categories with manually annotated per-pixel object masks. (Jia Li) [Before 28/12/19]
- CO-SKEL dataset - This dataset consists of categorized skeleton and segmentation masks for evaluating co-skeletonization methods. (Koteswar Rao Jerripothula, Jianfei Cai, Jiangbo Lu, Junsong Yuan) [Before 28/12/19]
- COSMICA A curated dataset of over 10,000 annotated telescope images of comets, galaxies, nebulae, globular clusters, and background regions for training and evaluating astronomical object detectors(Piratinskii, E., Rabaev, I.) [10/07/25]
- Crack detection on 2D pavement images - five sets of pavement images that contain cracks with the manual ground truth associated and 5 automatic segmentations obtained with existing approaches (Sylvie Chambon) [Before 28/12/19]
- CTU Color and Depth Image Dataset of Spread Garments - Images of spread garments with annotated corners. (Wagner, L., Krejov D., and Smutn V. (Czech Technical University in Prague)) [Before 28/12/19]
- CTU Garment Folding Photo Dataset - Color and depth images from various stages of garment folding. (Sushkov R., Melkumov I., Smutn y V. (Czech Technical University in Prague)) [Before 28/12/19]
- DADE dataset – Driving Agents in Dynamic Environments - A synthetic dataset containing video sequences (RGB images) acquired by vehicles navigating dynamic environments and weather conditions, with semantic segmentation grounds truths, GNSS position data, and weather information. (Halin, Gérin, Cioppa, Henry, Ghanem, Macq, Vleeschouwer, Droogenbroeck) [13/8/25]
- DeformIt 2.0 - Image Data Augmentation Tool: Simulate novel images with ground truth segmentations from a single image-segmentation pair (Brian Booth and Ghassan Hamarneh) [Before 28/12/19]
- EVIMO - Dataset for motion segmentation, egomotion estimation and tracking using an event camera; the dataset is collected with DAVIS 346C and provides 3D poses for camera and independently moving objects, and pixelwise motion segmentation masks. (Mitrokhin, Ye, Fermuller, Aloimonos, Delbruck) [14/1/20]
- Extreme Event Dataset - An event dataset with multiple moving objects in challenging conditions (low lighting conditions and extreme light variation including flashing strobe lights).(Mitrokhin, Fermuller, Parameshwara, Aloimonos) [27/12/2020]
- Food50Seg - A dataset for semantic segmentation of food images - pixel-wise semantic segmentation annotations of 5000 food images of 50 categories. In order to cope with the lack of variability in the imaging condition of the existing food datasets, we have also provided images having diferent visual distortions that could have resulted during the food acquisition: illuminant casts, JPEG compression distortions, Gaussian noises, and Gaussian blurs. The final dataset, composed of 120,000 images. (Aslan, Ciocca, Mazzini, Schettini) [7/1/2021]
- GrabCut Image database (C. Rother, V. Kolmogorov, A. Blake, M. Brown) [Before 28/12/19]
- Histology Image Collection Library (HICL) - The HICL is a compilation of 3870histopathological images (so far) from various diseases, such as brain cancer,breast cancer and HPV (Human Papilloma Virus)-Cervical cancer. (Medical Image and Signal Processing (MEDISP) Lab., Department of BiomedicalEngineering, School of Engineering, University of West Attica) [Before 28/12/19]
- ICDAR'15 Smartphone document capture and OCR competition - challenge 1 - videos of documents filmed by a user with a smartphone to simulate mobile document capture, and ground truth coordinates of the document corners to detect. (Burie, Chazalon, Coustaty, Eskenazi, Luqman, Mehri, Nayef, Ogier, Prum and Rusinol) [Before 28/12/19]
- Interactive Image Segmentation Datasets A dataset consisting of 151 images (V. Gulshan, C. Rother, et al.) [05/06/25]
- Intrinsic Images in the Wild (IIW) - Intrinsic Images in the Wild, is a large-scale, public dataset for evaluating intrinsic image decompositions of indoor scenes (Sean Bell, Kavita Bala, Noah Snavely) [Before 28/12/19]
- LabelMe images database and online annotation tool (Bryan Russell, Antonio Torralba, Kevin Murphy, William Freeman) [Before 28/12/19]
- LITS Liver Tumor Segmentation - 130 3D CT scans with segmentations of the liver and liver tumor. Public benchmark with leaderboard at Codalab.org (Patrick Christ) [Before 28/12/19]
- Materials in Context (MINC) - The Materials in Context Database (MINC) builds on OpenSurfaces, but includes millions of point annotations of material labels. (Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala) [Before 28/12/19]
- MOBIUS (Mobile Ocular Biometrics In Unconstrained Settings dataset) Contains over 16,000 eye images captured using three mobile devices, with manual segmentation masks on a subset (M. Vitek) [10/07/25]
- MSeg - A composite dataset that unifies semantic, instance, and panoptic segmentation datasets from different domains, evaluated via zero-shot cross-dataset generalization (Lambert, Liu, Sener, Hays, Koltun) [27/12/2020]
- Multi-species fruit flower detection - This dataset consists of four sets of flower images, from three different tree species: apple, peach, and pear, and accompanying ground truth images. (Philipe A. Dias, Amy Tabb, Henry Medeiros) [Before 28/12/19]
- OCTA-500 OCTA-500 is a large-scale multimodal retinal imaging dataset for segmentation, comprising OCT and OCTA volumetric data from 500 subjects. It includes six types of projection maps, four types of text labels (e.g., gender, disease), and seven segmentation annotations. (M. Li, Y. Zhang, Z. Ji et al.) [31/07/25]
- ODMS: Object Depth via Motion and Segmentation Dataset - dataset for learning Object Depth via Motion and Segmentation, which includes extensible training data and a benchmark evaluation across multiple application domains (Griffin,Corso) [26/12/2020]
- Objects with thin and elongated parts - The three datasets used to evaluate our method Oriented Image Foresting Transform with Connectivity Constraints, which contain objects with thin and elongated parts. These databases are composed of 280 public images of birds and insects with ground truths. (Lucy A. C. Mansilla (IME-USP), Paulo A. V. Miranda) [Before 28/12/19]
- OpenSurfaces - OpenSurfaces consists of tens of thousands of examples of surfaces segmented from consumer photographs of interiors, and annotated with material parameters, texture information, and contextual information . (Kavita Bala et al.) [Before 28/12/19]
- Osnabrück gaze tracking data - 318 video sequences from several different gaze tracking data sets with polygon based object annotation. (Schöning, Faion, Heidemann, Krumnack, Gert, Açik, Kietzmann, Heidemann & König) [Before 28/12/19]
- PASCAL-Scribble Dataset - Our PASCAL-Scribble Dataset provides scribble-annotations on 59 object/stuff categories. (Di Lin) [Before 28/12/19]
- PetroSurf3D - 26 high resolution (sub-millimeter accuracy) 3D scans of rock art with pixelwise labeling of petroglyphs for segmentation. (Poier, Seidl, Zeppelzauer, Reinbacher, Schaich, Bellandi, Marretta, Bischof) [Before 28/12/19]
- Ref-AVS - Refer and Segment Objects in Audio-Visual Scenes: segment objects in audio-visual scenes with language expression reference (Wang) [25/06/25]
- Simulated Brain Database The SBD contains a set of realistic MRI data volumes produced by an MRI simulator. (R.K.-S. Kwan, A.C. Evans, et al.) [05/06/25]
- The FAce Semantic SEGmentation repository The FASSEG repository is composed by two datasets (frontal01 and frontal02) for frontal face segmentation, and one dataset (multipose01) with labeled faces in multiple poses. (K. Khan, M. Mauro et al.) [05/06/25]
- The Internet Brain Segmentation Repository This dataset provides manually-guided expert segmentation results along with magnetic resonance brain image data. (IBSR) [05/06/25]
- SAIL-VOS - The Semantic Amodal Instance Level Video Object Segmentation (SAIL-VOS) dataset provides accurate ground truth annotations to develop methods for reasoning about occluded parts of objects while enabling to take temporal information into account (Hu, Chen, Hui, Huang, Schwing) [29/12/19]
- SBVPI (Sclera Blood Vessels, Periocular, and Iris dataset) Nearly 2,000 high-quality eye images with full and partial segmentation masks of the sclera, iris, pupil, and more (P. Rot) [10/07/25]
- Semantic Segmentation of Human Skin - Semantic Segmentation of Skin in NIR Images or low light settings (Pandey, Aayush Tyagi, Ambekar, Prathosh AP) [26/12/2020]
- Shadow Detection/Texture Segmentation Computer Vision Dataset - Video based sequences for shadow detection/suppression, with ground truth (Newey, C., Jones, O., & Dee, H. M.) [Before 28/12/19]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- Stony Brook University Shadow Dataset (SBU-Shadow5k) - Large scale shadow detection dataset from a wide variety of scenes and photo types, with human annotations (Tomas F.Y. Vicente, Le Hou, Chen-Ping Yu, Minh Hoai, Dimitris Samaras) [Before 28/12/19]
- TB-roses-v1 data set of rose bush images with ground truth for evaluation of rose stem segmentation (TrimBot2020 consortium) [26/2/20]
- Trans2k Trans2k is a large-scale synthetic dataset for transparent object segmentation, containing over 2,000 high-resolution images with pixel-level annotations of transparent objects like glass and plastic. (Z. Shu, Z. Wang, Q. Chen et al.) [11/07/25]
- TRoM: Tsinghua Road Markings - This is a dataset which contributes to the area of road marking segmentation for Automated Driving and ADAS. (Xiaolong Liu, Zhidong Deng, Lele Cao, Hongchao Lu) [Before 28/12/19]
- TSP6K - a specialized traffic dataset for researching the task of traffic monitoring scene parsing, which collects images spanning various scenes from the urban road shooting platform with pixel-level annotations of semantic labels and instance labels provided (Jiang, Yang, Cao, Hou, Cheng, Shen) [1/8/24]
- UniMod1K UniMod1K is a multimodal dataset for universal segmentation tasks, containing 1,000 high-quality images across RGB, depth, infrared, and thermal modalities with pixel-level annotations. (X. Zhu, Z. Yuan, J. Zhang et al.) [11/07/25]
- UVA Intrinsic Images and Semantic Segmentation Dataset - RGB dataset with ground-truth albedo, shading, and semantic annotations (Baslamisli, Groenestege, Das, Le, Karaoglu, Gevers)> [Before 28/12/19]
- Video Object Instance Segmentation Dataset A dataset consisting of Internet collected video clips with avarage length around 10s, and resolution is over 1920 x 1080.(maadaa.ai) [05/06/25]
- VOS - A dataset with 200 Internet videos for video-based salient object detection and segmentation. (Jia Li, Changqun Xia) [Before 28/12/19]
- XPIE - An image dataset with 10000 images containing manually annotated salient objects and 8596 containing no salient objects. (Jia Li, Changqun Xia) [Before 28/12/19]
Simultaneous Localization and Mapping
- ADVIO - A handheld smartphone dataset for Visual-Inertial Odometry. (Cortes, Solin, Rahtu, Kannala) [5/10/24]
- Air Ground Dataset - Air-Ground Matching of Airborne images with Google Street View data (Andr?as L. Majdik, Yves Albers-Schoenberg, Davide Scaramuzza) [1/2/21]
- Bramante39M - A dataset for camera localisation in a highly challenging deformable environment comprising about 39 million camera trajectories in a room-sized scene. (Bartoli, Sengupta) [6/9/25]
- Collaborative SLAM Dataset (CSD) - The dataset consists of four different subsets - Flat, House, Priory and Lab - each containing several RGB-D sequences that can be reconstructed and successfully relocalised against each other to form a combined 3D model. Each sequence was captured using an Asus ZenFone AR, and we provide an accurate local 6D pose for each RGB-D frame in the dataset. We also provide the calibration parameters for the depth and colour sensors, optimised global poses for the sequences in each subset, and a pre-built mesh of each sequence. (Golodetz, Cavallari, Lord, Prisacariu, Murray, Torr) [Before 28/12/19]
- Combined Dynamic Vision / RGB-D Dataset - "This dataset consists of recordings of the three data streams (color, depth, events) from the D-eDVS . a depth-augmented embedded dynamic vision sensor . and corresponding ground truth data from an external tracking system."(Weikersdorfer, Adrian, Cremers, Conradt) [27/12/2020]
- CSE Dataset: A Benchmark Dataset for Collaborative SLAM in Service Environments - A synthetic dataset for evaluating collaborative SLAM (C-SLAM) with multiple service robots in realistic dynamic indoor environments (Hospital, Office, Warehouse). (H. Park, I. Lee, M. Kim, et al) [22/06/25]
- CVSSP Dynamic RGBD Modelling - Eight RGBD sequences of general dynamic scenes captured using the Kinect V1/V2 as well as two synthetic sequences. Designed for non-rigid reconstruction. (C. Malleson, J. Guillemaut, A. Hilton) [30/07/25]
- DROT Dataset - Depth Reconstruction Occlusionless Temporal (DROT) Dataset. Five stop-motion sequences of 11-30 frames each. (D. Rotman, G. Gilboa) [30/07/25]
- EuRoC MAV Datasets - visual-inertial datasets collected on-board a Micro Aerial Vehicle (MAV). The datasets contain stereo images, synchronized IMU measurements, and accurate motion and structure ground-truth. (Burri, Nikolic, Gohl, Schneider, Rehder, Omari, Achtelik, Siegwart) [5/10/24]
- Event-aided Direct Sparse Odometry (EDS) dataset - VGA-resolution dataset of events and RGB images for visual odometry and low-level vision tasks with hybrid event-RGB sensors. (Hidalgo-Carrió, Gallego, Scaramuzza) [24/8/25]
- "Event-based, Direct Camera Tracking Dataset" - The dataset consists of one or more trajectories of an event camera (stored as a rosbag) and corresponding photometric map in the form of a point cloud for real data and a textured mesh for simulated scenes as well as ground truth pose.(Bryner, Gallego, Rebecq, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- Event-Camera Data for Pose Estimation, Visual Odometry, and SLAMThe data also include intensity images, inertial measurements, and ground truth from a motion-capture system. (ETH) [Before 28/12/19]
- EVIMO - Dataset for motion segmentation, egomotion estimation and tracking using an event camera; the dataset is collected with DAVIS 346C and provides 3D poses for camera and independently moving objects, and pixelwise motion segmentation masks. (Mitrokhin, Ye, Fermuller, Aloimonos, Delbruck) [14/1/20]
- House3D - House3D is a virtual 3D environment which consists of thousands of indoor scenes equipped with a diverse set of scene types, layouts and objects sourced from the SUNCG dataset. It consists of over 45k indoor 3D scenes, ranging from studios to two-storied houses with swimming pools and fitness rooms. All 3D objects are fully annotated with category labels. Agents in the environment have access to observations of multiple modalities, including RGB images, depth, segmentation masks and top-down 2D map views. The renderer runs at thousands frames per second, making it suitable for large-scale RL training. (Yi Wu, Yuxin Wu, Georgia Gkioxari, Yuandong Tian, facebook research) [Before 28/12/19]
- Indoor Dataset of Quadrotor with Down-Looking Camera - This dataset contains the recording of the raw images, IMU measurements as well as the ground truth poses of a quadrotor flying a circular trajectory in an office size environment. (Scaramuzza, ETH Zurich, University of Zurich) [Before 28/12/19]
- InLoc - Benchmark for evaluating the accuracy of 6DoF visual localization algorithms in challenging indoor scenarios. (Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, Akihiko Torii) [Before 28/12/19]
- IROS 2011 Paper Kinect Dataset - Lab-based setup. The aim seems to be to track the motion of camera. (F. Pomerleau, S. Magnenat, F. Colas, et al.) [30/07/25]
- LaMaR VISLAM benchmark - a comprehensive capture and GT pipeline that co-registers realistic trajectories and sensor streams captured by heterogeneous AR devices in large, unconstrained scenes (Sarlin, Dusmanu, Schoenberger, Speciale, Gruber, Larsson, Miksik, Pollefeys) [5/10/24]
- Long-term visual localization - TBenchmark for evaluating visual localization and mapping algorithms under various illumination and seasonal condition. (Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, Tomas Pajdla) [Before 28/12/19]
- Multi Vehicle Stereo Event Camera Dataset - Multiple sequences containing a stereo pair of DAVIS 346b event cameras with ground truth poses, depth maps and optical flow. (lex Zihao Zhu, Dinesh Thakur, Tolga Ozaslan, Bernd Pfrommer, Vijay Kumar, Kostas Daniilidis) [Before 28/12/19]
- PanoNavi dataset - A panoramic dataset for robot navigation, consisted of 5 videos lasting about 1 hour. (Lingyan Ran) [Before 28/12/19]
- RAWSEEDS SLAM benchmark datasets (Rawseeds Project) [Before 28/12/19]
- RGB-D Dataset 7-Scenes - The 7-Scenes dataset is a collection of tracked RGB-D camera frames. The dataset may be used for evaluation of methods for different applications such as dense tracking and mapping and relocalization techniques. (Microsoft) [29/07/25]
- Rijksmuseum Challenge 2014 - It consist of 100K art objects from the rijksmuseum and comes with an extensive xml files describing each object. (Thomas Mensink and Jan van Gemert) [Before 28/12/19]
- RSM dataset of Visual Paths - Visual dataset of indoor spaces to benchmark localisation/navigation methods. It consists of 1.5 km of corridors and indoor spaces with ground truth for every frame, measured as distance in centimetres from starting point. Includes a synthetically generated corridor for benchmark. (Jose Rivera-Rubio, Ioannis Alexiou, Anil A. Bharath) [Before 28/12/19]
- SenseTime VISLAM Benchmark - ZJU sensetime SLAM competition (SenseTime and Zhejiang University) [5/10/24]
- Shading-based Refinement on Volumetric Signed Distance Functions - Four RGBD sequences of small statues and artefacts. (M. Zollhofer, A. Dai, M. Innman, et al.) [30/07/25]
- Stanford 3D Scenes Data - RGBD videos of six indoor and outdoor scenes, together with a dense reconstruction of each scene. (Q. Zhou) [30/07/25]
- Trimbot-Wageningen-SLAM-Dataset This dataset is a real outdoor garden dataset captured in Wageningen for Trimbot2020 project. The dataset could be used for depth estimation, pose estimation, SLAM, 3D reconstruction, etc.. (Pu, Can and Yang, et al.) [05/06/25]
- TUM Benchmark Dataset - Many different scenes and scenarios for tracking and mapping, including reconstruction, robot kidnap etc. (TUM) [29/07/25]
- TUM RGB-D Benchmark - Dataset and benchmark for the evaluation of RGB-D visual odometry and SLAM algorithms (BCrgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard and Daniel Cremers) [Before 28/12/19]
- TUM VI Benchmark - 28 sequences, indoor and outdoor, sensor data from stereo camera and IMU, accurate ground truth at beginning and end segments. (David Schubert, Thore Goll, Nikolaus Demmel, Vladyslav Usenko, Joerg Stueckler, Daniel Cremers) [Before 28/12/19]
- UZH-RPG stereo events dataset - First dataset for 3D Reconstruction with a Stereo Event Camera (DAVIS240C) (Zhou, Gallego, Rebecq, Kneip, Li, Scaramuzza) [24/8/25]
- Visual Odometry / SLAM Evaluation - The odometry benchmark consists of 22 stereo sequences (Andreas Geiger and Philip Lenz and Raquel Urtasun) [Before 28/12/19]
- Visual Odometry Dataset with Plenoptic and Stereo Data - The dataset contains 11 sequences recorded by a hand-held platform consisting of a plenoptic camera and a pair of stereo cameras. The sequences comprising different indoor and outdoor sequences with trajectory length ranging from 25 meters up to several hundred meters. The recorded sequences show moving objects as well as changing lighting conditions. (Niclas Zeller and Franz Quint, Hochschule Karlsruhe, Karlsruhe University of Applied Sciences) [Before 28/12/19]
- ViViD : Vision for Visibility Dataset - "The dataset provides normal and poor illumination sequences recorded by thermal, depth, and temporal difference sensor for indoor and outdoor trajectories."(Lee, Cho, Yoon, Shin, Kim) [27/12/2020]
Surveillance and Tracking
- A collection of challenging motion segmentation benchmark datasets - These datasets enclose real-life long and short sequences, with increased number of motions and frames per sequence, and also real distortions with missing data. The ground truth is provided on all the frames of all the sequences. (Muhammad Habib Mahmood, Yago Diez, Joaquim Salvi, Xavier Llado) [Before 28/12/19]
- A Day on Campus (ADOC) - A dataset for anomalous detection with 24 hours of continuous video and 875 statio-temporal annotations of events. (Mantini, Li, Shah) [06/1/2021]
- ATOMIC GROUP ACTIONS dataset - (Ricky J. Sethi et al.) [Before 28/12/19]
- AUT MULTIDRONE video dataset for racing bicycle detection/tracking from UAV footage - 7 Youtube videos (resolution: 1920 x 1080) at 25fps (Mademlis) [Before 28/12/19]
- AVisT - Visual Object Tracking dataset that covers a variety of adverse scenarios highly relevant to real-world applications. Importantly, it poses additional challenges to the tracker design due to adverse visibility (CVL MBZUAI and CVL ETH Zurich) [3/12/2022]
- AVSS07: Advanced Video and Signal based Surveillance 2007 datasets (Andrea Cavallaro) [Before 28/12/19]
- Activity modeling and abnormality detection dataset - The dataset containes a 45 minutes video with annotated anomalies. (Jagan Varadarajan and Jean-Marc Odobez) [Before 28/12/19]
- Background subtraction - a list of datasets about background subtraction(Thierry BOUWMANS ) [Before 28/12/19]
- C2A - Combination to Application (C2A) Dataset for Human Detection in Disaster Scenarios. A synthetic dataset of human poses overlaid on UAV imagery of diverse disaster scenes for training deep learning models. (Nihal, Yen, Itoyama, Nakadai) [8/12/24]
- CAMO-UOW Dataset - 10 high resolution videos captured in real scenes for camouflaged background subtraction (Shuai Li and Wanqing Li) [Before 28/12/19]
- City1M - A large synthetic group re-identification dataset containing over 1M images (Zhang, Dang, Lai, Feng, Xie) [6/12/2022]
- CCTV-Fights - 1,000 videos picturing real-world fights, recorded from CCTVs or mobile cameras, and temporally annotated at the frame level. (Mauricio Perez, ROSE Lab, NTU) [Before 28/12/19]
- CMUSRD: Surveillance Research Dataset - multi-camera video for indoor surveillance scenario (K. Hattori, H. Hattori, et al) [Before 28/12/19]
- DukeMTMC: Duke Multi-Target Multi-Camera tracking dataset - 8 cameras, 85 min, 2m frames, 2000 people of video (Ergys Ristani, Francesco Solera, Roger S. Zou, Rita Cucchiara, Carlo Tomasi) [Before 28/12/19]
- DukeMTMC-reID - A subset of the DukeMTMC for image-based person re-identification (8 cameras,16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images.) (Zheng, Zheng, and Yang) [Before 28/12/19]
- ETISEO Video Surveillance Download Datasets (INRIA Orion Team and others) [Before 28/12/19]
- Evening Bridge Pedestrians (EBP) Dataset - Images collected during a large-scale traditional celebration in the evening in Foshan (a great city in China), challenging for research on crowd counting. (Huicheng Zheng, Zijian Lin, Jiepeng Cen, Zeyu Wu, and Yadan Zhao) [1/2/21]
- Firearm-related action recognition and object detection dataset for video surveillance systems A video dataset of 398 CCTV videos with annotated firearm actions and object detections in COCO JSON format (VISILAB, UCLM) [10/07/25]
- FMO dataset - FMO dataset contains annotated video sequences with Fast Moving Objects - objects which move over a projected distance larger than their size in one frame. (Denys Rozumnyi, Jan Kotera, Lukas Novotny, Ales Hrabalik, Filip Sroubek, Jiri Matas) [Before 28/12/19]
- HDA+ Multi-camera Surveillance Dataset - video from a network of 18 heterogeneous cameras (different resolutions and frame rates) distributed over 3 floors of a research institute with 13 fully labeled sequences, 85 persons, and 64028 bounding boxes of persons. (D. Figueira, M. Taiana, A. Nambiar, J. Nascimento and A. Bernardino) [Before 28/12/19]
- Human click data - 20K human clicks on a tracking target (including click errors) (Zhu and Porikli) [Before 28/12/19]
- IITB Corridor - group activities such as protest, chasing, fighting, sudden running as well as single person activities such as hiding face, loitering, unattended baggage, carrying a suspiciou object and cycling (in a pedestrian area) (Royston Rodriguez, et.al.) [26/12/2020]
- Immediacy Dataset - This dataset is designed for estimation personal relationships. (Xiao Chu et al.) [Before 28/12/19]
- Long-Term Crowd Flow - 87,430 crowd configurations in procedurally generated environments and corresponding simulated long-term crowd flows (Sohn, Zhou, Moon, Yoon, Pavlovic, Kapadia) [26/12/2020]
- MAHNOB Databases -including Laughter Database,HCI-tagging Database,MHI-Mimicry Database( M. Pantic. etc.) [Before 28/12/19]
- Moving INfants In RGB-D (MINI-RGBD) - A synthetic, realistic RGB-D data set for infant pose estimation containing 12 sequences of moving infants with ground truth joint positions. (N. Hesse, C. Bodensteiner, M. Arens, U. G. Hofmann, R. Weinberger, A. S. Schroeder) [Before 28/12/19]
- MSMT17 - Person re-identification dataset. 180 hours of videos, 12 outdoor cameras, 3 indoor cameras, and 12 time slots. (Wei Longhui, Zhang Shiliang, Gao Wen, Tian Qi) [Before 28/12/19]
- MULTIDRONE boat detection/tracking - 3 HD videos (720p - 1280 x 720) subsampled at 25 fps (Mademlis,) [Before 28/12/19]
- MVHAUS-PI - a multi-view human interaction recognition dataset (Saeid et al.) [Before 28/12/19]
- Multispectral visible-NIR video sequences - Annotated multispectral video, visible + NIR (LE2I, Universit de Bourgogne) [Before 28/12/19]
- Openvisor - Video surveillance Online Repository (Univ of Modena and Reggio Emilia) [Before 28/12/19]
- Parking-Lot dataset - Parking-Lot dataset is a car dataset which focus on moderate and heavily occlusions on cars in the parking lot scenario. (B. Li, T.F. Wu and S.C. Zhu) [Before 28/12/19]
- PKLot Dataset - 12,416 images of parking lots extracted from surveilance camera frames. (Universidade Federal do Parana) [09/06/25]
- Pornography Database - The Pornography database is a pornography detection dataset containing nearly 80 hours of 400 pornographic and 400 non-pornographic videos extracted from pornography websites and Youtube. (Avila, Thome, Cord, Valle, de Araujo) [Before 28/12/19]
- Princeton Tracking Benchmark - 100 RGBD tracking datasets (Song and Xiao) [Before 28/12/19]
- QMUL Junction Dataset 1 and 2 - Videos of busy road junctions. Supports anomaly detection tasks. (T. Hospedales Edinburgh/QMUL) [Before 28/12/19]
- Queen Mary Multi-Camera Distributed Traffic Scenes Dataset (QMDTS) - The QMDTS is collected from urban surveillance environment for the study of surveillance behaviours in distributed scenes. (Dr. Xun Xu. Prof. Shaogang Gong and Dr. Timothy Hospedales) [Before 28/12/19]
- Road Anomaly Detection - 22km, 11 vehicles, normal + 4 defect categories (Hameed, Mazhar, Hassan) [Before 28/12/19]
- S-Hock dataset - A new Benchmark for Spectator Crowd Analysis. (Francesco Setti, Davide Conigliaro, Paolo Rota, Chiara Bassetti, Nicola Conci, Nicu Sebe, Marco Cristani) [Before 28/12/19]
- SALSA: Synergetic sociAL Scene Analysis - A Novel Dataset for Multimodal Group Behavior Analysis(Xavier Alameda-Pineda etc.) [Before 28/12/19]
- SBMnet (Scene Background Modeling.NET) - A dataset for testing background estimation algorithms(Jodoin, Maddalena, and Petrosino) [Before 28/12/19]
- SBM-RGBD dataset - 35 Kinect indoor RGBD videos to evaluate and compare scene background modelling methods for moving object detection (Camplani, Maddalena, Moya Alcover, Petrosino, Salgado) [Before 28/12/19]
- SCOUTER - video surveillance ground truthing (shifting perspectives, different setups/lighting conditions, large variations of subject). 30 videos and approximately 36,000 manually labeled frames. (Catalin Mitrea) [Before 28/12/19]
- SJTU-BESTOne surveillance-specified datasets platform with realistic, on-using camera-captured, diverse set of surveillance images and videos (Shanghai Jiao Tong University) [Before 28/12/19]
- SPEVI: Surveillance Performance EValuation Initiative (Queen Mary University London) [Before 28/12/19]
- Shinpuhkan 2014 - A Person Re-identification dataset containing 22,000 images of 24 people captured by 16 cameras. (Yasutomo Kawanishi et al.) [Before 28/12/19]
- Stanford Drone Dataset - 60 images and videos of various types of agents (not just pedestrians, but also bicyclists, skateboarders, cars, buses, and golf carts) that navigate in a real world outdoor environment such as a university campus (Robicquet, Sadeghian, Alahi, Savarese) [Before 28/12/19]
- Stuttgart Artificial Background Subtraction Dataset [Before 28/12/19]
- Tracking in extremely cluttered scenes - this single object tracking dataset has 28 highly cluttered sequences with per frame annotation(Jingjing Xiao,Linbo Qiao,Rustam Stolkin,Ale Leonardis) [Before 28/12/19]
- TrackingNet - Large-scale dataset for tracking in the wild: more than 30k annotated sequences for training, more than 500 sequestered sequences for testing, evaluation server and leaderboard for fair ranking. (Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al-Subaihi and Bernard Ghanem) [Before 28/12/19]
- TREK-150 - egocentric videos to study bounding-box-based object tracking in first-person vision. (Dunnhofer, Furnari, Farinella, Micheloni) [13/8/25]
- UAV Detection and Tracking Benchmark Dataset - A collated image dataset for benchmarking UAV detection and tracking performance. (Isaac-Medina, Poyser, Organisciak, Willcocks, Breckon, Shum) [13/8/25]
- UAV-ReID Benchmark Dataset - A collated image dataset for benchmark UAV re-identification (ReID) performance. (Organisciak, Poyser, Alsehaim, Hu, Isaac-Medina, Breckon, Shum) [13/8/25]
- UCF-Crime Dataset: Real-world Anomaly Detection in Surveillance Videos - A large-scale dataset for real-world anomaly detection in surveillance videos. It consists of 1900 long and untrimmed real-world surveillance videos (of 128 hours), with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. (Center for Research in Computer Vision, University of Central Florida) [Before 28/12/19]
- UCLA Aerial Event Dataset - Human activities in aerial videos with annotations of people, objects, social groups, activities and roles (Shu, Xie, Rothrock, Todorovic, and Zhu) [Before 28/12/19]
- UCSD Anomaly Detection Dataset - a stationary camera mounted at an elevation, overlooking pedestrian walkways, with unusual pedestrian or non-pedestrian motion. [Before 28/12/19]
- UCSD trajectory clustering and analysis datasets - (Morris and Trivedi) [Before 28/12/19]
- USC Information Sciences Institute's ATOMIC PAIR ACTIONS dataset - (Ricky J. Sethi et al.) [Before 28/12/19]
- Udine Trajectory-based anomalous event detection dataset - synthetic trajectory datasets with outliers (Univ of Udine Artificial Vision and Real Time Systems Laboratory) [Before 28/12/19]
- VISTA - synchronized first-person and third-person videos to study viewpoint bias in object tracking and segmentation algorithms. (Dunnhofer, Manigrasso, Micheloni) [13/8/25]
- Visual Tracker Benchmark - 100 object tracking sequences with ground truth with Visual Tracker Benchmark evaluation, including tracking results from a number of trackers (Wu, Lim, Yang) [Before 28/12/19]
- WATB: Wild Animal Tracking Benchmark - 203,000 frames and 206 video sequences, and covers different kinds of animals from land, sea and sky (Wang, Cao, Li, Wang, He, Sun) [5/3/2023]
- WIDER Attribute Dataset - WIDER Attribute is a large-scale human attribute dataset, with 13789 images belonging to 30 scene categories, and 57524 human bounding boxes each annotated with 14 binary attributes. (Li, Yining and Huang, Chen and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
- Wildfire Smoke Dataset - This dataset is released by AI for Mankind in collaboration with HPWREN under a Creative Commons by Attribution Non-Commercial Share Alike license. (AI For Mankind) [09/06/25]
Textures
- Animal, Earth, and Plant Texture Dataset - Around 1,000 images for each category from either Flickr or Adobe Stock (Yu, Barnes, Shechtman, et al) [27/12/2020]
- Brodatz Texture, Normalized Brodatz Texture, Colored Brodatz Texture, Multiband Brodatz Texture 154 new images plus 112 original images with various transformations (A. Safia, D. He) [Before 28/12/19]
- BTF Database Bonn - material scans, each consisting of radiometrically calibrated, registered images, illuminated and observed from varying directions, images represented in Bidirectional Texture Functions (Reinhard Klein) [27/12/2020]
- Color texture images by category (textures.forrest.cz) [Before 28/12/19]
- Columbia-Utrecht Reflectance and Texture Database (Columbia & Utrecht Universities) [Before 28/12/19]
- CoMMonS: Challenging Microscopic Material Surface - Dataset for material characterization with high-resolution images of fabric surfaces under varying controlled imaging conditions (Y. Hu and Z. Long and A. Sunderasan and M. Alfarraj, G. AlRegib, Sungmee Park, and Sundaresan Jayaraman) [1/2/21]
- DynTex: Dynamic texture database (Renaud Piteri, Mark Huiskes and Sandor Fazekas) [Before 28/12/19]
- FASHIONPEDIA - (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, 294 fine-grained attributes and their relationships; (2) a dataset with 48k everyday and celebrity event fashion images annotated with segmentation masks and their associated per-mask fine-grained attributes. (M. Jia, M. Shi, M. Sirotenko, et al.) [28/07/25]
- Houses dataset - Benchmark dataset for houses prices that contains both
visual and textual information about 535 houses. (Ahmed, Eman and Moustafa, Mohamed) [Before 28/12/19]
- HD-VILA-100M Dataset HD-VILA-100M is a large-scale, high-resolution, and diversified video-language dataset to facilitate the multimodal representation learning.(Xue, Hongwei and Hang, et al.) [05/06/25]
- HowTo100M HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. (A. Miech, M. Tapaswi et al.) [05/06/25]
- Intrinsic Images in the Wild (IIW) - Intrinsic Images in the Wild, is a large-scale, public dataset for evaluating intrinsic image decompositions of indoor scenes (Sean Bell, Kavita Bala, Noah Snavely) [Before 28/12/19]
- KTH TIPS & TIPS2 textures - pose/lighting/scale variations (Eric Hayman) [Before 28/12/19]
- Materials in Context (MINC) - The Materials in Context Database (MINC) builds on OpenSurfaces, but includes millions of point annotations of material labels. (Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala) [Before 28/12/19]
- OpenSurfaces - OpenSurfaces consists of tens of thousands of examples of surfaces segmented from consumer photographs of interiors, and annotated with material parameters, texture information, and contextual information . (Kavita Bala et al.) [Before 28/12/19]
- Oulu Texture Database (Oulu University) [Before 28/12/19]
- Oxford Describable Textures Dataset - 5640 images in 47 categories (M.Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, A. Vedaldi) [Before 28/12/19]
- Prague Texture Segmentation Data Generator and Benchmark (Mikes, Haindl) [Before 28/12/19]
- Salzburg Texture Image Database (STex) - a large collection of 476 color texture image that have been captured around Salzburg, Austria. (Roland Kwitt and Peter Meerwald) [Before 28/12/19]
- SVBRDF Database Bonn - material scans, each consisting of radiometrically calibrated, registered images, illuminated and observed from varying directions, along with SVBRDF fits (Reinhard Klein) [27/12/2020]
- Synthetic SVBRDFs and renderings - The dataset contains 200000 renderings of 20000 different materials associated with their ground truth representation in the Cook-Torrance model. Distributed under research only, non commercial use license.
("GraphDeco" team, Inria) [Before 28/12/19]
- Texture DatabaseThe texture database features 25 texture classes, 40 samples each(Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce) [Before 28/12/19]
- Uppsala texture dataset of surfaces and materials - fabrics, grains, etc. [Before 28/12/19]
- Vision Texture (MIT Media Lab) [Before 28/12/19]
Underwater Datasets
- Aquariums This dataset consists of 638 images collected by Roboflow from two aquariums in the United States (Roboflow) [05/06/25]
- Brackish Dataset - The first publicly available European underwater image dataset with bounding box annotations of fish, crabs, and other marine organisms (Pedersen, Haurum, Gade, Moeslund, Madsen) [27/12/2020]
- BrackishMOT - A multi-object tracking expansion to the Brackish Dataset, including 9 additional sequences and annotations following the MOTChallenge standard (Pedersen, Lehotský, Nikolov, Moeslund) [13/8/25]
- Caltech Fish Counting (CFC) - The Caltech Fish Counting Dataset (CFC) is a large-scale dataset for detecting, tracking, and counting fish in sonar videos. (J. Kay, P. Kulits, S. Stathatos, et al.) [28/07/25]
- Catania Fish Species Recognition - 15 fish species, with about 20,000 sample training images and additional test images (Concetto Spampinato) [Before 28/12/19]
- CoralVOS CoralVOS is the first large-scale dense coral video segmentation dataset and benchmark, comprising 150 underwater video sequences (60,456 densely annotated frames) from 17 coral reef sites with pixel-level coral masks. (Z. Zheng, Y. Xie, H. Liang et al.) [31/07/25]
- Fishnet Open Images Dataset - features 35,000 fishing images that each contain 5 bounding boxes. (The Nature Conservancy) [03/06/25]
- JAMBO - A multi-annotator image dataset for benthic habitat classification captured by an ROV in temperate waters off the North West coast of Denmark. (Humblot-Renaux, Johansen, Schmidt, Irlind, Madsen, Moeslund, Pedersen) [13/8/25]
- Lampert's Spectrogram Analysis - Passive sonar spectrogram images derived from time-series data. These spectrograms are generated from recordings of acoustic energy radiated from propeller and engine machinery in underwater sea recordings. (Thomas Lampert) [Before 28/12/19]
- LIACi - Lifecycle Inspection, Analysis and Condition information system -Images have been collected during underwater ship inspections and annotated by human domain experts. (Waszak et al.) [28/07/25]
- MARIS Portofino dataset - A dataset of underwater stereo images depicting cylindrical pipe objects and collected to test object detection and pose estimation algorithms. (RIMLab (Robotics and Intelligent Machines Laboratory), University of Parma.) [Before 28/12/19]
- Marine Video Kit (MVK) Single-shot videos from moving underwater cameras for content-based analysis and retrieval (Q. T. Truong, T. A. Vu, T. S. Ha et al.) [10/07/25]
- MSC marine dataset - 396 high-quality video-text-mask triplets annotated from real-world marine videos (Truong, Kwan, Dang, Gotama, Nguyen, Yeung) [13/8/25]
- OceanDark dataset - 100 low-lighting underwater images from underwater sites in the Northeast Pacific Ocean. 1400x1000 pixels, varying lighting and recording conditions (Ocean Networks Canada) [Before 28/12/19]
- RUOD - underwater dataset contains 14,000 high-resolution images, 74,903 labeled objects, and 10 common aquatic categories. (Fu, Liu, Xin, Chen, et al) [1/06/25]
- Shellfish-OpenImages - 581 images of various shellfish classes for object detection. These images are derived from the Open Images open source computer vision datasets. (Solawetz) [05/06/25]
- ShrimpView: A Versatile Dataset for Shrimp Detection and Recognition - A collection of 10,000 samples (each with 11 attributes) designed to facilitate the training of deep learning models for shrimp detection and classification. (D. Bhattacharyya) [22/06/25]
- SUIM dataset -Dataset for semantic segmentation of underwater imagery. (Md Jahidul Islam et al.) [28/07/25]
- Underwater Image Enhancement Benchmark Dataset and Beyond - An underwater image enhancement benchmark (UIEB) including 950 real-world underwater images, 890 of which have the corresponding reference images (Li, Guo, Ren, Cong, Hou, Kwong, Tao) [27/12/2020]
- Underwater Single Image Color Restoration - A dataset of forward-looking underwater images, enabling a quantitative evaluation of color restoration using color charts at different distances and ground truth distances using stereo imaging. (Berman, Levy, Avidan, Treibitz) [Before 28/12/19]
- URPC2021 Underwater Images dataset with extraction code 2021 - URPC2021 dataset comprises 7600 images split into training and validation sets in a 4:1 ratio (Baidu) [10/07/25]
- URPC2021 The DUO dataset includes 6671 training images and 1111 test images, with extraction code DDUO (Baidu) [10/07/25]
Urban Datasets
- Ai3Dr - Benchmark that provides two types of aerial images (oblique and nadir) and the densest classified ALS point cloud of two areas of the city of Dublin to evaluate image-based 3D reconstructions from aerial images including a per urban element category evaluation (Ruano and Smolic) [23/11/21]
- Barcelona - 15,150 images, urban views of Barcelona (Tighe and Lazebnik) [Before 28/12/19]
- Cityscapes - a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5.000 frames in addition to a larger set of 20.000 weakly annotated frames. (Cityscpes Team) [Before 28/12/19]
- CMP Facade Database - Includes 606 rectified images of facades from various places with 12 architectural classes annotated. (Radim Tylecek) [Before 28/12/19]
- DeepGlobe Satellite Image Understanding Challenge - Datasets and evaluation platforms for three deep learning tasks on satellite images: road extraction, building detection, and land type classification. (Demir, Ilke and Koperski, Krzysztof and Lindenbaum, David and Pang, Guan and Huang, Jing and Basu, Saikat and Hughes, Forest and Tuia, Devis and Raskar, Ramesh) [Before 28/12/19]
- DroNet: Learning to Fly by Driving - Videos from a bicycle with labeled collision data used for learning to predict potentially dangerous situations for vehicles. (Loquercio, Maqueda, Del Blanco, Scaramuzza) [Before 28/12/19]
- European Flood 2013 - 3,710 images of a flood event in central Europe, annotated with relevance regarding 3 image retrieval tasks (multi-label) and important image regions. (Friedrich Schiller University Jena, Deutsches GeoForschungsZentrum Potsdam) [Before 28/12/19]
- Fishyscapes Benchmark of Anomaly Detection for Semantic Segmentation - Anomaly Detection in Cityscapes-like urban driving images (Blum, Sarlin, Nieto, Siegwart, Cadena) [27/12/2020]
- Houses dataset - Benchmark dataset for houses prices that contains both visual and textual information about 535 houses. (Ahmed, Eman and Moustafa, Mohamed) [Before 28/12/19]
- KITTI Vision Benchmark Suite - synchronized and calibrated RGB, Grey video plus point cloud, GPS/IMU data for about 50K frames, captured over 5 days, in a mostly urban environment. (Geiger, Lenz, Stiller, Urtasun, et al) [11/2/25]
- LM+SUN - 45,676 images, mainly urban or human related scenes (Tighe and Lazebnik) [Before 28/12/19]
- MIT CBCL StreetScenes Challenge Framework: (Stan Bileschi) [Before 28/12/19]
- Playing for Benchmarks (VIPER) - video sequences comprising 250K frames of urban scenes extracted from a photorealistic open world computer game. Ground truth annotations are available for several visual perception tasks (semantic, instance, panoptic segmentation, optical flow, 3D object detection, visual odometry) (Richter, Hayder, Koltun) [12/08/20]
- Playing for Data: Ground Truth from Computer Games - 25K synthetic images and semantic segmentation ground truth of urban scenes extracted from a photorealistic open world computer game (Richter, Vineet, Roth, Koltun) [12/08/20]
- Pothole Dataset - Collection of 665 images of roads with the potholes labeled. (Chitholian) [09/06/25]
- Queen Mary Multi-Camera Distributed Traffic Scenes Dataset (QMDTS) - The QMDTS is collected from urban surveillance environment for the study of surveillance behaviours in distributed scenes. (Dr. Xun Xu. Prof. Shaogang Gong and Dr. Timothy Hospedales) [Before 28/12/19]
- Robust Global Translations with 1DSfMthe numerical data describing global structure from motion problems for each dataset (Kyle Wilson and Noah Snavely) [Before 28/12/19]
- Sift Flow (also known as LabelMe Outdoor, LMO) - 2688 images, mainly outdoor natural and urban (Tighe and Lazebnik) [Before 28/12/19]
- Street-View Change Detection with Deconvolutional Networks - Database with aligned image pairs from street-view imagery with structural,lighting, weather and seasonal changes. (Pablo F. Alcantarilla, Simon Stent, German Ros, Roberto Arroyo and Riccardo Gherardi) [Before 28/12/19]
- SydneyHouse - Streetview house images with accurate 3D house shape, facade object label, dense point correspondence, and annotation toolbox. (Hang Chu, Shenlong Wang, Raquel Urtasun,Sanja Fidler) [Before 28/12/19]
- Traffic Signs Dataset - recording sequences from over 350 km of Swedish highways and city roads (Fredrik Larsson) [Before 28/12/19]
- nuTonomy scenes dataset (nuScenes) - The nuScenes dataset is a large-scale autonomous driving dataset. It features: Full sensor suite (1x LIDAR, 5x RADAR, 6x camera, IMU, GPS), 1000 scenes of 20s each, 1,440,000 camera images, 400,000 lidar sweeps, two diverse cities: Boston and Singapore, left versus right hand traffic, detailed map information, manual annotations for 25 object classes, 1.1M 3D bounding boxes annotated at 2Hz, attributes such as visibility, activity and pose. (Caesar et al) [Before 28/12/19]
- TUM City Campus - Urban point clouds taken by Mobile Laser Scanning (MLS) for classification, object extraction and change detection (Stilla, Hebel, Xu, Gehrung) [3/1/20]
- UrbAM-ReID - a long-term geo-positioned urban ReID dataset. It is composed by four subdatasets recording the same trajectory at the UAM Campus, each one recorded in different seasons and including an inverse direction recording (Moral, Garcia-Martin, Martinez) [22/1/25]
Vision and Natural Language
- AVTrustBench - the dataset comprises 600K samples over 9 meticulously crafted tasks, evaluating the capabilities of Audio Visual LLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency (Sanjoy Chowdhury) [01/07/25]
- Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events BlackSwan is a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Explaining and understanding such out-of-distribution events requires models to
extend beyond basic pattern recognition and regurgitation of their prior knowledge.(A. Chinchure, S. Ravi et al.) [10/07/25]
- A Dataset & Benchmark for Understanding Human-Centric Situations A dataset called MovieGraphs which provides detailed graph-based annotations of social situations de- picted in movie clips. (P. Vicol and M. Tapaswi, et al.) [05/06/25]
- AVTrustBench - the dataset comprises 600K samples over 9 meticulously crafted tasks, evaluating the capabilities of Audio Visual LLMs across three distinct dimensions: Adversarial
attack, Compositional reasoning, and Modality-specific dependency (Sanjoy Chowdhury) [01/07/25]
- Cardiff Conversation Database (CCDb) - audio-visual natural dyadic conversation data (Rosin) [13/8/25]
- CCv1 - over 45,000 videos (3,011 participants) and intended to be used for assessing the performance of already trained models in computer vision and audio applications. The subjects are a diverse set of adults in various age, gender and apparent skin tone groups. Spoken words in all videos were also manually transcribed by human annotators and are available with the dataset. It is useful for measuring algorithmic fairness in terms of age, gender, apparent skin tone, ambient lighting conditions, and speech recognition. (Hazirbas, Bitton, Dolhansky, Pan, Gordo, Canton Ferrer) [13/8/25]
- CCv2 - 5,567 participants (26,467 videos) and designed to help researchers evaluate their computer vision, audio and speech models for accuracy across a diverse set of ages, genders, language/dialects, geographies, disabilities, physical adornments, physical attributes, voice timbres, skin tones, activities, and recording setups. (Porgali, Albiero, Ryda, Canton Ferrer, Hazirbas) [13/8/25]
- CLEVR CLEVR is a diagnostic benchmark of 100,000 synthetic 3D-rendered images paired with nearly 1 million automatically generated reasoning questions.(J. Johnson, B. Hariharan, L. van der Maaten et al.) [31/07/25]
- CrisisMMD: Multimodal Twitter Datasets from Natural Disasters - The CrisisMMD multimodal Twitter dataset consists of several thousands of manually annotated tweets and images collected during seven major natural disasters including earthquakes, hurricanes, wildfires, and floods that happened in the year 2017 across different parts of the World. (Firoj Alam, Ferda Ofli, Muhammad Imran) [Before 28/12/19]
- DAQUAR - A dataset of human question answer pairs about images, which manifests our vision on a Visual Turing Test. (Mateusz Malinowski, Mario Fritz) [Before 28/12/19]
- Dataset of Structured Queries and Spatial Relations - Dataset of structured queries about images with the emphasise on spatial relations. (Mateusz Malinowski, Mario Fritz) [Before 28/12/19]
- DVQA: Understanding Data Visualization through Question Answering - a dataset for VQA about bar charts: 3 types of questions, 300,000 Images, 3,487,194 question-answer pairs, detailed metadata (Kafle, Cohen, Price, Kanan) [Before 28/12/19]
- Ego-QA - a long-form video dataset whose average length is 18 minutes and domain is egocentric scenes (Nguyen, Hu, Wu, Nguyen, Ng, Luu) [5/10/24]
- FigureQA - a dataset for VQA about bar and pie charts, and numerical graphs: 100,000 images, 1,327,368 question-answer pairs, 100 colors and figure plot element names, 15 question types (Kahou, Michalski, Atkinson, Kadar, Trischler, Bengio) [Before 28/12/19]
- GLAMI-1M: A Multilingual Image-Text Fashion Dataset - images of fashion products with item descriptions, each in 1 of 13 languages, and a category label (191 classes) (Kosar, Hoskovec, Sulc, Bartyzal) [5/12/2022]
- Hannah and her sisters database - a dense audio-visual person-oriented ground-truth annotation of faces, speech segments, shot boundaries (Patrick Perez, Technicolor) [Before 28/12/19]
- HowToDIV - dialogues, instructions and video-steps for procedural task assistance across diverse domains in cooking, mechanics, and planting with 507 conversations, 6636 dialogue turns and 24 hours of videoclips. (Aggarwal, Colaco) [13/8/25]
- INRIA BL-database - an audio-visual speech corpus multimodal automatic speech recognition, audio/visual synchronization or speech-driven lip animation systems (Benezeth, Bachman, Lejan, Souviraa-Labastie, Bimbot) [Before 28/12/19]
- Large scale Movie Description Challenge (LSMDC) - A large scale dataset and challenge for movie description, including over 128K video-sentence pairs, mainly sourced from Audio Description (also known as DVS). (Rohrbach, Torabi, Rohrbach, Tandon, Pal, Larochelle, Courville and Schiele) [Before 28/12/19]
- LEGO A dataset of over 150k paired egocentric images captured before and during daily tasks, designed for visual instruction generation. (B. Lai, X. Dai, L. Chen, G. Pang,et al.) [10/07/25]
- Lip Reading Datasets LRW, LRS2 and LRS3 are audio-visual speech recognition datasets collected from in the wild videos (J. S. Chung, A. Zisserman, et al.) [05/06/25]
- LRW-1000 - Audio-Visual Speech Recognition Dataset, including 1000 word/phrases and 70,000+ samples, which could be used for AVSR, VSR (lip reading) or other related tasks (Shuang Yang) [27/12/2020]
- M2E2: MultiMedia Event Extraction - a set of news articles collected from the web, each containing full text and several accompanying images, annotated with events and event arguments that appear in text or images, as well as coreferential links between visual and textual events (Li, Zareian, Zeng, Whitehead, Lu, Ji, Chang) [28/12/2020]
- MAD-QA - a long-form video dataset whose average length is 18 minutes and domain is movies (Nguyen, Hu, Wu, Nguyen, Ng, Luu) [5/10/24]
- MeerkatBench - the dataset comprises 3M instruction tuning samples for fine grained audio visual tasks (Sanjoy Chowdhury) [01/07/25]
- Melfusion - the dataset contains (image, text, music) pairs for multimodal conditioned audio generation (Sanjoy Chowdhury) [01/07/25]
- MPII dataset - A dataset about correcting inaccurate sentences based on the videos. (Amir Mazaheri) [Before 28/12/19]
- MPI Movie Description dataset - text and video - A dataset of movie clips associated with natural language descriptions sourced from movie scripts and Audio Description. (Rohrbach, Rohrbach, Tandon and Schiele) [Before 28/12/19]
- Multimodal Ferramenta dataset - 88010 images belonging to 52 classes described using more than 20K different words (Gallo, Calefati, Nawaz) [Before 28/12/19]
- MultiVENT 2.0 - a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and over 3,900 queries targeting specific world events (Human Language Technology Center of Excellence at Johns Hopkins University) [23/06/25]
- nocaps - a large-scale benchmark for novel object captioning; the task of describing images containing visual concepts not seen in paired image-caption training data (Agrawal, Desai, Wang, Chen, Jain, Johnson, Batra, Parikh, Lee, Anderson) [2/1/20]
- OCR-VQA - 207572 images and 1m associated question-answer pairs (Mishra, Shekhar, Singh, Chakraborty) [12/08/20]
- Open Poetry Vision Dataset - Synthetic dataset created by Roboflow for OCR tasks. Combines a random image with text. Each image in the dataset contains strings in a variety of fonts and colors randomly positioned in the canvas. (Dwyer) [09/06/25]
- Panda-70M WA Large-Scale Dataset with 70M High-Quality Video-Caption Pairs (T. Chen, A. Siarohin, W. Menapace, et al.) [10/07/25]
- PlotQA: Reasoning over Scientific Plots - A VQA dataset with 28.9 million QA pairs grounded over 224,377 scientific plots (bar, line, and dotline) on data from real-world sources and questions based on crowd-sourced question templates. (N. Methani, P. Ganguly, M. M. Khapra and P. Kumar) [1/2/21]
- Qualcomm Interactive Video Dataset (QIVD) - facilitate research on audio-visual understanding using Large Multi-modal Models, enabling systems to answer questions about events occurring in real-world, situated settings. Contains 2,900 videos of users posing real-world questions about the audio-visual content of scenes, designed to support research in multi-modal systems (Pourreza, Dagli, Bhattacharyya, Panchal, Berger, Memisevic) [16/8/25]
- Recipe1M - A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images - Recipe1M is a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. (Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, Antonio Torralba) [Before 28/12/19]
- Room to Room (R2R) dataset for Vision-and-Language Navigation - A corpus of visually-grounded natural language navigation instructions paired with trajectories in reconstructed indoor buildings from the Matterport3D dataset (Anderson, Wu, Teney, Bruce, Johnson, Sunderhauf, Reid, Gould, van den Hengel) [2/1/20]
- SemArt dataset - A dataset for semantic art understanding, including 21,384 fine-art painting images with attributes and artistic comments. (Noa Garcia, George Vogiatzis) [Before 28/12/19]
- SHOT7M2 SHOT7M2 is a large-scale video dataset for evaluating real-world short-video reasoning and temporal grounding tasks, sourced from TikTok. It contains over 7 million videos annotated with temporal event information, captions, and questions for evaluation. (A. Mathis, R. Zhang, R. W. T. Schick et al.) [11/07/25]
- SpatialSense - a dataset of spatial relations in 2D images, which is constructed with the goal of reducing dataset bias and sampling more challenging relations in the long tail (Yang, Russakovsky, Deng) [2/1/20]
- STAIR Actions Captions - a video caption dataset consisting of 400,000 Japanese captions for 80,000 video clips (Shigeto, Yoshikawa, Lin, Takeuchi) [26/12/2020]
- STAIR Captions - dataset containing 820,310 Japanese captions for MS-COCO (Yoshikawa, Shigeto, Takeuchi) [26/12/2020]
- STVD-FC: Large-Scale TV Dataset - Fact Checking - STVD-FC is the largest public dataset on the political content analysis and factchecking tasks. It consists of more than 1,200 fact-checked claims that have been scraped from a fact-checking service with associated metadata. For the video counterpart, the dataset contains nearly 6,730 TV programs, having a total duration of 6,540 hours, with metadata. (Rayar, Delalandre, Le) [4/12/2022]
- TACoS Multi-Level Corpus - Dataset of cooking videos associated with natural language descriptions at three levels of detail (long, short and single sentence). (Rohrbach, Rohrbach, Qiu, Friedrich, Pinkal and Schiele) [Before 28/12/19]
- TallyQA - The largest dataset for open-ended counting as of 2018, and it includes test sets that evaluate both simple and more advanced capabilities. (Manoj Acharya, Kushal Kafle, Christopher Kanan) [Before 28/12/19]
- TDIUC (Task-driven image understanding) - As of 2018, this is the largest VQA dataset and it faciliates analysis for 12 kinds of questions. (Kushal Kafle, Christopher Kanan) [Before 28/12/19]
- TextCaps 0.1 - 125K captions, 25K images Rosetta OCR tokens for joint visual and text recognition (Rohrbach team) [12/08/20]
- TextVQA - A dataset to benchmark visual reasoning based on text in images. 28K images, 45K questions, 453K answers (Singh, Natarjan, Shah, Jiang, Chen, Parikh, Rohrbach) [12/08/20]
- TGIF - 100K animated GIFs from Tumblr and 120K natural language descriptions. (Li, Song, Cao, Tetreault, Goldberg, Jaimes, Luo) [Before 28/12/19]
- Toronto COCO-QA Dataset - Automatically generated from image captions. 123287 images 78736 train questions 38948 test questions 4 types of questions: object, number, color, location Answers are all one-word. (Mengye Ren, Ryan Kiros, Richard Zemel) [Before 28/12/19]
- Totally Looks Like - A benchmark for assessment of predicting human-based image similarity (Amir Rosenfeld, Markus D. Solbach, John Tsotsos) [Before 28/12/19]
- Twitter for Sentiment Analysis (T4SA) - About 1 million tweets (text and associated images) labelled according to the sentiment polarity of the text; the data can be used for sentiment analysis as well as other analysis in the wild since the tweets were randomly sampled tweets from the stream of all globally produced tweets. (Lucia Vadicamo, Fabio Carrara, Andrea Cimino, Stefano Cresci, Felice Dell'Orletta, Fabrizio Falchi, Maurizio Tesconi) [Before 28/12/19]
- UCF-CrossView Dataset: Cross-View Image Matching for Geo-localization in Urban Environments - A new dataset of street view and bird's eye view images for cross-view image geo-localization. (Center for Research in Computer Vision, University of Central Florida) [Before 28/12/19]
- Vatex We present a new large-scale multilingual video description dataset, VATEX, which contains over 41, 250 videos and 825, 000 captions in both English and Chinese. Among the captions, there are over 206, 000 English-Chinese parallel translation pairs (X. Wang, J. Wu, J. Chen, et al.) [10/07/25]
- ViCaS - Large-scale video dataset containing detailed video descriptions with grounded object segmentation masks (Athar, Deng, Chen) [13/8/25]
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations - Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. (Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia-Li, David Ayman Shamma, Michael Bernstrein, Li Fei-Fei) [Before 28/12/19]
- VISCO Dataset for evaluating the self-reflection, critique, and correction capabilities of vision-language models (X. Wu, Y. Ding, B. Li et al.) [10/07/25]
- Visual Relationship Detection with Language Priors - 5000 images, 37,993 thousand relationships, 100 object categories, 70 predicate categories (Lu, Krishna, Bernstein, Fei-Fei) [Before 28/12/19]
- VizWiz - captioning datasets, VQA/Visual Question Answering, image quality assessment, private image detection databases with ground truth. Datasets are particularly suited for algorithms to help blind people (Jeffrey P. Bigham,Erin Brady, Danna Gurari,Kristen Grauman, Qing Li, Anhong Guo, and others) [12/08/20]
- VQA: Visual Question Answering - a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer. (Yash Goyal, Tejas Khot, Georgia Institute of Technology, Army Research Laboratory, Virginia Tech) [Before 28/12/19]
- VQA v1 - VQA: Visual Question Answering - For every image, we collected 3 free-form natural-language questions with 10 concise open-ended answers each. We provide two formats of the VQA task. (Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu) [Before 28/12/19]
- VQA 2.0 Visual Question Answering. COCO based 200k images, 6M annotations, 1M questions ... (Goyal, Khot, Summers-Stay, Batra, Parikh) [12/08/20]
- VQDv1 - Visual Query Detection (VQD) is a task in which given a query in natural language and an image the system must produce 0 - N boxes that satisfy that query (Acharya, Jariwala, Kanan) [26/12/2020]
- YouCook2 - 2000 long YouTube cooking videos, where each recipe step is temporally localized and described by an imperative English sentence. Bounding box annotations are available for the validation & test splits. (Luowei Zhou, Chenliang Xu, and Jason Corso) [Before 28/12/19]
- YouTube Movie Summaries - movie summary videos from YouTube, annotated with the correspondence between the video segments and the movie synopsis text at the sentence level and the phrase level. (Pelin Dogan, Boyang Li, Leonid Sigal, Markus Gross) [Before 28/12/19]
Other Collections
- 4D Light Field Dataset - 24 synthetic scenes with 9x9x512x512x3 input images, depth and disparity ground truth, camera parameters, and evaluation masks. (Katrin Honauer, Ole Johannsen, Daniel Kondermann, Bastian Goldluecke) [Before 28/12/19]
- AMADI_LontarSet - Balinese Palm Leaf Manuscript Images Dataset for Binarization, Query-by-Example Word Spotting, and Isolated Character Recognition of Balinese Script. (The AMADI Project et al.) [Before 28/12/19]
- Annotated Web Ears Dataset (AWE Dataset) - All images were acquired by cropping ears from images from the internet of known persons. ( Ziga Emersic, Vitomir Struc and Peter Peer) [Before 28/12/19]
- CALVIN research group datasets - object detection with eye tracking, imagenet bounding boxes, synchronised activities, stickman and body poses, youtube objects, faces, horses, toys, visual attributes, shape classes (CALVIN group) [Before 28/12/19]
- CANTATA Video and Image Database Index site (Multitel) [Before 28/12/19]
- Chinese University of Hong Kong datasets - Face sketch, face alignment, image search, public square observation, occlusion, central station, MIT single and multiple camera trajectories, person re-identification (Multimedia lab) [Before 28/12/19]
- Computer Vision Homepage list of test image databases (Carnegie Mellon Univ) [Before 28/12/19]
- Computer Vision Lab OCR DataBase (CVL OCR DB) - CVL OCR DB is a public annotated image dataset of 120 binary annotated images of text in natural scenes. (Andrej Ikica and Peter Peer.) [Before 28/12/19]
- ETHZ various datasets - including ETH 3D head pose, BIWI audiovisual data, ETHZ shape classes, BIWI walking pedestrians, pedestrians, buildings, 4D MRI, personal events, liver untrasound, Food 101. (ETH Zurich, Computer Vision Lab) [Before 28/12/19]
- Event-Camera Dataset - This presents the world's first collection of datasets with an event-based camera for high-speed robotics (E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, D. Scaramuzza) [Before 28/12/19]
- Finger Vein USM (FV-USM) Database - An infrared finger image database consists of finger vein and also finger geometry information. (Bakhtiar Affendi Rosdi, Universiti Sains Malaysia) [Before 28/12/19]
- Frequency-cue-conflict - The dataset contains 1200 images with class-specific correlations in low- and high-frequency information across 16 classes, designed to assess the frequency bias of vision models. (P. Gavrikov, J. Lukasik, S. Jung, et al) [22/06/25]
- FVI: Free-form video inpainting dataset - videos from YouTube-VOS and YouTube-BoundingBox and could be used for training and evaluation of video inpainting models (Chang, Ya-Liang, et al) [28/12/2020]
- General 100 Dataset - General-100 dataset contains 100 bmp-format images (with no compression), which are well-suited for super-resolution training(Dong, Chao and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
- GPDS Bengali and Devanagari Synthetic Signature Databases - Dual Off line and On line signature databases of Bengali and Devanagari signatures. (Miguel A. Ferrer, GPDS, ULPGC) [Before 28/12/19]
- GPDS Synthetic OnLine and OffLine Signature database - Dual Off line and On line Latin signature database. (Miguel A. Ferrer, GPDS, ULPGC) [Before 28/12/19]
- HKU-IS - 4447 images with pixel labeling groundtruth for salient object detection. (Guanbin Li, Yizhou Yu) [Before 28/12/19]
- High-res 3D-Models - it includes high-res renderings of these data-sets. ( Hubert etc.) [Before 28/12/19]
- Human3.6M Human3.6M is a large-scale motion capture dataset featuring 3.6 million accurate 3D joint poses and synchronized RGB images recorded from 11 professional actors performing 15 activities under four calibrated cameras.(C. Ionescu, D. Papava, V. Olaru et al.) [31/07/25]
- I3 - Yahoo Flickr Creative Commons 100M - This dataset contains a list of photos and videos. (B. Thomee, D.A. Shamma, G. Friedland et al.) [Before 28/12/19]
- Int. Assoc. for Pattern Recognition's Technical Committee TC11 on Reading Systems index of datasets concerning document text reading [Before 28/12/19]
- IDIAP dataset collection - 26 different datasets - multimodal, attack, biometric, cursive characters, discourse, eye gaze, posters, maya codex, MOBIO, face spoofing, game playing, finger vein, youtube-personality traits (IDIAP team) [Before 28/12/19]
- Images sequences and datasets - Images under severe illumination changes / Color images under severe illumination changes / Omnidirectional images under severe illumination changes (G. Silveira and E. Malis) [1/2/21]
- Kinect v2 dataset - Dataset for evaluating unwrapping in kinect2 depth decoding (Felix etc.) [Before 28/12/19]
- Laval HDR Sky Database - The database contains 800 hemispherical, full HDR photos of the sky that can be used for outdoor lighting analysis. (Jean-Francois Lalonde et al.) [Before 28/12/19]
- Leibe's Collection of people/vehicle/object databases (Bastian Leibe) [Before 28/12/19]
- Lotus Hill Image Database Collection with Ground Truth (Sealeen Ren, Benjamin Yao, Michael Yang) [Before 28/12/19]
- MIT Saliency Benchmark dataset - collection (pointers to 23 datasets) (Bylinskii, Judd, Borji, Itti, Durand, Oliva, Torralba} [Before 28/12/19]
- Michael Firman's List of RGBD datasets [Before 28/12/19]
- Msspoof:2D multi-spectral face spoofing - Presentation attack (spoofing) dataset with samples from both real data subjects and spoofed data subjects performed with paper to a NIR and VIS camera(Idiap research institute) [Before 28/12/19]
- Multiview Stereo Evaluation - Each dataset is registered with a "ground-truth" 3D model acquired via a laser scanning process(Steve Seitz et al) [Before 28/12/19]
- Oxford Misc, including Buffy, Flowers, TV characters, Buildings, etc (Oxford Visual geometry Group) [Before 28/12/19]
- PEIPA Image Database Summary (Pilot European Image Processing Archive) [Before 28/12/19]
- PalmVein spoofing - Presentation attack (spoofing) dataset with samples from spoofed data subjects (corresponding to VERA Palmvein) performed with paper(Idiap research institute) [Before 28/12/19]
- RSBA dataset - Sequences for evaluating rolling shutter bundle adjustment (Per-Erik etc.) [Before 28/12/19]
- Replay Attack:2D face spoofing - Presentation attack (spoofing) dataset with samples from both real data subjects and spoofed data subjects performed with paper, photos and videos from a mobile device to a laptop. (Idiap research institute) [Before 28/12/19]
- Replay Mobile:2D face spoofing - Presentation attack (spoofing) dataset with samples from both real data subjects and spoofed data subjects performed with paper, photos and videos to/from a mobile device. (Idiap research institute) [Before 28/12/19]
- Synthetic Sequence Generator - Synthetic Sequence Generator (G. Hamarneh) [Before 28/12/19]
- U-DIADS-Bib A full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts (S. Zottin, A. De Nardin, E. Colombi et al.) [10/07/25]
- USC Annotated Computer Vision Bibliography database publication summary (Keith Price) [Before 28/12/19]
- USC-SIPI image databases: texture, aerial, favorites (eg. Lena) (USC Signal and Image Processing Institute) [Before 28/12/19]
- Univ of Bern databases on handwriting, online documents, string edit and graph matching (Univ of Bern, Computer Vision and Artificial Intelligence) [Before 28/12/19]
- VERA Fingervein spoofing - Presentation attack (spoofing) dataset with samples from spoofed data subjects (corresponding to VERA Fingervein) performed with paper(Idiap research institute) [Before 28/12/19]
- VERA Fingervein - Fingervein dataset with data subjects recorded with a open fingervein sensor(Idiap research institute) [Before 28/12/19]
- VERA PalmVein:PalmVein - Palmvein dataset with data subjects recorded with a open palmvein sensor(Idiap research institute) [Before 28/12/19]
- Vehicle Detection in Aerial Imagery - VEDAI is a dataset for Vehicle Detection in Aerial Imagery, provided as a tool to benchmark automatic target recognition algorithms in unconstrained environments. (Sebastien Razakarivony and Frederic Jurie) [Before 28/12/19]
- Video Stacking Dataset - Dataset for evaulating video stacking on cell-phones (Erik Ringaby etc.) [Before 28/12/19]
- VISCO Dataset for evaluating the self-reflection, critique, and correction capabilities of vision-language models (X. Wu, Y. Ding, B. Li et al.) [10/07/25]
- World from a cat perspective - videos recorded from the head of a freely behaving cat (Belinda Y. Betsch, Wolfgang Einh?user) [Before 28/12/19]
- Wrist-mounted camera video dataset - Activities of Daily Living videos captured from a wrist- mounted camera and a head-mounted camera(Katsunori Ohnishi, Atsushi Kanehira,Asako Kanezaki, Tatsuya Harada) [Before 28/12/19]
- Yummly-10k dataset - The goal was to understand human perception, in this case of food taste similarity. (SE(3) Computer Vision Group at Cornell Tech) [Before 28/12/19]
Miscellaneous
- 3D mesh watermarking benchmark dataset (Guillaume Lavoue) [Before 28/12/19]
- 3D-ZeF - A challenging 3D multi-object tracking benchmark dataset with zebrafish recorded in a laboratory environment (Pedersen, Haurum, Bengtson, Moeslund) [27/12/2020]
- 4D Light Field Dataset - 24 synthetic scenes with 9x9x512x512x3 input images, depth and disparity ground truth, camera parameters, and evaluation masks. (Katrin Honauer, Ole Johannsen, Daniel Kondermann, Bastian Goldluecke) [Before 28/12/19]
- Active Appearance Models datasets (Mikkel B. Stegmann) [Before 28/12/19]
- AF 4D dataset - Based on our observations, we settled on 10 representative scenes that are categorized into three types: (1) scenes containing no face (NF), (2) scenes with a face in the foreground (FF), and (3) scenes with faces in the background (FB). For each of these scenes, we allowed different arrangements in terms of textured backgrounds, whether the camera moves, and how many types of objects in the scene change their directions (referred to as motion switches). (Abdullah Abuolaim, York University) [Before 28/12/19]
- Aircraft tracking (Ajmal Mian) [Before 28/12/19]
- AMADI_LontarSet - Balinese Palm Leaf Manuscript Images Dataset for Binarization, Query-by-Example Word Spotting, and Isolated Character Recognition of Balinese Script. (The AMADI Project et al.) [Before 28/12/19]
- Annotated Web Ears Dataset (AWE Dataset) - All images were acquired by cropping ears from images from the internet of known persons. (Ziga Emersic, Vitomir Struc and Peter Peer) [Before 28/12/19]
- Astypalaia rock carving image dataset - annotated photographs of prehistoric rock carvings, taken in situ and under differing poses and lighting parameters (Tsigkas, Sfikas, Pasialis, Vlachopoulos, Nikou) [7/1/2021]
- Autonomous Helicopter Landing (AHL) - Unconstrained Vision Guided UAV Based Safe Helicopter Landing (Arindam Sikdar,?Abhimanyu Sahu,?Debajit Sen,?Rohit Mahajan,?Ananda Chowdhury) [1/2/21]
- BAPPS BAPPS (Berkeley-Adobe Perceptual Patch Similarity) is a perceptual similarity dataset containing 325k pairs of image patches spanning traditional distortions and CNN-based perturbations. (R. Zhang, P. Isola, A. Efros et al.) [31/07/25]
- BeDDE: BEnchmark Dataset for Dehazing Evaluation - a real-world benchmark dataset for single image dehazing, including 208 image pairs collected in different weather conditions, and it comes with full reference metrics (e.g., our VI and RI) especially designed for the dehazing task (Zhao; Zhang; Huang; Shen; Zhao) [27/12/2020]
- BODAIR Bogazici University-DeviantArt Image Reuse Database - 144 stock images in six categories (animals, food,nature, places, plants and premade backgrounds) and 1056 artistic images created with them. (Isikdogan, Adiyaman, Akdag Salah, Salah) [1/2/21]
- California-ND - 701 photos from a personal photo collection, including many challenging real-life non-identical near-duplicates (Vassilios Vonikakis) [Before 28/12/19]
- Cambridge Motion-based Segmentation and Recognition Dataset (Brostow, Shotton, Fauqueur, Cipolla) [Before 28/12/19]
- Catadioptric camera calibration images (Yalin Bastanlar) [Before 28/12/19]
- CED: Color Event Camera Dataset - CED features 50 minutes of footage with both color frames and color events from the Color-DAVIS346.(Scheerlinck, Rebecq, Stoffregen, Barnes, Mahony, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- Chars74K dataset - 74 English and Kannada characters (Teo de Campos - t.decampos@surrey.ac.uk) [Before 28/12/19]
- CITIUS Video Database - A database of 72 videos with eye-tracking data for evaluate dynamic saliency visual models. (Xose) [Before 28/12/19]
- CO2FISHEYE - CO2 Measurements versus Fisheye Cameras. CO2 levels, air inflow, and people counts (ground truth and estimated from two overhead fisheye cameras) allowing to develop CO2-based people-count estimates (Cokbas, Pyltsov, Zolkos, Gevelber, Konrad) [25/06/25]
- COIN IMAGE DATASET - The coin image dataset is a dataset of 60 classes of Roman Republican coins. Each class is represented by three coin images of the reverse side acquired at Coin Cabinet of the Museum of Fine Arts in Vienna, Austria (CVL, Museum of Fine Arts Vienna) [1/2/21]
- Columbia Camera Response Functions: Database (DoRF) and Model (EMOR) (M.D. Grossberg and S.K. Nayar) [Before 28/12/19]
- Columbia Database of Contaminants' Patterns and Scattering Parameters (Jinwei Gu, Ravi Ramamoorthi, Peter Belhumeur, Shree Nayar) [Before 28/12/19]
- Competition on Baseline Detection - This dataset contains the training, evaluation, and test set for the ICDAR 2019 Competition on Baseline Detection (cBAD) (CVL, NCSR Demokritos) [1/2/21]
- Comprehensive Disaster Dataset (CDD) - datasets for 5 disaster categories and one non disaster category (Fahim Faisal Niloy) [16/1/2021]
- Conflict Escalation Resolution (CONFER) Database - 120 audio-visual episodes (~142 mins) of naturalistic interactions from televised political debates, annotated frame-by-frame in terms of real-valued conflict intensity. (Christos Georgakis, Yannis Panagakis, Stefanos Zafeiriou,Maja Pantic) [Before 28/12/19]
- Corel Image Features -This dataset contains image features extracted from a Corel image collection. Four sets of features are available based on the color histogram, color histogram layout, color moments, and co-occurence. (M. Ortega-Binderberger) [28/07/25]
- Corrosion Image Data Set for Automating Scientific Assessment of Materials - an expertly annotated material corrosion dataset investigated with deep learning classification models (Worcester Polytechnic Institute, and US Army Research Lab) [22/11/2021]
- COVERAGE - copy-move forged (CMFD) images and their originals with similar but genuine objects (SGOs), which highlight and address tamper detection ambiguity of popular methods, caused by self-similarity within natural images (Wen, Zhu, Subramanian, Ng, Shen, and Winkler) [Before 28/12/19]
- CREStereo synthetic image stereo dataset - using ShapeNet shapes in Blender with varying scene complexity, lighting, and disparity (Li, Wang, et al) [1/10/2024]
- Crime Scene Footwear Impression Database - crime scene and reference foorware impression images (Adam Kortylewski) [Before 28/12/19]
- CrowdFlow - Optical flow dataset and benchmark for crowd analytics (Gregory Schroeder, Tobias Senst, Erik Bochinski, Thomas Sikora) [Before 28/12/19]
- Curve tracing database for an automatic grading system. - The ground truth database of 70 public images used to evaluate our method Bandeirantes and other curve tracing methods in an automatic grading system. (Marcos A. Tejada Condori, Paulo A. V. Miranda) [Before 28/12/19]
- CVL Database - The CVL Database is a public database for writer retrieval, writer identification and word spotting. The database consists of 7 different handwritten texts (1 German and 6 English Texts). (CVL, Kleber, Florian; Fiel, Stefan; Diem, Markus; Sablatnig, Robert) [1/2/21]
- CVL Ruling Database - The CVL ruling dataset was synthetically generated to allow for comparing different ruling removal methods. It is based on the ICDAR 2013 Handwriting Segmentation database. (CVL, Diem, Markus, Kleber, Florian, Sablatnig, Robert) [1/2/21]
- CVSSP 3D data repository - The datasets are designed to evaluate general multi-view reconstruction algorithms. (Armin Mustafa, Hansung Kim, Jean-Yves Guillemaut and Adrian Hilton) [Before 28/12/19]
- D-HAZY - : A DATASET TO EVALUATE QUANTITATIVELY DEHAZING ALGORITHMS (Cosmin Ancuti et al.) [Before 28/12/19]
- Director's Cut - A Combined Dataset for Visual Attention Analysis in Cinematic VR Content - 8 omnidirectional videos, the scan-paths of the professional filmmakers (Director's Cut) and 20 test subjects, the video details with the directional cues and plot points provided by the filmmakers, and the analysis results with 5 metrics. (Knorr, Ozcinar, Fearghail, Smolic) [23/11/21]
- DR(eye)VE - A driver's attention dataset (University of Modena and Reggio Emilia) [Before 28/12/19]
- DTU controlled motion and lighting image dataset (135K images) (Henrik Aanaes) [Before 28/12/19]
- Database for Visual Eye Movements (DOVES) - A set of eye movements collected from 29 human observers as they viewed 101 natural calibrated images. (van der Linde, I., Rajashekar, U., Bovik, A. C. etc.) [Before 28/12/19]
- DeformIt 2.0 - Image Data Augmentation Tool: Simulate novel images with ground truth segmentations from a single image-segmentation pair (Brian Booth and Ghassan Hamarneh) [Before 28/12/19]
- Dense outdoor correspondence ground truth datasets, for optical flow and local keypoint evaluation (Christoph Strecha) [Before 28/12/19]
- DVS09 - DVS128 Dynamic Vision Sensor Silicon Retina - Dataset containing sample DVS recordings.(Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DVSFLOW16 - DVS/DAVIS Optical Flow Dataset - "DVS optical flow dataset contains samples of a scene with boxes, moving sinusoidal gratings, and a rotating disk. The ground truth comes from the camera's IMU rate gyro."(Rueckauer, Delbruck, Institute of Neuroinformatics, UZH and ETH Zurich) [27/12/2020]
- DVSNOISE20 - This dataset is designed to evaluate event denoising algorithm performance against real sensor data and was collected using a DAVIS346 neuromorphic-camera.(Almatrafi,Baldwin, Aizawa, Hirakawa) [27/12/2020]
- EgoSocialRelation - a dataset of image sequences (2fpm) taken by wearable camera capturing people social interactions in the wild (Aimar ES, Radeva P, Dimiccoli M.) [1/2/21]
- EISATS: .enpeda.. Image Sequence Analysis Test Site (Auckland University Multimedia Imaging Group) [Before 28/12/19]
- Epic Sounds Dataset A large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos from EPIC-KITCHENS-100. (J. Huh, J. Chalk, et al.) [05/06/25]
- The Event-Based Space Situational Awareness (EBSSA) Dataset - "The EBSSA dataset is a collection of event-based recordings of resident space objects, planets and stars."(Afshar, Nicholson, van Schaik, Cohen) [27/12/2020]
- Falling Things synthetic image dataset - 60k annotated RGB photos of 21 household object with dense depth, object poses, bounding boxes, per-pixel class segmentation. (Tremblay, To, Birchfield) [1/10/2024]
- FAMOS Dataset -5,000 unique microstructures, all samples have been acquired 3 times with two different cameras. (S. Voloshynovskiy, et al.) [28/07/25]
- Featureless object tracking - This dataset contains several videosequences with limited texture, intended for visual tracking, including manually annotated per-frame pose. (Lebeda, Hadfield, Matas, Bowden) [Before 28/12/19]
- FlickrLogos-32 - 8240 images of 32 product logos (Stefan Romberg) [Before 28/12/19]
- FORENSIC FOOTWEAR IMPRESSION - Multiple footwear impressions of 300 different pairs of shoes under varying conditions captured using an acquisition line with the help of the Austrian Police. (CVL, BKA) [1/2/21]
- General 100 Dataset - General-100 dataset contains 100 bmp-format images (with no compression), which are well-suited for super-resolution training(Dong, Chao and Loy, Chen Change and Tang, Xiaoou) [Before 28/12/19]
- Geometry2view - This dataset contains image pairs for 2-view geometry computation, including manually annotated point coordinates. (Lebeda, Matas, Chum) [Before 28/12/19]
- HairNet Dataset HairNet Dataset comprises 40K synthetic 3D hair models paired with 160K rendered orientation-field images, used to train real-time dense hair reconstruction from 2D inputs. (L. Hu, Y. Zhou, J. Xing et al.) [31/07/25]
- Handwritten Digit and Digit String Recognition Competition - The CVL Single Digit dataset consists of 7000 single digits (700 digits per class) written by approximately 60 different writers. The validation set has the same size but different writers (CVL, Diem, Markus; Fiel, Stefan; Garz, Angelika; Keglevic, Manuel; Kleber, Florian; Sablatnig, Robert) [1/2/21]
- Hannover Region Detector Evaluation Data Set - Feature detector evaluation sequences in multiple image resolutions from 1.5 up to 8 megapixels (Kai Cordes) [Before 28/12/19]
- High Quality Frames (HQF) dataset - The datasets contains events and groundtruth frames from a DAVIS240C that are well-exposed and minimally motion-blurred.(Stoffregen, Scheerlinck, Scaramuzza, Drummond, Barnes, Kleeman, Mahony) [27/12/2020]
- High Speed and HDR Datasets - "The sequences are used in the paper ""High Speed and High Dynamic Range Video with an Event Camera"" and include events from an event camera and images from a RGB camera."(Rebecq, Scaramuzza, RPG UZH and ETH Zurich) [27/12/2020]
- Hillclimb and CubicGlobe datasets - a video of a rally car, separated into several independent shots (for visual tracking and modelling). (Lebeda, Hadfield, Bowden) [Before 28/12/19]
- Houston Multimodal Distracted Driving Dataset - 68 volunteers that drove the same simulated highway under four different conditions (Dcosta, Buddharaju, Khatri, and Pavlidis) [Before 28/12/19]
- HyperSpectral Salient Object Detection Dataset (HS-SOD Dataset) - Hyperspectral (visible spectrum) image data for benchmarking on salient object detection with a collection of 60 hyperspectral images with their respective ground-truth binary images and representative rendered colour images (rendered in sRGB). (Nevrez Imamoglu, Yu Oishi, Xiaoqiang Zhang, Guanqun Ding, Yuming Fang, Toru Kouyama, Ryosuke Nakamura) [Before 28/12/19]
- I3 - Yahoo Flickr Creative Commons 100M - This dataset contains a list of photos and videos. (B. Thomee, D.A. Shamma, G. Friedland et al.) [Before 28/12/19]
- ICDAR'15 Smartphone document capture and OCR competition - challenge 2 - pictures of documents captured with smartphones under various conditions of perspective, lighting, etc. The ground truth is the textual content which should be extracted. (Burie, Chazalon, Coustaty, Eskenazi, Luqman, Mehri, Nayef, Ogier, Prum and Rusinol) [Before 28/12/19]
- I-HAZE - A dehazing benchmark with real hazy and haze-free indoor images. (ethz) [Before 28/12/19]
- INTEL-TAU Dataset - It is the largest available dataset for illumination estimation, i.e., color constancy, research. It can also be used to study the color shading effect. (Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Jarno Nikkanen, and Moncef Gabbouj) [1/2/21]
- Intrinsic Images in the Wild (IIW) - Intrinsic Images in the Wild, is a large-scale, public dataset for evaluating intrinsic image decompositions of indoor scenes (Sean Bell, Kavita Bala, Noah Snavely) [Before 28/12/19]
- IISc - Dissimilarity between Isolated Objects (IISc-DIO) - The dataset has a total of 26,675 perceived dissimilarity measurements made on 269 human subjects using a Visual Search task with a diverse set of objects. (RT Pramod & SP Arun, IISc) [Before 28/12/19]
- Image/video quality assessment database summary (Stefan Winkler) [Before 28/12/19]
- INRIA feature detector evaluation sequences (Krystian Mikolajczyk) [Before 28/12/19]
- INRIA's PERCEPTION's database of images and videos gathered with several synchronized and calibrated cameras (INRIA Rhone-Alpes) [Before 28/12/19]
- InStereo2K synthetic image stereo dataset - 2050 stereo pairs with high accuracy disparity maps (Bao, Wang, Xu, Guo, Hong, Zhang) [1/10/2024]
- Karlsruhe: multispectral light field dataset for light field deep learning - A synthetic multispectral light field dataset with resolution (9, 9, 512, 512, 13) of 500 randomly generated as well as 6 hand-crafted scenes (Schambach and Heizmann) [1/12/2021]
- Karlsruhe: textured multispectral light field dataset - A real-world multispectral light field dataset with resolution (9, 9, 400, 400, 13) of three real-world highly textured scenes. (Schambach and Heizmann) [1/12/2021]
- KITTI dataset for stereo, optical flow and visual odometry (Geiger, Lenz, Urtasun) [Before 28/12/19]
- Kodak Lossless True Color Image Suite -RGB images for testing image compression. (Kodak) [28/07/25]
- LabelMe images database and online annotation tool (Bryan Russell, Antonio Torralba, Kevin Murphy, William Freeman) [Before 28/12/19]
- Large scale 3D point cloud data from terrestrial LiDAR scanning (Andreas Nuechter) [Before 28/12/19]
- LFW-10 dataset for learning relative attributes - A dataset of 10,000 pairs of face images with instance-level annotations for 10 attributes. (CVIT, IIIT Hyderabad. ) [Before 28/12/19]
- Light-field Material Dataset - 1.2k annotated images of 12 material classes taken with the Lytro ILLUM camera(Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki,Manmohan Chandraker, Alexei Efros, Ravi Ramamoorthi) [Before 28/12/19]
- Linkoping Rolling Shutter Rectification Dataset (Per-Erik Forssen and Erik Ringaby) [Before 28/12/19]
- LIRIS-ACCEDE Dataset - a collection of video excerpts with a large content diversity annotated along affective dimensions (Technicolor) [Before 28/12/19]
- Materials in Context (MINC) - The Materials in Context Database (MINC) builds on OpenSurfaces, but includes millions of point annotations of material labels. (Sean Bell, Paul Upchurch, Noah Snavely, Kavita Bala) [Before 28/12/19]
- Mathematical Mathematics Memes -Collection of 10,000 memes on mathematics. (A. Belgaid) [28/07/25]
- MASSVIS (Massive Visualization Dataset) - Over 5K different information visualizations from a variety of sources, a subset of which have been categorized, segmented, and come with memorability and eye tracking recordings. (Borkin, Bylinskii, Kim, Oliva, Pfister) [Before 28/12/19]
- MIT/Tuebingen Saliency Benchmark - hosts the MIT300 and the CAT2000 datasets and benchmark (Borji et al) [23/2/22]
- MPI Sintel Flow Dataset A data set for the evaluation of optical flow derived from the open source 3D animated short film, Sintel. It has been extended for Stereo and disparity, Depth and camera motion, and Segmentation. (Max Planck Tubingen) [Before 28/12/19]
- MPI-Sintel optical flow evaluation dataset (Michael Black) [Before 28/12/19]
- MSR-VTT - video to text database of 200K+ video clip/sentence pairs [Before 28/12/19]
- Middlebury College stereo vision research datasets (Daniel Scharstein and Richard Szeliski) [Before 28/12/19]
- Modelling of 2D Shapes with Ellipses - he dataset contains 4,526 2D shapes included in standard as well as in home-build datasets. (Costas Panagiotakis and Antonis Argyros) [Before 28/12/19]
- MSBin - MultiSpectral Document Binarization - The dataset is dedicated to the (document image) binarization of multispectral images. (CVL, Hollaus Fabian, Brenner Simon, Sablatnig Robert) [1/2/21]
- Multi-FoV - Photo-realistic video sequences that allow benchmarking of the impact of the Field-of-View (FoV) of the camera on various vision tasks. (Zhang, Rebecq, Forster, Scaramuzza) [Before 28/12/19]
- Kubric-NK - Multi-resolution (1K, 2K, 4K, 8K) dataset based on Kubric containing annotations for optical flow, depth, and other tasks. (H. Morimitsu, X. Zhu, R. Cesar-Jr, et al.) [22/06/25]
- Multiview Stereo Evaluation - Each dataset is registered with a "ground-truth" 3D model acquired via a laser scanning process(Steve Seitz et al) [Before 28/12/19]
- Multiview stereo images with laser based groundtruth (ESAT-PSI/VISICS,FGAN-FOM,EPFL/IC/ISIM/CVLab) [Before 28/12/19]
- Open Video Project (Gary Marchionini, Barbara M. Wildemuth, Gary Geisler, Yaxiao Song) [Before 28/12/19]
- NCI Cancer Image Archive - prostate images (National Cancer Institute) [Before 28/12/19]
- NIST 3D Interest Point Detection (Helin Dutagaci, Afzal Godil) [Before 28/12/19]
- NRC-GAMMA -novel large benchmark dataset of real-life gas meter images, named the NRC-GAMMA dataset. (A. Ebadi, P. Paul, S. Auer, S. Tremblay) [28/07/25]
- NRCS natural resource/agricultural image database (USDA Natural Resources Conservation Service) [Before 28/12/19]
- O-HAZE - A dehazing benchmark with real hazy and haze-free outdoor images. (ethz) [Before 28/12/19]
- Object recognition dataset for domain adaptation - Consists of images from 4 different domains: Artistic images, Clip Art, Product images and Real-World images. For each domain, the dataset contains images of 65 object categories found typically in Office and Home settings. (Venkateswara Hemanth, Eusebio Jose, Chakraborty Shayok, Panchanathan Sethuraman) [Before 28/12/19]
- Object Removal - Generalized Dynamic Object Removal for Dense Stereo Vision Based Scene Mapping using Synthesised Optical Flow - Evaluation Dataset (Hamilton, O.K., Breckon, Toby P.) [Before 28/12/19]
- Occlusion detection test data (Andrew Stein) [Before 28/12/19]
- Online Video Characteristics and Transcoding Time Dataset -The dataset contains a million randomly sampled video instances listing 10 fundamental video characteristics along with the YouTube video ID. (T. Deneke) [28/07/25]
- OpenSurfaces - OpenSurfaces consists of tens of thousands of examples of surfaces segmented from consumer photographs of interiors, and annotated with material parameters, texture information, and contextual information . (Kavita Bala et al.) [Before 28/12/19]
- OSIE - Object and Semantic Images and Eye-tracking - 700 images, 5551 segmented objects, eye tracking data (Xu, Jiang, Wang, Kankanhalli, Zhao) [Before 28/12/19]
- Osnabrück gaze tracking data - 318 video sequences from several different gaze tracking data sets with polygon based object annotation (Schöning, Faion, Heidemann, Krumnack, Gert, Açik, Kietzmann, Heidemann & König) [Before 28/12/19]
- OTIS: Open Turbulent Image Set - several sequences (either static or dynamic) of long distance imaging through a turbulent atmosphere (Jerome Gilles, Nicholas B. Ferrante) [Before 28/12/19]
- PanoNavi dataset - A panoramic dataset for robot navigation, consisted of 5 videos lasting about 1 hour. (Lingyan Ran) [Before 28/12/19]
- PAVIS Leadership Corpus - Automatic detection of emergent leaders and their leadership styles in a meeting environment. (C. Beyan, N. Carissimi, S.Vascon, M. Bustreo, F. Capozzi, A. Pierro, C. Becchio, and V. Murino) [1/2/21]
- PCB DSLR DATASET - The PCB DSLR dataset is meant to facilitate research on computer-vision-based Printed Circuit Board (PCB) analysis, with a focus on recycling-related applications (CVL, martin kampel; Christopher Pramerdorfer) [1/2/21]
- PetroSurf3D - 26 high resolution (sub-millimeter accuracy) 3D scans of rock art with pixelwise labeling of petroglyphs for segmentation(Poier, Seidl, Zeppelzauer, Reinbacher, Schaich, Bellandi, Marretta, Bischof) [Before 28/12/19]
- PharmaPack -Public dataset of pharmaceutical packages enrolled by mobile phones. (O. Taran and S. Rezaeifar, et al.) [28/07/25]
- PHOS (illumination invariance dataset) - 15 scenes captured under different illumination conditions * 15 images (Vassilios Vonikakis) [Before 28/12/19]
- PIRM - perceptual quality of super-resolution benchmark (Blau, Y., Mechrez, R., Timofte, R., Michaeli, T., Zelnik-Manor, L) [Before 28/12/19]
- PittsStereo-RGBNIR - A Large RGB-NIR Stereo Dataset Collected in Pittsburgh with challenging Materials. (Tiancheng Zhi, Bernardo R. Pires, Martial Hebert and Srinivasa G. Narasimha) [Before 28/12/19]
- PRINTART: Artistic images of prints of well known paintings, including detail annotations. A benchmark for automatic annotation and retrieval tasks with this database was published at ECCV. (Nuno Miguel Pinho da Silva) [Before 28/12/19]
- Pics 'n' Trails - Dataset of Continuously archived GPS and digital photos (Gamhewage Chaminda de Silva) [Before 28/12/19]
- Pitt Image and Video Advertisement Understanding - rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer (Hussain, Zhang, Zhang, Ye, Thomas, Agha, Ong, Kovashka (University of Pittsburgh)> [Before 28/12/19]
- Quantum simulations of an electron in a two dimensional potential well -Labelled images of raw input to a simulation of 2d Quantum mechanics. (K. Mills, M.A. Spanner, I. Tamblyn) [28/07/25]
- RAWSEEDS SLAM benchmark datasets (Rawseeds Project) [Before 28/12/19]
- READ ABP WI Dataset - Writer Identification over decades - A hand is usually considered as a unique characteristic of a person, however, it may slightly change over their whole lifespan. This dataset covers this aspect of evolvement of handwriting of a single person (CVL, Bistum Passau, READ) [1/2/21]
- Real Low-Light Image Noise Reduction - It contains pixel and intensity aligned pairs of images corrupted by low-light camera noise and their low-noise counterparts. (J. Anaya, A. Barbu) [Before 28/12/19]
- Real-World Federated Visual Classification Dataset (Landmarks-User and iNaturalist-User) - Realistic per-user data for practical real-world federated learning (Hsu, Qi, Brown) [27/12/2020]
- RGB-DAVIS Dataset - The datasets contains indoor and outdoor sequences involving camera motion and/or scene motion collected with a RGB-DAVIS imaging system.(Wang, Duan, Cossairt, Katsaggelos, Huang, Shi) [27/12/2020]
- Roaming Panda - a set of images with ground-truth explanations for explaining image classifiers (Sun, Chockler, Huang, Kroening) [26/12/2020]
- ROMA (ROad MArkings) : Image database for the evaluation of road markings extraction algorithms (Jean-Philippe Tarel, et al) [Before 28/12/19]
- Robotic 3D Scan Repository - 3D point clouds from robotic experiments of scenes (Osnabruck and Jacobs Universities) [Before 28/12/19]
- Rolling Shutter Rectification Dataset - Rectifying rolling shutter video from hand-held devices (Per-Erik etc.) [Before 28/12/19]
- RRC-60 Roman Republican Coins dataset - contains 6000 images of obverse and corresponding 6000 images of reverse sides of 60 coin types from Roman Republican period (Sinem Aslan) [3/1/20]
- SALAMI - We introduce a novel dataset of Subjective Assessments of Legibility in Ancient Manuscript Images (SALAMI) to serve as a ground truth for the development of quantitative evaluation metrics in the field of digital text restoration. (CVL, Simon Brenner) [1/2/21]
- SALICON - Saliency in Context eye tracking dataset c. 1000 images with eye-tracking data in 80 image classes. (Jiang, Huang, Duan, Zhao) [Before 28/12/19]
- SceneFlow synthetic stereo dataset - 39000+ stereo frames in 960x540 pixel resolution (Mayer, Ilg, et al) [1/10/2024]
- Scripps Plankton Camera System - thousands of images of c. 50 classes of plankton and other small marine objects (Jaffe et al) [Before 28/12/19]
- ScriptNet: ICDAR2017 Competition on Historical Document Writer Identification (Historical-WI) - The dataset consists of 4782 handwritten pages written by more than 1100 writers anddating from the 13th to 20th century. (Fiel Stefan, Kleber Florian, Diem Markus, Christlein Vincent, Louloudis Georgios, Stamatopoulos Nikos, Gatos Basili) [Before 28/12/19]
- Seam Carving JPEG Image Database - Our seam-carving-based forgery database contains 500 untouched JPEG images and 500 JPEG images that were manipulated by seam-carving, both at the quality of 75 (Qingzhong Liu) [Before 28/12/19]
- SELECT - large-scale benchmark of curation strategies for image classification (Feuer, Xu, Cohen, Yubeaton, Mittal, Hegde) [28/10/24]
- SIDIRE - SIDIRE is a freely available image dataset which provides synthetically generated images allowing to investigate the influence of illumination changes on object appearance. (CVL, Sebastian Zambanini) [1/2/21]
- Smartphone document capture and OCR 2015 - Quality Assessment - pictures of documents captured with smartphones under various conditions perspective, lighting, etc. It also features text ground truth and OCR accuracies to train and test document image quality assessment systems. (Nayef, Luqman, Prum, Eskenazi, Chazalon, and Ogier) [Before 28/12/19]
- Smartphone document capture and OCR 2017 - mobile video capture - video recording of documents, along with the reference ground truth image to reconstruct using the video stream. (Chazalon, Gomez-Kramer, Burie, Coustaty, Eskenazi, Luqman, Nayef, Rusinol, Sidere, and Ogier) [Before 28/12/19]
- Stony Brook Univeristy Real-World Clutter Dataset (SBU-RwC90) - Images of different level of clutterness, ranked by humans (Chen-Ping Yu, Dimitris Samaras, Gregory Zelinsky) [Before 28/12/19]
- Street-View Change Detection with Deconvolutional Networks - Database with aligned image pairs from street-view imagery with structural,lighting, weather and seasonal changes. (Pablo F. Alcantarilla, Simon Stent, German Ros, Roberto Arroyo and Riccardo Gherardi) [Before 28/12/19]
- STVD/PVCD - partial video copy detection - 83K videos,more than 10k hours, more than 420k video copy pairs. (Le, Delalandre and Conte) [3/10/2022]
- SydneyHouse - Streetview house images with accurate 3D house shape, facade object label, dense point correspondence, and annotation toolbox. (Hang Chu, Shenlong Wang, Raquel Urtasun,Sanja Fidler) [Before 28/12/19]
- SYNTHIA - Large set (~half million) of virtual-world images for training autonomous cars to see. (ADAS Group at Computer Vision Center) [Before 28/12/19]
- Stony Brook University Shadow Dataset (SBU-Shadow5k) - Large scale shadow detection dataset from a wide variety of scenes and photo types, with human annotations (Tomas F.Y. Vicente, Le Hou, Chen-Ping Yu, Minh Hoai, Dimitris Samaras) [Before 28/12/19]
- Technicolor Interestingness Dataset - a collection of movie excerpts and key-frames and their corresponding ground-truth files based on the classification into interesting and non-interesting samples (Technicolor) [Before 28/12/19]
- Technicolor Hannah Dataset - 153,825 frames from the movie "Hannah and her sisters" annotated for several types of audio and visual information (Technicolor) [Before 28/12/19]
- Technicolor HR-EEG4EMO Dataset - EEG and other physiological recordings of 40 subjects collected during the viewing of neutral and emotional videos (Technicolor) [Before 28/12/19]
- Technicolor VSD Violent Scenes Dataset - a collection of ground-truth files based on the extraction of violent events in movies (Technicolor) [Before 28/12/19]
- The SUPATLANTIQUE dataset -Images of scanned official and Wikipedia documents. (C. Ben Rabah et al.) [28/07/25]
- TMAGIC dataset - Several videosequences for visual tracking, containing strong out-of-plane rotation(Lebeda, Hadfield, Bowden) [Before 28/12/19]
- Totally Looks Like - A benchmark for assessment of predicting human-based image similarity (Amir Rosenfeld, Markus D. Solbach, John Tsotsos) [Before 28/12/19]
- Toulouse Vanishing Points Dataset - a dataset of Manhattan scenes for vanishing point estimation which also provide, for each image, the IMU data of the camera orientation. (Vincent Angladon and Simone Gasparini) [Before 28/12/19]
- TUM RGB-D Benchmark - Dataset and benchmark for the evaluation of RGB-D visual odometry and SLAM algorithms (BCrgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard and Daniel Cremers) [Before 28/12/19]
- UCL Ground Truth Optical Flow Dataset (Oisin Mac Aodha) [Before 28/12/19]
- Univ of Genoa Datasets for disparity and optic flow evaluation (Manuela Chessa) [Before 28/12/19]
- Validation and Verification of Neural Network Systems (Francesco Vivarelli) [Before 28/12/19]
- Very Long Baseline Interferometry Image Reconstruction Dataset (MIT CSAIL) [Before 28/12/19]
- VIDIT: Virtual Image Dataset for Illumination Transfer - 390 1024x1024 virtual scenes, each captured with 40 illumination settings that are all the combinations of 5 color temperatures (2500K, 3500K, 4500K, 5500K and 6500K) and 8 light directions (N, NE, E, SE, S, SW, W, NW), resulting in 15,600 images split into train/validation/test sets. (El Helou, Zhou, Barthas, Susstrunk) [27/12/2020]
- Virtual KITTI - 40 high-resolution videos (17,008 frames) generated from five different virtual worlds, for : object detection and multi-object tracking, scene-level and instance-level semantic segmentation, optical flow, and depth estimation (Gaidon, Wang, Cabon, Vig) [Before 28/12/19]
- VIST -First dataset for sequential vision-to-language. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND). (Microsoft Research) [28/07/25]
- Visual Object Tracking challenge - This challenge is held annually as an ICCV/ECCV workshop, with a new dataset and an updated evaluation kit every year. (Kristan et al.) [Before 28/12/19]
- vsenseVVAR: V-SENSE Volumetric Video in Augmented Reality Database - A collection of movement patterns of 20 participants for 2 volumetric video shown in augmented reality scenario. (Zerman, Kulkarni, Smolic) [22/11/2021]
- vsenseLFVA: V-SENSE Light Field Visual Attention Dataset - A collection of eye tracking data of 21 participants for 17 different light fields which includes various refocus renders. (Gill, Zerman, Ozcinar, Smolic) [22/11/2021]
- Website Screenshots - synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. (Dwyer) [05/06/25]
- WHOI-Plankton - 3.5 million images of microscopic marine plankton on 103 categories (Olson, Sosik) [Before 28/12/19]
- WILD: Weather and Illumunation Database (S. Narasimhan, C. Wang. S. Nayar, D. Stolyarov, K. Garg, Y. Schechner, H. Peri) [Before 28/12/19]
- YACCLAB dataset - YACCLAB dataset includes both synthetic and real binary images(Grana, Costantino; Bolelli, Federico; Baraldi, Lorenzo; Vezzani, Roberto) [Before 28/12/19]
- YtLongTrack - This dataset contains two video sequences with challenges such as low quality, extreme length and full occlusions, including manually annotated per-frame pose. (Lebeda, Hadfield, Matas, Bowden) [Before 28/12/19]
Acknowledgements: Many thanks to all of the contributors for their suggestions of databases.
Will B.Y. Kim, Zeal Rathod, Can Pu and Hanz Cuevas Velasquez were very helpful with the updating of this web page.
Return to CVentry top level
There have been
accesses since July 2019.

© 2025 Robert Fisher, School of Informatics, Univ. of Edinburgh