NVIDIA Releases 2+ Petabytes of Open AI Training Data on HuggingFace
Summary
- • NVIDIA has published 2+ petabytes of open AI training data across 180+ datasets
- • Physical AI Collection includes 500K+ robotics trajectories and 1,700+ hours of multi-sensor AV data
- • Synthetic Nemotron Personas drove CrowdStrike NL-to-CQL accuracy from 50.7% to 90.4%
- • Open data initiative aims to cut the bottleneck costing organizations millions before any training begins
Details
2+ petabytes across 180+ datasets and 650+ open models on HuggingFace and GitHub
The scale positions NVIDIA as one of the largest institutional contributors of open AI training data, rivaling academic consortia and hyperscaler open-data programs.
Physical AI Collection: 500K+ robotics trajectories, 57M grasps, 15TB multimodal data — downloaded 10M+ times
This data underpins the GR00T reasoning vision-language-action model. Runway used GR00T data to build GWM-Robotics, a generative world model for robotics — demonstrating real downstream commercial use. The full Physical AI Collection has been downloaded over 10 million times.
AV dataset: 1,700+ hours across 25 countries, 2,500+ cities with 7-camera, LiDAR, and radar fusion
Multi-sensor, multi-geography coverage at this scale is rare in open data. The breadth supports perception benchmarking across varied driving environments and complements academic datasets with broader commercial usability.
Nemotron Personas: synthetic demographic datasets at population scale across 5 countries
Persona counts range from 888K (Singapore) to 21M (India). Synthetic personas grounded in real demographic distributions allow fine-tuning for culturally and linguistically specific tasks without collecting sensitive personal data.
CrowdStrike: NL→CQL accuracy jumped from 50.7% to 90.4% using 2M Nemotron personas
A 39.7 percentage point accuracy gain on a security-critical NL-to-query translation task, validating the synthetic persona approach for enterprise fine-tuning at scale.
NTT Data and APTO: legal QA from 15.3% to 79.3%; attack success rate from 7% to 0%
Dual improvement — domain accuracy and adversarial robustness — from a single synthetic dataset suggests the persona grounding adds both task-specific signal and safety-relevant diversity.
Nemotron-Nano-9B-v2-Japanese topped the Nejumi leaderboard using NVIDIA's Japanese persona dataset
Topping a national-language benchmark with a sub-10B parameter model demonstrates that high-quality synthetic data can substitute for large-scale human-annotated corpora, lowering compute and annotation costs for sovereign AI development.
La Proteina: fully synthetic protein biology dataset
Protein data is notoriously expensive and slow to generate experimentally. A synthetic alternative could accelerate drug discovery and protein engineering for teams without wet-lab resources.
NVIDIA frames the initiative as solving the data bottleneck that currently costs millions and months pre-training
By commoditizing training data, NVIDIA reduces friction for organizations to start AI projects — which in practice means more workloads running on NVIDIA hardware and more adoption of NVIDIA's model and tooling ecosystem.
Stat = quantitative scale facts, New Tech = new dataset or model capability, Research = documented empirical results, Product Launch = released model, Strategy = business or ecosystem positioning
What This Means
NVIDIA is making a deliberate push to own not just the compute layer of AI but the data layer as well, releasing over 2 petabytes of open training data spanning robotics, autonomous vehicles, synthetic personas, and biology. The documented results from CrowdStrike, NTT Data, and Runway show this is not theoretical — enterprises are already using the data to achieve substantial accuracy gains in production use cases. For the broader AI ecosystem, this lowers the barrier to entry for organizations that previously faced months of expensive data acquisition before any model training could begin. It also deepens lock-in to NVIDIA's stack: teams that build on NVIDIA's open data are more likely to adopt NVIDIA's models, tools, and ultimately its hardware.
Sources
- How NVIDIA Builds Open Data for AIHugging Face
