NVIDIA Releases 2+ Petabytes of Open AI Training Data on HuggingFace

nvidia huggingface datasets robotics ai-adoption

Summary

• NVIDIA has published 2+ petabytes of open AI training data across 180+ datasets
• Physical AI Collection includes 500K+ robotics trajectories and 1,700+ hours of multi-sensor AV data
• Synthetic Nemotron Personas drove CrowdStrike NL-to-CQL accuracy from 50.7% to 90.4%
• Open data initiative aims to cut the bottleneck costing organizations millions before any training begins

Adjust signal

Details

#	Type	Key Point	Context
1	Stat	2+ petabytes across 180+ datasets and 650+ open models on HuggingFace and GitHub	The scale positions NVIDIA as one of the largest institutional contributors of open AI training data, rivaling academic consortia and hyperscaler open-data programs.
2	New Tech	Physical AI Collection: 500K+ robotics trajectories, 57M grasps, 15TB multimodal data — downloaded 10M+ times	This data underpins the GR00T reasoning vision-language-action model. Runway used GR00T data to build GWM-Robotics, a generative world model for robotics — demonstrating real downstream commercial use. The full Physical AI Collection has been downloaded over 10 million times.
3	Stat	AV dataset: 1,700+ hours across 25 countries, 2,500+ cities with 7-camera, LiDAR, and radar fusion	Multi-sensor, multi-geography coverage at this scale is rare in open data. The breadth supports perception benchmarking across varied driving environments and complements academic datasets with broader commercial usability.
4	New Tech	Nemotron Personas: synthetic demographic datasets at population scale across 5 countries	Persona counts range from 888K (Singapore) to 21M (India). Synthetic personas grounded in real demographic distributions allow fine-tuning for culturally and linguistically specific tasks without collecting sensitive personal data.
5	Research	CrowdStrike: NL→CQL accuracy jumped from 50.7% to 90.4% using 2M Nemotron personas	A 39.7 percentage point accuracy gain on a security-critical NL-to-query translation task, validating the synthetic persona approach for enterprise fine-tuning at scale.
6	Research	NTT Data and APTO: legal QA from 15.3% to 79.3%; attack success rate from 7% to 0%	Dual improvement — domain accuracy and adversarial robustness — from a single synthetic dataset suggests the persona grounding adds both task-specific signal and safety-relevant diversity.
7	Product Launch	Nemotron-Nano-9B-v2-Japanese topped the Nejumi leaderboard using NVIDIA's Japanese persona dataset	Topping a national-language benchmark with a sub-10B parameter model demonstrates that high-quality synthetic data can substitute for large-scale human-annotated corpora, lowering compute and annotation costs for sovereign AI development.
8	New Tech	La Proteina: fully synthetic protein biology dataset	Protein data is notoriously expensive and slow to generate experimentally. A synthetic alternative could accelerate drug discovery and protein engineering for teams without wet-lab resources.
9	Strategy	NVIDIA frames the initiative as solving the data bottleneck that currently costs millions and months pre-training	By commoditizing training data, NVIDIA reduces friction for organizations to start AI projects — which in practice means more workloads running on NVIDIA hardware and more adoption of NVIDIA's model and tooling ecosystem.

1.Stat

2+ petabytes across 180+ datasets and 650+ open models on HuggingFace and GitHub

The scale positions NVIDIA as one of the largest institutional contributors of open AI training data, rivaling academic consortia and hyperscaler open-data programs.

2.New Tech

Physical AI Collection: 500K+ robotics trajectories, 57M grasps, 15TB multimodal data — downloaded 10M+ times

This data underpins the GR00T reasoning vision-language-action model. Runway used GR00T data to build GWM-Robotics, a generative world model for robotics — demonstrating real downstream commercial use. The full Physical AI Collection has been downloaded over 10 million times.

3.Stat

AV dataset: 1,700+ hours across 25 countries, 2,500+ cities with 7-camera, LiDAR, and radar fusion

Multi-sensor, multi-geography coverage at this scale is rare in open data. The breadth supports perception benchmarking across varied driving environments and complements academic datasets with broader commercial usability.

4.New Tech

Nemotron Personas: synthetic demographic datasets at population scale across 5 countries

Persona counts range from 888K (Singapore) to 21M (India). Synthetic personas grounded in real demographic distributions allow fine-tuning for culturally and linguistically specific tasks without collecting sensitive personal data.

5.Research

CrowdStrike: NL→CQL accuracy jumped from 50.7% to 90.4% using 2M Nemotron personas

A 39.7 percentage point accuracy gain on a security-critical NL-to-query translation task, validating the synthetic persona approach for enterprise fine-tuning at scale.

6.Research

NTT Data and APTO: legal QA from 15.3% to 79.3%; attack success rate from 7% to 0%

Dual improvement — domain accuracy and adversarial robustness — from a single synthetic dataset suggests the persona grounding adds both task-specific signal and safety-relevant diversity.

7.Product Launch

Nemotron-Nano-9B-v2-Japanese topped the Nejumi leaderboard using NVIDIA's Japanese persona dataset

Topping a national-language benchmark with a sub-10B parameter model demonstrates that high-quality synthetic data can substitute for large-scale human-annotated corpora, lowering compute and annotation costs for sovereign AI development.

8.New Tech

La Proteina: fully synthetic protein biology dataset

Protein data is notoriously expensive and slow to generate experimentally. A synthetic alternative could accelerate drug discovery and protein engineering for teams without wet-lab resources.

9.Strategy

NVIDIA frames the initiative as solving the data bottleneck that currently costs millions and months pre-training

By commoditizing training data, NVIDIA reduces friction for organizations to start AI projects — which in practice means more workloads running on NVIDIA hardware and more adoption of NVIDIA's model and tooling ecosystem.

Stat = quantitative scale facts, New Tech = new dataset or model capability, Research = documented empirical results, Product Launch = released model, Strategy = business or ecosystem positioning

What This Means

NVIDIA is making a deliberate push to own not just the compute layer of AI but the data layer as well, releasing over 2 petabytes of open training data spanning robotics, autonomous vehicles, synthetic personas, and biology. The documented results from CrowdStrike, NTT Data, and Runway show this is not theoretical — enterprises are already using the data to achieve substantial accuracy gains in production use cases. For the broader AI ecosystem, this lowers the barrier to entry for organizations that previously faced months of expensive data acquisition before any model training could begin. It also deepens lock-in to NVIDIA's stack: teams that build on NVIDIA's open data are more likely to adopt NVIDIA's models, tools, and ultimately its hardware.

Sources

How NVIDIA Builds Open Data for AIHugging Face

Similar Events

Nvidia Commits $26B to Build Open-Weight Frontier AI Models

Mar 11

Nvidia Commits $40B+ to AI Equity Deals in Early 2026

May 9