使用流式处理、Pandas 和 tiktoken 从 NVIDIA Nemotron-Pretraining-Code-v3 元数据构建代码数据集管道
英文摘要
A practical tutorial demonstrates how to stream NVIDIA's Nemotron-Pretraining-Code-v3 metadata index without downloading the full multi-gigabyte dataset. It creates a shuffled 30,000-record sample, derives features like file extension and directory depth, and visualizes top languages, extensions, repositories, and directory nesting. The workflow reconstructs raw GitHub URLs from metadata fields (repo, commit_id, rel_path) and attempts to fetch actual source files, handling missing/deleted repos gracefully. A Python-file filter is applied, and token counts are estimated using tiktoken, while the full dataset's scale is noted at approximately 173 billion tokens across 146 million files. Processed outputs are saved as Parquet and JSON for reuse.
中文摘要
本教程展示了如何以流式方式处理 NVIDIA 的 Nemotron-Pretraining-Code-v3 元数据索引,无需下载完整的多GB数据集。它创建了一个包含3万条记录的随机样本,派生出文件扩展名和路径深度等特征,并可视化了主要语言、扩展名、代码仓库和目录嵌套层次。工作流从元数据字段(repo、commit_id、rel_path)重建原始 GitHub URL,并尝试获取实际源文件,优雅地处理缺失或已删除的仓库。教程过滤出 Python 文件,使用 tiktoken 估算 token 数量,并提及完整数据集的规模约为 1.73 万亿个 token,涵盖 1.46 亿个文件。处理后的输出被保存为 Parquet 和 JSON 以供复用。
关键要点
Streaming avoids downloading the full multi-gigabyte Nemotron-Pretraining-Code-v3 dataset, using only a 30,000-row sample for analysis.
通过流式传输避免了下载完整的多GB Nemotron-Pretraining-Code-v3 数据集,仅使用 30,000 行样本进行分析。
Metadata analysis reveals dominant languages (e.g., Python, JavaScript), file extensions (.py, .js), repository frequency, and typical directory depth.
元数据分析揭示了主要语言(如 Python、JavaScript)、文件扩展名(.py、.js)、代码仓库频率和典型目录深度。
Raw GitHub URLs are reconstructed from repo, commit_id, and rel_path, and a few real source files are fetched with handling for missing or deleted repos.
从 repo、commit_id 和 rel_path 重建原始 GitHub URL,并尝试获取少量真实源文件,同时处理了缺失或已删除仓库的情况。
Python files are filtered, and token counts are estimated (using tiktoken) to gauge the scale of fetched code; the full dataset contains ~173 billion tokens.
过滤出 Python 文件,并使用 tiktoken 估算 token 数量以评估获取代码的规模;完整数据集约有 1730 亿个 token。
Processed outputs are saved as Parquet and JSONL files, enabling reuse without re-streaming the dataset.
处理后的输出保存为 Parquet 和 JSONL 文件,支持在不重新流式传输的情况下复用。