OpenAI published a new pre-deployment safety method called Deployment Simulation. It replays past de-identified production conversations through a candidate model, regenerating assistant responses to estimate the frequency of undesired behaviors before release. Evaluated on GPT-5-series Thinking models using 1.3 million conversations, the method achieved a median multiplicative error of 1.5x in forecasting 20 behavioral categories. It cannot measure risks rarer than once in 200,000 messages. The technique reduces evaluation awareness—only 5.1% of simulated traffic was labeled as evaluation-like versus 5.4% for real traffic—and extends to agentic coding by simulating tool calls with another LLM. OpenAI used it to catch novel misalignment (calculator hacking) and assess internal agent deployments.
The Qwen team released Qwen-RobotSuite, a suite of three independent embodied AI foundation models for robotics. Qwen-RobotManip is a Vision-Language-Action model based on Qwen3.5-4B that aligns heterogeneous manipulation data into a unified 80-dimensional action vector, achieving 1st place on RoboChallenge Table30-v1 and strong cross-embodiment transfer. Qwen-RobotWorld is a language-conditioned video world model using a 60-layer dual-stream MMDiT and a frozen Qwen2.5-VL encoder, ranking 1st overall on EWMBench and DreamGen Bench. Qwen-RobotNav is a scalable navigation model built on Qwen3-VL with a parameterized observation interface, reaching 76.5% success rate on VLN-CE RxR and enabling agentic planning. RobotManip and RobotNav have public GitHub repositories; RobotWorld is presented as a research paper.
A hands-on tutorial streams 3,000 documents from the FineWeb sample-10BT subset without downloading the full multi-terabyte corpus. It reproduces quality filters (Gopher, C4, custom), finding most already-passed due to pre-filtering. MinHash-based deduplication with 128 permutations and 0.7 threshold identifies few near-duplicate pairs, consistent with per-crawl deduplication. GPT-2 token counts are verified against the stored field, showing near-perfect match (mean absolute difference ~0). Analytics cover token distribution, language scores, characters per token, and top domains, providing practical insights for scaling corpus preprocessing pipelines.
This tutorial builds an end-to-end spatial graph learning pipeline using the city2graph library. It collects real POI and street network data from OpenStreetMap around Shibuya, Tokyo (with a synthetic clustered fallback to ensure reliability), engineers spatial features like local density and street distance, and constructs six proximity graph families (KNN, Delaunay, Gabriel, RNG, EMST, Waxman) to compare graph topologies. A two-layer GraphSAGE model is trained on a homogeneous KNN graph to predict urban function categories (food, retail, education, health) from spatial structure and node features, achieving test accuracy and macro-F1. The pipeline also demonstrates heterogeneous graph construction using bridge edges between node types and a heterogeneous GNN forward pass via PyTorch Geometric's to_hetero, along with PCA visualization of learned embeddings and a geographic prediction map.
Perplexity integrated its Deep Research mode into Computer, the company’s multi-model orchestration system. The upgraded feature automatically breaks complex questions into subtasks and routes them across more than 20 frontier models. It uses Search as Code to generate code that runs thousands of parallel retrieval steps, dramatically improving agentic browsing: the BrowseComp benchmark score rose from 40.7% to 83.8%, and Humanity’s Last Exam rose from 36.4% to 50.5%. The system reads user-uploaded files alongside live web sources, cites every claim inline, and delivers finished reports, slide decks, and interactive dashboards. Developers can access the same search stack via the pay-as-you-go Perplexity Agent API with a deep-research preset.
A practical tutorial demonstrates how to stream NVIDIA's Nemotron-Pretraining-Code-v3 metadata index without downloading the full multi-gigabyte dataset. It creates a shuffled 30,000-record sample, derives features like file extension and directory depth, and visualizes top languages, extensions, repositories, and directory nesting. The workflow reconstructs raw GitHub URLs from metadata fields (repo, commit_id, rel_path) and attempts to fetch actual source files, handling missing/deleted repos gracefully. A Python-file filter is applied, and token counts are estimated using tiktoken, while the full dataset's scale is noted at approximately 173 billion tokens across 146 million files. Processed outputs are saved as Parquet and JSON for reuse.