Together AI has optimized serving of DeepSeek V4 Pro to achieve top performance on the Artificial Analysis benchmark, ranking #1 for both output speed (tokens per second) and latency. The inference optimizations tackled KV cache efficiency, prefix reuse, custom kernel implementation, and endpoint profiling. This breakthrough provides developers with the fastest DeepSeek V4 Pro API experience currently available. The company shared a detailed breakdown of their systems work via a linked blog post.
SocialSource: XImportance: 3/5
DeepSeek V4 Pro, when deployed on Together Compute's inference platform, has been ranked first in both latency and speed benchmarks. The announcement, originating from a tweet by Vipul Ved and retweeted by Together Compute, positions the model as the current leader in inference performance on the service. No specific metrics or comparative figures were disclosed in the social media post.
SocialSource: V2EXImportance: 2/5
Krill, an AI relay service, launched a 618 promotion from June 15–18, 2026, reducing base Codex model rates to as low as 0.15 and offering a 66% discount coupon on Codex plans. With a 10-person group buy, the effective rate reaches 0.1 Chinese yuan per US dollar. Existing Codex plan holders on June 15 will have their quotas adjusted to the 0.1 level. Claude model access is discounted only via balance top-ups, not plans. The service uses Pro accounts and emphasizes cost transparency.
SocialSource: V2EXImportance: 1/5
A V2EX user reported that a friend purchased a GLM annual subscription as a backup while primarily using OpenAI's Codex and ChatGPT. After recent policy-driven access restrictions (possible reference to “Fable” or similar incidents), that backup proved strategically valuable. The user warns against sole dependence on providers like OpenAI or Anthropic, whose policies can cut off access without notice, and plans to similarly secure a GLM annual plan. The post highlights growing community concerns over API dependency and the importance of having fallback options.
MiniMax Sparse Attention (MSA) is a new method for efficient processing of ultra-long contexts (hundreds of thousands to millions of tokens) in large language models. It uses blockwise sparsity and an optimized GPU execution path to achieve significant speedups in both training and inference while maintaining performance. The method is built on Grouped Query Attention (GQA), introducing a lightweight Index Branch for group-specific sparse token retrieval and a Main Branch for exact block-sparse attention. MSA is co-designed with GPU kernels for cross-GPU scalability and has been deployed in a production-grade multimodal model, reducing per-token attention compute. Its inference kernel and model are openly available online.
Developer Knok0932 updated an open-source C++ implementation of PaddleOCR to support text detection and recognition models from PP-OCR v3 through the latest v6. The project uses the ncnn inference framework instead of the official Paddle C++ runtime, which is described as complex and heavy with many dependencies. The ncnn-based approach reportedly offers faster inference for the author's tasks and greatly simplifies deployment. The code is available on GitHub at https://github.com/Avafly/PaddleOCR-ncnn-CPP.