提出基于Rust/WASM的开源边缘语义缓存架构,用于LLM – 架构可行性检查
英文摘要
The author proposes an open-source edge semantic cache architecture for LLMs aimed at reducing latency and API costs. It uses Rust compiled to WebAssembly to run on CDN edge nodes (e.g., Cloudflare Workers), intercepting user prompts. On a cache hit (similarity ≥ 0.88), a cached response is returned in ~5ms without calling the LLM; on a miss, the request is proxied to providers and the cache updated asynchronously. Key components include a lightweight embedding model like bge-small-en-v1.5, a vector similarity check against an edge vector database, and an edge KV store for response texts. The author seeks community feedback on realistic semantic cache hit rates in production, potential edge caching pitfalls, and interest in an open-source template.
中文摘要
作者提出一种用于LLM的开源边缘语义缓存架构,旨在降低延迟和API成本。系统使用Rust编译为WASM,运行在CDN边缘节点(如Cloudflare Workers),拦截用户提示。缓存命中(相似度≥0.88)时约5毫秒内返回缓存响应,无需调用LLM;未命中则代理请求至LLM提供商并异步更新缓存。关键组件包括轻量级嵌入模型(如bge-small-en-v1.5)、基于边缘向量数据库的余弦相似度检查以及用于存储响应的边缘KV存储。作者向社区征求关于生产环境中语义缓存命中率、边缘缓存陷阱以及开源模板采用兴趣的反馈。
关键要点
Proposes a low-latency semantic cache for LLMs deployed at the CDN edge using Rust compiled to WebAssembly, avoiding Python overhead and GC pauses.
提出一种低延迟语义缓存,通过Rust编译为WASM部署在CDN边缘,避免Python开销和GC暂停。
Architecture uses a lightweight embedding model (bge-small-en-v1.5) and edge vector database (Cloudflare Vectorize) for similarity checks, returning cached responses in ~5ms on hits.
架构使用轻量级嵌入模型(bge-small-en-v1.5)和边缘向量数据库(Cloudflare Vectorize)进行相似度检查,命中时约5毫秒返回缓存响应。
Cache misses proxy requests to upstream LLM providers and asynchronously populate the edge cache, aiming to cut API costs for repetitive queries.
缓存未命中时代理请求至上游LLM提供商,并异步填充边缘缓存,旨在减少重复查询的API成本。
The author asks community members about real-world semantic cache hit rates, edge caching pitfalls, and whether they would adopt an open-source template for this setup.
作者询问社区成员关于真实世界的语义缓存命中率、边缘缓存陷阱以及是否会采用一个用于此设置的开源模板。