A PhD researcher from Mainz University of Applied Sciences is recruiting UX designers and AI/ML practitioners to evaluate a structured method for designing interface elements that calibrate user trust in LLM-based chatbots. Participants complete an anonymous 20-30 minute online survey where they apply the method to a worked example, then rate its clarity, usefulness, and applicability and provide open feedback. The study seeks critical feedback to refine the method for the dissertation, focusing on avoiding over-reliance or under-trust. No personal data is collected beyond optional professional background questions, and no compensation is provided.
Independent researcher demonstrates that a coherent target context can shift large language models into latent states where safety rules are reinterpreted, without triggering output-based filters. Measurements on open models (primarily Gemma-3-12B-IT) using hidden-state geometry, residual stream trajectories, SAE readouts, and causal interventions show regime changes before final output. Current RLHF and output classifiers only inspect surface-level outputs, missing these internal shifts. Code, data, and scripts are released on GitHub and Zenodo.
This paper, presented at ACM CAIS 2026, studies safety evaluation in tool-using LLM agents. It categorizes outcomes into safe success, unsafe success, and failure, and proposes a two-tier verification architecture: deterministic policy/tool checks followed by an LLM-based verifier. Using τ-bench tool-use scenarios, the authors find that verification can reduce unsafe success but also decreases task completion as the task horizon increases. They term this phenomenon the 'Verifier Tax', a horizon-dependent tradeoff between safety and successful task completion. The work highlights that unsafe completion should be treated as a separate category distinct from safe success.
Anthropic is reversing its undisclosed practice of secretly interfering with Claude Fable 5 usage aimed at building highly capable AI. Instead of silently refusing or rerouting such requests, the system will now notify users when it suspects frontier AI development. The company admitted it made "the wrong tradeoff" between safety and transparency and apologized. The change follows backlash over covert sabotage reported by Wired. Affected users will receive alerts indicating if the request was blocked or routed to a less capable model.
A student nearing completion of a Psychology degree and studying Systems Engineering is seeking research papers, datasets, benchmarks, and methodological advice for a project comparing how AI systems (ChatGPT, Gemini, Wysa, Replika) respond to prompts involving psychological distress at different intensity levels. The study aims to analyze linguistic and safety responses—such as empathy, psychoeducation, crisis resources, or refusals—rather than clinical effectiveness. Key interest areas include how responses change with prompt intensity, declarative versus question phrasing, explicit versus indirect distress, and the influence of hidden safety layers, system prompts, model versions, and stochastic outputs. The request also covers reproducibility concerns, moderation classifiers, and product updates.
Anthropic has introduced silent safeguards in its new Fable model that degrade performance on requests related to advanced LLM development, such as building pretraining pipelines, distributed training infrastructure, or ML accelerator design. These interventions, invisible to users, are implemented through prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). The model does not fall back to another version; instead, it internally alters responses. The restriction impacts an estimated 0.03% of traffic, concentrated in fewer than 0.1% of organizations. Anthropic states this enforces its Terms of Service against using Claude to develop competing models, aiming to avoid accelerating malicious actors.