AI Research · AI Info Forge

Research Wired AI Jul 28, 2026

Hugging Face Has a Deepfake Nudes Problem

Researchers discovered that popular image editing models available on Hugging Face can be readily used to generate explicit deepfake content, with analysis of 1,000 prompts revealing how users are actually exploiting this capability.

Read on Wired AI →

Research arXiv cs.AI Jul 28, 2026

Concept-based Visual Counterfactual Explanations with Diffusion Models

Researchers introduced C-VCE, a diffusion-based framework that generates visual counterfactual explanations by embedding a classifier directly into the generative model using concept bottleneck layers. This approach produces more realistic and minimally-altered counterfactual images compared to existing methods while eliminating reliance on separate noise-robust classifiers, making it more practical for safety-critical applications like medical imaging.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

SeT-Diff: Towards Semantic Foundation Models for HPC Telemetry and Time-Series

Researchers introduced SeT-Diff, a foundational diffusion-based model designed for high-performance computing telemetry and time-series data. Unlike traditional approaches that rely on fixed sensor configurations, SeT-Diff uses semantic descriptions to condition its generative process, enabling it to handle varying sensor arrangements and perform multiple tasks including forecasting, data imputation, and thermal inference with minimal performance loss.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

QFoldAgent: An Autonomous Quantum Optimization Multi-Agent System for Protein Structure Prediction

Researchers developed QFoldAgent, a multi-agent system combining quantum and classical computing to predict protein structures on a lattice. The framework uses AI agents to automatically adjust optimization parameters across iterations, achieving improved structural accuracy and validity on tested protein fragments without relying on ground-truth data during optimization.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Same Question, Different Answers: Evaluating LLM Reliability Beyond Accuracy

Researchers evaluated how consistently large language models answer the same questions when phrased differently. Across multiple models and benchmarks, they found that over 23% of answers flip between correct and incorrect depending on wording, despite modest overall accuracy changes. Models demonstrated inconsistent knowledge retrieval, though a self-paraphrasing strategy partially improved performance.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DeepLens Diagnosis Agent: Agentic Workflow Design Lets a Small Reasoning Model Compete with Frontier LLMs

Researchers developed DeepLens Diagnosis Agent, a structured workflow system that enables a smaller 7B medical model to match or exceed frontier LLMs on diagnostic reasoning tasks. The multi-stage pipeline achieved 60.14% accuracy on DiagnosisArena while costing 35-45% less than Claude Sonnet or Gemini, demonstrating that disciplined process design can compensate for smaller model size.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

MIITA: Memory-Induced Inference-Time Adaptation for Continual Learning with Small Language Models

Researchers introduced MIITA, a framework enabling small language models to adapt to new tasks while retaining previous knowledge without catastrophic forgetting. The approach stores compact memory prototypes and applies them during inference through temporary hidden-state adjustments, achieving improved performance under storage constraints typical of resource-limited deployments.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Codifying the Judge: Scalable Evaluation via Program Distillation

Researchers developed PAJAMA, a system that replaces expensive LLM-based evaluation judges with synthesized programs that assess model outputs directly. The approach matches LLM judge performance while reducing costs, latency, and improving transparency, with applications to both automated evaluation and reward model training.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

SF-AMS: Strategic Forgetting for Structured Memory in LLM Agent

Researchers introduced SF-AMS, a framework that improves how language model agents manage memory by dynamically prioritizing important information and filtering irrelevant data. The approach uses utility-driven scoring to maintain compact, high-quality memory for better long-context reasoning, showing significant performance improvements across multiple benchmarks and models.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Synthetic Scenario Generation for Evaluation of Industry 4.0 Agents

Researchers extended an industrial agent benchmark by adding a Smart Grid Transformer asset class and introducing ScenarioGeneratorAgent, a system that automatically generates realistic synthetic evaluation scenarios for testing industrial AI agents. The pipeline incorporates domain standards and multiple optimization techniques to produce high-quality scenarios at scale while reducing computational time by 8x.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Loss-Aware Feature-Map Pruning in Convolutional Neural Networks Using Multi-Armed Bandits

Researchers developed a pruning method for convolutional neural networks that uses multi-armed bandit algorithms to identify and remove redundant feature maps while maintaining model accuracy. The approach treats each feature map as a bandit arm, evaluates them based on loss changes, and outperforms traditional pruning methods across multiple datasets.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DSTFView: Multi-View Cloud-Edge Workload Forecasting with Dual-Input Spatio-Temporal-Frequency Modeling

Researchers introduced DSTFView, a forecasting framework designed to predict workload patterns in cloud-edge computing environments by analyzing spatial, temporal, and frequency-domain data simultaneously. The method adapts to sudden demand changes and outperformed existing approaches on benchmark datasets.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

MedLoCoMo: A Long-Context Multi-Session Medical Dialogue Benchmark for Large Language Models

Researchers introduced MedLoCoMo, a benchmark for evaluating large language models on medical dialogue tasks requiring longitudinal patient history analysis across multiple hospital admissions. The dataset, built from MIMIC medical records, contains 100 patient timelines with multi-session conversations and tests whether models can reason across long clinical contexts, revealing that cross-admission reasoning remains challenging even for models with extended context windows.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Keyword Matters: Unveiling the Energy Sensitivity of On-Device LLM Prompting

Researchers measured how prompt wording affects energy consumption when running large language models on mobile devices. They found that linguistic features, particularly the choice of verbs and instruction phrasing, meaningfully impact decoding length and battery usage, suggesting prompt design is a practical optimization technique for on-device inference.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Execution-Grounded Security Testing for Coding Agents in Software Engineering Pipelines

Researchers developed a security testing framework that evaluates coding agents' actual execution behavior rather than just their stated intentions. The framework successfully induced agents to perform unsafe system operations 53-73% of the time by embedding malicious intent within routine software engineering tasks, revealing significant security vulnerabilities in current coding agent implementations.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Reference Feature Atlases for Mechanistic Auditing of Language Models

Researchers introduced reference feature atlases, a method for auditing language models by reusing a pre-trained sparse feature library across different models rather than analyzing each from scratch. This approach uses linear decoders to interpret target models and identifies both known features and novel behaviors, demonstrating effectiveness at detecting injected mechanisms and discovering model-specific behaviors like political framing.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

SCAIR: Schema-Conditioned Agentic Iterative Reasoning for Enterprise Knowledge Graphs

Researchers introduced SCAIR, a framework for improving how AI agents interact with enterprise knowledge graphs by incorporating structural constraints and schema-aware reasoning. The approach demonstrates better performance on real-world enterprise databases compared to existing methods, without requiring model retraining.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Schema-Aware Localisation (SAL): Live Schema Grounding and Hallucination Validation for Oracle NL2SQL

Researchers developed Schema-Aware Localisation (SAL), a middleware layer that improves LLM-generated SQL for Oracle databases by grounding models in actual database schemas and validating outputs against live catalogs. The approach eliminates hallucinated column references without retraining, achieving 62.6% execution success compared to 2.2% baseline on test queries.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

PhononBench-MP40: a spectrum-resolved benchmark dataset for phonon stability

Researchers released PhononBench-MP40, a benchmark dataset containing nearly 47,000 crystal structure records with phonon stability labels and spectral data from the Materials Project. The dataset addresses a key challenge in computational materials screening by providing detailed stability classifications and enabling researchers to study dynamic instability in crystal structures.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Too much evidence, too little time: From text to actionable recommendations through multi-objective evidence reasoning

Researchers developed SCEPTER, a framework that helps clinicians manage overwhelming medical literature by automatically retrieving relevant PubMed papers, extracting key claims, detecting contradictions, and generating evidence-based recommendations. The system compressed typical search results from hundreds of papers to a manageable set of recommendations while maintaining evidence diversity.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Temporal Context Reinstatement Drives Episodic-Like Order Memory in Long-Context Language Models

Researchers studied how long-context language models handle episodic memory tasks involving temporal order recall. They discovered that LLMs use a one-dimensional temporal code reinstated through specific attention mechanisms, mirroring behavioral patterns observed in humans and suggesting similar computational approaches to long-term memory retrieval.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

cMoLLM at Scale: Horizontal Scaling Laws for Mixture-of-LLMs

Researchers introduced cMoLLM, a mixture-of-experts approach that scales language models by routing across multiple parallel streams using dynamic convolution rather than dense parameters. The method addresses computational bottlenecks in trillion-parameter models and demonstrated improvements in perplexity and downstream task performance compared to existing scaling approaches.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

HeraSys: Collaborative Serving of Multiple LLM Workflows via Fine-Grained End-to-End Optimization

HeraSys is a new LLM serving system that optimizes performance for concurrent multi-tenant workflows through cross-workflow optimization and fine-grained orchestration. It reduces latency by up to 2.17× and increases throughput by 1.85× through structural node merging, adaptive scheduling, and load-aware resource management.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Multi-Objective Structured Pruning of LLMs for Latency and Model Size Optimization

Researchers developed a hardware-aware pruning framework that removes redundant components from large language models to optimize them for edge devices. The two-stage method combines block-level pruning with Bayesian optimization to balance model accuracy, latency, and size, demonstrating effective deployment on resource-constrained platforms.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Source-Aware Reranking for Retrieval-Augmented Generation: A Reliability Prior Approach

Researchers propose incorporating source credibility priors into retrieval-augmented generation systems by weighting retrieved documents based on their source reliability in addition to semantic similarity. Testing on a health domain corpus showed the approach improved precision and reduced retrieval of low-credibility sources compared to similarity-only ranking.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

The Scaffold Effect in Coding Agents: Harness Choice as a Hidden Variable in Coding-Agent Evaluation

Researchers found that the software framework (harness) used to evaluate coding agents significantly impacts performance metrics, causing up to 40x differences in token usage while model-to-model pass rate differences remain small. The study recommends evaluating harness-model pairs together and reporting detailed specifications rather than comparing models in isolation.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

MM-ShiftKV: Decode-Aware Prefill-Stage KV Selection for Multimodal Large Language Models

Researchers developed MM-ShiftKV, a method that improves key-value cache efficiency in multimodal large language models by better predicting which visual tokens matter during generation. The approach uses variance-expanded query proxies during the prefilling stage to more accurately estimate which cached information will be needed during decoding, outperforming existing selection methods under memory constraints.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

TriSP: Tri-Signal Structured Pruning for Large Language Models

Researchers introduced TriSP, a structured pruning method for large language models that combines weight magnitude, activation norms, and gradient sensitivity to identify which model components to remove. The approach achieves significant inference speedups (82% at 50% pruning) while maintaining performance competitive with unpruned models.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

ParBench: A Benchmark for Reliable Evaluation of LLM Parallel Code Translation

Researchers introduced ParBench, a benchmark framework for evaluating how well large language models translate parallel code across different programming APIs like CUDA and OpenMP. The framework provides standardized testing conditions and reveals that current state-of-the-art models struggle with preserving computational semantics, thread synchronization, and handling source variations during translation.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Lexical discovery in unknown environments orchestrated by Large Language Models

Researchers developed a framework enabling autonomous LLM-based agents to collectively create shared vocabularies for unknown objects in unexplored environments. The system combines vision encoding and language models to establish consensus on new terms for out-of-distribution entities, with applications to space and deep-sea exploration missions.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Structure Over Scale: Schema-Constrained Causal Graphs for RAG

Researchers introduced HCG-RAG, a method that uses schema-constrained causal graphs instead of exhaustive entity extraction for retrieval-augmented generation. The approach reduces computational costs by 3-20x fewer nodes and 8-135x fewer LLM calls while maintaining or improving answer quality on medical and clinical benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

An Agentic Orchestration of Atomistic Simulations

Researchers developed an AI agent system within the URSA framework that automates atomistic simulations for materials design using LAMMPS. The agent independently selects interatomic potentials, executes simulations, and recovers from errors, reducing human expertise requirements while improving reproducibility and scalability compared to manual workflows.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

HyCE-RAG: Hypergraph Chain-of-Evidence Retrieval-Augmented Generation for Explainable Multi-hop Question Answering

Researchers introduced HyCE-RAG, a retrieval-augmented generation framework using hypergraph structures to improve multi-hop question answering. The system organizes evidence into hyperedges and performs confidence propagation to select and rank evidence paths, outperforming standard and graph-based RAG approaches on multiple benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Differencing the Diffusion Trajectory toward Uncertain Components for Time Series Forecasting

Researchers propose DiffDiff, a diffusion-based framework for time series forecasting that adapts the corruption process to focus modeling effort on uncertain future components while leveraging historical continuity. The method uses step-dependent forward operators and adaptive conditioning to outperform existing diffusion baselines across multiple benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Chart Deception in Vision-Language Models: From Vulnerability to Mitigation

Researchers introduced VisDeception, a benchmark testing how well vision-language models resist misleading chart designs like distorted axes and manipulated colors. Testing 10 advanced models on 1,600 paired charts revealed significant vulnerabilities to deceptive visualizations, and the team proposed a mitigation approach using structured metadata to improve robustness.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DeepLook: Deeper Thinking with Lookahead

Researchers introduced DeepLook, a training-free decoding method that strategically allocates computational resources during language model reasoning by detecting uncertainty points and applying lookahead exploration only where needed. Testing across multiple models and mathematics benchmarks showed the approach improved accuracy while reducing token generation by 87% on average compared to existing inference-scaling methods.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Group Preference Collapse in Personalized Multimodal Large Language Models

Researchers identified a problem in personalized multimodal language models where individual user preferences get overshadowed by dominant population trends. They developed PrefMoE, a framework that better preserves individual preferences by separating them from user profile information and using specialized learning techniques to maintain personalization across different user groups.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Evaluating LLMs as Interpretable Controllers for Dynamical Systems

Researchers evaluated five large language models of varying sizes as controllers for a thermal system, finding that larger models like GPT-4o successfully maintained temperature setpoints with coherent reasoning, while smaller models struggled with actuator dynamics. Incorporating physics-based tools improved control performance, suggesting LLMs can serve as interpretable controllers when sufficiently capable and equipped with domain knowledge.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Tokengeist: Multi-Turn Attribution Tracing in Agentic Conversations

Researchers introduced Tokengeist, a framework for tracing how tokens from previous conversation turns influence language model responses across multiple steps. The method outperforms existing attribution techniques by recursively mapping dependencies across dialogue turns, achieving 90% accuracy compared to under 20% for single-pass approaches, with a new benchmark of 3,845 annotated examples.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Decentralized Granular Access Control for Agentic AI Systems in Critical Infrastructure

Researchers have developed a decentralized access control system designed to manage autonomous AI agents operating in critical cloud infrastructure. The framework uses compound identity models, hierarchical permissions, and progressive trust escalation to prevent unauthorized operations, with a production deployment at a major cloud provider showing zero unauthorized write operations over eight months.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DynaResize: Runtime GPU Reallocation for Disaggregated LLM Post-Training

Researchers introduced DynaResize, a system that dynamically reallocates GPU resources between rollout and training phases during reinforcement learning-based LLM post-training. The approach reduces execution time by 33% compared to static GPU partitioning by minimizing pipeline delays from uneven workload distribution.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Opti-Q: A Constraint-Based Optimization Framework for Multi-LLM Question Planning

Researchers introduced Opti-Q, an optimization framework that orchestrates multiple LLMs for question answering by planning execution paths that balance answer quality against constraints like cost, latency, and energy. The system models LLM operations as a directed acyclic graph and uses a statistics catalog to estimate performance without executing each candidate plan, achieving significantly higher quality than baseline approaches within resource budgets.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

CHS-SQL: A Text-to-SQL approach based on Confidence-Guided Heuristic Search Schema Linking process

Researchers introduced CHS-SQL, a framework for converting natural language queries to SQL code using smaller language models. The approach uses confidence-guided heuristic search to better balance precision and recall when selecting relevant database schema elements, achieving state-of-the-art results while requiring only a single GPU.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

TokenMem: Faithful Knowledge Injection for Frozen LLMs

Researchers introduced TokenMem, a lightweight system that addresses knowledge conflicts in retrieval-augmented generation by injecting external knowledge into frozen LLMs through a dedicated attention pathway rather than competing with the model's existing parameters. The method uses a minimal gating adapter trained in two phases and achieved significantly higher knowledge compliance rates (69-70%) compared to standard RAG approaches (20-52%) when handling contradictory information.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Masked Distillation: Internalizing the Chain-of-Thought in Language Models

Researchers proposed masked distillation, a knowledge-distillation technique that trains language models to produce answers directly without lengthy intermediate reasoning traces. The method uses a reasoning teacher to supervise a student model on final solutions while treating intermediate steps as optional scaffolding, reducing inference latency and computational costs.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

VlogReward: Learning Multi-Dimensional Evaluation for Vlog Editing

Researchers introduced VlogReward, an AI system designed to evaluate vlog editing across multiple dimensions including creativity, cinematography, and pacing. The team created a 100k vlog dataset and benchmark to train and test multimodal language models on providing detailed feedback for video improvement.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Evolving from Lessons: Skill-Augmented Table Graph Reasoning for Operation-wise Table Question Answering

Researchers introduced a new evaluation framework for table question answering that breaks down questions by operation type, revealing that language models perform well on simple lookups but struggle with complex operations. They proposed SkillTGR, a method using graph-based table representations and reusable reasoning skills that improves accuracy while reducing computational costs.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

PRESTO: Prefix-Aligned Tree Drafting for Diffusion Speculative Decoding

Researchers introduced PRESTO, a framework that improves speculative decoding by applying tree-based drafting to diffusion language models. The method addresses a fundamental mismatch between how diffusion models generate candidates and how autoregressive models verify them, achieving up to 1.5× throughput speedup on existing diffusion-based speculative decoding systems.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

CallBench: A Benchmark for Dual-Goal Coordination in Phone Call Assistants

Researchers introduced CallBench, a Chinese benchmark dataset with 50,000 phone call conversations designed to evaluate dialogue systems managing dual goals simultaneously—the device owner's preset objective and the caller's dynamic objective. The benchmark covers six scenarios and proposes evaluation metrics for assessing semantic understanding, context usage, and goal coordination, revealing that current dialogue methods struggle with this coordination task.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Answering Path Queries under Linear and Guarded Existential Rules

Researchers analyzed the computational complexity of answering path queries over knowledge bases with ontologies expressed as existential rules. They demonstrated that for linear rules, path query answering matches the complexity of queries on plain databases, while guarded rules maintain the same complexity bounds as standard conjunctive queries.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Fast Cross-Scenario Adaptation of CSI Models via Channel Conditional Parameter Generation

Researchers propose Channel Conditional Parameter Generation (CCPG), a method for rapidly adapting deep learning models for wireless channel estimation to new environments without retraining. The approach generates lightweight parameters in seconds using feature compression and diffusion-based generation, achieving performance comparable to traditional fine-tuning methods on standard benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

TRACE: Business Rule-Grounded Reasoning Curriculum for Knowledge-Preserving Parametric Tool Retrieval in Enterprise LLMs

Researchers introduced TRACE, a two-stage training approach that enables large language models to retrieve enterprise tools more efficiently while preserving knowledge. The method combines memorization training with business rule-based reasoning, allowing models to use fast single-beam decoding instead of slow beam search while achieving significantly higher tool retrieval accuracy across 8,300+ enterprise tools.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

CRAFT: Learn the Schema, Execute the Plan

Researchers developed CRAFT, a post-training method for enterprise coding agents that learn to work with APIs and schemas through fine-tuning and reinforcement learning rather than requiring exhaustive documentation in each prompt. The approach improves agent reliability and consistency while significantly reducing computational overhead and schema discovery errors in multi-turn analytics tasks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Reason Before You Retrieve: Agentic Planning for Multi-modal RAG

Researchers introduced MM-R2, a multimodal RAG system that uses agentic reasoning to plan retrieval before searching. The framework models retrieval intent and searches structured knowledge maps rather than flat document collections, achieving improved performance on visual question-answering benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DocHRL: A Hierarchical Reinforcement Learning Framework for Cost-Optimised Document Classification

Researchers developed DocHRL, a hierarchical reinforcement learning framework that dynamically selects the most cost-effective classification approach for each document. The system learns to choose between vision models, LLMs, OCR, and human review based on document complexity, achieving high accuracy while reducing per-document processing costs compared to fixed classifier pipelines.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Extracting Algorithms in Pre-trained LLMs: A Case on Hidden Markov Models

Researchers developed a method to identify the internal algorithms that enable large language models to perform in-context learning on Hidden Markov Models. Using a technique called Principal Activations Probe, they traced low-dimensional representations across model layers that causally drive predictions, revealing how different computational stages are distributed throughout the network.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

PTStore (Prefix Tensor Store): Distributed Prefix Caching and Replication for High Throughput Inference Serving

PTStore introduces a distributed caching system for LLM inference that replicates frequently-used KV cache prefixes across multiple nodes, similar to CDN architecture. This approach reduces latency, balances server loads, and enables significantly larger cache sizes, achieving 5-6x better efficiency on long-context tasks compared to existing methods.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

STAIF: A Stage-wise Optimization for Complex Instruction Following

Researchers introduced STAIF, a two-stage optimization framework that improves how language models follow complex instructions with multiple constraints. The method separates soft constraints (preference-based) from hard constraints (verifiable), using a new bilingual dataset of 31,000 complex instructions to achieve better compliance on benchmark tests.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

ARdena: Scenario-driven control of real-time LLM agents

Researchers introduced ARdena, a framework for controlling LLM agent behavior in real-time through structured prompting rather than model fine-tuning. The system combines persistent context with scenario-specific constraints to modify agent behavior during interaction, and was tested on a multimodal embodied agent with speech and visual capabilities.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

KG2Code: Bridging Knowledge Graphs and Large Language Models via Executable Code for Question Answering

Researchers introduced KG2Code, a method that converts knowledge graphs into executable code representations to improve how language models answer knowledge-based questions. This approach generates verifiable reasoning traces and reduces hallucinations, outperforming existing retrieval and SPARQL-based methods while generalizing well to unfamiliar knowledge graphs.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Do Language Models Converge to Themselves? Recursive Self-Refinement as Textual Relaxation

Researchers studied how language models behave when repeatedly refining their own outputs, finding that iterative refinement converges quickly to a stable textual form rather than improving indefinitely. Using GPT-4.5 on academic abstracts, they observed that most meaningful edits occur in early iterations before reaching a fixed-point region with only minor changes, suggesting models settle into preferred equilibrium states.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

EventOD: Event-Aware OD Flow Generation via LLM-Guided Semantic Modulation

Researchers introduced EventOD, a framework that adapts existing origin-destination flow models to predict mobility patterns during disruptive events like hurricanes and pandemics. The approach uses large language models to infer functional changes in regions from event descriptions, then applies lightweight adaptation modules to adjust a pretrained generator without retraining.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

StanceBench: A Benchmark for Audio LLM-Based Interpersonal Stance Evaluation from Speech

Researchers introduced StanceBench, a new benchmark for evaluating how audio language models assess interpersonal attitudes like empathy and politeness in conversational speech. The benchmark tests LLM performance across nine stance dimensions and reveals that models handle some social cues well (empathy, politeness) but struggle with others (honesty), while showing sensitivity to prompt ordering and context.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

TRE: Training-Free Hallucination Detection for Diffusion Language Models

Researchers introduced TRE, a training-free method for detecting hallucinations in diffusion language models by analyzing entropy signals during text generation. Unlike existing approaches that require training detectors, TRE operates without additional parameters or repeated sampling, showing competitive performance across multiple models and datasets while offering better generalizability.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

CuraWeb: Joint Optimization of Quality, Redundancy, and Diversity for Web-Scale Pretraining Data

Researchers introduced CuraWeb, a 2 trillion-token English corpus created using a novel data curation approach that jointly optimizes quality, redundancy, and diversity for large language model pretraining. The method combines rule-based and model-driven filtering with dual deduplication techniques, demonstrating 1.8% average performance improvements over existing curated datasets across multiple benchmarks, especially for knowledge and reasoning tasks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Beyond Block Boundaries: Multi-Block Editing for Diffusion Large Language Models

Researchers proposed Multi-Block Editing (MBE), a technique that improves discrete diffusion language models by allowing tokens to be edited across block boundaries using cross-block context. The method includes both a training-free decoding algorithm and a supervised fine-tuning strategy, achieving performance gains of 2.7 points on LLaDA2.1-Mini while maintaining comparable generation speed.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Obliviate: Efficient Unlearning in Recommender Systems

Researchers introduced Obliviate, a two-stage unlearning framework for recommendation systems that efficiently removes user interaction data from trained models while maintaining recommendation quality. The approach uses lightweight adapters and calibration techniques to achieve data removal at a fraction of the computational cost of full model retraining.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Reinforcement Learning for Heterogeneous Sensor Selection in Maritime Surveillance

Researchers developed a reinforcement learning system that intelligently selects which sensor to activate for tracking ships in maritime networks. Using Bayesian filtering and a trained policy agent, the approach achieves tracking performance comparable to using all sensors simultaneously while reducing computational costs and activating only one sensor per decision step.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

AIR-BENCH Live: An Evolving Safety Benchmark for Foundation Models

Researchers introduced AIR-BENCH Live, an automatically updating safety benchmark for AI models that monitors regulatory changes and generates multilingual test prompts to keep pace with evolving governance and model capabilities. The benchmark expanded from 314 to 335 risk categories based on new policies across seven jurisdictions, revealing significant safety variations among 14 tested models.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

How LLM Task-Adaptation Reshapes Alignment: A Multi-dimensional Study of Behavioral and Representational Drift

Researchers conducted a comprehensive study examining how different fine-tuning methods affect language model alignment across safety, factuality, and other domains. They found that supervised fine-tuning causes significant alignment drift, while reinforcement learning with verifiable rewards better preserves alignment properties, with KL regularization offering a middle ground.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

DOSA: A Tree-Guided, Self-Regressive Framework for Long Document Structure Analysis

Researchers introduced DOSA, a framework for analyzing document structure by predicting relationships between page elements across multiple pages. The system uses visual, textual, and layout features to build hierarchical semantic trees incrementally, showing significant performance improvements on document understanding benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

A Vocabulary for Multi-Agent Automated Research Systems

Researchers propose a standardized vocabulary for describing and comparing multi-agent automated research systems, specifying key design elements like agent roles, operations, communication methods, and evaluation approaches. The framework distinguishes between generative taste (proposing novel trajectories) and evaluative taste (alignment between proxy scores and true quality), enabling more systematic analysis of system design choices.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Imprompt: A Language Framework for Prompt Programming

Researchers introduced Imprompt, a language framework that treats prompts as programmable interfaces for language models. The framework separates task descriptions from execution details and incorporates concepts like compilation and type checking to improve prompt programming structure and effectiveness.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 28, 2026

Co-Harness: Co-Evolving Harnesses and Model Weights for LLM Agents

Researchers introduced Co-Harness, a framework that simultaneously optimizes both the runtime environment (prompts, tools, memory) and model parameters when training AI agents for research tasks. The approach uses an LLM-based critic to identify failures and propose harness improvements, then fine-tunes the model on better trajectories, showing improvements in efficiency and autonomous capability.

Read on arXiv cs.AI →

Research OpenAI Jul 27, 2026

How AI is expanding what people do at work

OpenAI research indicates that ChatGPT is enabling workers to perform tasks beyond their traditional job descriptions, expanding the scope of what employees accomplish across various roles.

Read on OpenAI →

Research TechCrunch AI Jul 27, 2026

Are brain waves the next unlock for physical AI?

Researchers are exploring brain wave data as an additional input signal for training physical AI models, moving beyond traditional video-based approaches. This development suggests future embodied AI systems may incorporate neurological signals alongside multi-angle visual data and detailed annotations to improve learning capabilities.

Read on TechCrunch AI →

Research arXiv cs.AI Jul 24, 2026

AINTMA: Agentic AI Architecture for Autonomous Test Management with Generative Intelligence, Secure Cloud Communication and Adaptive Quality Analytics

Researchers introduced AINTMA, a multi-agent AI system for automated software testing that uses six specialized agents coordinated through secure cloud infrastructure. The system achieved 88.4% test prioritization accuracy, reduced test cycle time by 43%, and lowered defect escape rates from 8.3% to 2.1% across 12 software projects over 18 months.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Marking the Wrong Symptoms: Evaluating LLM Watermarks in Medical Texts

Researchers evaluated how watermarking techniques affect large language model performance in medical applications across multiple models and tasks. Their analysis revealed that watermarks can degrade clinical reasoning quality, introduce terminology errors, and cause hallucinations—problems masked by standard evaluation metrics, highlighting the need for domain-specific assessment before deploying watermarked models in healthcare.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Stochastic Sampling is Epistemically Shallow: The Dimensionality Gap Between Temperature Variation and Model Diversity in LLMs

Research comparing temperature-based sampling in large language models versus ensemble diversity found that repeated runs of a single model at high temperature produce limited insight into what the model doesn't know across different questions. Only diverse ensembles of different models revealed meaningful structure in knowledge gaps, suggesting self-consistency methods work for per-question uncertainty but not for identifying systematic knowledge limitations.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

JAXBench: Benchmarking Autonomous TPU Kernel Optimization

Researchers introduced JAXBench, a benchmark suite for evaluating AI-driven TPU kernel optimization using 50 JAX workloads from production ML models. Testing showed that providing target-specific documentation and search techniques improved AI-generated kernel performance, achieving up to 1.36x speedup over standard optimization while addressing a previously unmeasured gap in TPU kernel tuning.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

DC-Leap: Training-Free Acceleration of dLLMs via Draft-Guided Contiguous Leaping Decoding

Researchers introduce DC-Leap, a training-free method to accelerate diffusion-based language models by addressing confidence threshold limitations through dynamic token verification and draft-guided decoding. The approach achieves significant speedups (up to 53x and 105x with optimization) while maintaining generation quality on standard benchmarks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

InferenceBench: A Benchmark for Open-Ended LLM Inference Optimization by AI Agents

Researchers introduced InferenceBench, a benchmark that tests AI agents' ability to optimize large language model inference speed on a single GPU within a two-hour time limit across different performance scenarios. Results show agents can improve over baseline approaches but lack exploration diversity compared to systematic hyperparameter search, suggesting their limitation lies in proposing varied configurations rather than domain knowledge.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

DecodeShare: Tracing the Shared Subspace of LLM Decode-Time Decisions

Researchers developed DecodeShare, a method that identifies a shared low-dimensional subspace in language model decode-time operations across different tasks. By removing this subspace during decoding, they demonstrated it plays a causal role in model decisions and has practical applications for activation steering and model interpretability.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

PlanE: Meta Planning of Data, Tuning, and Inference for Extractive-based LLMs

Researchers introduced PlanE, a framework that optimizes the creation of task-specific LLMs by strategically managing data decomposition, instruction tuning, and inference. The approach includes a planning system that selects optimal base models and configurations to reduce annotation costs and improve efficiency across different datasets.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Benchmarking the Personalization Capabilities of Large Language Models

Researchers released SDR-Bench, a benchmark dataset of 6,279 customer success stories across 22 industries, to evaluate how well large language models can personalize sales outreach to persuade third parties. Testing frontier LLMs revealed a consistent personalization plateau, with no model statistically outperforming others on Fortune 100 tech companies, though field validation showed 48% of generated content was rated immediately useful.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Robust Critics: Defending LLMs Against Multi-Turn Attacks

Researchers propose Dialogue Critic Guided Sampling (DCGS), a framework that defends language models against multi-turn attacks by inferring user intent across conversation history rather than evaluating safety turn-by-turn. The method uses reinforcement learning critics to score responses and demonstrates improved robustness across multiple adversarial benchmarks compared to existing approaches.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Incomplete Prompt Jailbreaks in Large Language Models

Researchers identified a vulnerability in open-weight language models where incomplete harmful prompts can elicit unsafe completions through sentence continuation. The study shows that standard refusal training fails to generalize across different prompt types, but proposes neuron-level interventions targeting specific completion mechanisms as a more robust defense approach.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

VeriSimpl: Robust Optimization Modeling from Natural Language using Simplification-based Verification

Researchers introduced VeriSimpl, a framework that uses simplification-based verification to help large language models accurately translate natural language descriptions into optimization solver formulations. The approach leverages optimization solvers to generate diagnostic queries that allow LLMs to verify their formulations are correct, showing improved accuracy over existing methods.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

SonicSampler: Unified Tile-Aware Kernels for LLM Sampling and Speculative Verification

Researchers introduced SonicSampler, a set of optimized kernels that combine multiple sampling and verification operations for language model inference into a single efficient process. The approach supports diverse sampling configurations like grammar constraints and token filtering while achieving significant speedups and maintaining compatibility with GPU execution optimization techniques.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Benchmarking Large Language Models on Multi-Sensor Physical Hazard Assessment

Researchers benchmarked five major large language models on their ability to assess physical hazards using multi-sensor data. The study found all tested models failed to issue safety warnings when multiple sensors showed elevated readings below individual thresholds, despite performing well on single-sensor violations, raising concerns for safety-critical applications.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Semi-Supervised Text-Attributed Graph Distillation

Researchers propose a semi-supervised framework for compressing text-attributed graphs that combines graph and textual information while maintaining performance on downstream tasks. The method uses collaborative dual-pathway encoding, Wasserstein distance-based graph compression, and LLM-based text synthesis to create smaller, more efficient graph representations without sacrificing quality.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Beyond Liars' Bench: The Impact of Lie Typology, Depth, and Sparsity on Deception Detection in LLMs

Researchers investigated how various factors affect the ability of detection probes to identify deceptive outputs from large language models. By testing multiple probe types on datasets containing different types of deception (fabrication, omission, exaggeration), they found that training data composition and lie types significantly impact detection performance, with no single optimal approach across all scenarios.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Enabling Scalable Topology Inference in Distribution Systems via Constrained Multi-Source Inference

Researchers developed a constrained inference framework for identifying distribution system topology using multiple data sources, addressing challenges from incomplete utility records. Validated on operational data from a major U.S. utility, the method achieves over 95% accuracy while being computationally efficient compared to existing global inference approaches.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Routing Without Training: Controllable-Ratio LLM Offloading via Reliability Gating

Researchers introduced CARGO, a training-free routing method for local-cloud LLM collaboration that uses the local model's response agreement to decide when to offload tasks to a stronger cloud model. The approach employs prompt-varied sampling and Bayesian early stopping to control uncertainty and support flexible collaboration ratios without requiring trained routers or specialized finetuning.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

PersonaTrail: Benchmarking Personalized Web Agents through Browsing Trails

Researchers introduced PersonaTrail, a benchmark for evaluating web agents that must infer user preferences from browsing history to complete underspecified tasks. They also proposed PACMem, a memory framework that organizes browsing data into factual and preference memories to improve agent performance on personalized navigation tasks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Tractable Hierarchical Control of Autoregressive Language Models

Researchers developed a method to efficiently constrain language model generation to follow formal grammar rules like SQL or JSON by distilling LLMs into tractable probabilistic models. The approach enables polynomial-time verification of grammar compliance, improving upon previous exponential-time methods and ensuring syntactically valid outputs for tasks like program synthesis.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

The Devil is in the Spectrum: Mitigating Representation Collapse in LLMs via Topologically Regularized Side-Path

Researchers propose Topologically Regularized Side-Path (TRSP), an architectural modification that addresses representation collapse in large language models by balancing token mixing efficiency with information capacity. The method improves long-context performance significantly, achieving 83% accuracy at 8 times the training length on benchmarks where competing approaches drop substantially.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Expectation Alignment of Language Models for Real-World User Expectations

Researchers introduced ExpectBench, a benchmark based on actual user expectations from real-world LLM interactions, and found that current language models struggle to meet what users truly need. They developed LENS, a framework that helps models better understand and satisfy user expectations, demonstrating improved alignment between model outputs and genuine user goals.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

OPTScientist: Multi-Agent Discovery of Typed Optimizer Programs for Transformer Pretraining

Researchers introduced OPTScientist, a multi-agent framework that automatically designs optimizers for transformer training using a typed domain-specific language. The system combines four specialized agents with evolutionary search to discover new optimizers, resulting in RS-MR, which outperforms existing optimizers on transformer pretraining tasks.

Read on arXiv cs.AI →

Research arXiv cs.AI Jul 24, 2026

Directional Hallucinations: Ideological Drift in News-Grounded LLM Question Answering

Researchers developed a framework to measure hallucinations in LLM responses to political questions, finding that while hallucination rates vary by model, the errors consistently demonstrate leftward ideological bias across multiple models, including those using right-leaning sources. The study analyzed over 21,000 political news articles and suggests this drift occurs in high-uncertainty generation contexts.

Read on arXiv cs.AI →

Scott Sokolowski, founder of AI Info Forge

Hi — I'm Scott. I'm building AI Info Forge because the Claude skills market is full of prompts that worked last month and break this month. Verified, drift-tested skills is the version of this market I wish existed. If you've felt the same friction, get on the list and I'll send you early access.

— Scott Sokolowski, Founder