Publications

Selected papers

This page highlights representative papers across LLM systems, hybrid infrastructure, security, and trustworthy AI systems. For the most complete and up-to-date list, please use my Google Scholar profile.

EuroSys 2026 Trustworthy LLM systems

TrustWeave: Integrity Measurement and Attestation For Multi-Cloud LLMs

A system for integrity measurement and attestation in multi-cloud LLM deployments, aimed at making model execution more trustworthy across heterogeneous cloud environments.

DOI

arXiv 2026 Deterministic inference

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

A verification policy that targets only low-margin decode steps, aiming to restore deterministic decoding while keeping verification overhead significantly lower than always-on checking.

arXiv

arXiv 2026 Agent systems

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

A routing system for same-function tool providers that treats latency as service capacity and answer quality as the main target under changing load and provider heterogeneity.

arXiv

ICDCS 2025 LLM serving

MCaM: Efficient LLM Inference with Multi-tier KV Cache Management

A multi-tier KV-cache management system for improving large-model inference efficiency under memory pressure.

IEEE

arXiv 2025 MoE systems

ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference

A runtime system for MoE inference that jointly optimizes expert scheduling and memory coordination instead of treating them as separate problems.

arXiv

arXiv 2025 MoE memory efficiency

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

A runtime-aware approach to dynamic expert precision control for MoE serving, designed to adapt expert bit-widths under strict GPU memory budgets.

arXiv

arXiv 2025 LLM security

Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

A security-focused line of work on reducing timing side-channel leakage in multi-tenant LLM serving without giving up the performance benefits of shared caching.

arXiv