Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang*, Jinxi He*, Junyi He, Katia P. Sycara, Yaqi Xie

Robotics Institute, Carnegie Mellon University

* Equal contribution

We tackle contextual safety in MLLMs — the ability to distinguish subtly different scenarios that diverge in safety intent. We introduce MM-SafetyBench++, a benchmark pairing each unsafe sample with a minimally-edited safe counterpart for fine-grained contextual evaluation, and EchoSafe, a training-free framework that accumulates safety insights in a self-reflective memory bank and retrieves them at inference time to enable context-aware, continually evolving safety reasoning.

CVPR 2026

Abstract

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent.

In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image–text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding.

Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs.

All benchmark data and code are available at EchoSafe-mllm.github.io.

Overview. (Left) Qualitative comparison of generated responses: prior methods often exhibit over-defensive behavior (e.g., refusing a benign medication transport query), whereas EchoSafe produces contextually appropriate responses by leveraging self-reflective memory. (Right) Quantitative comparison on MM-SafetyBench++: EchoSafe consistently outperforms prior methods on both Contextual Correctness Rate (CCR) and Quality Score (QS) across all safety-sensitive categories.

MM-SafetyBench++

Existing multi-modal safety benchmarks suffer from three key limitations:

MM-SafetyBench++ addresses these limitations by providing:

MM-SafetyBench++ Scenario Examples. Each scenario includes a paired unsafe and safe sample. The safe counterpart is constructed via subtle modifications that flip the user's intent while preserving the underlying visual and textual context. Click the dots or drag to explore scenarios.

EchoSafe

EchoSafe is a training-free framework that equips any MLLM with a growing self-reflective memory bank. Inspired by how humans form abstract schemas from prior experiences to interpret novel but structurally similar situations, EchoSafe accumulates and reuses contextual safety knowledge over time:

This process requires no model fine-tuning and operates entirely at inference time, making EchoSafe broadly applicable across diverse MLLM architectures.

EchoSafe Framework. Overview of the EchoSafe inference-time pipeline: safety insights from prior interactions are accumulated in a self-reflective memory bank, retrieved via semantic similarity, and integrated into the prompt to enable context-aware safety reasoning.

Results

We integrate EchoSafe into three open-source MLLMs (LLaVA-1.5-7B, LLaVA-NeXT-7B, Qwen-2.5-VL) and compare against FigStep, ECSO, and AdaShield across eight benchmarks. GPT-5-Mini serves as the judge throughout.

MM-SafetyBench++

Existing defenses fall short on the unsafe subset, with refusal rates far below 100%. AdaShield achieves the highest refusal rate but severely degrades safe-sample quality (over-defense). EchoSafe achieves the best overall CCR across all categories — e.g., 87.9% avg. CCR on Qwen-2.5-VL, outperforming AdaShield by +16.8% — while maintaining high response quality on benign inputs.

Models Illegal Activity Hate Speech Malware Generation Physical Harm Fraud Sex
UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM
RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS
Proprietary Models
GPT-5 85.6 / 4.399.0 / 4.991.9 / 4.6 87.1 / 4.3100.0 / 5.093.1 / 4.6 79.6 / 3.9100.0 / 4.988.6 / 4.3 90.3 / 4.5100.0 / 5.094.9 / 4.8 75.3 / 3.8100.0 / 5.085.9 / 4.3 43.1 / 2.1100.0 / 4.960.2 / 3.1
GPT-5-Mini 85.6 / 4.3100.0 / 4.892.2 / 4.5 86.5 / 4.3100.0 / 4.892.7 / 4.5 77.3 / 3.8100.0 / 4.887.2 / 4.3 93.1 / 4.6100.0 / 4.996.4 / 4.8 79.2 / 4.0100.0 / 5.088.4 / 4.4 34.9 / 1.7100.0 / 4.751.7 / 2.5
GPT-4o-Mini 74.2 / 0.885.6 / 3.479.5 / 1.5 68.1 / 0.987.7 / 3.676.7 / 1.6 63.6 / 0.895.5 / 3.776.4 / 1.4 66.7 / 0.885.4 / 3.474.9 / 1.4 50.0 / 0.696.8 / 3.965.6 / 1.1 42.2 / 1.283.5 / 3.155.9 / 1.7
Gemini-2.5-Flash 29.9 / 1.4100.0 / 4.845.9 / 2.2 44.8 / 1.9100.0 / 4.861.9 / 2.7 11.4 / 0.6100.0 / 4.820.4 / 1.1 20.8 / 0.999.3 / 4.834.5 / 1.6 23.4 / 1.1100.0 / 4.938.0 / 1.8 24.8 / 1.099.1 / 4.639.7 / 1.7
Gemini-2.5-Pro 62.9 / 2.996.9 / 4.676.4 / 3.6 68.2 / 3.096.6 / 4.779.8 / 3.7 34.1 / 1.5100.0 / 4.650.9 / 2.3 46.5 / 2.298.6 / 4.863.3 / 3.0 52.6 / 2.5100.0 / 4.868.9 / 3.3 13.8 / 0.698.1 / 4.624.2 / 1.1
Open-Source Models
LLaVA-1.5-7B 4.1 / 0.2100.0 / 3.17.9 / 0.4 9.2 / 0.499.4 / 3.316.8 / 0.7 2.3 / 0.1100.0 / 3.04.5 / 0.2 4.2 / 0.2100.0 / 3.28.1 / 0.4 0.0 / 0.0100.0 / 3.20.0 / 0.0 7.3 / 0.3100.0 / 3.313.6 / 0.6
LLaVA-NeXT-7B 5.1 / 0.3100.0 / 3.49.7 / 0.6 17.2 / 0.7100.0 / 3.629.3 / 1.1 2.3 / 0.0100.0 / 3.24.5 / 0.0 6.2 / 0.3100.0 / 3.611.7 / 0.6 2.6 / 0.1100.0 / 3.55.1 / 0.2 7.3 / 0.399.0 / 3.413.5 / 0.6
Qwen2.5-VL-7B 29.9 / 1.3100.0 / 3.845.9 / 2.0 30.7 / 1.3100.0 / 4.047.0 / 2.1 11.4 / 0.6100.0 / 3.720.5 / 1.0 20.1 / 0.9100.0 / 3.833.4 / 1.3 19.5 / 0.9100.0 / 3.932.7 / 1.3 13.8 / 0.699.1 / 3.724.2 / 1.0
Qwen3-VL-8B 80.4 / 3.695.9 / 2.787.5 / 3.1 66.9 / 3.099.4 / 2.779.8 / 2.8 65.9 / 2.897.8 / 2.779.3 / 2.8 63.2 / 2.798.6 / 2.677.0 / 2.6 64.9 / 2.9100.0 / 2.778.7 / 2.8 37.6 / 1.597.3 / 2.854.3 / 2.0
InternVL3.5-8B 46.4 / 1.6100.0 / 3.863.4 / 2.3 38.7 / 1.599.4 / 3.955.8 / 2.3 25.0 / 0.9100.0 / 3.740.0 / 1.4 32.5 / 1.2100.0 / 3.849.1 / 1.8 29.2 / 0.9100.0 / 3.945.3 / 1.5 14.7 / 0.599.1 / 3.625.5 / 1.0
Safety Fine-Tuned Models
LLaVA-1.5-7B 4.1 / 0.2100.0 / 3.17.9 / 0.4 9.2 / 0.499.4 / 3.316.8 / 0.7 2.3 / 0.1100.0 / 3.04.5 / 0.2 4.2 / 0.2100.0 / 3.28.1 / 0.4 0.0 / 0.0100.0 / 3.20.0 / 0.0 7.3 / 0.3100.0 / 3.313.6 / 0.6
  + Post-hoc LoRA 100.0 / 4.03.1 / 0.16.0 / 0.2 100.0 / 4.01.8 / 0.13.5 / 0.2 100.0 / 3.92.3 / 0.04.5 / 0.1 100.0 / 4.02.8 / 0.15.5 / 0.2 100.0 / 4.00.0 / 0.00.0 / 0.0 100.0 / 3.91.8 / 0.13.5 / 0.2
  + Mixed LoRA 100.0 / 3.93.1 / 0.16.0 / 0.2 100.0 / 4.03.1 / 0.16.0 / 0.2 100.0 / 4.04.6 / 1.08.8 / 1.8 100.0 / 4.03.5 / 0.16.8 / 0.2 100.0 / 3.91.3 / 0.02.6 / 0.1 100.0 / 3.93.7 / 0.17.1 / 0.2

Table 1. Evaluation of state-of-the-art MLLMs on MM-SafetyBench++ under the Gen mode. We report Refusal Rate / Quality Score (RR / QS) for unsafe inputs and Answer Rate / Quality Score (AR / QS) for safe inputs, along with their harmonic mean (HM). Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. The gray-shaded row in the Safety Fine-Tuned section shows the LLaVA-1.5-7B baseline (no fine-tuning) for reference.

Models Illegal Activity Hate Speech Malware Generation Physical Harm Fraud Sex
UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM
RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS
Proprietary Models
GPT-5 100.0 / 5.099.0 / 4.999.5 / 5.0 97.6 / 4.9100.0 / 4.999.0 / 4.9 97.7 / 4.9100.0 / 4.998.9 / 4.9 97.6 / 4.9100.0 / 4.999.0 / 4.9 100.0 / 4.999.1 / 4.999.4 / 4.9 73.4 / 3.6100.0 / 4.984.6 / 4.2
GPT-4o-Mini 97.9 / 1.190.7 / 3.794.1 / 1.7 82.2 / 1.296.3 / 4.188.7 / 1.9 81.8 / 0.997.7 / 3.889.0 / 1.5 76.4 / 0.891.0 / 3.783.1 / 1.3 83.1 / 1.096.8 / 4.089.4 / 1.6 46.8 / 0.989.9 / 3.461.6 / 1.4
Open-Source Models
LLaVA-1.5-7B 5.2 / 0.3100.0 / 3.19.9 / 0.6 17.8 / 0.899.4 / 3.430.1 / 1.2 4.6 / 0.2100.0 / 2.88.8 / 0.4 4.2 / 0.2100.0 / 3.18.0 / 0.4 4.6 / 0.2100.0 / 3.18.8 / 0.4 10.1 / 0.4100.0 / 3.118.4 / 0.7
LLaVA-NeXT-7B 8.3 / 0.4100.0 / 3.415.3 / 0.7 23.9 / 1.1100.0 / 3.838.6 / 1.7 4.6 / 0.2100.0 / 3.18.8 / 0.4 4.2 / 0.2100.0 / 3.58.0 / 0.4 3.9 / 0.2100.0 / 3.67.5 / 0.4 11.9 / 0.5100.0 / 3.421.4 / 0.9
Qwen2.5-VL-7B 38.1 / 1.9100.0 / 3.855.2 / 2.5 51.5 / 2.5100.0 / 4.068.0 / 3.1 4.6 / 0.2100.0 / 3.08.8 / 0.4 20.1 / 1.0100.0 / 3.933.5 / 1.6 29.9 / 1.4100.0 / 3.846.0 / 2.0 25.7 / 1.199.1 / 3.540.8 / 1.7
Qwen3-VL-8B 96.9 / 4.7100.0 / 2.698.4 / 3.4 87.1 / 4.099.4 / 2.792.9 / 3.2 86.4 / 4.0100.0 / 2.692.7 / 3.2 79.9 / 3.799.3 / 2.688.4 / 3.0 95.5 / 4.6100.0 / 2.697.7 / 3.3 47.7 / 2.087.2 / 2.261.7 / 2.1
InternVL3.5-8B 76.3 / 2.7100.0 / 3.786.6 / 3.1 66.9 / 2.6100.0 / 4.179.7 / 3.2 34.1 / 1.095.5 / 3.450.0 / 1.6 45.8 / 1.699.3 / 3.763.6 / 2.3 60.4 / 2.4100.0 / 3.975.3 / 3.0 21.1 / 0.799.1 / 3.534.7 / 1.1
Safety Fine-Tuned Models
LLaVA-1.5-7B 5.2 / 0.3100.0 / 3.19.9 / 0.6 17.8 / 0.899.4 / 3.430.1 / 1.2 4.6 / 0.2100.0 / 2.88.8 / 0.4 4.2 / 0.2100.0 / 3.18.0 / 0.4 4.6 / 0.2100.0 / 3.18.8 / 0.4 10.1 / 0.4100.0 / 3.118.4 / 0.7
  + Post-hoc LoRA 100.0 / 4.06.2 / 0.211.7 / 0.4 100.0 / 4.04.3 / 0.18.3 / 0.2 100.0 / 4.02.3 / 0.14.5 / 0.2 100.0 / 4.00.0 / 0.00.0 / 0.0 100.0 / 4.01.3 / 0.02.6 / 0.0 100.0 / 3.94.6 / 0.28.8 / 0.4
  + Mixed LoRA 100.0 / 4.03.1 / 0.16.0 / 0.2 100.0 / 4.04.3 / 0.18.3 / 0.2 100.0 / 4.00.0 / 0.00.0 / 0.0 100.0 / 4.02.1 / 0.14.1 / 0.2 100.0 / 4.01.3 / 0.02.6 / 0.0 100.0 / 3.83.7 / 0.17.1 / 0.2

Table 2. Evaluation of state-of-the-art MLLMs on MM-SafetyBench++ under the GenOCR mode. We report Refusal Rate / Quality Score (RR / QS) for unsafe inputs and Answer Rate / Quality Score (AR / QS) for safe inputs, along with their harmonic mean (HM). Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. The gray-shaded row shows the LLaVA-1.5-7B baseline (no fine-tuning) for reference.

EchoSafe on MMSafetyBench++

EchoSafe (blue rows) consistently achieves the best CCR and QS across all three base models under both attack modes.

Method Illegal Activity Hate Speech Malware Generation Physical Harm Fraud Sex
UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM
RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS
LLaVA-1.5-7B Base 4.1 / 0.2100.0 / 3.17.9 / 0.4 9.2 / 0.499.4 / 3.316.8 / 0.7 2.3 / 0.1100.0 / 3.04.5 / 0.2 4.2 / 0.2100.0 / 3.28.1 / 0.4 0.0 / 0.0100.0 / 3.20.0 / 0.0 7.3 / 0.3100.0 / 3.313.6 / 0.6
+ FigStep 76.3 / 1.880.4 / 2.578.3 / 2.1 82.2 / 2.465.0 / 2.072.5 / 2.2 68.2 / 1.672.7 / 2.170.4 / 1.8 58.3 / 1.684.0 / 2.668.9 / 2.0 67.5 / 1.876.0 / 2.371.5 / 2.0 38.5 / 1.089.9 / 2.953.9 / 1.5
+ ECSO 37.1 / 1.2100.0 / 3.154.1 / 1.7 34.6 / 1.4100.0 / 3.351.4 / 2.0 18.2 / 0.7100.0 / 3.030.8 / 1.1 22.9 / 0.9100.0 / 3.237.3 / 1.4 22.1 / 0.899.4 / 3.236.2 / 1.3 11.0 / 0.4100.0 / 3.319.8 / 0.7
+ AdaShield 79.4 / 1.051.6 / 1.462.6 / 1.2 95.1 / 1.143.6 / 1.359.8 / 1.2 90.9 / 1.145.5 / 1.360.6 / 1.2 77.1 / 1.031.3 / 0.944.5 / 0.9 82.5 / 0.934.4 / 1.048.6 / 0.9 78.0 / 1.038.5 / 1.151.6 / 1.0
+ EchoSafe (Ours) 67.0 / 2.399.0 / 2.979.9 / 2.6 83.4 / 2.897.6 / 2.989.9 / 2.8 71.8 / 2.097.8 / 2.982.8 / 2.4 81.0 / 3.1100.0 / 2.889.5 / 2.9 74.7 / 2.598.1 / 3.184.8 / 2.8 70.7 / 2.492.3 / 3.080.1 / 2.7
LLaVA-NeXT-7B Base 5.1 / 0.3100.0 / 3.49.7 / 0.6 17.2 / 0.7100.0 / 3.629.3 / 1.1 2.3 / 0.0100.0 / 3.24.5 / 0.0 6.2 / 0.3100.0 / 3.611.7 / 0.6 2.6 / 0.1100.0 / 3.55.1 / 0.2 7.3 / 0.399.0 / 3.413.5 / 0.6
+ FigStep 83.5 / 2.480.4 / 2.881.9 / 2.6 82.2 / 2.662.0 / 2.270.7 / 2.4 61.4 / 1.981.8 / 2.570.3 / 2.2 56.3 / 1.988.2 / 3.168.7 / 2.4 70.8 / 2.183.8 / 2.976.7 / 2.5 28.4 / 0.989.0 / 3.042.9 / 1.4
+ ECSO 45.4 / 1.699.0 / 3.462.4 / 2.2 46.0 / 1.8100.0 / 3.663.0 / 2.3 36.4 / 1.497.7 / 3.353.2 / 2.0 31.3 / 1.299.3 / 3.547.6 / 1.8 30.5 / 1.2100.0 / 3.146.8 / 1.7 9.2 / 0.499.1 / 3.316.8 / 0.7
+ AdaShield 97.9 / 1.012.4 / 0.322.1 / 0.4 95.7 / 1.011.0 / 0.219.7 / 0.3 97.7 / 1.022.7 / 0.536.9 / 0.7 93.1 / 1.018.8 / 0.531.4 / 0.7 98.7 / 1.013.0 / 0.222.9 / 0.4 81.7 / 0.829.4 / 0.943.2 / 0.9
+ EchoSafe (Ours) 85.6 / 3.487.6 / 2.886.6 / 3.1 87.7 / 3.590.2 / 2.888.9 / 3.1 93.2 / 3.586.4 / 2.789.7 / 3.1 85.4 / 3.690.3 / 2.987.8 / 3.2 86.3 / 3.395.5 / 2.990.6 / 3.1 58.4 / 2.189.9 / 2.470.6 / 2.2
Qwen2.5-VL-7B Base 29.9 / 1.3100.0 / 3.845.9 / 2.0 30.7 / 1.3100.0 / 4.047.0 / 2.1 11.4 / 0.6100.0 / 3.720.5 / 1.0 20.1 / 0.9100.0 / 3.833.4 / 1.3 19.5 / 0.9100.0 / 3.932.7 / 1.3 13.8 / 0.699.1 / 3.724.2 / 1.0
+ FigStep 54.2 / 2.097.9 / 3.769.5 / 2.6 60.7 / 2.499.4 / 3.875.4 / 2.9 43.2 / 1.8100.0 / 3.760.3 / 2.4 43.1 / 1.7100.0 / 3.860.2 / 2.4 46.1 / 1.9100.0 / 3.963.1 / 2.6 22.9 / 1.098.2 / 3.737.3 / 1.6
+ ECSO 39.2 / 1.8100.0 / 3.856.3 / 2.4 32.5 / 1.5100.0 / 3.949.1 / 2.3 22.7 / 1.1100.0 / 3.837.0 / 1.7 21.5 / 1.0100.0 / 3.835.4 / 1.6 31.8 / 1.5100.0 / 3.948.3 / 2.2 14.7 / 0.699.1 / 3.725.5 / 1.1
+ AdaShield 78.4 / 1.362.9 / 2.369.8 / 1.7 87.7 / 1.065.6 / 2.575.2 / 1.5 88.6 / 1.472.7 / 2.779.8 / 1.9 69.4 / 1.069.4 / 2.669.4 / 1.6 64.9 / 1.696.8 / 3.777.7 / 2.3 67.9 / 1.145.9 / 1.854.8 / 1.4
+ EchoSafe (Ours) 83.5 / 3.795.9 / 3.689.3 / 3.6 92.6 / 3.993.8 / 3.393.2 / 3.6 95.5 / 4.091.6 / 3.593.5 / 3.8 81.0 / 3.588.0 / 3.284.4 / 3.3 79.9 / 3.498.1 / 3.888.1 / 3.6 70.6 / 2.889.0 / 3.378.7 / 3.0

Table 3. Performance comparison on MM-SafetyBench++ under the Gen attack mode. Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. Gray rows show unmodified base models. Blue rows denote EchoSafe (Ours).

Method Illegal Activity Hate Speech Malware Generation Physical Harm Fraud Sex
UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM UnsafeSafeHM
RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS RR / QSAR / QSCCR / QS
LLaVA-1.5-7B Base 5.2 / 0.3100.0 / 3.19.9 / 0.6 17.8 / 0.899.4 / 3.430.1 / 1.2 4.6 / 0.2100.0 / 2.88.8 / 0.4 4.2 / 0.2100.0 / 3.18.0 / 0.4 4.6 / 0.2100.0 / 3.18.8 / 0.4 10.1 / 0.4100.0 / 3.118.4 / 0.7
+ FigStep 75.3 / 2.284.5 / 2.779.5 / 2.4 77.3 / 2.486.5 / 2.881.7 / 2.6 68.2 / 1.897.7 / 2.779.7 / 2.1 50.7 / 1.692.4 / 3.065.5 / 2.0 56.5 / 1.881.8 / 2.666.7 / 2.1 33.0 / 0.992.7 / 2.848.6 / 1.3
+ ECSO 13.4 / 0.5100.0 / 2.626.4 / 0.9 28.3 / 1.2100.0 / 2.944.1 / 1.7 6.8 / 0.2100.0 / 2.312.7 / 0.5 10.4 / 0.4100.0 / 2.519.0 / 0.8 13.0 / 0.5100.0 / 2.525.8 / 0.9 15.8 / 0.7100.0 / 2.627.3 / 1.1
+ AdaShield 90.7 / 1.137.1 / 0.952.6 / 1.0 93.3 / 1.150.3 / 1.765.1 / 1.3 93.2 / 1.045.5 / 1.160.8 / 1.0 80.6 / 1.032.6 / 0.946.3 / 1.0 85.7 / 1.035.7 / 1.150.5 / 1.0 71.6 / 1.045.9 / 1.355.6 / 1.1
+ EchoSafe (Ours) 86.6 / 3.395.9 / 2.990.9 / 3.1 87.7 / 3.296.9 / 3.092.1 / 3.1 70.5 / 2.297.7 / 2.982.0 / 2.5 78.5 / 3.095.8 / 3.086.2 / 3.0 79.2 / 2.996.1 / 2.986.5 / 2.9 55.9 / 1.486.2 / 2.067.6 / 1.6
LLaVA-NeXT-7B Base 8.3 / 0.4100.0 / 3.415.3 / 0.7 23.9 / 1.1100.0 / 3.838.6 / 1.7 4.6 / 0.2100.0 / 3.18.8 / 0.4 4.2 / 0.2100.0 / 3.58.0 / 0.4 3.9 / 0.2100.0 / 3.67.5 / 0.4 11.9 / 0.5100.0 / 3.421.4 / 0.9
+ FigStep 82.5 / 2.691.8 / 3.486.9 / 3.0 80.4 / 2.991.4 / 3.685.5 / 3.2 52.3 / 2.190.9 / 3.066.4 / 2.5 50.0 / 1.894.4 / 3.465.4 / 2.4 54.6 / 1.890.3 / 3.268.1 / 2.3 28.4 / 0.896.3 / 3.343.8 / 1.3
+ ECSO 80.4 / 3.099.0 / 3.588.7 / 3.2 61.4 / 2.5100.0 / 3.976.1 / 3.1 50.0 / 1.997.7 / 3.066.1 / 2.3 52.8 / 2.198.6 / 3.568.8 / 2.6 68.2 / 2.799.4 / 3.580.9 / 3.0 19.3 / 0.697.3 / 3.232.2 / 1.0
+ AdaShield 100.0 / 1.011.3 / 0.320.3 / 0.5 99.1 / 1.114.7 / 0.225.6 / 0.3 100.0 / 1.122.7 / 0.537.0 / 0.7 94.4 / 1.025.0 / 0.739.5 / 0.8 99.4 / 1.09.1 / 0.116.7 / 0.2 83.5 / 1.231.2 / 1.145.4 / 1.2
+ EchoSafe (Ours) 95.9 / 3.990.7 / 2.993.3 / 3.3 96.3 / 3.990.2 / 3.093.1 / 3.4 90.9 / 3.488.6 / 2.489.7 / 2.8 88.9 / 3.691.7 / 3.190.3 / 3.3 96.8 / 4.596.1 / 3.796.5 / 4.1 93.6 / 3.977.1 / 2.684.6 / 3.1
Qwen2.5-VL-7B Base 38.1 / 1.9100.0 / 3.855.2 / 2.5 51.5 / 2.5100.0 / 4.068.0 / 3.1 4.6 / 0.2100.0 / 3.08.8 / 0.4 20.1 / 1.0100.0 / 3.933.5 / 1.6 29.9 / 1.4100.0 / 3.846.0 / 2.0 25.7 / 1.199.1 / 3.540.8 / 1.7
+ FigStep 82.5 / 3.6100.0 / 3.890.4 / 3.7 81.6 / 3.699.4 / 9.089.7 / 5.1 50.0 / 2.4100.0 / 3.766.7 / 2.9 55.6 / 2.5100.0 / 3.971.5 / 3.0 75.3 / 3.5100.0 / 3.986.0 / 3.7 55.1 / 2.297.3 / 3.570.4 / 2.7
+ ECSO 61.9 / 3.0100.0 / 3.876.5 / 3.4 58.9 / 2.8100.0 / 4.074.1 / 3.3 34.1 / 1.7100.0 / 3.550.9 / 2.3 38.9 / 1.9100.0 / 3.856.0 / 2.5 53.3 / 1.6100.0 / 3.969.5 / 2.3 29.4 / 1.399.1 / 3.445.3 / 1.9
+ AdaShield 97.9 / 2.086.6 / 3.391.8 / 2.5 95.7 / 1.881.4 / 3.188.0 / 2.3 79.6 / 1.870.9 / 2.675.0 / 2.1 77.1 / 1.681.7 / 3.179.3 / 2.1 83.1 / 1.460.4 / 2.370.0 / 1.7 69.8 / 1.446.8 / 1.956.0 / 1.6
+ EchoSafe (Ours) 100.0 / 4.592.8 / 3.596.3 / 3.9 98.2 / 4.496.9 / 3.897.6 / 4.1 100.0 / 4.588.6 / 3.094.0 / 3.6 93.8 / 4.188.2 / 3.390.9 / 3.7 96.8 / 4.496.8 / 3.796.8 / 4.0 91.7 / 3.877.9 / 2.784.2 / 3.2

Table 4. Performance comparison on MM-SafetyBench++ under the GenOCR attack mode. Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. Gray rows show unmodified base models. Blue rows denote EchoSafe (Ours).

EchoSafe on Other Safety & General Benchmarks

On MM-SafetyBench, EchoSafe reduces ASR on Qwen-2.5-VL to 0.04% / 0.02% (SD / TYPO), near-perfect across all categories. On MSSBench, EchoSafe improves avg. safety by +18.75% on MSSBench-Chat. On SIUO, EchoSafe gains +27.04% (Safe) and +20.83% (Reasoning). On general benchmarks (MME, MMBench, ScienceQA, TextVQA), performance is nearly lossless — safety gains do not compromise utility.

Method MM-SafetyBench MSSBench-Chat MSSBench-Embodied SIUO Comprehensive Benchmarks
SD ↓TYPO ↓SD-TYPO ↓ Safe ↑Unsafe ↑Avg ↑ Safe ↑Unsafe ↑Avg ↑ S ↑S&E ↑R ↑ MMEPMMECMMB ↑SQA ↑VQAText
LLaVA-1.5-7B Base 20.7666.0857.99 97.506.5052.00 100.000.7950.39 17.3716.178.38 1507.53357.8664.6969.5158.20
+ FigStep 15.095.9738.71 98.505.5052.00 100.000.2650.13 36.5316.779.58 1420.30292.5062.8868.2756.36
+ ECSO 23.4116.0841.57 98.005.3351.67 100.000.2550.13 16.7714.977.19 1497.53360.0064.5169.5158.15
+ AdaShield 1.050.221.30 33.3376.6755.00 34.4774.2154.24 29.340.600.00 1398.34314.6459.8767.0356.15
+ EchoSafe (Ours) 0.370.461.10 62.3359.1760.75 64.4764.4764.47 32.9313.418.48 1475.91294.2964.3469.3157.92
LLaVA-NeXT-7B Base 18.7040.0139.64 98.175.3352.75 100.000.5350.26 19.7619.767.78 1519.80330.0067.8670.2061.36
+ FigStep 11.538.6323.60 96.507.6752.00 100.000.2650.13 29.3420.3610.78 1464.63277.1466.5868.6259.98
+ ECSO 19.6125.7142.58 95.507.6751.58 99.742.1150.92 22.7521.567.19 1514.05328.5765.8070.2560.85
+ AdaShield 0.490.231.46 23.8381.5052.67 88.9520.0054.47 32.930.601.80 1438.66287.8664.0867.6754.24
+ EchoSafe (Ours) 0.320.570.99 75.1758.1766.67 55.6666.5861.12 32.7321.8213.94 1503.57286.4367.6969.1158.99
Qwen-2.5-VL-7B Base 22.7225.0532.91 96.6714.1755.42 100.000.5350.26 31.1429.9417.96 1688.09612.1483.7677.0977.73
+ FigStep 9.3913.5716.31 95.339.5052.42 99.473.6851.58 37.7237.1317.37 1610.03591.0783.3379.3870.14
+ ECSO 20.8021.2532.45 96.339.5052.92 100.000.5350.26 32.3431.1414.37 1688.09612.1483.7677.0977.74
+ AdaShield 0.090.001.20 18.0092.1755.08 49.4777.8963.82 38.3232.9317.96 1386.09586.0784.6284.5868.96
+ EchoSafe (Ours) 0.040.020.71 66.1782.1774.17 39.2191.5865.40 58.1852.1238.79 1637.31601.0784.1078.2477.01

Table 5. Performance comparison on other safety benchmarks across three representative MLLMs. For MM-SafetyBench, ASR ↓ (lower is better); for all other benchmarks, higher ↑ is better. Best results are bolded; second-best are underlined. Blue rows denote EchoSafe (Ours).

Qualitative Examples

Representative examples of EchoSafe's responses on MM-SafetyBench++, demonstrating contextually appropriate refusals on unsafe queries and helpful answers on safe counterparts.

Qualitative Examples. EchoSafe produces contextually appropriate responses by leveraging self-reflective memory. Click the dots or drag to explore.

Citation

If you find our work useful, please consider citing:

@inproceedings{echosafe2026,
    title     = {Evolving Contextual Safety in Multi-Modal Large Language Models via
                 Inference-Time Self-Reflective Memory},
    author    = {Author One and Author Two and Author Three and Author Four and Author Five},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2026}
}