Evolving Contextual Safety in Multi-Modal Large Language Models via Inference-Time Self-Reflective Memory

Ce Zhang^*, Jinxi He^*, Junyi He, Katia P. Sycara, Yaqi Xie

Robotics Institute, Carnegie Mellon University

^* Equal contribution

We tackle contextual safety in MLLMs — the ability to distinguish subtly different scenarios that diverge in safety intent. We introduce MM-SafetyBench++, a benchmark pairing each unsafe sample with a minimally-edited safe counterpart for fine-grained contextual evaluation, and EchoSafe, a training-free framework that accumulates safety insights in a self-reflective memory bank and retrieves them at inference time to enable context-aware, continually evolving safety reasoning.

Paper Code Dataset

CVPR 2026

Abstract

Multi-modal Large Language Models (MLLMs) have achieved remarkable performance across a wide range of visual reasoning tasks, yet their vulnerability to safety risks remains a pressing concern. While prior research primarily focuses on jailbreak defenses that detect and refuse explicitly unsafe inputs, such approaches often overlook contextual safety, which requires models to distinguish subtle contextual differences between scenarios that may appear similar but diverge significantly in safety intent.

In this work, we present MM-SafetyBench++, a carefully curated benchmark designed for contextual safety evaluation. Specifically, for each unsafe image–text pair, we construct a corresponding safe counterpart through minimal modifications that flip the user intent while preserving the underlying contextual meaning, enabling controlled evaluation of whether models can adapt their safety behaviors based on contextual understanding.

Further, we introduce EchoSafe, a training-free framework that maintains a self-reflective memory bank to accumulate and retrieve safety insights from prior interactions. By integrating relevant past experiences into current prompts, EchoSafe enables context-aware reasoning and continual evolution of safety behavior during inference. Extensive experiments on various multi-modal safety benchmarks demonstrate that EchoSafe consistently achieves superior performance, establishing a strong baseline for advancing contextual safety in MLLMs.

All benchmark data and code are available at EchoSafe-mllm.github.io.

Overview. (Left) Qualitative comparison of generated responses: prior methods often exhibit over-defensive behavior (e.g., refusing a benign medication transport query), whereas EchoSafe produces contextually appropriate responses by leveraging self-reflective memory. (Right) Quantitative comparison on MM-SafetyBench++: EchoSafe consistently outperforms prior methods on both Contextual Correctness Rate (CCR) and Quality Score (QS) across all safety-sensitive categories.

MM-SafetyBench++

Existing multi-modal safety benchmarks suffer from three key limitations:

They focus solely on refusal behavior and inadvertently reward over-defensive models.
They contain low-fidelity or trivially solvable samples, with recent defenses already achieving near-zero attack success rate on MM-SafetyBench.
They rely on coarse binary metrics (e.g., attack success rate) that fail to capture contextual safety awareness..

MM-SafetyBench++ addresses these limitations by providing:

High-fidelity image–text pairs covering diverse safety-sensitive scenarios.
Carefully balanced safe–unsafe sample pairs with minimal contextual edits that flip intent while preserving semantics.
Fine-grained reasoning-aware metrics: Contextual Correctness Rate (CCR) and Response Quality Score (QS).

MM-SafetyBench++ Scenario Examples. Each scenario includes a paired unsafe and safe sample. The safe counterpart is constructed via subtle modifications that flip the user's intent while preserving the underlying visual and textual context. Click the dots or drag to explore scenarios.

EchoSafe

EchoSafe is a training-free framework that equips any MLLM with a growing self-reflective memory bank. Inspired by how humans form abstract schemas from prior experiences to interpret novel but structurally similar situations, EchoSafe accumulates and reuses contextual safety knowledge over time:

Memory Accumulation: After each inference, EchoSafe stores a structured safety insight — capturing the contextual semantics and the inferred safety judgment — into the memory bank.
Memory Retrieval: For a new input, EchoSafe retrieves the most relevant past experiences via semantic similarity search.
Context-Aware Reasoning: Retrieved insights are integrated into the model's prompt, enabling the MLLM to reason about the current query in light of relevant prior safety experiences.

This process requires no model fine-tuning and operates entirely at inference time, making EchoSafe broadly applicable across diverse MLLM architectures.

EchoSafe Framework. Overview of the EchoSafe inference-time pipeline: safety insights from prior interactions are accumulated in a self-reflective memory bank, retrieved via semantic similarity, and integrated into the prompt to enable context-aware safety reasoning.

Results

We integrate EchoSafe into three open-source MLLMs (LLaVA-1.5-7B, LLaVA-NeXT-7B, Qwen-2.5-VL) and compare against FigStep, ECSO, and AdaShield across eight benchmarks. GPT-5-Mini serves as the judge throughout.

MM-SafetyBench++

Existing defenses fall short on the unsafe subset, with refusal rates far below 100%. AdaShield achieves the highest refusal rate but severely degrades safe-sample quality (over-defense). EchoSafe achieves the best overall CCR across all categories — e.g., 87.9% avg. CCR on Qwen-2.5-VL, outperforming AdaShield by +16.8% — while maintaining high response quality on benign inputs.

Models	Illegal Activity			Hate Speech			Malware Generation			Physical Harm			Fraud			Sex
	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM
	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS
Proprietary Models
GPT-5	85.6 / 4.3	99.0 / 4.9	91.9 / 4.6	87.1 / 4.3	100.0 / 5.0	93.1 / 4.6	79.6 / 3.9	100.0 / 4.9	88.6 / 4.3	90.3 / 4.5	100.0 / 5.0	94.9 / 4.8	75.3 / 3.8	100.0 / 5.0	85.9 / 4.3	43.1 / 2.1	100.0 / 4.9	60.2 / 3.1
GPT-5-Mini	85.6 / 4.3	100.0 / 4.8	92.2 / 4.5	86.5 / 4.3	100.0 / 4.8	92.7 / 4.5	77.3 / 3.8	100.0 / 4.8	87.2 / 4.3	93.1 / 4.6	100.0 / 4.9	96.4 / 4.8	79.2 / 4.0	100.0 / 5.0	88.4 / 4.4	34.9 / 1.7	100.0 / 4.7	51.7 / 2.5
GPT-4o-Mini	74.2 / 0.8	85.6 / 3.4	79.5 / 1.5	68.1 / 0.9	87.7 / 3.6	76.7 / 1.6	63.6 / 0.8	95.5 / 3.7	76.4 / 1.4	66.7 / 0.8	85.4 / 3.4	74.9 / 1.4	50.0 / 0.6	96.8 / 3.9	65.6 / 1.1	42.2 / 1.2	83.5 / 3.1	55.9 / 1.7
Gemini-2.5-Flash	29.9 / 1.4	100.0 / 4.8	45.9 / 2.2	44.8 / 1.9	100.0 / 4.8	61.9 / 2.7	11.4 / 0.6	100.0 / 4.8	20.4 / 1.1	20.8 / 0.9	99.3 / 4.8	34.5 / 1.6	23.4 / 1.1	100.0 / 4.9	38.0 / 1.8	24.8 / 1.0	99.1 / 4.6	39.7 / 1.7
Gemini-2.5-Pro	62.9 / 2.9	96.9 / 4.6	76.4 / 3.6	68.2 / 3.0	96.6 / 4.7	79.8 / 3.7	34.1 / 1.5	100.0 / 4.6	50.9 / 2.3	46.5 / 2.2	98.6 / 4.8	63.3 / 3.0	52.6 / 2.5	100.0 / 4.8	68.9 / 3.3	13.8 / 0.6	98.1 / 4.6	24.2 / 1.1
Open-Source Models
LLaVA-1.5-7B	4.1 / 0.2	100.0 / 3.1	7.9 / 0.4	9.2 / 0.4	99.4 / 3.3	16.8 / 0.7	2.3 / 0.1	100.0 / 3.0	4.5 / 0.2	4.2 / 0.2	100.0 / 3.2	8.1 / 0.4	0.0 / 0.0	100.0 / 3.2	0.0 / 0.0	7.3 / 0.3	100.0 / 3.3	13.6 / 0.6
LLaVA-NeXT-7B	5.1 / 0.3	100.0 / 3.4	9.7 / 0.6	17.2 / 0.7	100.0 / 3.6	29.3 / 1.1	2.3 / 0.0	100.0 / 3.2	4.5 / 0.0	6.2 / 0.3	100.0 / 3.6	11.7 / 0.6	2.6 / 0.1	100.0 / 3.5	5.1 / 0.2	7.3 / 0.3	99.0 / 3.4	13.5 / 0.6
Qwen2.5-VL-7B	29.9 / 1.3	100.0 / 3.8	45.9 / 2.0	30.7 / 1.3	100.0 / 4.0	47.0 / 2.1	11.4 / 0.6	100.0 / 3.7	20.5 / 1.0	20.1 / 0.9	100.0 / 3.8	33.4 / 1.3	19.5 / 0.9	100.0 / 3.9	32.7 / 1.3	13.8 / 0.6	99.1 / 3.7	24.2 / 1.0
Qwen3-VL-8B	80.4 / 3.6	95.9 / 2.7	87.5 / 3.1	66.9 / 3.0	99.4 / 2.7	79.8 / 2.8	65.9 / 2.8	97.8 / 2.7	79.3 / 2.8	63.2 / 2.7	98.6 / 2.6	77.0 / 2.6	64.9 / 2.9	100.0 / 2.7	78.7 / 2.8	37.6 / 1.5	97.3 / 2.8	54.3 / 2.0
InternVL3.5-8B	46.4 / 1.6	100.0 / 3.8	63.4 / 2.3	38.7 / 1.5	99.4 / 3.9	55.8 / 2.3	25.0 / 0.9	100.0 / 3.7	40.0 / 1.4	32.5 / 1.2	100.0 / 3.8	49.1 / 1.8	29.2 / 0.9	100.0 / 3.9	45.3 / 1.5	14.7 / 0.5	99.1 / 3.6	25.5 / 1.0
Safety Fine-Tuned Models
LLaVA-1.5-7B	4.1 / 0.2	100.0 / 3.1	7.9 / 0.4	9.2 / 0.4	99.4 / 3.3	16.8 / 0.7	2.3 / 0.1	100.0 / 3.0	4.5 / 0.2	4.2 / 0.2	100.0 / 3.2	8.1 / 0.4	0.0 / 0.0	100.0 / 3.2	0.0 / 0.0	7.3 / 0.3	100.0 / 3.3	13.6 / 0.6
+ Post-hoc LoRA	100.0 / 4.0	3.1 / 0.1	6.0 / 0.2	100.0 / 4.0	1.8 / 0.1	3.5 / 0.2	100.0 / 3.9	2.3 / 0.0	4.5 / 0.1	100.0 / 4.0	2.8 / 0.1	5.5 / 0.2	100.0 / 4.0	0.0 / 0.0	0.0 / 0.0	100.0 / 3.9	1.8 / 0.1	3.5 / 0.2
+ Mixed LoRA	100.0 / 3.9	3.1 / 0.1	6.0 / 0.2	100.0 / 4.0	3.1 / 0.1	6.0 / 0.2	100.0 / 4.0	4.6 / 1.0	8.8 / 1.8	100.0 / 4.0	3.5 / 0.1	6.8 / 0.2	100.0 / 3.9	1.3 / 0.0	2.6 / 0.1	100.0 / 3.9	3.7 / 0.1	7.1 / 0.2

Table 1. Evaluation of state-of-the-art MLLMs on MM-SafetyBench++ under the Gen mode. We report Refusal Rate / Quality Score (RR / QS) for unsafe inputs and Answer Rate / Quality Score (AR / QS) for safe inputs, along with their harmonic mean (HM). Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. The gray-shaded row in the Safety Fine-Tuned section shows the LLaVA-1.5-7B baseline (no fine-tuning) for reference.

Models	Illegal Activity			Hate Speech			Malware Generation			Physical Harm			Fraud			Sex
	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM
	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS
Proprietary Models
GPT-5	100.0 / 5.0	99.0 / 4.9	99.5 / 5.0	97.6 / 4.9	100.0 / 4.9	99.0 / 4.9	97.7 / 4.9	100.0 / 4.9	98.9 / 4.9	97.6 / 4.9	100.0 / 4.9	99.0 / 4.9	100.0 / 4.9	99.1 / 4.9	99.4 / 4.9	73.4 / 3.6	100.0 / 4.9	84.6 / 4.2
GPT-4o-Mini	97.9 / 1.1	90.7 / 3.7	94.1 / 1.7	82.2 / 1.2	96.3 / 4.1	88.7 / 1.9	81.8 / 0.9	97.7 / 3.8	89.0 / 1.5	76.4 / 0.8	91.0 / 3.7	83.1 / 1.3	83.1 / 1.0	96.8 / 4.0	89.4 / 1.6	46.8 / 0.9	89.9 / 3.4	61.6 / 1.4
Open-Source Models
LLaVA-1.5-7B	5.2 / 0.3	100.0 / 3.1	9.9 / 0.6	17.8 / 0.8	99.4 / 3.4	30.1 / 1.2	4.6 / 0.2	100.0 / 2.8	8.8 / 0.4	4.2 / 0.2	100.0 / 3.1	8.0 / 0.4	4.6 / 0.2	100.0 / 3.1	8.8 / 0.4	10.1 / 0.4	100.0 / 3.1	18.4 / 0.7
LLaVA-NeXT-7B	8.3 / 0.4	100.0 / 3.4	15.3 / 0.7	23.9 / 1.1	100.0 / 3.8	38.6 / 1.7	4.6 / 0.2	100.0 / 3.1	8.8 / 0.4	4.2 / 0.2	100.0 / 3.5	8.0 / 0.4	3.9 / 0.2	100.0 / 3.6	7.5 / 0.4	11.9 / 0.5	100.0 / 3.4	21.4 / 0.9
Qwen2.5-VL-7B	38.1 / 1.9	100.0 / 3.8	55.2 / 2.5	51.5 / 2.5	100.0 / 4.0	68.0 / 3.1	4.6 / 0.2	100.0 / 3.0	8.8 / 0.4	20.1 / 1.0	100.0 / 3.9	33.5 / 1.6	29.9 / 1.4	100.0 / 3.8	46.0 / 2.0	25.7 / 1.1	99.1 / 3.5	40.8 / 1.7
Qwen3-VL-8B	96.9 / 4.7	100.0 / 2.6	98.4 / 3.4	87.1 / 4.0	99.4 / 2.7	92.9 / 3.2	86.4 / 4.0	100.0 / 2.6	92.7 / 3.2	79.9 / 3.7	99.3 / 2.6	88.4 / 3.0	95.5 / 4.6	100.0 / 2.6	97.7 / 3.3	47.7 / 2.0	87.2 / 2.2	61.7 / 2.1
InternVL3.5-8B	76.3 / 2.7	100.0 / 3.7	86.6 / 3.1	66.9 / 2.6	100.0 / 4.1	79.7 / 3.2	34.1 / 1.0	95.5 / 3.4	50.0 / 1.6	45.8 / 1.6	99.3 / 3.7	63.6 / 2.3	60.4 / 2.4	100.0 / 3.9	75.3 / 3.0	21.1 / 0.7	99.1 / 3.5	34.7 / 1.1
Safety Fine-Tuned Models
LLaVA-1.5-7B	5.2 / 0.3	100.0 / 3.1	9.9 / 0.6	17.8 / 0.8	99.4 / 3.4	30.1 / 1.2	4.6 / 0.2	100.0 / 2.8	8.8 / 0.4	4.2 / 0.2	100.0 / 3.1	8.0 / 0.4	4.6 / 0.2	100.0 / 3.1	8.8 / 0.4	10.1 / 0.4	100.0 / 3.1	18.4 / 0.7
+ Post-hoc LoRA	100.0 / 4.0	6.2 / 0.2	11.7 / 0.4	100.0 / 4.0	4.3 / 0.1	8.3 / 0.2	100.0 / 4.0	2.3 / 0.1	4.5 / 0.2	100.0 / 4.0	0.0 / 0.0	0.0 / 0.0	100.0 / 4.0	1.3 / 0.0	2.6 / 0.0	100.0 / 3.9	4.6 / 0.2	8.8 / 0.4
+ Mixed LoRA	100.0 / 4.0	3.1 / 0.1	6.0 / 0.2	100.0 / 4.0	4.3 / 0.1	8.3 / 0.2	100.0 / 4.0	0.0 / 0.0	0.0 / 0.0	100.0 / 4.0	2.1 / 0.1	4.1 / 0.2	100.0 / 4.0	1.3 / 0.0	2.6 / 0.0	100.0 / 3.8	3.7 / 0.1	7.1 / 0.2

Table 2. Evaluation of state-of-the-art MLLMs on MM-SafetyBench++ under the GenOCR mode. We report Refusal Rate / Quality Score (RR / QS) for unsafe inputs and Answer Rate / Quality Score (AR / QS) for safe inputs, along with their harmonic mean (HM). Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. The gray-shaded row shows the LLaVA-1.5-7B baseline (no fine-tuning) for reference.

EchoSafe on MMSafetyBench++

EchoSafe (blue rows) consistently achieves the best CCR and QS across all three base models under both attack modes.

Method		Illegal Activity			Hate Speech			Malware Generation			Physical Harm			Fraud			Sex
		Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM
		RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS
LLaVA-1.5-7B	Base	4.1 / 0.2	100.0 / 3.1	7.9 / 0.4	9.2 / 0.4	99.4 / 3.3	16.8 / 0.7	2.3 / 0.1	100.0 / 3.0	4.5 / 0.2	4.2 / 0.2	100.0 / 3.2	8.1 / 0.4	0.0 / 0.0	100.0 / 3.2	0.0 / 0.0	7.3 / 0.3	100.0 / 3.3	13.6 / 0.6
	+ FigStep	76.3 / 1.8	80.4 / 2.5	78.3 / 2.1	82.2 / 2.4	65.0 / 2.0	72.5 / 2.2	68.2 / 1.6	72.7 / 2.1	70.4 / 1.8	58.3 / 1.6	84.0 / 2.6	68.9 / 2.0	67.5 / 1.8	76.0 / 2.3	71.5 / 2.0	38.5 / 1.0	89.9 / 2.9	53.9 / 1.5
	+ ECSO	37.1 / 1.2	100.0 / 3.1	54.1 / 1.7	34.6 / 1.4	100.0 / 3.3	51.4 / 2.0	18.2 / 0.7	100.0 / 3.0	30.8 / 1.1	22.9 / 0.9	100.0 / 3.2	37.3 / 1.4	22.1 / 0.8	99.4 / 3.2	36.2 / 1.3	11.0 / 0.4	100.0 / 3.3	19.8 / 0.7
	+ AdaShield	79.4 / 1.0	51.6 / 1.4	62.6 / 1.2	95.1 / 1.1	43.6 / 1.3	59.8 / 1.2	90.9 / 1.1	45.5 / 1.3	60.6 / 1.2	77.1 / 1.0	31.3 / 0.9	44.5 / 0.9	82.5 / 0.9	34.4 / 1.0	48.6 / 0.9	78.0 / 1.0	38.5 / 1.1	51.6 / 1.0
	+ EchoSafe (Ours)	67.0 / 2.3	99.0 / 2.9	79.9 / 2.6	83.4 / 2.8	97.6 / 2.9	89.9 / 2.8	71.8 / 2.0	97.8 / 2.9	82.8 / 2.4	81.0 / 3.1	100.0 / 2.8	89.5 / 2.9	74.7 / 2.5	98.1 / 3.1	84.8 / 2.8	70.7 / 2.4	92.3 / 3.0	80.1 / 2.7
LLaVA-NeXT-7B	Base	5.1 / 0.3	100.0 / 3.4	9.7 / 0.6	17.2 / 0.7	100.0 / 3.6	29.3 / 1.1	2.3 / 0.0	100.0 / 3.2	4.5 / 0.0	6.2 / 0.3	100.0 / 3.6	11.7 / 0.6	2.6 / 0.1	100.0 / 3.5	5.1 / 0.2	7.3 / 0.3	99.0 / 3.4	13.5 / 0.6
	+ FigStep	83.5 / 2.4	80.4 / 2.8	81.9 / 2.6	82.2 / 2.6	62.0 / 2.2	70.7 / 2.4	61.4 / 1.9	81.8 / 2.5	70.3 / 2.2	56.3 / 1.9	88.2 / 3.1	68.7 / 2.4	70.8 / 2.1	83.8 / 2.9	76.7 / 2.5	28.4 / 0.9	89.0 / 3.0	42.9 / 1.4
	+ ECSO	45.4 / 1.6	99.0 / 3.4	62.4 / 2.2	46.0 / 1.8	100.0 / 3.6	63.0 / 2.3	36.4 / 1.4	97.7 / 3.3	53.2 / 2.0	31.3 / 1.2	99.3 / 3.5	47.6 / 1.8	30.5 / 1.2	100.0 / 3.1	46.8 / 1.7	9.2 / 0.4	99.1 / 3.3	16.8 / 0.7
	+ AdaShield	97.9 / 1.0	12.4 / 0.3	22.1 / 0.4	95.7 / 1.0	11.0 / 0.2	19.7 / 0.3	97.7 / 1.0	22.7 / 0.5	36.9 / 0.7	93.1 / 1.0	18.8 / 0.5	31.4 / 0.7	98.7 / 1.0	13.0 / 0.2	22.9 / 0.4	81.7 / 0.8	29.4 / 0.9	43.2 / 0.9
	+ EchoSafe (Ours)	85.6 / 3.4	87.6 / 2.8	86.6 / 3.1	87.7 / 3.5	90.2 / 2.8	88.9 / 3.1	93.2 / 3.5	86.4 / 2.7	89.7 / 3.1	85.4 / 3.6	90.3 / 2.9	87.8 / 3.2	86.3 / 3.3	95.5 / 2.9	90.6 / 3.1	58.4 / 2.1	89.9 / 2.4	70.6 / 2.2
Qwen2.5-VL-7B	Base	29.9 / 1.3	100.0 / 3.8	45.9 / 2.0	30.7 / 1.3	100.0 / 4.0	47.0 / 2.1	11.4 / 0.6	100.0 / 3.7	20.5 / 1.0	20.1 / 0.9	100.0 / 3.8	33.4 / 1.3	19.5 / 0.9	100.0 / 3.9	32.7 / 1.3	13.8 / 0.6	99.1 / 3.7	24.2 / 1.0
	+ FigStep	54.2 / 2.0	97.9 / 3.7	69.5 / 2.6	60.7 / 2.4	99.4 / 3.8	75.4 / 2.9	43.2 / 1.8	100.0 / 3.7	60.3 / 2.4	43.1 / 1.7	100.0 / 3.8	60.2 / 2.4	46.1 / 1.9	100.0 / 3.9	63.1 / 2.6	22.9 / 1.0	98.2 / 3.7	37.3 / 1.6
	+ ECSO	39.2 / 1.8	100.0 / 3.8	56.3 / 2.4	32.5 / 1.5	100.0 / 3.9	49.1 / 2.3	22.7 / 1.1	100.0 / 3.8	37.0 / 1.7	21.5 / 1.0	100.0 / 3.8	35.4 / 1.6	31.8 / 1.5	100.0 / 3.9	48.3 / 2.2	14.7 / 0.6	99.1 / 3.7	25.5 / 1.1
	+ AdaShield	78.4 / 1.3	62.9 / 2.3	69.8 / 1.7	87.7 / 1.0	65.6 / 2.5	75.2 / 1.5	88.6 / 1.4	72.7 / 2.7	79.8 / 1.9	69.4 / 1.0	69.4 / 2.6	69.4 / 1.6	64.9 / 1.6	96.8 / 3.7	77.7 / 2.3	67.9 / 1.1	45.9 / 1.8	54.8 / 1.4
	+ EchoSafe (Ours)	83.5 / 3.7	95.9 / 3.6	89.3 / 3.6	92.6 / 3.9	93.8 / 3.3	93.2 / 3.6	95.5 / 4.0	91.6 / 3.5	93.5 / 3.8	81.0 / 3.5	88.0 / 3.2	84.4 / 3.3	79.9 / 3.4	98.1 / 3.8	88.1 / 3.6	70.6 / 2.8	89.0 / 3.3	78.7 / 3.0

Table 3. Performance comparison on MM-SafetyBench++ under the Gen attack mode. Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. Gray rows show unmodified base models. Blue rows denote EchoSafe (Ours).

Method		Illegal Activity			Hate Speech			Malware Generation			Physical Harm			Fraud			Sex
		Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM	Unsafe	Safe	HM
		RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS	RR / QS	AR / QS	CCR / QS
LLaVA-1.5-7B	Base	5.2 / 0.3	100.0 / 3.1	9.9 / 0.6	17.8 / 0.8	99.4 / 3.4	30.1 / 1.2	4.6 / 0.2	100.0 / 2.8	8.8 / 0.4	4.2 / 0.2	100.0 / 3.1	8.0 / 0.4	4.6 / 0.2	100.0 / 3.1	8.8 / 0.4	10.1 / 0.4	100.0 / 3.1	18.4 / 0.7
	+ FigStep	75.3 / 2.2	84.5 / 2.7	79.5 / 2.4	77.3 / 2.4	86.5 / 2.8	81.7 / 2.6	68.2 / 1.8	97.7 / 2.7	79.7 / 2.1	50.7 / 1.6	92.4 / 3.0	65.5 / 2.0	56.5 / 1.8	81.8 / 2.6	66.7 / 2.1	33.0 / 0.9	92.7 / 2.8	48.6 / 1.3
	+ ECSO	13.4 / 0.5	100.0 / 2.6	26.4 / 0.9	28.3 / 1.2	100.0 / 2.9	44.1 / 1.7	6.8 / 0.2	100.0 / 2.3	12.7 / 0.5	10.4 / 0.4	100.0 / 2.5	19.0 / 0.8	13.0 / 0.5	100.0 / 2.5	25.8 / 0.9	15.8 / 0.7	100.0 / 2.6	27.3 / 1.1
	+ AdaShield	90.7 / 1.1	37.1 / 0.9	52.6 / 1.0	93.3 / 1.1	50.3 / 1.7	65.1 / 1.3	93.2 / 1.0	45.5 / 1.1	60.8 / 1.0	80.6 / 1.0	32.6 / 0.9	46.3 / 1.0	85.7 / 1.0	35.7 / 1.1	50.5 / 1.0	71.6 / 1.0	45.9 / 1.3	55.6 / 1.1
	+ EchoSafe (Ours)	86.6 / 3.3	95.9 / 2.9	90.9 / 3.1	87.7 / 3.2	96.9 / 3.0	92.1 / 3.1	70.5 / 2.2	97.7 / 2.9	82.0 / 2.5	78.5 / 3.0	95.8 / 3.0	86.2 / 3.0	79.2 / 2.9	96.1 / 2.9	86.5 / 2.9	55.9 / 1.4	86.2 / 2.0	67.6 / 1.6
LLaVA-NeXT-7B	Base	8.3 / 0.4	100.0 / 3.4	15.3 / 0.7	23.9 / 1.1	100.0 / 3.8	38.6 / 1.7	4.6 / 0.2	100.0 / 3.1	8.8 / 0.4	4.2 / 0.2	100.0 / 3.5	8.0 / 0.4	3.9 / 0.2	100.0 / 3.6	7.5 / 0.4	11.9 / 0.5	100.0 / 3.4	21.4 / 0.9
	+ FigStep	82.5 / 2.6	91.8 / 3.4	86.9 / 3.0	80.4 / 2.9	91.4 / 3.6	85.5 / 3.2	52.3 / 2.1	90.9 / 3.0	66.4 / 2.5	50.0 / 1.8	94.4 / 3.4	65.4 / 2.4	54.6 / 1.8	90.3 / 3.2	68.1 / 2.3	28.4 / 0.8	96.3 / 3.3	43.8 / 1.3
	+ ECSO	80.4 / 3.0	99.0 / 3.5	88.7 / 3.2	61.4 / 2.5	100.0 / 3.9	76.1 / 3.1	50.0 / 1.9	97.7 / 3.0	66.1 / 2.3	52.8 / 2.1	98.6 / 3.5	68.8 / 2.6	68.2 / 2.7	99.4 / 3.5	80.9 / 3.0	19.3 / 0.6	97.3 / 3.2	32.2 / 1.0
	+ AdaShield	100.0 / 1.0	11.3 / 0.3	20.3 / 0.5	99.1 / 1.1	14.7 / 0.2	25.6 / 0.3	100.0 / 1.1	22.7 / 0.5	37.0 / 0.7	94.4 / 1.0	25.0 / 0.7	39.5 / 0.8	99.4 / 1.0	9.1 / 0.1	16.7 / 0.2	83.5 / 1.2	31.2 / 1.1	45.4 / 1.2
	+ EchoSafe (Ours)	95.9 / 3.9	90.7 / 2.9	93.3 / 3.3	96.3 / 3.9	90.2 / 3.0	93.1 / 3.4	90.9 / 3.4	88.6 / 2.4	89.7 / 2.8	88.9 / 3.6	91.7 / 3.1	90.3 / 3.3	96.8 / 4.5	96.1 / 3.7	96.5 / 4.1	93.6 / 3.9	77.1 / 2.6	84.6 / 3.1
Qwen2.5-VL-7B	Base	38.1 / 1.9	100.0 / 3.8	55.2 / 2.5	51.5 / 2.5	100.0 / 4.0	68.0 / 3.1	4.6 / 0.2	100.0 / 3.0	8.8 / 0.4	20.1 / 1.0	100.0 / 3.9	33.5 / 1.6	29.9 / 1.4	100.0 / 3.8	46.0 / 2.0	25.7 / 1.1	99.1 / 3.5	40.8 / 1.7
	+ FigStep	82.5 / 3.6	100.0 / 3.8	90.4 / 3.7	81.6 / 3.6	99.4 / 9.0	89.7 / 5.1	50.0 / 2.4	100.0 / 3.7	66.7 / 2.9	55.6 / 2.5	100.0 / 3.9	71.5 / 3.0	75.3 / 3.5	100.0 / 3.9	86.0 / 3.7	55.1 / 2.2	97.3 / 3.5	70.4 / 2.7
	+ ECSO	61.9 / 3.0	100.0 / 3.8	76.5 / 3.4	58.9 / 2.8	100.0 / 4.0	74.1 / 3.3	34.1 / 1.7	100.0 / 3.5	50.9 / 2.3	38.9 / 1.9	100.0 / 3.8	56.0 / 2.5	53.3 / 1.6	100.0 / 3.9	69.5 / 2.3	29.4 / 1.3	99.1 / 3.4	45.3 / 1.9
	+ AdaShield	97.9 / 2.0	86.6 / 3.3	91.8 / 2.5	95.7 / 1.8	81.4 / 3.1	88.0 / 2.3	79.6 / 1.8	70.9 / 2.6	75.0 / 2.1	77.1 / 1.6	81.7 / 3.1	79.3 / 2.1	83.1 / 1.4	60.4 / 2.3	70.0 / 1.7	69.8 / 1.4	46.8 / 1.9	56.0 / 1.6
	+ EchoSafe (Ours)	100.0 / 4.5	92.8 / 3.5	96.3 / 3.9	98.2 / 4.4	96.9 / 3.8	97.6 / 4.1	100.0 / 4.5	88.6 / 3.0	94.0 / 3.6	93.8 / 4.1	88.2 / 3.3	90.9 / 3.7	96.8 / 4.4	96.8 / 3.7	96.8 / 4.0	91.7 / 3.8	77.9 / 2.7	84.2 / 3.2

Table 4. Performance comparison on MM-SafetyBench++ under the GenOCR attack mode. Higher (↑) values indicate better performance. All evaluations use gpt-5-mini as the judge. Best results are bolded; second-best are underlined. Gray rows show unmodified base models. Blue rows denote EchoSafe (Ours).

EchoSafe on Other Safety & General Benchmarks

On MM-SafetyBench, EchoSafe reduces ASR on Qwen-2.5-VL to 0.04% / 0.02% (SD / TYPO), near-perfect across all categories. On MSSBench, EchoSafe improves avg. safety by +18.75% on MSSBench-Chat. On SIUO, EchoSafe gains +27.04% (Safe) and +20.83% (Reasoning). On general benchmarks (MME, MMBench, ScienceQA, TextVQA), performance is nearly lossless — safety gains do not compromise utility.

	Method	MM-SafetyBench			MSSBench-Chat			MSSBench-Embodied			SIUO			Comprehensive Benchmarks
	Method	SD ↓	TYPO ↓	SD-TYPO ↓	Safe ↑	Unsafe ↑	Avg ↑	Safe ↑	Unsafe ↑	Avg ↑	S ↑	S&E ↑	R ↑	MME^P ↑	MME^C ↑	MMB ↑	SQA ↑	VQA^Text ↑
LLaVA-1.5-7B	Base	20.76	66.08	57.99	97.50	6.50	52.00	100.00	0.79	50.39	17.37	16.17	8.38	1507.53	357.86	64.69	69.51	58.20
	+ FigStep	15.09	5.97	38.71	98.50	5.50	52.00	100.00	0.26	50.13	36.53	16.77	9.58	1420.30	292.50	62.88	68.27	56.36
	+ ECSO	23.41	16.08	41.57	98.00	5.33	51.67	100.00	0.25	50.13	16.77	14.97	7.19	1497.53	360.00	64.51	69.51	58.15
	+ AdaShield	1.05	0.22	1.30	33.33	76.67	55.00	34.47	74.21	54.24	29.34	0.60	0.00	1398.34	314.64	59.87	67.03	56.15
	+ EchoSafe (Ours)	0.37	0.46	1.10	62.33	59.17	60.75	64.47	64.47	64.47	32.93	13.41	8.48	1475.91	294.29	64.34	69.31	57.92
LLaVA-NeXT-7B	Base	18.70	40.01	39.64	98.17	5.33	52.75	100.00	0.53	50.26	19.76	19.76	7.78	1519.80	330.00	67.86	70.20	61.36
	+ FigStep	11.53	8.63	23.60	96.50	7.67	52.00	100.00	0.26	50.13	29.34	20.36	10.78	1464.63	277.14	66.58	68.62	59.98
	+ ECSO	19.61	25.71	42.58	95.50	7.67	51.58	99.74	2.11	50.92	22.75	21.56	7.19	1514.05	328.57	65.80	70.25	60.85
	+ AdaShield	0.49	0.23	1.46	23.83	81.50	52.67	88.95	20.00	54.47	32.93	0.60	1.80	1438.66	287.86	64.08	67.67	54.24
	+ EchoSafe (Ours)	0.32	0.57	0.99	75.17	58.17	66.67	55.66	66.58	61.12	32.73	21.82	13.94	1503.57	286.43	67.69	69.11	58.99
Qwen-2.5-VL-7B	Base	22.72	25.05	32.91	96.67	14.17	55.42	100.00	0.53	50.26	31.14	29.94	17.96	1688.09	612.14	83.76	77.09	77.73
	+ FigStep	9.39	13.57	16.31	95.33	9.50	52.42	99.47	3.68	51.58	37.72	37.13	17.37	1610.03	591.07	83.33	79.38	70.14
	+ ECSO	20.80	21.25	32.45	96.33	9.50	52.92	100.00	0.53	50.26	32.34	31.14	14.37	1688.09	612.14	83.76	77.09	77.74
	+ AdaShield	0.09	0.00	1.20	18.00	92.17	55.08	49.47	77.89	63.82	38.32	32.93	17.96	1386.09	586.07	84.62	84.58	68.96
	+ EchoSafe (Ours)	0.04	0.02	0.71	66.17	82.17	74.17	39.21	91.58	65.40	58.18	52.12	38.79	1637.31	601.07	84.10	78.24	77.01

Table 5. Performance comparison on other safety benchmarks across three representative MLLMs. For MM-SafetyBench, ASR ↓ (lower is better); for all other benchmarks, higher ↑ is better. Best results are bolded; second-best are underlined. Blue rows denote EchoSafe (Ours).

Qualitative Examples

Representative examples of EchoSafe's responses on MM-SafetyBench++, demonstrating contextually appropriate refusals on unsafe queries and helpful answers on safe counterparts.

Qualitative Examples. EchoSafe produces contextually appropriate responses by leveraging self-reflective memory. Click the dots or drag to explore.

Citation

If you find our work useful, please consider citing:

@inproceedings{echosafe2026,
    title     = {Evolving Contextual Safety in Multi-Modal Large Language Models via
                 Inference-Time Self-Reflective Memory},
    author    = {Author One and Author Two and Author Three and Author Four and Author Five},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2026}
}