EditInspector

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits

The Hebrew University of Jerusalem¹, Tel Aviv University², Google Research³
ACL 2025

Edit Inspector Benchmark

Example of models evaluating the edit “Let the floor be made of wood.” Only few identified the edit executed correctly. Gemini 1.5 failed to detect any differences between the images. GPT-4o recognized the main floor change but missed additional edits, such as alterations to the fridge, door, and text on the yellow box.

Abstract

Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation

Evaluation Metrics

Comparison of traditional metrics (BLEU, ROUGE, METEOR) against our proposed evaluation metric (MP). The first example shows high scores despite missing the edited object. The second penalizes correct but longer captions. The third fails to detect reversed edits, while our metric captures these issues.

We propose two novel evaluation metrics tailored for differences caption comparisons: Model Precision (MP) and Hallucination Rate (HR). MP is the percentage of human-annotated differences matching model detected ones, while HR is the percentage of model detected differences that do not correspond to any human-annotated differences.

Benchmark Table

Combined performance on Edit Inspectors questions, and the Difference Caption Generation task. GPT-4o model demonstrates the best performance in Edit Inspectors questions. Qwen2.5-VL achieves the highest precision in predicting differences, with the lowest hallucination rate.

Avg. Diff indicates the average number of differences detected per edit, while No Diffs represents the percentage of edits where no differences were predicted.
Human annotators identified an average of 6 differences per edit. The main difference row reports the percentage of predicted main difference captions correctly describing the main difference.

New Methods: Difference Caption Generation

Example of our pipeline generating an instruction-grounded difference caption with rich metadata. Edit images are split into three zoom levels, with Gemini extracting and prioritizing captions to generate the metadata.

New Methods: Artifact Detection

Edit: "turn the stop sign to a lollipop".

We propose two novel methods for detecting artifacts using the Detic model.
The first method compares segmentation probabilities for objects intersecting the turquoise in-painting mask between the pre-edit (left) and post-edit (right) images reveals two artifacts, the truck and small car, whose probability drops exceeds our threshold. The second method identifies elements that inter sect with the mask area, have disappeared from the image, and do not overlap with the edited object’s bounding box.

BibTeX

@misc{yosef2025editinspectorbenchmarkevaluationtextguided, title={EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits}, author={Ron Yosef and Moran Yanuka and Yonatan Bitton and Dani Lischinski}, year={2025}, eprint={2506.09988}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.09988}, }