This is the official repository of the paper "Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach". The paper has been accepted by 2025 IEEE ISMAR (Best Paper Award, Top 1%), and also accepted as a special issue of IEEE TVCG. It introduces AR-VIM, a dataset of 452 raw-AR video pairs. The videos were all collected in real-world settings.
The dataset is now available on Hugging Face Datasets.
See Download Instructions for detailed usage and access options.
The dataset can also be accesssed through Google Drive.
The paper can be found here: pdf/ArXiv.
An example usage tutorial Jupyter notebok of the dataset can be found here.
- Dataset Description
- Data Collection Pipelines
- Dataset Structure
- User-Based Data Validation
- Results
- Use AR-VIM For Other Purposes
This dataset contains paired Raw and Augmented videos collected to evaluate visual information manipulation (VIM) in augmented reality (AR). Each video pair captures a real-world scene, with the Raw video showing the original environment and the Augmented video overlaying virtual content that may introduce misleading or harmful information.
The dataset is structured around two key taxonomies:
-
Attack Format: what aspect of visual information is altered
-
Attack Purpose: the intention or effect of the manipulation
- Character Manipulation: Virtual content changes individual characters in existing textual elements.
- Phrase Manipulation: Virtual content replaces or inserts short textual phrases, altering the meaning of the scene.
- Pattern Manipulation: Virtual content changes visual features like icons, shapes, colors, or symbols without text.
- Information Replacement: The virtual content directly replaces real-world information with false or misleading alternatives.
- Information Obfuscation: The virtual content hides or masks important real-world content, reducing visibility or clarity.
- Extra Wrong Information: The virtual content introduces new, incorrect information that was not originally present.
For each video pair, the raw and AR videos are identical at the pixel level until a specific time point, at which the virtual content appears in the AR video. This virtual content may or may not result in a VIM attack.
Left: Raw video without AR content. Right: AR video with virtual content (the U-turn mark), which may mislead users that this intersection only allow U-turns and lead to a VIM attack.
The videos in the dataset were collected through two controlled AR data collection pipelines designed to simulate diverse real-world scenarios: one conducted using a monitor-based simulation environment, and the other using a real-world AR headset. Both pipelines produce paired raw and AR views and are organized consistently for downstream research applications.
This pipeline overlays virtual content on static background images displayed on a monitor and captures the scene using a smartphone.
-
Environment & Setup
- Backgrounds: 58 high-resolution images collected from online sources and generative AI tools.
- Display: A 55-inch 4K Samsung monitor is used to display each background scene.
-
AR App Implementation
- Platform: Unity 2022.3.28f1 with ARCore image tracking.
- Mechanism: Each background image acts as a reference marker. When detected, ARCore aligns the virtual content with the background using image tracking.
-
Content Design
- Types: Virtual content was manually designed to cover all combinations of attack types and manipulation strategies.
- Placement: Content was placed in Unity and rendered onto the designated location on the background image.
-
Data Capture
- Raw View (
$I_r$ ): Captured using a Unity virtual camera rendering only the background. - Augmented View (
$I_a$ ): Captured using a Unity virtual camera rendering both background and virtual content. - Device: Samsung Galaxy S25 smartphone.
- Output: Video pairs.
- Raw View (
This pipeline captures real AR experiences using a headset in physical environments to reflect real-world usage conditions.
-
Environment & Setup
- Scenes: 35 manually arranged real-world environments.
- Device: Meta Quest 3 was used for AR deployment and video capture.
-
AR App Implementation
- Platform: Unity 2022.3.61f1 with support for main camera access.
- Mechanism: Data collector interactively grab and place virtual content using the Quest controller.
-
Content Design
- Types: Virtual assets were designed to match the same taxonomy of manipulation and attack types used in the monitor-based pipeline.
- Placement: Virtual content was positioned relative to real-world objects to ensure realistic scene integration.
-
Data Capture
- Raw View (
$I_r$ ): Captured using the Quest’s main RGB camera. - Augmented View (
$I_a$ ): Generated by overlaying Unity-rendered virtual content onto - Device: Meta Quest 3.
- Output: Video pairs.
- Raw View (
- Total Video Pairs: 452 raw-augmented pairs (307 monitor-based and 145 real-world) across 202 unique scenes (133 monitor-based and 69 real-world).
- Labeling: Each pair is annotated as either: A (Attacked) or N (Non-attack). The labels are in the videos' names.
- Format: .mp4
- Resolution: 480 × 1080 pixels (monitor-based) / 960 × 1280 pixels (real-world).
- Frame Rate: 15 FPS
AR-VIM dataset consists of 452 video pairs, each containing a raw video and its corresponding augmented version. These pairs span a total of 202 unique scenes of AR experiences. Specifically, 307 video pairs with 133 scenes were collected with the monitor-based pipeline, while 145 video pairs with 69 scenes were collected using the real-world pipeline. The detailed data distribution is shown in the table below.
The dataset can be accesssed through this link.
The dataset is organized under a root directory named AR-VIM/, which contains two main subdirectories:
- Monitor-Based Data/: Contains data collected using a screen-based simulation pipeline.
- Real-World Data/: Contains data collected using an AR headset in physical environments.
Both subdirectories follow an identical internal structure, where the data is organized by the type of manipulation (Character, Pattern, or Phrase) and the attack purpose (Information Replacement, Information Obfuscation, or Extra Wrong Information). The structure is as follows:
ARVIM/
├── Monitor-Based Data/
│ ├── Character Manipulation + Information Replacement/
│ ├── Pattern Manipulation + Extra Wrong Information/
│ ├── Pattern Manipulation + Information Obfuscation/
│ ├── Pattern Manipulation + Information Replacement/
│ ├── Phrase Manipulation + Extra Wrong Information/
│ ├── Phrase Manipulation + Information Obfuscation/
│ └── Phrase Manipulation + Information Replacement/
└── Real-World Data/
├── Character Manipulation + Information Replacement/
├── Pattern Manipulation + Extra Wrong Information/
├── Pattern Manipulation + Information Obfuscation/
├── Pattern Manipulation + Information Replacement/
├── Phrase Manipulation + Extra Wrong Information/
├── Phrase Manipulation + Information Obfuscation/
└── Phrase Manipulation + Information Replacement/
Inside each folder, video files follow the naming convention:
{VideoType}_Recordings_{AttackLabel}_{XXX}.mp4
where:
-
{VideoType}: Either Raw or Augmented
-
{AttackLabel}: A for attack, N for non-attack
-
{XXX}: A 3-digit index, starting from 001
For example;
Pattern Manipulation + Extra Wrong Information/
├── Augmented_Recordings_A_001.mp4
├── Augmented_Recordings_A_002.mp4
├── ...
├── Augmented_Recordings_N_001.mp4
├── ...
├── Raw_Recordings_A_001.mp4
├── Raw_Recordings_A_002.mp4
├── ...
├── Raw_Recordings_N_001.mp4
└── ...
[Important Update]
The dataset is also available on Hugging Face Datasets for easier access and reproducibility:
https://huggingface.co/datasets/HarbingerKX/AR-VIM
You can directly download the entire dataset (including all subfolders) using the official Hugging Face Hub API:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="HarbingerKX/AR-VIM",
repo_type="dataset",
local_dir="./AR-VIM",
local_dir_use_symlinks=False, # copy files instead of symlinks
resume_download=True # support resume
)Install the Hugging Face CLI first:
pip install huggingface_hubThen download the dataset with:
huggingface-cli download HarbingerKX/AR-VIM --repo-type dataset --local-dir ./AR-VIM --resume-downloadTo ensure that the attack labels in this dataset align with human perception, we conducted a user-based data validation under an IRB-approved protocol. Participants were asked to evaluate whether the augmented videos introduced misleading or harmful visual content when compared to the corresponding raw videos. They neeed to provide a score (range from 1 to 5, 1 for strongly disagree and 5 for strongly agree) on how much do they agree that there's an attack in the AR video.
The overall agreement level is 4.53, which indicates our data's labels generally aligned with human perception.
User agreement with attack labels in the AR-VIM dataset. (a): The overall distribution of Likert-scale responses. (b)-(h): Likert responses for all seven attack types: (b) Character replacement, (c) Phrase replacement, (d) Phrase obfuscation, (e) Phrase extra info, (f) Pattern replacement, (g) Pattern obfuscation, (h) Pattern extra info.We tested our proposed system, VIM-Sense, with AR-VIM. VIM-Sense is a system designed for detecting the visual information manipulation attacks proposed in AR-VIM. The results are provided below. The detail of VIM-Sense and all the baselines can be found in the paper.
AR-VIM is originally created for visual information manipulation detection task. However, it can also be used for other tasks such as attack classification, scene understanding, visual question answering, etc.
We provide a standardized json file and dataloader for AR-VIM. To use them, you need to put these files in the ARVIM folder. The folder should look like this:
ARVIM/
├── Monitor-Based Data/
├── Real-World Data/
├── metadata.json
└── dataset.py
Then you can load the videos as well as their ID and other information such as attack format and attack purpose. A usage example is shown below:
from ARVIM.dataset import ARVIMDataset
dataset = ARVIMDataset(root="ARVIM")
print(len(dataset))
print(dataset[0])The study is approved by Duke University Campus Institutional Review Board. Protocol number: 2020-0292.
If you use AR-VIM dataset in an academic work, please cite:
@misc{xiu2025detectingvisualinformationmanipulation,
title={Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach},
journal={IEEE Transactions on Visualization and Computer Graphics},
author={Yanming Xiu and Maria Gorlatova},
year={2025},
}
The authors of this repository are Yanming Xiu and Maria Gorlatova. Contact information of the authors:
-
Yanming Xiu (yanming.xiu AT duke.edu)
-
Maria Gorlatova (maria.gorlatova AT duke.edu)
Please contact Yanming if younhave any questions when using AR-VIM dataset.
We thank the participants of our user-based label validation for their invaluable effort and assistance in this research. This work was supported in part by NSF grants CSR-2312760, CNS-2112562, and IIS-2231975, NSF CAREER Award IIS-2046072, NSF NAIAD Award 2332744, a Cisco Research Award, a Meta Research Award, Defense Advanced Research Projects Agency Young Faculty Award HR0011-24-1-0001, and the Army Research Laboratory under Cooperative Agreement Number W911NF-23-2-0224.





