This is the official repository for the following paper:
Mizuho Aoki*, Kohei Honda, Yasuhiro Yoshimura, Takeshi Ishita, Ryo Yonetani, "VLG-Loc: Vision-Language Global Localization from Labeled Footprint Maps", arxiv, 2025, paper | project page | dataset | video
Vision-Language Global Localization (VLG-Loc) is a global localization method that uses camera images and a human-readable labeled footprint map containing only names and areas of distinctive visual landmarks.

This setup provides a GPU-enabled Ubuntu 22.04 environment using Docker.
-
Prerequisites
- git
- For ubuntu users:
sudo apt install git
- For ubuntu users:
- git-lfs
- For ubuntu users:
sudo apt install git-lfs git lfs install
- For ubuntu users:
- docker
- For ubuntu users:
curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo groupadd docker sudo usermod -aG docker $USER reboot
- For ubuntu users:
- make
- For ubuntu users:
sudo apt install make
- For ubuntu users:
- NVIDIA Container Toolkit
- This is required to allow Docker containers to access the host's GPU.
- NVIDIA GPU & Driver
- An NVIDIA GPU and a compatible driver for the base image (nvidia/cuda:12.4.1-devel-ubuntu22.04) are required.
- Azure OpenAI API KEY
- git
-
Clone the repository
git clone [email protected]:CyberAgentAILab/VLG-Loc.git
-
Download the dataset to the project root on your host machine. The
datasetdirectory will be mounted as~/dev_ws/datasetin the container.cd VLG-Loc mkdir -p dataset cd dataset git clone https://huggingface.co/datasets/cyberagent/VLG-Loc-Dataset vlg_loc_dataset -
Build the docker container.
cd VLG-Loc make setup_docker -
Get inside the docker container.
cd VLG-Loc make launch_docker -
Set the VLM API key and endpoint as environment variables.
Copy the example configuration file.env.exampleto create your own.envfile:cd ~/dev_ws cp .env.example .env
Open
.envand replace<vlm_api_key>and<vlm_api_endpoint>with your actual credentials. -
Setup the workspace.
cd ~/dev_ws source setup_workspace.sh
Use the following command to run an evaluation. Replace <ENV_NAME> and <CONFIG_PATH> with the appropriate values from the sections below.
python3 scripts/run_eval.py \
--dataset-root dataset \
--dataset_name vlg_loc_dataset/<ENV_NAME> \
--mode <EVAL_MODE> \
--config_filename=<CONFIG_PATH> \
--overwrite
--create_videoDescriptions of Arguments:
--dataset-root: Path to the root directory of the dataset.--dataset_name: Name of the dataset.--mode: Evaluation mode.--config_filename: Path to the configuration file of the localizer.--overwrite: If specified, existing results will be overwritten.--create_video: If specified, a video summarizing the evaluation will be created. Be aware that this takes longer time and requires more disk space.
Evaluation Modes (<EVAL_MODE>):
eval_scan_localizer: Evaluate the scan localizer.eval_vision_localizer: Evaluate the vision localizer.eval_vision_and_scan_localizer: Evaluate multimodal localization using both vision and scan data.clean_all_logs: Clean evaluation outputs.
Environments (<ENV_NAME> and <CONFIG_PATH>):
Select the <ENV_NAME> and <CONFIG_PATH> for your desired environment from the table below.
| Environment | <ENV_NAME> | <CONFIG_PATH> |
|---|---|---|
| UG/UA (Uniform Geometry, Uniform Appearance) | env_ug_ua | env_ug_ua/loc_eval_env_ug_ua.yaml |
| UG/DA (Uniform Geometry, Diverse Appearance) | env_ug_da | env_ug_da/loc_eval_env_ug_da.yaml |
| DG/UA (Diverse Geometry, Uniform Appearance) | env_dg_ua | env_dg_ua/loc_eval_env_dg_ua.yaml |
| DG/DA (Diverse Geometry, Diverse Appearance) | env_dg_da | env_dg_da/loc_eval_env_dg_da.yaml |
| Retail Store (Real) | env_retail_store_real | retail_store_real/loc_eval_env_retail_store_real.yaml |
| Retail Store (Sim) | env_retail_store_sim | env_retail_store_sim/loc_eval_env_retail_store_sim.yaml |
Note
The included configuration file is set to use gpt-4.1 as default.
If you change model, please specify [vlm_config][model_name] in configuration file.
Example Command:
To run the vision localizer evaluation in the DG/DA environment, use the following command:
python3 scripts/run_eval.py --dataset-root dataset --dataset_name vlg_loc_dataset/env_dg_da --mode eval_vision_localizer --config_filename=env_dg_da/loc_eval_env_dg_da.yaml --overwriteNote
While we strive for reproducible outputs by using seed and temperature parameters, please be aware that minor variations in LLM outputs can lead to slight differences in localization results.
Visualize the Results
After running the evaluation, you can visualize the results using the web visualizer. Use the following command to start the visualizer:
python3 scripts/web_visualizer.py --target_dir <PATH_TO_DATASET_DIR>Replace <PATH_TO_DATASET_DIR> with the path to the dataset directory (e.g., dataset/vlg_loc_dataset/env_dg_da).
Then, open a web browser and navigate to http://127.0.0.1:8080 to view the visualizer.
Dataset Generation in the Simulation Environments
You can generate datasets by running simulations in the provided Gazebo Classic environments. This requires running commands in separate terminals inside the Docker container.
-
Build the workspace.
cd ~/dev_ws make build
-
Terminal 1: Launch Gazebo World In your first Docker terminal, launch the desired simulation environment.
cd ~/dev_ws source ~/.bashrc ros2 launch mobile_robot_ros2 vmegarover_world.launch.py world_fname:=<ENV_NAME>
- Note: Replace
<ENV_NAME>with one of the environment names listed in the table above except forenv_retail_store_real.
- Note: Replace
-
Terminal 2: Launch Manual Controller
In a separate terminal on the docker container, run the joypad controller to operate the robot. For more information, please refer to the joy_controller documentation.
cd ~/dev_ws source ~/.bashrc ros2 launch joy_controller joy_controller_launch.py
Alternatively, you can operate robot with
teleop_twist_keyboardby running the following command:ros2 run teleop_twist_keyboard teleop_twist_keyboard
-
Terminal 3: Launch Dataset Maker
In another separate terminal on the docker container, run the dataset maker to record data while operating the robot.
The screenshot below shows RViz (left) and the Gazebo Simulator (right) while data is being recorded.
In the RViz visualization:- The red point cloud shows the current, live sensor scan.
- The blue point cloud and the three images represent the most recently saved dataset.
- The blue arrows indicate the sequence of ground truth positions for that saved dataset.
After you finish recording, you can stop the process by pressing
CTRL + Con the terminal.cd ~/dev_ws source ~/.bashrc export HYDRA_CONFIG_PATH=<ENV_NAME>/loc_eval_<ENV_NAME> ros2 launch launch/make_dataset.launch.py map_file_path:=<ENV_NAME>/occupancy_grid_map/<ENV_NAME>.yaml
-
Visualize Recorded Data
After the dataset is recorded, the data will be saved in the
resultsdirectory. You can visualize the recorded data using the web visualizer.cd ~/dev_ws python3 scripts/web_visualizer.py --target_dir results
This project is licensed under the Apache-2.0 License. Please note that certain components may be distributed under different licenses; refer to the corresponding directories for detailed information.
This project makes use of several open-source libraries. We would like to express our gratitude to the developers and contributors of these projects.
-
- License: Apache-2.0
- Included in:
src/joy_controller
-
- License: LGPL-3.0
- Included in:
src/emcl2_ros2
-
AWS RoboMaker World Assets
- Assets
- License: MIT-0
- Included in:
src/mobile_robot_ros2/models,src/mobile_robot_ros2/photos
-
- License: Apache-2.0
- Included in:
lib/vlg_loc/recognition/azure.py - The code used to call the GPT model was adapted from this repository.
-
- License: Apache-2.0
- While this repository does not include the tool’s source code, the occupancy grid maps used for the simulation worlds were generated with this tool.