Skip to content

Vision-Language Global Localization (VLG-Loc) is a global localization method that uses camera images and a human-readable labeled footprint map containing only names and areas of distinctive visual landmarks.

License

Notifications You must be signed in to change notification settings

CyberAgentAILab/VLG-Loc

Repository files navigation

VLG-Loc

This is the official repository for the following paper:

Mizuho Aoki*, Kohei Honda, Yasuhiro Yoshimura, Takeshi Ishita, Ryo Yonetani, "VLG-Loc: Vision-Language Global Localization from Labeled Footprint Maps", arxiv, 2025, paper | project page | dataset | video

Vision-Language Global Localization (VLG-Loc) is a global localization method that uses camera images and a human-readable labeled footprint map containing only names and areas of distinctive visual landmarks. VLG-Loc Architecture

Setup

This setup provides a GPU-enabled Ubuntu 22.04 environment using Docker.

  • Prerequisites

    • git
      • For ubuntu users:
        sudo apt install git
    • git-lfs
      • For ubuntu users:
        sudo apt install git-lfs
        git lfs install
    • docker
      • For ubuntu users:
        curl -fsSL https://get.docker.com -o get-docker.sh
        sudo sh get-docker.sh
        sudo groupadd docker
        sudo usermod -aG docker $USER
        reboot
    • make
      • For ubuntu users:
        sudo apt install make
    • NVIDIA Container Toolkit
      • This is required to allow Docker containers to access the host's GPU.
    • NVIDIA GPU & Driver
      • An NVIDIA GPU and a compatible driver for the base image (nvidia/cuda:12.4.1-devel-ubuntu22.04) are required.
    • Azure OpenAI API KEY
  • Clone the repository

    git clone [email protected]:CyberAgentAILab/VLG-Loc.git
  • Download the dataset to the project root on your host machine. The dataset directory will be mounted as ~/dev_ws/dataset in the container.

    cd VLG-Loc
    mkdir -p dataset
    cd dataset
    git clone https://huggingface.co/datasets/cyberagent/VLG-Loc-Dataset vlg_loc_dataset
    
  • Build the docker container.

    cd VLG-Loc
    make setup_docker
  • Get inside the docker container.

    cd VLG-Loc
    make launch_docker
  • Set the VLM API key and endpoint as environment variables.
    Copy the example configuration file .env.example to create your own .env file:

    cd ~/dev_ws
    cp .env.example .env

    Open .env and replace <vlm_api_key> and <vlm_api_endpoint> with your actual credentials.

  • Setup the workspace.

    cd ~/dev_ws
    source setup_workspace.sh

Evaluation

Use the following command to run an evaluation. Replace <ENV_NAME> and <CONFIG_PATH> with the appropriate values from the sections below.

python3 scripts/run_eval.py \
    --dataset-root dataset \
    --dataset_name vlg_loc_dataset/<ENV_NAME> \
    --mode <EVAL_MODE> \
    --config_filename=<CONFIG_PATH> \
    --overwrite
    --create_video

Descriptions of Arguments:

  • --dataset-root: Path to the root directory of the dataset.
  • --dataset_name: Name of the dataset.
  • --mode: Evaluation mode.
  • --config_filename: Path to the configuration file of the localizer.
  • --overwrite: If specified, existing results will be overwritten.
  • --create_video: If specified, a video summarizing the evaluation will be created. Be aware that this takes longer time and requires more disk space.

Evaluation Modes (<EVAL_MODE>):

  • eval_scan_localizer: Evaluate the scan localizer.
  • eval_vision_localizer: Evaluate the vision localizer.
  • eval_vision_and_scan_localizer: Evaluate multimodal localization using both vision and scan data.
  • clean_all_logs: Clean evaluation outputs.

Environments (<ENV_NAME> and <CONFIG_PATH>):

Select the <ENV_NAME> and <CONFIG_PATH> for your desired environment from the table below.

Environment <ENV_NAME> <CONFIG_PATH>
UG/UA (Uniform Geometry, Uniform Appearance) env_ug_ua env_ug_ua/loc_eval_env_ug_ua.yaml
UG/DA (Uniform Geometry, Diverse Appearance) env_ug_da env_ug_da/loc_eval_env_ug_da.yaml
DG/UA (Diverse Geometry, Uniform Appearance) env_dg_ua env_dg_ua/loc_eval_env_dg_ua.yaml
DG/DA (Diverse Geometry, Diverse Appearance) env_dg_da env_dg_da/loc_eval_env_dg_da.yaml
Retail Store (Real) env_retail_store_real retail_store_real/loc_eval_env_retail_store_real.yaml
Retail Store (Sim) env_retail_store_sim env_retail_store_sim/loc_eval_env_retail_store_sim.yaml

Note

The included configuration file is set to use gpt-4.1 as default.
If you change model, please specify [vlm_config][model_name] in configuration file.

Example Command:

To run the vision localizer evaluation in the DG/DA environment, use the following command:

python3 scripts/run_eval.py --dataset-root dataset --dataset_name vlg_loc_dataset/env_dg_da --mode eval_vision_localizer --config_filename=env_dg_da/loc_eval_env_dg_da.yaml --overwrite

Note

While we strive for reproducible outputs by using seed and temperature parameters, please be aware that minor variations in LLM outputs can lead to slight differences in localization results.

Visualize the Results

After running the evaluation, you can visualize the results using the web visualizer. Use the following command to start the visualizer:

python3 scripts/web_visualizer.py --target_dir <PATH_TO_DATASET_DIR>

Replace <PATH_TO_DATASET_DIR> with the path to the dataset directory (e.g., dataset/vlg_loc_dataset/env_dg_da).

Then, open a web browser and navigate to http://127.0.0.1:8080 to view the visualizer.

Dataset Generation in the Simulation Environments

You can generate datasets by running simulations in the provided Gazebo Classic environments. This requires running commands in separate terminals inside the Docker container.

  1. Build the workspace.

    cd ~/dev_ws
    make build
  2. Terminal 1: Launch Gazebo World In your first Docker terminal, launch the desired simulation environment.

    cd ~/dev_ws
    source ~/.bashrc
    ros2 launch mobile_robot_ros2 vmegarover_world.launch.py world_fname:=<ENV_NAME>
    • Note: Replace <ENV_NAME> with one of the environment names listed in the table above except for env_retail_store_real.
  3. Terminal 2: Launch Manual Controller

    In a separate terminal on the docker container, run the joypad controller to operate the robot. For more information, please refer to the joy_controller documentation.

    cd ~/dev_ws
    source ~/.bashrc
    ros2 launch joy_controller joy_controller_launch.py

    Alternatively, you can operate robot with teleop_twist_keyboard by running the following command:

    ros2 run teleop_twist_keyboard teleop_twist_keyboard
  4. Terminal 3: Launch Dataset Maker

    In another separate terminal on the docker container, run the dataset maker to record data while operating the robot.
    The screenshot below shows RViz (left) and the Gazebo Simulator (right) while data is being recorded.
    In the RViz visualization:

    • The red point cloud shows the current, live sensor scan.
    • The blue point cloud and the three images represent the most recently saved dataset.
    • The blue arrows indicate the sequence of ground truth positions for that saved dataset.

    After you finish recording, you can stop the process by pressing CTRL + C on the terminal.

    cd ~/dev_ws
    source ~/.bashrc
    export HYDRA_CONFIG_PATH=<ENV_NAME>/loc_eval_<ENV_NAME>
    ros2 launch launch/make_dataset.launch.py map_file_path:=<ENV_NAME>/occupancy_grid_map/<ENV_NAME>.yaml
    make_dataset_screenshot
  5. Visualize Recorded Data

    After the dataset is recorded, the data will be saved in the results directory. You can visualize the recorded data using the web visualizer.

    cd ~/dev_ws
    python3 scripts/web_visualizer.py --target_dir results
    check_dataset_screenshot

License and Acknowledgements

This project is licensed under the Apache-2.0 License. Please note that certain components may be distributed under different licenses; refer to the corresponding directories for detailed information.

This project makes use of several open-source libraries. We would like to express our gratitude to the developers and contributors of these projects.

About

Vision-Language Global Localization (VLG-Loc) is a global localization method that uses camera images and a human-readable labeled footprint map containing only names and areas of distinctive visual landmarks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published