Skip to content

[ECCV 2024] Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

License

Notifications You must be signed in to change notification settings

LiuJF1226/Mono-ViFI

Repository files navigation

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu and Jinwei Chen

vivo Mobile Communication Co., Ltd

ECCV 2024 [arxiv]

Table of Contents

Description

This is the official PyTorch implementation for Mono-ViFI, which is built on the codebase of BDEdepth. If you find our work useful in your research, please consider citing our paper:

@misc{liu2024,
      title={Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation}, 
      author={Jinfeng Liu and Lingtong Kong and Bo Li and Zerong Wang and Hong Gu and Jinwei Chen},
      year={2024},
      eprint={2407.14126},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.14126}, 
}

Setup

Install the dependencies with:

pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

pip install -r requirements.txt

Preparing datasets

KITTI

For KITTI dataset, you can prepare them as done in Monodepth2. Note that we directly train with the raw png images and do not convert them to jpgs. You also need to generate the groundtruth depth maps before training since the code will evaluate after each epoch. For the raw KITTI groundtruth (eigen eval split), run the following command. This will generate gt_depths.npz file in the folder splits/kitti/eigen/.

python export_gt_depth.py --data_path /home/datasets/kitti_raw_data --split eigen

For the improved KITTI groundtruth (eigen_benchmark eval split), please directly download it in this link. And then move the downloaded file (gt_depths.npz) to the folder splits/kitti/eigen_benchmark/.

Make3D

For Make3D dataset, you can download it from here.

Cityscapes

For Cityscapes dataset, we follow the instructions in ManyDepth. First Download leftImg8bit_sequence_trainvaltest.zip and camera_trainvaltest.zip in its website, and unzip them into a folder /path/to/cityscapes/. Then preprocess CityScapes dataset using the followimg command:

python prepare_cityscapes.py \
--img_height 512 \
--img_width 1024 \
--dataset_dir /path/to/cityscapes \
--dump_root /path/to/cityscapes_preprocessed \
--seq_length 3 \
--num_threads 8

Remember to modify --dataset_dir and --dump_root to your own path. The ground truth depth files are provided by ManyDepth in this link, which were converted from pixel disparities using intrinsics and the known baseline. Download it and unzip into splits/cityscapes/

VFI Pre-training

Download the following 6 checkpoints related to VFI in this link:

  • small IFRNet pretrained on Vimeo90K dataset : IFRNet_S_Vimeo90K.pth
  • large IFRNet pretrained on Vimeo90K dataset : IFRNet_L_Vimeo90K.pth
  • small IFRNet pretrained on KITTI dataset : IFRNet_S_KITTI.pth
  • large IFRNet pretrained on KITTI dataset : IFRNet_L_KITTI.pth
  • small IFRNet pretrained on Cityscapes dataset : IFRNet_S_CS.pth
  • large IFRNet pretrained on Cityscapes dataset : IFRNet_L_CS.pth

To save time, you can skip VFI pre-training and directly use our provided checkpoints. Just create a folder Mono-ViFI/weights/ and move IFRNet_S_KITTI.pth, IFRNet_L_KITTI.pth, IFRNet_S_CS.pth, IFRNet_L_CS.pth to this folder.

If you want to train VFI models by yourself, move IFRNet_L_Vimeo90K.pth, IFRNet_S_Vimeo90K.pth to the folder Mono-ViFI/weights/. We load Vimeo90K checkpoints to train on KITTI/Cityscapes. All the VFI training configs are in the folder configs/vfi/. For example, the command for training large IFRNet on KITTI is:

### Training large IFRNet on KITTI
# single-gpu
CUDA_VISIBLE_DEVICES=0 python train_vfi.py -c configs/vfi/IFRNet_L_KITTI.txt

# multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_vfi.py -c configs/vfi/IFRNet_L_KITTI.txt

Mono-ViFI Training

Before training, move the 2 checkpoints downloaded from this link to the folder Mono-ViFI/weights/:

  • HRNet18 backbone pretrained on ImageNet : HRNet_W18_C_cosinelr_cutmix_300epoch.pth.tar
  • LiteMono backbone pretrained on ImageNet : lite-mono-pretrain.pth

You can refer to config files for the training settings/parameters/paths. All training configs are in the folders:

  • ResNet18 backbone : configs/resnet18
  • LiteMono backbone : configs/litemono
  • D-HRNet backbone : configs/dhrnet

Remember to modify related paths to your own. Take ResNet18 as an example, the training commands are as follows.

Note: you can adjust batch_size in the config files according to your maximum GPU memory.

### Training with ResNet18 backbone (KITTI, 640x192)
# single-gpu
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_KITTI_MR.txt

# multi-gpu
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py -c configs/resnet18/ResNet18_KITTI_MR.txt


### Training with ResNet18 backbone (KITTI, 1024x320)
# For 1024x320 resolution, we load 640x192 model and train for 5 epoches with 1e-5 learning rate.
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_KITTI_HR.txt


### Training with ResNet18 backbone (Cityscapes, 512x192)
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_CS.txt

Evaluation

Evaluate with single-frame model

### KITTI 640x192 model, ResNet18

CUDA_VISIBLE_DEVICES=0 python evaluate_depth.py \
--pretrained_path our_models/ResNet18_KITTI_MR.pth \
--backbone ResNet18 \
--batch_size 16 \
--width 640 \
--height 192 \
--kitti_path /data/juicefs_sharing_data/public_data/Datasets/KITTI/kitti_raw_data \
--make3d_path /data/juicefs_sharing_data/public_data/Datasets/make3d \
--cityscapes_path /data/juicefs_sharing_data/public_data/Datasets/cityscapes \
# --post_process

This script will evaluate on KITTI (both raw and improved GT), Make3D and Cityscapes together. If you don't want to evaluate on some of these datasets, for example KITTI, just do not specify the corresponding --kitti_path flag. It will only evaluate on the datasets which you have specified a path flag.

If you want to evalute with post-processing, add the --post_process flag (disabled by default).

Evaluate with multi-frame model

### KITTI 640x192 model, ResNet18

CUDA_VISIBLE_DEVICES=0 python evaluate_depth_mf.py \
--pretrained_path our_models/ResNet18_KITTI_MR.pth \
--backbone ResNet18 \
--vfi_scale small \
--training_data kitti \
--batch_size 16 \
--width 640 \
--height 192 \
--kitti_path /data/juicefs_sharing_data/public_data/Datasets/KITTI/kitti_raw_data \
--cityscapes_path /data/juicefs_sharing_data/public_data/Datasets/cityscapes \

Prediction

Prediction for a single image (only single-frame model)

You can predict the disparity (inverse depth) for a single image with:

python test_simple.py --image_path folder/test_image.png --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --height 192 --width 640 --save_npy

The --image_path flag can also be a directory containing several images. In this setting, the script will predict all the images (use --ext to specify png or jpg) in the directory:

python test_simple.py --image_path folder --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --height 192 --width 640 --ext png --save_npy

Prediction for a video (both single- and multi-frame model)

python test_video.py --image_path folder --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --vfi_scale small --training_data kitti --height 192 --width 640 --ext png --save_npy

Here the --image_path flag should be a directory containing several video frames. Note that these video frame files should be named in an ascending numerical order. For example, the first frame is named as 0000.png, the second frame is named as 0001.png, and etc. This command will also output a GIF file.

Mono-ViFI Weights

We provide our pretrained weights of depth models in this link, including 9 checkpoints:

  • ResNet18 backbone trained on KITTI with 640x192 : ResNet18_KITTI_MR.pth
  • ResNet18 backbone trained on KITTI with 1024x320 : ResNet18_KITTI_HR.pth
  • ResNet18 backbone trained on Cityscapes with 512x192 : ResNet18_CS.pth
  • Lite-Mono backbone trained on KITTI with 640x192 : LiteMono_KITTI_MR.pth
  • Lite-Mono backbone trained on KITTI with 1024x320 : LiteMono_KITTI_HR.pth
  • Lite-Mono backbone trained on Cityscapes with 512x192 : LiteMono_CS.pth
  • D-HRNet backbone trained on KITTI with 640x192 : DHRNet_KITTI_MR.pth
  • D-HRNet backbone trained on KITTI with 1024x320 : DHRNet_KITTI_HR.pth
  • D-HRNet backbone trained on Cityscapes with 512x192 : DHRNet_CS.pth

Note that they are newly trained checkpoints whose evaluation indexes are slightly different from those reported in the paper.

Related Projects

About

[ECCV 2024] Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages