Visual Prompting via Image Inpainting

3줄 요약

visual prompting : given input-output image examples of a new task at test time and a new input image, the goal is to automatically produce the output image
visual prompt를 보고 inpaintiong을 할 수 있는 MAE-VQGAN 모델을 만들었다
computer vision task에서는 baseline에 비해서 뛰어난 성능을 보였지만 synthetic data study에서는 부족한 성능을 보였다

Introduction

features learned via self-supervision are not ‘ready for use’ → need to be adapted for a given downstream task by fine-tuning
In NLP, they use prompting to employ a model for a new task without any fine-tuning (additional training)
→ Is it possible to build a single general model that can perform a wide range of tasks without any fine-tuning?
- prediction red square base on the given examples
Visual prompt task → 상단의 이미지 2개를 보고 규칙을 유추한 뒤 좌측 하단의 그림에 규칙을 적용해서 우측 하단의 hole을 inpainting한다
- 2 x 2 grid를 가진 우측의 이미지가 1개의 이미지로 모델에 input으로 들어간다 (visual prompt = $x_{vp}$)
- simply concatenate images into a single image with a hole (it is not exactly an analogy since there is no implied left-to-right ordering).
- task must be defincd as image to image translation
contribution

presenting a simple yet surprisingly powerful general approach for visual prompting (MAE-VQGAN)
providing a new dataset that allows a model to learn such grid structures without any labeling, task descriptions, or any additional information about the grid structure
showing that while using our new dataset for training is essential, adding more generic image data from other sources further improves the results.

MAE(Masked AutoEncoder), VQGAN(이미지 생성모델 with transformer)
- VQGAN은 codebook이라는 categorical distribution을 활용 (not gaussian), cnn으로 codebook을 만듬 → 트랜스포머를 통해 visual part간 관계를 학습
image inpainting model using visual prompting
input: image $(x)$ and binary mask $(m)$
output: new image with the masked regions filled $(y)$