image

paper , demo

TL;DR

  • I read this because.. : I wasn’t going to read it, but… SAM is so powerful that it seems to be used a lot to create VLM datasets.
  • task : prompt segmentation
  • problem : Given the prompt I want to segment / point, there is disambiguity as to which segment I want.
  • idea : use a simple prompt encoder and use this as a query in MaskFormer / learn by losing only the most confident ones
  • input/output : image + prompt(points, box, mask, text) -> mask (no cls)
  • architecture : A variation of MaskFormer. First, a strong backbone (ViT-H) + prompt encoder is passed through a pe or text encoder and then added to SA. image -> prompt cross attention (original) prompt -> image cross attention (added). Generating masks internally by pixel upsample and mask embedding. Changed to return IoU score to pick only confident masks.
  • objective : focal loss + dice loss
  • baseline : Interactive segmentation model called RITM
  • data: SA-1B proposal
  • evaluation : mIoU
  • result : Almost beats RITM. Doesn’t beat semantic segmentation SOTA on the benchmark. Performance on text prompt is not very good.
  • contribution : Benchmarks for semantic segmentation seem to be very subjective, but I solved this with dataset + model arch! I created a general semantic segmentation model.
  • etc. : How did you learn about bbox / mask / text, or did you not learn…

Details

Preliminaries

image

Disambiguity in interactive segmentation

image

Model

image

Result

image