TL;DR
- I read this because.. : I wasn’t going to read it, but… SAM is so powerful that it seems to be used a lot to create VLM datasets.
- task : prompt segmentation
- problem : Given the prompt I want to segment / point, there is disambiguity as to which segment I want.
- idea : use a simple prompt encoder and use this as a query in MaskFormer / learn by losing only the most confident ones
- input/output : image + prompt(points, box, mask, text) -> mask (no cls)
- architecture : A variation of MaskFormer. First, a strong backbone (ViT-H) + prompt encoder is passed through a pe or text encoder and then added to SA. image -> prompt cross attention (original) prompt -> image cross attention (added). Generating masks internally by pixel upsample and mask embedding. Changed to return IoU score to pick only confident masks.
- objective : focal loss + dice loss
- baseline : Interactive segmentation model called RITM
- data: SA-1B proposal
- evaluation : mIoU
- result : Almost beats RITM. Doesn’t beat semantic segmentation SOTA on the benchmark. Performance on text prompt is not very good.
- contribution : Benchmarks for semantic segmentation seem to be very subjective, but I solved this with dataset + model arch! I created a general semantic segmentation model.
- etc. : How did you learn about bbox / mask / text, or did you not learn…