[130] Segment Anything

TL;DR

I read this because.. : I wasn’t going to read it, but… SAM is so powerful that it seems to be used a lot to create VLM datasets.
task : prompt segmentation
problem : Given the prompt I want to segment / point, there is disambiguity as to which segment I want.
idea : use a simple prompt encoder and use this as a query in MaskFormer / learn by losing only the most confident ones
input/output : image + prompt(points, box, mask, text) -> mask (no cls)
architecture : A variation of MaskFormer. First, a strong backbone (ViT-H) + prompt encoder is passed through a pe or text encoder and then added to SA. image -> prompt cross attention (original) prompt -> image cross attention (added). Generating masks internally by pixel upsample and mask embedding. Changed to return IoU score to pick only confident masks.
objective : focal loss + dice loss
baseline : Interactive segmentation model called RITM
data: SA-1B proposal
evaluation : mIoU
result : Almost beats RITM. Doesn’t beat semantic segmentation SOTA on the benchmark. Performance on text prompt is not very good.
contribution : Benchmarks for semantic segmentation seem to be very subjective, but I solved this with dataset + model arch! I created a general semantic segmentation model.
etc. : How did you learn about bbox / mask / text, or did you not learn…