
TL;DR
- I read this because.. : GPT series. It’s been talked about in various places.
- task : chatGPT with visual input output
- problem : chatGPT only communicates with language. It would be nice to receive image input/output, but if chatGPT also receives a vision model, it will take too long to learn the model.
- idea : Let’s just make a system that can call external Vision Foundation Models -> chain of thought to think about which vision model to call - do action. -> Make a system that can rephrase ambiguous queries to fit the chat interface and quote the image files you create.
- architecture : vision models collected from hf etc + chatGPT based on instructGPT + apply system through LangChain
- objective : LM ce loss
- baseline : x
- data : does not appear to be newly learned
- evaluation : qualatatively
- result : Working
- contribution : The first? vision chatGPT
- limitation / things I cannot understand : I thought it was a flamingo-like model, but it wasn’t.. something feels more like an instruction manual than a model .. not fancy, but in the future, this approach will be the mainstream…
Details

visual foundation models + MaskFormer in hf
Since Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather than imagination. The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to repeat important information in the final response for Human. Thought: Do I need to use a tool?" as a prefix to the query.
