
paper
TL;DR#
- I read this because.. : table related research. wondering if SGG has any ideas for me.
- task : Table Structure Recognition(TSR)
- problem : There are three modalities for representing tables: coordinates, images, and content. Usually coordinates are the most important feature, but if the image is distorted, we need to utilize other modalities rather than coordinates. How can we learn to make these modalities help each other?
- An architecture that’s not early fusion, not late fusion… where you do each and then put them back together and iterate over multiple blocks.
- architecture : bboxes in the image as nodes and all connections as edges. For each modality, extract features with an MHA where Q is the feature and K=V is the edge representation (Ego Context Extractor), followed by three Cross Context Synthesizers that query each modality and write the other modality as K,V.
- objective : For every node i, j, check if there is an edge to it and if it is connected by a row, column, or cell, for each bce
- baseline : FLAG-Net, TabStr, DGCNN
- data : ICDAR-2013, ICDAR-2019, WTW, UNLV, SciTSR, SciTSR-COMP
- evaluation : Tree Edit Distance(TED), BLEU
- result : sota
Details#
Motivation#

Architecture#

Use Compressed MHA (CMHA) from [Pyramid ViT](https://arxiv.org/abs/2102.12122) because MHA math is too complex
Result#
