Approach Visual Text Correction using RoBERTa

Updated: Jan 30, 2022

Abstract

Visual Text Correction (VTC) represents a set of methods to find and replace of an inaccurate word in a sentence given both textual and visual information. In this paper, we propose a novel solution to the VTC problem by applying stacked generalization[12]. This solution combines the output of several neural networks based upon RoBERTa and MnasNet. We utilize image features processed by VGG19 and English natural language text as input into two backbone models. The image model backbone is MnasNet, and the NLP Transformer is RoBERTa. We train these models separately (RoBERTa only, and RoBERTa with MnasNet) and together to form a weighted ensemble. This ensemble when evaluated on a realistic falsified dataset shows the strength of ensembling neural networks. Our experiments show that a linear combination of two neural networks: RoBERTa only and RoBERTa with MNASTNET have higher accuracy than any single model.

KEYWORDS: NLP, BERT, RoBERTa, MnasNet, Stack Generalization, Visual Text Correction