Answer-Aware Question Generation from Tabular and Textual Data using T5

— Automatic Question Generation (AQG) systems are applied in a myriad of domains to generate questions from sources such as documents, im-ages, knowledge graphs to name a few. With the rising interest in such AQG systems, it is equally important to recognize structured data like tables while generating questions from documents. In this paper, we propose a single model architecture for question generation from tables along with text using “Text-to-Text Transfer Transformer” (T5) - a fully end-to-end model which does not rely on any intermediate planning steps, delexicalization, or copy mechanisms. We also present our systematic approach in modifying the ToTTo dataset, release the augmented dataset as TabQGen along with the scores achieved using T5 as a baseline to aid further research.


Introduction
The development of end-to-end supervised Question-Answering (QA) models has been accelerated with the advent of large-scale datasets. The Stanford Question Answering Dataset (SQUAD) [5] is a reading comprehension dataset composed of questions from Wikipedia articles, with the answer to each question being a part of the corresponding reading passage. Microsoft Machine Reading Comprehension (MS MARCO) [6] is a large-scale dataset focused on reading comprehension, question answering, passage ranking, Keyphrase Extraction, and Conversational Search Studies. TriviaQA [7] is a realistic text-based question-answer dataset with 950K question-answer pairings extracted from Wikipedia and the internet. Since the answers to questions may not be simply acquired via span prediction, TriviaQA is more challenging than traditional QA benchmark datasets such as SQuAD. DuoRC [8] comprises 186K distinct question-answer combinations derived from 7680 pairs of movie plots, each pair representing two different versions of the same film and highlights the challenges of combining knowledge and reasoning in neural architectures for reading comprehension.
As is the case, neural-network based solutions could always benefit from more training data, especially in domains where the existing datasets do not cater. Augmenting existing datasets or creating new datasets for specific domains is a time-consuming, tedious, and expensive task. To alleviate this problem and create more training data, there is a growing interest in developing new techniques that can automatically generate questions from a given source like a document [9] [10]. This task is referred to as Automatic Question Generation.
Given the practical importance of AQG systems, it is crucial to consider all forms of data from a document while generating questions. However, current AQG systems are ineffective in generating a large amount of high-quality question-answer pairs from structured data like tables. A tabular structure allows representing complex and vital information in a format that is a lot easier to interpret, ignoring which, would lead to a loss of potential high-quality questions. Let us consider an example of an organization that wants to create a QA database for its policies. It is more likely that financial policies such as 'Per Diem' would contain a substantial amount of information like the 'per day' expenses for the company across all branches represented as tables. In this case, a typical AQG system would struggle to generate a comprehensive QA dataset by ignoring tabular data. AQG also offers great value in computer-aided assessments [37,38], allowing the instructors to generator a plethora of questions from existing materials such as presentations and documents.
In this paper, we emphasize that there is a need to improve the existing AQG systems and address the challenges involved while working with tabular data. To the best of our knowledge, the area of Question Generation from tables is not intensively studied, leading to a dearth of academic datasets to explore. But the recent advancements in NLP have led to representing a section of tabular data as natural language, also referred to as Table-to-Text Generation. In this paper, we propose a single model architecture for question generation from textual and tabular data using T5 and also release a modified version of the ToTTo [4] dataset as TabQGen that can be used to generate questions from tables based on the highlighted cells along with its baseline scores to aid further research.

Question Generation from Text
Early work on question generation relied on heuristic algorithms to produce questions using manually constructed templates. In [14] proposed to generate self-questioning instructions automatically for a given text by decomposing strategy instruction into describing, modeling, scaffolding, and prompting the strategy. A template-based strategy to online learning [15] proposed the usage of semantic role labels into a system that generated natural language questions automatically. Later approaches also employed semantic pattern recognition to produce questions of various depths and types, as well as semantic role labeling of source sentences to produce both questions and responses in a domain-independent way [16]. Other approaches relied on generating deep comprehension questions by breaking the task down into an ontology crowd-relevance workflow that included representing the original text in a low-dimensional ontology, crowdsourcing candidate question templates aligned with that space and ranking potentially relevant templates for a novel region of text [17]. Despite the fact that [18] used an over-generate-and-rank strategy followed by learning to rank the questions, the performance of the system is still heavily reliant on the manually constructed generation rules. To examine the profound relationship between language and image, [19] created a visual question generation task.
Attention-based neural networks are also used for question generation tasks such as training a sequence learning model using a seq2seq learning objective [9]. Formerly overlooked approaches such as using a hierarchical neural sentence-level sequence tagging model [10] helped bolster the use of attention-based neural networks for question generation tasks. In [13], high-quality question-answer pairs were generated using a system consisting of an information extractor, a neural question generator, and a neural quality controller. Some models [20] feed the generated questions to a QA system which then uses the QA system's performance as a measure of question quality. A few models consider question answering (QA) and question generation (QG) to be complementary tasks and focus on jointly training the two tasks. A generative machine comprehension model that encapsulates the text and creates a question based on the response using a seq2seq framework was proposed in [21]. Some approaches rely on probabilistic correlation to direct the training of both the QA and QG models simultaneously [22]. Other models focus only on the performance of the QA task and not explicitly on the quality of the generated questions. A Generative Domain-Adaptive Nets training framework [23], trains a generative model to produce questions based on unlabeled text and integrate model-generated questions with human-generated questions for question answering model training.
Textual Question and Answer Generation has been well studied in recent years after the introduction of transformer-based architectures which help with prolonged passages. Also, the recent success of large-scale transformer-based architectures such as BERT [24], RoBERTa [25], T5 [1], and PEGASUS [26] has further helped accelerate the research. Quiz-Style Question Generation for News Stories [11] formulated the problem into two distinct seq2seq tasks: question-answer generation (QAG) using PEGASUS [26] and distractor, or incorrect answer, generation (DG) using T5 [1]. In [12] proposed a Rough Answer and Key Sentence Tagging approach to discover answer-related contents and an Answer-guided graph to collect answer-focused structural information that supplements seq2seq models to construct exam-like questions using an extracted dataset from RACE [31]. Table-to-Text Generation is highly dominated by attention-based neural networks. While [3] presented an LSTM based seq2seq model that augments the seq2seq attentional model with a hybrid "pointer-generator" network, [2] utilized the seq2seq model in two stages -first generate a content plan that outlined the information to be included along with the structure, followed by generating a document by keeping the content structure into consideration. BERT-to-BERT [27] is a transformer-based encoder-decoder model, where both the encoder and decoder are initialized with publicly available checkpoints of BERT [24]. In particular, [28] used T5 [1] for a variety of Data-to-Text tasks which includes Table-to-Text generation.

Research Method
As with any machine learning problem, we need a reliable dataset to come up with a solution. While there are many techniques and datasets available for Table-to-Text Generation -which focuses on generating descriptive text based on the highlighted cells, there aren't many reliable datasets for question generation from tables. So, rather than creating an entirely new dataset, we resorted to modifying an existing Table-to-Text dataset into a Table-to-Question dataset for enhancing the performance of existing AQG systems, allowing it to generate questions from tables along with the text.
The ToTTo dataset serves as a great baseline for Table-to-Text tasks as it covers a significant variety of domains, contains highlighted cells that can be used by the model to attend, and meta-data for each table -providing additional context for the model to better formulate the output. In this section, we explain our entire pipeline in two parts, one focusing on the dataset creation and the other on question generation from tables using the generated dataset as shown in Figure 1.

Table-to-Question Dataset Generation
ToTTo offers over 120K training examples and puts forward a controlled generation task of constructing a one-sentence description given a Wikipedia table and highlighted cells. Each training example also has around 2-3 different descriptions totaling ~240K-360K descriptions which could potentially be converted into its question form. Given the size of the dataset, it would not be viable to generate a question for each description manually. To tackle this problem, we fine-tuned a model to generate a wide variety of questions given a context and answer.
Transfer Learning for Multi-Type Question Generation. We rely on the Stanford Question Answering Dataset [5] (SQuAD1.1) dataset which has over 100K questionanswer pairs on 500+ articles for generating semantically accurate questions with a high variance -ranging from a single word to a sentence length question-answer pair. For Boolean style questions, we rely on the BoolQ [36] (Boolean Questions) dataset which consists of 9427 and 3270 labeled training and validation samples respectively. Also, the ability of T5 to perform multiple tasks based on its prefix allows us to use a single model for multiple question-generation tasks. Using this approach, we fine-tune a single T5 model to generate an extensive range of questions with task-specific prefixes as shown below.
For questions with Boolean Answers: Input-boolqgen answer: True context: In 2020, among the top 5 leagues in Europe, the Portuguese forward Cristiano Ronaldo scored 33 goals for the Italian club Juventus making him the top goal scorer of the season.
Generated Question-did cristiano ronaldo score more goals than anyone else in europe?
For questions with one-word answers: Input-qgen answer: Cristiano Ronaldo context: In 2020, among the top 5 leagues in Europe, the Portuguese forward Cristiano Ronaldo scored 33 goals for the Italian club Juventus making him the top goal scorer of the season. Generated Question-Who was the top goal scorer in 2020? For questions with sentence length answer: Input-qgen answer: The Portuguese forward Cristiano Ronaldo scored 33 goals. context: In 2020, among the top 5 leagues in Europe, the Portuguese forward Cristiano Ronaldo scored 33 goals for the Italian club Juventus making him the top goal scorer of the season. Generated Question-How many goals did Cristiano Ronaldo score for Juventus in 2020?
For questions with summary answers: Input-qgen answer: Among top 5 leagues in Europe, Cristiano Ronaldo is the top goal scorer with 33 goals in 2020. context: In 2020, among the top 5 leagues in Europe, the Portuguese forward Cristiano Ronaldo scored 33 goals for the Italian club Juventus making him the top goal scorer of the season.
Generated Question-Who is the top goal scorer in the 2020 European football season?
Applying Multi-Type Question Generator to ToTTo. Given the ability of our Multi-Type Question Generator model, we apply this on the ToTTo dataset's descriptions to generate questions. Each final_sentence in the ToTTo dataset is passed as both the context and answer for the above model to generate a question by considering the entire sentence as an answer. We appended the generated question with its respective entry using the question key. This modified ToTTo dataset is referred to as TabQGen in this paper. Figure 2 shows the sample meta-data of the TabQGen dataset, where "…." represents values inherited from ToTTo. Though ToTTo features a test set containing Overlap and Non-Overlap samples, it is not made publicly available, leading us to utilize the ToTTo's dev set as TabQGen's test set.

Question Generation from Tables
Now that we have our dataset containing tables, corresponding highlighted cells, and questions, the challenge is to make T5 understand the tabular data. Encoding the entire table would require special embeddings as in TAPAS [29], but since TAPAS is based on the BERT architecture, adding a decoder model to generate text would not only add to the complexity but also deviate from our aim of using a single model architecture. We instead followed the linearization approach by [28] and used special tags to represent the tabular data and meta-data. We also resized the embeddings of the T5 model with the newly added tags for which the weights would be learned during the training. As mentioned in [4], though the full-table approach utilizes the entire table as the source along with additional tokens for highlighted cells, it performs poorly due to large table sizes. Instead, the sub-table approach focuses solely on the highlighted cells along with their respective column and row headers, leading to better performance as the model focuses only on the relevant content. Figure 3 shows the disparity between the tokenized lengths of full-table and subtable representations. Only 36% of the training data is <=600 tokens for the full-table whereas almost all samples are <600 tokens for sub-table representation which can be of utmost benefit for contemporary transformer architectures as most of them were trained with 512 as maximum input length. Furthermore, it was observed that using a sub-table with metadata achieved far better results than using without metadata, which corroborates the findings mentioned by [4]. The metadata provides the necessary context to the model about the table thus preventing it from noisy predictions due to hallucination [30]. This further allows the model to focus on a set of highlighted cells rather than the entire table, allowing more directed and precise questions as shown in Table  1.

Results
We trained the T5-small, T5-base, and T5-large models using a single NVIDIA Tesla T4 machine with batch sizes of 32, 8, and 4 respectively. We used a linear schedule warmup with an AdamW optimizer of 1e-4 as the learning rate. We used 91.72% of the TabQGen dataset for training and the remaining 8.28% (~10K samples) as validation. Table 2 shows the performance of the T5 models on the hold-out test dataset of TabQGen.
The following metrics are used to assess the performance of question generation from tables: ─ NIST [32] measures the information gain from each n-gram considered. This allows in giving more credit if a system gets a difficult n-gram match but less credit for an easy n-gram match. ─ BLEU [33] scores measure the quality of text that has been translated by a machine from one natural language to another using n-grams. We used a cumulative 4-gram BLEU score (B4) as an evaluation metric. ─ ROUGE-L [34] uses statistics based on the Longest Common Subsequence (LCS) to evaluate recall by how many words in reference sentences are used in predicted sentences. ─ METEOR [35] is a precision-based metric for evaluating machine-translation output. The TabQGen dataset along with relevant scripts can be found at https://github.com/saichandrapandraju/TabQGen [or] https://github.com/msakthiganesh/TabQGen.

Conclusion
In this paper, we emphasize the need for AQG systems to effectively utilize all the available data in source documents and propose an Answer-Aware Question Generation system using T5 to generate questions from both tabular and textual data. To do so, we augmented each entry of the ToTTo dataset with its respective questions and named this augmented ToTTo as TabQGen. This TabQGen dataset is further used for fine-tuning a T5 model to generate questions from tables. With this approach, utilizing TabQGen in coalescence with existing AQG approaches can effectively generate questions from source documents considering both the textual and tabular data. The findings demonstrate that the model can generate a wide range of high-quality questions from tabular data. Even though we concentrated on automated metrics such as BLEU, NIST, ROUGE, and METEOR, confirming our findings through human inspection is a critical next step.

7 Authors
Saichandra Pandraju graduated with a bachelor's degree in Electronics and Communications Engineering from QIS College of Engineering and Technology, Ongole, India. At present, he is working in the Research & Development wing at Infosys Ltd., mainly engaged in the research of Natural Language Processing.
Sakthi Ganesh Mahalingam graduated from VIT University, Vellore, India with a bachelor's degree in Electronics and Communications Engineering. At present, he is