Transformer Networks with Adversarial Discriminator for Domain Adaptation in Text Summarization
This blogpost describes the master’s project that one of our 2019 interns, Thijs Brouwers, undertook at BrainCreators.
My research subject was automatic text summarization, a challenging subfield of Natural Language Processing. In this post I first introduce you to the problem, then discuss the current advances in text summarization and describe the approach I took. Finally, I discuss the results of my work.
The two main approaches to text summarisation are extractive and abstractive summarization. The first finds the most important parts of the original text and combines these into a summary. No new text is generated using this strategy: the summary entirely consists of copied text snippets.
Abstractive text summarization methods, on the other hand, employ more powerful Natural Language Processing techniques to generate summaries through paraphrasing. Since deep learning took off, this approach has gained quite some attention, and the results in this area have improved a lot. Abstractive summarisation is generally viewed as a sequence-to-sequence problem, where the model learns a mapping between an input and output sequence. In this area, and especially in machine translation, big improvements have been made lately due to the rise of the attention mechanism in encoder-decoder recurrent neural networks. This mechanism adds a context vector to the network, allowing the decoder to look over all the information of the source sentence. This improves the ability to generate the proper word according to the current word it works on. Being able to enlarge the window of context has resulted in large improvements.
Though these models also improved on state-of-the-art performance in summarization, they are prone to reproduce factual details inaccurately and repeat themselves often. To overcome these problems, researchers extended the attention networks with new methods, like pointers and content selectors. Simply put, these methods are able to identify important words or information in the input, in the encoding phase. During decoding this notion of importance is leveraged by the decoder to generate good summaries.
In 2017 Google Brain eschewed with recurrence entirely and published a new type of network for sequence modelling that was solely based on attention mechanisms. By only using attention, the model is able to process the input at once, resulting in each input representation to have global level information on every other token in the sequence. This led to good performance in machine translation. The model is called the Transformer, and below I explain its working and strengths useful in summarization.
The left figure below shows the Transformer’s architecture. On the left the encoder of the network, which reads the input and generates a representation of it. The right shows the decoder, which generates the output word by word while attending the previously generated words and the final representations generated by the encoder. The Transformer uses N stacked encoder and decoder layers on top of each other, that all learn different weights. With the use of multi-head attention (displayed in the right figure below), the model’s ability to focus on different positions is expanded, making it able to capture various different aspects of the input and to improve its expressive ability. After the multi-head attention there is only a linear projection to the outputs, and therefore a feed forward network is added to add some nonlinearity that improves the model’s expressiveness even more. Around both sub-layers, the Transformer uses layer normalization and residual connections to make optimization easier and to make sure it does not lose important information from lower layers. Because the input is processed at once, attention cannot utilize the positions of the inputs. To solve this, the Transformer uses explicit position encoding added to the input embedding.
In Machine Learning methods for domain adaptation are popular, as data often comes from heterogeneous distributions and thereby hinders traditional supervised learning. The heterogeneity can be observed across different classes, but also in the same class. Take for instance a picture taken of a dog. Features that can vary can be for instance the lighting or angle of the camera. Changes in these features result in shifts in the data distributions. In text summarization this is also likely to occur. Think for example of news articles. Within this domain many subdomains exist. And also within those subdomains, features are likely to be different between articles. Therefore I used a method in my project that extends the Transformer network and tries to learn representations that mitigate the shifts in the data distributions. The goal is that this makes the representations to be better generalizable to unseen domains as well.
I’ve used an adversarial domain discriminator to invoke the desired behavior. The discriminator is a simple fully connected neural network that learns to discriminate between different input domains. The final representation of the source input is fed into both the decoder and discriminator. The goal of the decoder is to produce a summary similar to the target summary. The goal of the discriminator is to classify the domain based on the output of the encoder. The different parts of the model are optimized simultaneously, by adding the two loss functions together. However, when backpropagating from the discriminator network to the encoder, the gradients are reversed by multiplying them with -1, while the discriminator is still using the not-reversed gradients to update its parameters. By this the discriminator still learns to distinguish between domains, while the encoder is forced to generate representations that are not useful for the discriminator. This can be seen as a two player game in which both players try to be better than its opponent. The encoder however still needs to generate representations that are useful for the decoder to generate correct summaries.
I’ve trained three types of models in my project. First, I’ve trained baseline models on a single domain. This gave me an understanding of the behavior of the Transformer in every domain, and also an idea of how difficult the different domains are to summarize.
Next, I’ve trained the baseline models in a mixed setting. For every domain pair I’ve trained a model, on the concatenation of the datasets of those two domains. To give the model notion of the fact that it is learning two domains, the inputs are prepended with a domain tag. This has proven to be effective in machine translation.
And then my proposed models, the discriminator models. These models are trained on the same domain pairs as the mixed baseline models, without the use of the tags, but with an adversarial discriminator.
I compare the mixed baseline models and the discriminator models to see whether adding a discriminator improves the performance both in seen and unseen domains, and thus whether the discriminator is able to make the network domain generalizable. Next, I will compare the single baseline models with the discriminator models as well, to see if multi-domain training can improve over single domain training.
So, how do these models work, and how do they optimize their goal?
The overall architecture is shown in the picture below. On the left there is an input. The input flows through the 6 encoder layers. Each encoder consists of 8 attention heads, that together are able to learn various aspects of text. After the encoder we have the final hidden representations of the encoder, and the attention states of the encoder. The decoder uses the attention distribution over the encoder hidden representations, to be able to focus on all parts of the input before producing the next token. The output is a summary. The goal is to learn a function that maximizes the probability of generating the correct summaries, and this is done by using a cross-entropy training objective. This loss is used to update the weights in both the encoder and decoder.
Then next, the discriminator. It uses the attention context of the final encoder layer as input, which is the dot-product of the final representations and attention state of the encoder. The goal is to classify the domain correctly. The loss function is also the cross-entropy loss.
When backpropagating from the discriminator to the encoder, the gradients are reversed. The gradients not only are flipped, we also used a scaling factor 𝛌 that gradually increases as training passes. This prevents the models from getting stuck in local optima in the beginning of training, when the models hasn’t even learned to summarize yet. The goal is that by adding this discriminator, the encoder starts to represent the input in such a way that the discriminator is not able to classify the domains correct anymore, while the representations should remain useful for the decoder to summarize correct. This should allow the model to become domain-invariant and thus adaptable to other domains as well.
The datasets I’ve used are coming from four different domains. CNN / Daily Mail containing news articles. WikiHow, which is a knowledge base. TitleGen is a dataset containing scientific articles where the goal is to generate the title of the article based on the abstract. The final one is a big dataset containing social media posts from Reddit.
To evaluate the quality of the generated summaries the commonly used evaluation metric is ROUGE. It compares the generated summary against a reference summary. ROUGE-1 compares the overlap of single words between the generated and reference summary. ROUGE-2 compares the overlap of two consecutive words, whereas ROUGE-L identifies the longest co-occurring substring in the generated and reference summary. Together these metrics give an understanding of how well the generated summaries reflect the reference summaries.
My results in terms of ROUGE scores show that the diversity of different domains can be overcome, as all discriminator models outperform their respective mixed baseline models in unseen domains. They also outperform the mixed baseline models in-domain, which suggests that adding a discriminator enhances performance in the domains models are trained on. However, models trained with discriminator only outperformed the single baseline models in two out of four domains, being Reddit and WikiHow. Probably this has to do with the fact that these datasets are written by individuals with no guideline of how to write summaries whatsoever. In CNN / Daily Mail and TitleGen, there are probably better defined rules to follow, or at least, these datasets serve a more well-defined purpose. This is probably the reason why a model solely trained on these two domains performs best.
This is an example of a summary of the CNN and Daily Mail dataset, about the nuclear disaster in Fukushima. We can see that the model trained on only CNN and Daily Mail performed well. It is not naming the causation why the robot stranded, but gives on the other hand some more information about the nuclear disaster itself. The mixed model trained on CNN_DM and Reddit makes several mistakes. First it’s missing a part of the company’s name. It also forgets to mention the robot itself. And the end of the summary makes no sense either. The model trained on the same data, but trained with discriminator, performs really good! It is only not mentioning where this happened, it just starts with ‘The robot’. But on the other hand, the single baseline model is also not mentioning the location.
But there are unfortunately many summaries that are not correct, both of seen and unseen domains. Take for instance the second example below of WikiHow about how to shave. The model trained on this domain adds a weird sentence in the end that does not make sense at all. And the summary of a model trained on two other domains with discriminator is just completely weird.
So despite the promising ROUGE scores of the proposed method, many summaries remain error-prone. This varies from factual mistakes to complete nonsense. Besides that, the complexity of the task makes it hard for domains to perform well in other domains, as a result of the big differences in content and writing-style. But this method is proven to be effective when compared to the baseline models, suggesting it to be beneficial in different NLP research areas as well.
What did my research add to the existing research done in this field? It showed that the Transformer model is suitable for text summarization and that adding a discriminator leads to better performance in both seen and unseen domains. It also shows that we can manipulate the way the Transformer learns to represent inputs, and makes it interesting for purposes where the amount of available data is limited or in cases where the mixing of similar data can be leveraged to enhance performance.
In the future automatic text summarization will probably become more and more embedded in our normal life. The advances in language understanding and generating in the last few years are immense, with GPT-2 and XLNet for example. These models were not yet optimized for summarization, but their results on other tasks are very promising. Furthermore, adversarial learning has gained a lot of attention over the last few years, and its application spreads out to all fields of machine learning. It allows users to add constraints to the learning process that have proven to be very useful in many tasks and areas.