The development of new architectures allows to process long input windows of
text at once, overcoming both memory and computational constraints. New developments
pushed maximum input windows to 65k+ words compared to the 512
BERT limit. We aim to explore, compare and improve state-of-the-art long window
architectures to summarize long texts. We consider BERT (512 words), GPT-3 (2,048 words), and BigBird (4,096 words), and focus on the financial narrative domain, summarizing 100- to 200-page documents. We aim to test models with different maximum input size, exploring benefits and limitations. Long input windows allow to include wider context in the
summarization process, avoiding out-of-context sentence extraction that can lead to
changes at the sentence-level semantic. We compare extractive and abstractive methods
on key aspects in the financial context, such as numerical accuracy and summary semantic.
We show extractive methods (BERT-based) can retain sentence-by-sentence
accuracy from text; however, the extraction process can produce fragmented
summaries which can lead to a misleading interpretation. We also show abstractive
methods (by introducing BigBirdFLY, a wide context summarization method based
on BigBird) can produce fluent summaries. By using human evaluation, we reveal
BigBirdFLY can produce summaries more similar to human-generated summaries,
and excel in the human evaluation criteria --- whereas extractive methods are able
to score high in automatic metrics (ROUGE). Finally, we explore how enhanced
greedy sentence-selection methods exploiting long input window in a single step
compare to recursive solutions based on Reinforcement Learning.