Mastering Long Document Insights: Advanced Summarization with Amazon Bedrock and Anthropic Claude 2 Foundation Model
Unleash the Power of Generative AI for Comprehensive Document Analysis and Summarization
Text summarization is a standard Natural Language Processing (NLP) task that creates a concise summary of a longer text. According to LangChain , with foundation models, if the text is too long to stuff into a single prompt, we could use techniques like Map Reduce to summarize each document in a âmapâ step and then âreduceâ the summaries into a final summary. This technique is commonly referred to as a âsummary of summaries.â In this post, we will move beyond simple summarization and explore advanced techniques to analyze long texts using Amazon Bedrock and the Anthropic Claude 2 Foundation Model.
In the spirit of season ð, we will analyze Bram Stokerâs original 1897 Victorian-era graphic novel, Dracula, freely available through Project Gutenberg . Written in the epistolary style, the novel is based on diary entries, telegrams, letters, and memos, providing an additional level of complexity when analyzing the longer text. We will explore elements such as the novelâs context, setting, characters, plot, literary devices, and themes aided by Generative AI.
Source Code
The source code used in this postâs demonstration is open-sourced and available on GitHub . I suggest starting with the projectâs Jupyter Notebook , which contains all the examples herein plus additional examples not included in the post.
Amazon Bedrock and Anthropic Claude
AWS announced the general availability of Amazon Bedrock in late September 2023. Amazon Bedrock is a fully managed serverless service that makes foundation models (FMs) from leading AI companies, such as Cohere, Anthropic, Stability AI, Meta, and AI21 Labs, available through an application programming interface (API). The Anthropic Claude 2 foundation model employed in this post was accessed through Amazon Bedrock using the Boto3 SDK for Python.
What is a Prompt?
Quoting Wikipedia , Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform. Further, citing Prompt Engineering Guideâs Elements of a Prompt , a prompt can contain any of these four types of elements:
You do not need all four elements for a prompt, and the format depends on the task and the model provider.
Model-specific Prompt Engineering
Generally, each model provider has some required or recommended methods for constructing an optimal prompt for their particular foundation model. For example, according to Anthropicâs guide, Using the correct format , Claude was trained as a conversational agent using these special tokens to mark who is speaking. The \n\nHuman: (you) asks a question or gives instructions and the \n\nAssistant: (Claude) responds. Additionally, according to Anthropicâs guide, Constructing a prompt , XML tags, like <tag></tag>, help demarcate some essential parts of your prompt, such as rules, examples, or input text to process. Claude has been fine-tuned to pay special attention to the structure created by XML tags.
Text Splitting
To summarize long texts, texts whose length exceeds the context window (aka token limit) of a foundation model must be split into what is commonly referred to as chunks. Even Anthropic Claude 2, with its massive 100K context window, cannot ingest an entire novel, whose 27 chapters total approximately 833K characters, 153K words, or 206K tokens.
Splitting or chunking a long text into smaller chunks is more challenging than it may sound. You could somewhat mindlessly set an arbitrary character chunk size and the chunk overlap. However, as LangChainâs Text splitter documentation states, ideally, you want to keep the semantically related pieces of text together. What semantically related means could depend on the type of text. Luckily enough, our novel contains 27 chapters with an average length of 75 paragraphs, 5,683 words, 30,874 characters, or 7,628 tokens. A chapter is defined as âa main division of a book.â Thus, each chapter should contain semantically related pieces of text.
We can use simple Regular Expressions to split the chapters and remove any extraneous content before or after the novelâs chapters:
import re
def split_book(book_text):
# Use regular expressions to split the book into chapters
# Specific to this Gutenberg eBooks format
chapters = re.split(r"^CHAPTER [IVXLCDM]+$", book_text, flags=re.MULTILINE)
# remove everything prior to chapter 1
chapters.pop(0)
# Split the last chapter into two parts and remove everything after "THE END"
chapter26 = re.split(r"^.*THE END.*$", chapters[26], flags=re.MULTILINE)[0]
chapters.pop(26)
chapters.append(chapter26)
return chapters
What are Tokens?
Anthropic defines tokens as the smallest individual âatomsâ of a language model and can varyingly correspond to words, subwords, characters, or even bytes in the case of Unicode. For Claude, the average token is about 3.5 characters. Using Anthropicâs Python library to access the Anthropic REST APIâs client.count_tokens() method, we can compute the average input token length of the chapters in the novel to be 7,628, with an average character-to-token ratio of 4:05, well within the ballpark of the Anthropic estimate of 3.5:1.
from anthropic import Anthropic
client = Anthropic()
chapter_token_count = client.count_tokens(chapter.strip())
Based on that average input token count, we could easily pass the contents of several chapters at one time into the Anthropic Claude 2 foundation model due to its 100K token context window. However, instead of figuring out other ways to split the novel and maintain the semantic relationship within chunks, we will use the individual chapters for all examples in this post.
chpt paras words chrs tokens ratio cost
--------------------------------------------------------
1 39 5,547 30,624 7,218 4.24 0.080
2 62 5,305 28,510 6,833 4.17 0.075
3 46 5,571 29,805 7,075 4.21 0.078
4 86 5,703 30,267 7,338 4.12 0.081
5 28 3,390 18,019 4,650 3.88 0.051
6 64 5,299 29,195 7,524 3.88 0.083
7 62 5,424 29,964 7,120 4.21 0.078
8 59 6,044 32,637 7,970 4.09 0.088
9 64 5,709 30,180 7,477 4.04 0.082
10 100 5,623 30,817 7,706 4.00 0.085
11 78 4,754 26,991 7,014 3.85 0.077
12 95 6,993 37,944 9,372 4.05 0.103
13 107 6,242 34,198 8,490 4.03 0.094
14 97 6,053 32,612 8,277 3.94 0.091
15 101 5,485 29,787 7,509 3.97 0.083
16 62 4,381 23,928 5,895 4.06 0.065
17 80 5,264 29,074 7,140 4.07 0.079
18 84 6,615 35,948 8,924 4.03 0.098
19 46 5,505 29,462 7,041 4.18 0.078
20 104 5,467 31,241 7,883 3.96 0.087
21 69 5,905 32,220 7,940 4.06 0.087
22 65 5,249 28,130 6,822 4.12 0.075
23 84 5,403 29,551 7,351 4.02 0.081
24 75 6,057 32,123 7,924 4.05 0.087
25 87 5,907 32,612 8,153 4.00 0.090
26 106 6,818 37,084 9,266 4.00 0.102
27 78 7,733 40,677 10,055 4.05 0.111
---
Raw book text
---
sum paras: 2,040
sum chars: 856,545
sum words: 658,827
sum tokens: 211,209
Chapters
---
chpt count: 27
---
min paras: 28
max paras: 107
mean paras: 75
sum paras: 2,028
---
min words: 3,390
max words: 7,733
mean words: 5,683
sum words: 153,446
---
min chars: 18,019
max chars: 40,677
mean chars: 30,874
sum chars: 833,600
---
min tokens: 4,650
max tokens: 10,055
mean tokens: 7,628
sum tokens: 205,967
---
min chrs/tkn: 3.85
max chrs/tkn: 4.24
mean chrs/tkn: 4.05
---
min cost: $ 0.051
max cost: $ 0.111
mean cost: $ 0.084
sum cost: $ 2.270
Analysis Techniques
In the following section, we will examine several methods for analyzing the long text of the novel. As we progress through the examples, we will increase the complexity of the prompts, adding techniques to shape the output more precisely.
1. Plot Summarization
A simple summary of summaries is one of the most common examples of text summarization of long texts. In this example, first loop through each chapter and ask the foundation model to write a concise single-paragraph summary of the main points, events, and ideas contained in the chapter.
First, letâs look at a simple prompt that we could pass to the Anthropic Claude model, which should return a summary:
prompt = f"""\n\nHuman: Write a short summary of the following chapter:
{chapter}.\n\nAssistant:"""
The challenge with this basic prompt is that we donât know how large of a response we will get back or how it will be formatted. By increasing the amount of instructions we included in the prompt, we can increase the probability that we can get back the exact type of response we expect from the model:
prompt = f"""\n\nHuman: Write a concise, grammatically correct, single-paragraph summary of the following chapter:
{chapter}\n\nAssistant:"""
In addition to adding even more instructions to the prompt, we can also use XML-style tags, such as <chapter></chapter>, which Anthropic recommends for their Claude models to demarcate sections of the prompt better.
prompt = f"""\n\nHuman: Write a concise, grammatically correct, single-paragraph summary of the chapter's main points, events, and ideas contained inside the <chapter></chapter> XML tags below.
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
This prompt is then passed into the following function, which calls the Anthropic Claude 2 foundation model through Amazon Bedrock:
import json
import boto3
from botocore.exceptions import ClientError
def create_bedrock_connection(self):
client_bedrock = boto3.client(
"bedrock-runtime",
"us-east-1"
)
return client_bedrock
def create_summary(self, client_bedrock, prompt):
try:
body = json.dumps(
{
"prompt": prompt,
"max_tokens_to_sample": self.max_tokens_to_sample,
"temperature": self.temperature,
"top_k": self.top_k,
"top_p": self.top_p,
"stop_sequences": self.stop_sequences,
}
)
accept = "application/json"
content_type = "application/json"
response = client_bedrock.invoke_model(
body=body,
modelId=self.model,
accept=accept,
contentType=content_type
)
response_body = json.loads(response.get("body").read())
return response_body
except ClientError as ex:
self.logger.error(ex)
exit(1)
Summary of Summaries
Here is an example of the modelâs summary of Chapter 1 of the novel:
âJonathan Harker travels by train from Munich to Bistritz, where he stays at the Golden Krone Hotel before departing for the Borgo Pass. He is met by a mysterious driver in a calèche who takes him through the Pass at night. Harker sees strange sights along the way, including blue flames and a ring of wolves surrounding them. Despite feeling afraid, Harker continues on to a ruined castle, arriving just before midnight. The chapter covers Harkerâs journey to Transylvania and the foreboding events he witnesses that hint at the supernatural horrors that await him at Count Draculaâs castle.â
Using a nearly identical prompt, we can then pass the individual chapter summaries, stored in a single text file, to the same function to derive a summary of summaries â a summary of the complete novel. Note that we have added directions for the Assistant (Claude) to ensure the expected output further:
prompt = f"""\n\nHuman: Write a concise grammatically correct summary using only the information within the <text></text> XML tags below.
The Assistant will refrain from using bullet-point lists.
<text>
{input_text}
</text>\n\nAssistant:"""
We should receive a response similar to the following based on a summarization of the individual chapter summaries:
âThe novel follows the harrowing journey of Jonathan Harker and his companions to defeat Count Dracula after realizing he is an immortal vampire. It begins with Harkerâs ominous imprisonment in Draculaâs remote Transylvanian castle, where he first encounters the supernatural horror. Back in England, his fiancée Mina and her friend Lucy become targets of Draculaâs evil, as signs point to his arrival in London. Led by Professor Van Helsing, Harker joins Lucyâs suitors and doctors, John Seward and Van Helsing, to share information and hunt Dracula. After Lucy tragically dies and returns as a vampire, the men heroically destroy her. They race to find Draculaâs lairs across London and sterilize his boxes of native earth to weaken him. Their pursuit of Dracula leads them back to Transylvania, where they finally defeat him through a violent battle at his castle. But not before a bitten Mina faces her own dark transformation, and Quincey Morris sacrifices his life, striking the final blow to Draculaâs heart. United by love and courage, the characters embrace the reality of supernatural evil in the world, finding the strength to conquer it.â
We can modify the prompt if we desire a bullet-point list-format summary of each chapter versus the previous paragraph format. Note we are further reinforcing the use of the context contained between the <chapter></chapter> tags in the instructions:
prompt = f"""Human: Write a bullet-point list of 3-4 main points, events, and ideas using only the information within the <text></text> XML tags below.
<text>
{input_text}
</text>\n\nAssistant:""
For Chapter 1, the results should look similar to the following:
Here are the main points from the chapter:
2. Character Analysis
In addition to a simple plot summary, we can analyze the context, setting, characters, literary devices, and themes aided by Generative AI. Next, we will focus on the main characters from the novel. To reduce the output size, we have asked the model only to return 3â4 main characters from each chapter. Note how we have added two additional sections, the <template> and <example> sections. These two sections will help reinforce the format of the modelâs response. In the template, we use the square brackets to indicate placeholders the model will replace (e.g., [Character]: [Description]). Again, we will provide specific instructions to the Assistant to ensure the output is as expected.
prompt = f"""\n\nHuman: Provide a list of the 3-4 main characters and a brief description of each main character, in the chapter contained in the <chapter> tags below.
The Assistant will order the main characters by how many times they are mentioned.
The Assistant will number the characters.
Follow the template contained in the <template> tags below and replace the placeholders with the relevant information:
<template>
[Number]. [Character]: [Description]
</template>
Here is an example contained in the <example> tags below:
<example>
1. Pink Panther: A suave and smooth-talking anthropomorphic animated panther.
</example>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our prompt to the same function as before, we should get back character descriptions similar to the following for each chapter:
Here are the main characters from the chapter:
Using a nearly identical prompting strategy to the previous summary of summaries, we then pass the individual character descriptions to the same function as before to derive an overall character description for Count Dracula based on the individual chapter-level observations:
prompt = f"""\n\nHuman: Write a concise, grammatically correct, single-paragraph description of the main character, Dracula (aka Count Dracula), based on the following individual character descriptions.
The Assistant will refrain from using bullet-point lists.
<summaries>
{character_summaries}
</summaries>\n\nAssistant:"""
We should receive a response similar to the following overall character description based on the individual chapter-level descriptions of Count Dracula:
âHere is a single-paragraph description of Count Dracula:
Count Dracula is an ancient vampire of nobility from the remote mountains of Transylvania who resides alone in a decaying yet magnificent castle. He initially appears charming and welcoming but gradually reveals his true sinister and threatening nature. With his gaunt figure, waxen skin, high aquiline nose, pointed ears, sharp teeth, and burning red eyes, Dracula exhibits a mysterious magnetism and power over humans, able to control animals and the weather. He attacks the innocent Jonathan Harker, later turning Lucy Westenra into a vampire. Persistent with his vampiric evil, Dracula possesses supernatural abilities and cunning intelligence, escaping capture multiple times. Though eloquent and refined on the surface, at his core, Dracula is a ruthless predator who spreads his undead curse, requiring the combined efforts of Dr. Van Helsing and his allies to defeat him finally.â
3. Character Types
To quote Essentials in Writing , without characters, there would be no plot or conflicts. Readers should take note of the authorâs methods of character development, the charactersâ traits and actions, and how the characters change (or donât change) throughout the story. There are several typical character types used throughout fictional literature, including the Protagonist, Antihero, Antagonist, Guide, Contagonist, Sidekicks (Deuteragonist), Henchmen, Love Interest, Temptress, Confidant, and Foil.
Recommended by LinkedIn
Here is an example prompt to extract those character types from each chapter. Note how we use the <character_types> section to define a list of character types to be considered.
prompt = f"""\n\nHuman: The following list of character types, contained in the <character_types> tags below, are often found in fictional literature:
<character_types>
- Protagonist
- Antihero
- Antagonist
- Guide
- Contagonist
- Sidekicks (Deuteragonist)
- Henchmen
- Love Interest
- Temptress
- Confidant
- Foil
</character_types>
Based on this list of character types, give 3-4 examples of character types found in the chapter in the <chapter> tags below, including the character name and an explanation of why.
The Assistant will use a bullet-point list.
The Assistant will not use square brackets in its response.
Follow the template contained in the <template> tags below and replace the placeholders with the character name, character type, and explanation:
<template>
- [Character_Name] - [Character_Type]: [Explanation]
</template>
Here is an example contained in the <example> tags below:
<example>
- Love Interest - Minnie Mouse: Mickey Mouse's lifelong romantic interest.
</example>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our prompt to the same function as before, we should get back a list of character types similar to the following for each chapter:
Here are some examples of character types from the chapter:
4. Literary Devices
According to Literary-Devices.com , literary devices are techniques and structures writers use to convey a message, create an effect, and evoke emotions in the reader. These devices can emphasize a literary work, make it more interesting, or convey a deeper meaning. Standard literary devices include Metaphor, Simile, Alliteration, Personification, and Symbolism.
We will use the following prompt to extract literary devices from the chapters and generate explanations for each device. We have provided a list of literary devices to the model in the <literary_devices> tags. Again, we have included additional instructions to the Assistant to ensure the output is as expected.
prompt = f"""\n\nHuman: The following list of literary devices inside the <literary_devices></literary_devices> XML tags below, are often found in fictional literature:
<literary_devices>
Allegory, Alliteration, Allusion, Amplification, Anagram,
Analogy, Anthropomorphism, Antithesis, Chiasmus, Colloquialism,
Circumlocution, Epigraph, Euphemism, Foreshadowing, Hyperbole,
Imagery, Metaphor, Mood, Motif, Onomatopoeia, Oxymoron, Paradox,
Personification, Portmanteau, Puns, Satire, Simile, Symbolism, Tone
</literary_devices>
Based on the list of literary devices, give 2-3 examples of literary devices found inside the <chapter></chapter> XML tags below, and explain why.
The Assistant will use a bullet-point list.
The Assistant will refrain from including square brackets and XML tags in the response.
Follow the template inside the <template></template> XML tags below for your response. Replace the placeholders, in square brackets, with the literary device and the explanation:
<template>
- [Literary_Device]: [Explanation]
</template>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our prompt to the same function again, we should get back a list of literary devices similar to the following for each chapter. Note how the model has incorporated quotes from the chapter into its explanation:
Here are some examples of literary devices found in the provided chapter:
5. Settings
According to Britannica , setting, in literature, is the location and time frame in which the action of a narrative takes place. The makeup and behavior of fictional characters often depend on their environment just as much as on their personal characteristics.
We will use the following prompt to generate a numbered list of no more than three settings from each chapter, in order according to how often they are mentioned in the chapter. Again, we use the <template> and <example> tags, along with specific instructions to the Assistant to ensure the output is as expected.
prompt = f"""\n\nHuman: Provide a list of the no more than three settings and a brief description of each setting inside the <chapter></chapter> XML tags below.
The Assistant will order the settings by how many times they are mentioned in the chapter.
The Assistant will number the list of settings.
The Assistant will refrain from including square brackets and XML tags in the response.
Follow the template inside the <template></template> XML tags below and replace the placeholders, in square brackets, with the relevant information:
<template>
[Number]. [Setting]: [Description]
</template>
Here is an example inside the <example></example> XML tags below:
<example>
1. Hoboken, New Jersey: Part of the New York metropolitan area on the banks of the Hudson River across from lower Manhattan, where the story takes place.
</example>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our prompt to the same function again, we should get back a numbered list of settings similar to the following for each chapter:
Here are the top 3 settings mentioned in the chapter, with brief descriptions:
6. Questions and Answers
In addition to analysis, we can leverage the large language model to produce questions and answers based on the content of each chapter. The questions could be used to develop training and evaluation material or converted to dense vector embeddings and used for semantic similarity searches. Note the requested question-and-answer format of the <template> section in this prompt. This is reinforced by providing an example in the <example> section. Lastly, we have added even more instructions to the âAssistantâ to ensure the output is as expected.
prompt = f"""\n\nHuman: Generate a list of 6 questions that the following chapter, inside the <chapter></chapter> XML tags below, could be used to answer.
The Assistant will provide both the question and the answer.
The Assistant will refrain from asking overly broad questions.
The Assistant will refrain from using bullet-point lists.
The Assistant will refrain from including square brackets and XML tags in the response.
Follow the template inside the <template></template> XML tags below and replace the placeholders, in square brackets, with the relevant information:
<template>
Q: [Question]
A: [Answer]
</template>
Here is an example inside the <example></example> XML tags below:
<example>
Q: What is the weather like in Spain?
A: The rain in Spain stays mainly in the plain.
</example>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our prompt to the same function as before, we should get back a list of questions and answers similar to the following for each chapter:
Here are three questions that could be answered by the provided chapter:
Q: What happened when Jonathan Harker woke up back in his room after his encounter with Dracula?
A: He was confused about how he got back to his room, indicating Dracula must have carried him there. His clothes were folded neatly, which was unlike him, and his watch was unwound, suggesting Dracula hurriedly undressed him.
Q: How does Jonathan plan to try to escape from Castle Dracula?
A: He decides his only chance is to climb down the castle wall, sneak into Draculaâs room through the window, and search for the key that will let him out of the castle.
Q: What does Jonathan discover when he finds Draculaâs body in a box in the underground crypt?
A: He finds Dracula appearing partially rejuvenated â with dark hair instead of white, and a flushed, blood-stained face. This horrifies Jonathan, who sees Dracula as a monster. In a rage, Jonathan tries to kill Dracula but fails before fleeing the crypt.
By modifying the question-and-answer format prompt above, we can generate multiple-choice questions. Note the increased complexity of the <template> section in this prompt. We can indicate precisely how many choices we want to generate.
prompt = f"""\n\nHuman: Generate a list of 6 multiple-choice questions that the following chapter, inside the <chapter></chapter> XML tags below, could be used to answer.
The Assistant will provide the question, four possible answers lettered a,b,c, and d, and the correct answer.
The Assistant will ask brief, specific questions.
The Assistant will refrain from using bullet-point lists.
The Assistant will refrain from including square brackets and XML tags in the response.
Follow the template inside the <template></template> XML tags below and replace the placeholders, in square brackets, with the relevant information:
<template>
Q: [Question]
(a) [Choice_1]
(b) [Choice_2]
(c) [Choice_3]
(d) [Choice_4]
A: (Letter) [Correct_Answer]
</template>
Here is an example inside the <example></example> XML tags below:
<example>
Q: What color is fresh grass?
(a) Red
(b) Blue
(c) Green
(d) Yellow
A: (c) Green
</example>
<chapter>
{chapter}
</chapter>\n\nAssistant:"""
Passing our modified prompt to the same function as before, we should get back multiple-choice questions similar to the following for each chapter:
Here are three multiple-choice questions based on the chapter:
Q: What was the name of the hotel where Jonathan Harker stayed in Bistritz? (a) The Golden Lion
(b) The Silver Stag
(c) The Golden Krone
(d) The Red Bear
A: (c) The Golden Krone
Q: What did Jonathan order for dinner at the hotel that was described as a national dish?Â
(a) Goulash
(b) Schnitzel
(c) Paprika Hendl
(d) Sauerbraten
A: (c) Paprika Hendl
Q: What did the old lady at the hotel give Jonathan as a charm against evil before he departed?
(a) A four-leaf clover
(b) A crucifix
(c) A rabbitâs foot
(d) An amulet
A: (b) A crucifix
References
Conclusion
In this post, we learned how to move beyond simple long text summarization and discovered how to analyze different aspects of the document. Further, we build progressively refined prompts, which help precisely shape the output from the Anthropic Claude 2 foundation model, accessed through Amazon Bedrock. ð»
This blog represents my viewpoints and not those of my employer, Amazon Web Services (AWS). All product names, logos, and brands are the property of their respective owners.
Digital Marketing Executive at Oxygenite
3mo"Great insights on advanced text summarization! Exploring Bram Stokerâs Dracula using AWS and Anthropic Claude 2 sounds fascinating. #NLP #GenerativeAI #TextSummarization"
ML engineer @ SurveySparrow || M.Sc Data Science
6moGreat content Gary Stafford ð related to summarization with different size of input data. I need to develop a real time text summarization where data will be updating regularly in timely basis. How can I implement this effectively with the help of as bedrock.
Quant Trading Enthusiast, MSc in Quant Finance
9moHi Mr stafford, thank you for sharing the wonderful article. May I know what are some ways for individuals to get enough dataset to train a text-summarization model to a real-world deployable level?
AI Strategy | Enterprise Prompt Engineering | Explainable AI | Financial Services
10moHappy to discuss bringing RikAI multi modal foundation models on Bedrock! The summarization problem gets a lot easier and summaries become a lot more accurate as our models have a minimum of 1bn tokens context windows. A few illustrations here: https://www.loom.com/share/5af05d0b852c458f8dbf418d5bbef435?sid=2bf83a92-2323-4f2e-b5a2-dc3605abf5fe
Senior System Reliability Engineer / Platform Engineer
12moAmazon Web Services (AWS) has it all.