Example Use Case: 10K Processing#
Type of Use Case: Summarize
Management discussion and analysis (MD&A) is a section of a public company’s annual report or quarterly filing. The MD&A addresses the company’s performance. In this section, the company’s management and executives, also known as the C-suite, present an analysis of the company’s performance with qualitative and quantitative measures. source
Create a Summarization Pipeline
Full script here
def main(
cache_dir: Path = Path("/projects/kellogg/.cache"),
input_dir: Path = Path("/kellogg/data/EDGAR/10-K/2023"),
output_file: Path = Path("/projects/kellogg/output/annual_report_output.csv"),
model_checkpoint: str = "Falconsai/text_summarization",
num_files: int = 10,
):
# validate input parameters
assert cache_dir.exists() and cache_dir.is_dir()
assert input_dir.exists() and input_dir.is_dir()
assert num_files > 0
output_file.touch(exist_ok=True)
# set the huggingface model directory
os.environ["HF_HOME"] = str(cache_dir)
# get listing of 10K files
files = list(input_dir.glob("*.txt"))[:num_files]
files.sort()
# load and clean text, extr
data_dict = {"doc": [], "text": []}
for f in files:
print(f"loading: {f.name}")
mda_text = extract_mda(clean_html(f.read_text()))
if mda_text is None:
continue
data_dict["doc"].append(f.name)
data_dict["text"].append(mda_text)
# create a dataset object
dataset_10k = Dataset.from_dict(data_dict)
print(f"created dataset: {dataset_10k}")
# apply summarization pipeline to dataset
summarizer = pipeline("summarization", model=model_checkpoint)
dataset_10k = dataset_10k.map(
lambda batch: {
"summary": summarizer(
batch["text"],
max_length=50,
min_length=30,
do_sample=False,
truncation=True,
)
},
batched=True,
)
# output to file
dataset_10k.to_csv(output_file)
def clean_html(html):
# First we remove inline JavaScript/CSS:
cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
# Then we remove html comments. This has to be done before removing regular
# tags since comments can contain '>' characters.
cleaned = re.sub(r"(?s)<!--(.*?)-->[\n]?", "", cleaned)
# Next we can remove the remaining tags:
cleaned = re.sub(r"(?s)<.*?>", " ", cleaned)
# Finally, we deal with whitespace
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
cleaned = re.sub(r" ", " ", cleaned)
return cleaned.strip()
def extract_mda(text):
mda_text = None
# obtain the second occurrence of "Discussion and Analysis of Financial Condition" with wildcards
pattern = r"Discussion[\s,.-]*and[\s,.-]*Analysis[\s,.-]*of[\s,.-]*Financial[\s,.-]*Condition"
mda_matches = list(re.finditer(pattern, text, re.IGNORECASE))
if len(mda_matches) >= 2:
m = mda_matches[1]
mda_text = text[m.end():]
return " ".join(mda_text.split()[:250])
return mda_text
Note
Screen video here of executing script on Quest GPU node with Slurm script.
Original text snippet:
In addition, the spread of COVID-19 has caused us to modify our business practices (including restricting employee travel, developing social distancing plans for our employees and cancelling physical participation in meetings, events and conferences), and we may take further actions as may be required by government authorities or as we determine is in the best interests of our employees, partners and customers. The outbreak has adversely impacted and may further adversely impact our workforce and operations and the operations of our partners, customers, suppliers and third-party vendors, throughout the time period during which the spread of COVID-19 continues and related restrictions remain in place, and even after the COVID-19 outbreak has subsided. Even after the COVID-19 outbreak has subsided and despite the formal declaration of the end of the COVID-19 global health emergency by the World Health Organization in May 2023, our business may continue to experience materially adverse impacts as a result of the virus’s economic impact, including the availability and cost of funding and any recession that has occurred or may occur in the future. There are no comparable recent events that provide guidance as to the effect COVID-19 as a global pandemic may have, and, as a result, the ultimate impact of the outbreak is highly uncertain and subject to change. Additionally, many of the other risk factors described below are heightened by the effects of the COVID-19 pandemic and related economic conditions, which in turn could materially adversely affect…
Summary:
the spread of COVID-19 has caused us to modify our business practices . The outbreak has adversely impacted and may further adversely impact our workforce and operations and the operations of our partners, customers, suppliers and third-party vendors
