Text Analysis

Text Analysis#

Research Idea

Identify LLM researchers, which sub-topics they are working on, and whether there is geographical specialization.

Who are the researchers?
Where they are located?
What topics do they work on? (future work)

Development Workflow

Identify a dataset and specify precisely what you want to do with it
Engineer a prompt (use OpenAI Playground)
Evaluate performance on a sample
- If performance is unacceptable, try further prompt engineering, functions, etc.
- If performance is still not good enough, try fine-tuning
Deploy at scale

Gather Data

Promising dataset: arXiv

Almost all LLM papers of note are posted here
PDF files are freely available (~2 million covering STEM fields)
Limited amount of metadata is also available
Downloaded ~20K recent papers, filtered for ones containing the phrase Large Language Model in the abstract. Resulting dataset ~1.7K PDFs + Metadata
Extracted 1st page of text from each PDF

arXiv

import pandas as pd
df = pd.read_parquet('data/arxiv_metadata.parquet')
df.head()

	id	submitter	authors	title	comments	journal-ref	doi	report-no	categories	license	abstract	update_date	src_pdf	text
0	2310.00014	Yong Ren	Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhu...	Fewer-token Neural Speech Codec with Time-inva...	Submitted to ICASSP 2024	None	None	None	cs.SD eess.AS	http://arxiv.org/licenses/nonexclusive-distrib...	Language model based text-to-speech (TTS) mo...	2023-10-03	2310.00014v1.pdf	FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVA...
1	2310.00031	Markus Marks	Neehar Kondapaneni, Markus Marks, Manuel Knott...	Text-image Alignment for Diffusion-based Perce...	Project page: https://www.vision.caltech.edu/t...	None	None	None	cs.CV	http://arxiv.org/licenses/nonexclusive-distrib...	Diffusion models are generative models with ...	2023-10-06	2310.00031v1.pdf	Text-image Alignment for Diffusion-based Perce...
2	2310.00032	Qinghua Xu	Qinghua Xu, Tao Yue, Shaukat Ali and Maite Arr...	Pretrain, Prompt, and Transfer: Evolving Digit...	None	None	None	None	cs.SE	http://creativecommons.org/licenses/by/4.0/	Cyber-Physical Systems (CPSs), e.g., elevato...	2023-10-06	2310.00032v2.pdf	PRETRAIN, PROMPT, AND TRANSFER: EVOLVING DIGIT...
3	2310.00034	Yuzhang Shang	Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong	PB-LLM: Partially Binarized Large Language Models	Frist work using network binarization for larg...	None	None	None	cs.LG cs.AI cs.CL	http://arxiv.org/licenses/nonexclusive-distrib...	This paper explores network binarization, a ...	2023-10-03	2310.00034v1.pdf	PB-LLM: PARTIALLY BINARIZED LARGE LANGUAGE\nMO...
4	2310.00035	Xi Wang	Xi Wang, Laurence Aitchison, Maja Rudolph	LoRA ensembles for large language model fine-t...	Update the title in the PDF file	None	None	None	cs.LG cs.AI	http://creativecommons.org/licenses/by/4.0/	Finetuned LLMs often exhibit poor uncertaint...	2023-10-06	2310.00035v1.pdf	Preprint. Under review\nENSEMBLE OF LOW-RANK A...

Specify what information to extract

title
list of author names
each author’s email address
each author’s affiliation
each affiliation’s location in terms of latitude and longitude

{ "title": "The paper's title",
    "authors": [
        {
            "name": "author's name",
            "email": "name@domain.edu",
            "affiliations": [ "list of indices" ]
        }
    ],
    "affiliations": [ 
        {"index": "the index", 
        "name": "The affiliation name", 
        "longitude": "the longitude", 
        "latitude": "the latitude" 
        } 
    ]
 ]
}

Develop a Prompt

OpenAI playground

Prompt engineering

There are various prompting strategies you can use to improve performance. OpenAI has a very good guide to help you out. They also provide lots of examples to look at.

Here are a couple of my tries:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-1106-preview",
  messages=[
    {
      "role": "system",
      "content": "You are an expert research librarian. You are precise and can analyze the structure of papers very well. You return information in json format."
    },
    {
      "role": "user",
      "content": "Extract the title and authors and affiliations from the first page of a scientific paper. \n\nUse the following step-by-step instructions to respond to user inputs.\n\nExtract the title and authors from the first page of a scientific paper. The paper text will snipped will be delimited by triple quotes. Geolocate each author affiliation with latitude and longitude.\n\nThe output should have the following format:\n\n{ \"title\": \"The paper's title\",\n  \"authors\": [\n    {\n      \"name\": \"Yong Ren\",\n      \"email\": null,\n      \"affiliations\": [ \"list of indices\" ]\n    }\n  ],\n \"affiliations\": [ {\"index\": \"the index\", \"name\": \"The affiliation name\", \"longitude\": \"the longitude\", \"latitude\": \"the latitude\" } ]\n ]\n}\n\n\"\"\"\nFEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES\nYong Ren1,2, Tao Wang1, Jiangyan Yi1, Le Xu1,2, Jianhua Tao3, Chuyuan Zhang1,2, Junzuo Zhou1,2\n1Institute of Automation, Chinese Academy of Sciences, China\n2University of Chinese Academy of Sciences, China\n3Department of Automation, Tsinghua University, China\nABSTRACT\nLanguage model based text-to-speech (TTS) models, like VALL-E,\nhave gained attention for their outstanding in-context learning capa-\nbility in zero-shot scenarios. Neural speech codec is a critical com-\nponent of these models, which can convert speech into discrete token\nrepresentations. However, excessive token sequences from the codec\nmay negatively affect prediction accuracy and restrict the progres-\nsion of Language model based TTS models. To address this issue,\nthis paper proposes a novel neural speech codec with time-invariant\ncodes named TiCodec. By encoding and quantizing time-invariant\ninformation into a separate code, TiCodec can reduce the amount of\nframe-level information that needs encoding, effectively decreasing\nthe number of tokens as codes of speech. Furthermore, this paper\nintroduces a time-invariant encoding consistency loss to enhance the\nconsistency of time-invariant code within an utterance and force it\nto capture more global information, which can benefit the zero-shot\nTTS task. Experimental results demonstrate that TiCodec can not\nonly enhance the quality of reconstruction speech with fewer tokens\nbut also increase the similarity and naturalness, as well as reduce the\nword error rate of the synthesized speech by the TTS model.\nIndex Terms— speech codec, fewer tokens, time-invariant, lan-\nguage model, text-to-speech\n\"\"\"\n "
    }
  ],
  response_format={"type": "json_object"},
  temperature=0,
  max_tokens=2048,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  seed=42,
)

{
    "title": "FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES",
    "authors": [
        {
            "name": "Yong Ren",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Tao Wang",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Jiangyan Yi",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Le Xu",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Jianhua Tao",
            "email": null,
            "affiliations": [
                3
            ]
        },
        {
            "name": "Chuyuan Zhang",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Junzuo Zhou",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        }
    ],
    "affiliations": [
        {
            "index": 1,
            "name": "Institute of Automation, Chinese Academy of Sciences, China",
            "longitude": "116.3975",
            "latitude": "39.9085"
        },
        {
            "index": 2,
            "name": "University of Chinese Academy of Sciences, China",
            "longitude": "116.6514",
            "latitude": "40.1211"
        },
        {
            "index": 3,
            "name": "Department of Automation, Tsinghua University, China",
            "longitude": "116.326",
            "latitude": "40.0036"
        }
    ]
}

Evaluate on a sample

To evaluate the perfomance of the GPT on your dataset, you need some way of externally validating it. At least a portion of your data must be labelled with the correct (or at least likely correct) output. This is called a gold standard.

client = openai.Client()
df_sample = df.sample(100, random_state=42)
df_sample['extracted_info'] = df_sample['text'].apply(lambda x: analyze_text(client, x))
df_sample.to_parquet("sample_output.parquet")

import pandas as pd

df_gold = pd.read_parquet('./data/arxiv_metadata.parquet')
df_extracted = pd.read_parquet('./data/extracted_data.parquet')

true_positives = []
false_positives = []
false_negatives = []

for id in df_extracted['id']:
    gold_authors = list(df_gold[df_gold['id'] == id]['authors'])[0]
    gold_authors = {a.strip() for a in gold_authors.split(',')}

    predicted = df_extracted[df_extracted['id'] == id]
    predicted_authors = set(predicted['author'])
    
    for author in predicted_authors:
        if author in gold_authors:
            true_positives.append((id, author))
        else:
            false_positives.append((id, author))
    
    for author in gold_authors:
        if author not in predicted_authors:
            false_negatives.append((id, author))


# round precision to 2 decimal places
precision = round(len(true_positives) / (len(true_positives) + len(false_positives)), 2)

# round recall to 2 decimal places
recall = round(len(true_positives) / (len(true_positives) + len(false_negatives)), 2)

print(f"true_positives count: {len(true_positives)}")
print(f"false_positives count: {len(false_positives)}")
print(f"false_negatives count: {len(false_negatives)}")
print("precision:", precision)
print("recall:", recall)

true_positives count: 3884
false_positives count: 364
false_negatives count: 316
precision: 0.91
recall: 0.92

fp_sample = {fp[1] for fp in false_positives if fp[0] == '2310.08102'}
fn_sample = {fn[1] for fn in false_negatives if fn[0] == '2310.08102'}
print(f"False positives for id 2310.08102\n")
for fp in fp_sample:
    print(f"  {fp}")

print(f"\nFalse negatives for id 2310.08102\n")
for fn in fn_sample:
    print(f"  {fn}")

False positives for id 2310.08102

  Alham Fikri Aji
  Ayu Purwarianti
  Muhammad Razif Rizqullah

False negatives for id 2310.08102

  Ayu Purwarianti (1) and Alham Fikri Aji
  (2) ((1) Bandung Institute of Technology
  (2) Mohamed bin Zayed University of
  Artificial Intelligence)
  Muhammad Razif Rizqullah (1)

_images/arxiv-fp-fn-example.png — Fig. 8 Paper#

Other Forms of Evaluation

Now let’s map these geo-coordinates to get a sanity check on how well GPT-4 did the geo-coding. This is not a substitute for a quantitivate analysis, but it does give us more confidence if it looks reasonable

Deploy

Deployment of this code to the full dataset requires pre-cautions:

Structure your code into source code (functions) along with unit tests
Functions are composed in scripts, which should have runtime tests and generate logs
All code (source, scripts, tests) need to be in source control, typically git
Save logs and output files in a secure location. Do not modify them.

Unit Testing

Runtime Validation