Text Analysis#

Research Idea

Identify LLM researchers, which sub-topics they are working on, and whether there is geographical specialization.

  1. Who are the researchers?

  2. Where they are located?

  3. What topics do they work on? (future work)

Development Workflow
  • Identify a dataset and specify precisely what you want to do with it

  • Engineer a prompt (use OpenAI Playground)

  • Evaluate performance on a sample

    • If performance is unacceptable, try further prompt engineering, functions, etc.

    • If performance is still not good enough, try fine-tuning

  • Deploy at scale

_images/gpt-dev-cycle.png

Gather Data

Promising dataset: arXiv

  • Almost all LLM papers of note are posted here

  • PDF files are freely available (~2 million covering STEM fields)

  • Limited amount of metadata is also available

  • Downloaded ~20K recent papers, filtered for ones containing the phrase Large Language Model in the abstract. Resulting dataset ~1.7K PDFs + Metadata

  • Extracted 1st page of text from each PDF

import pandas as pd
df = pd.read_parquet('data/arxiv_metadata.parquet')
df.head()
id submitter authors title comments journal-ref doi report-no categories license abstract update_date src_pdf text
0 2310.00014 Yong Ren Yong Ren, Tao Wang, Jiangyan Yi, Le Xu, Jianhu... Fewer-token Neural Speech Codec with Time-inva... Submitted to ICASSP 2024 None None None cs.SD eess.AS http://arxiv.org/licenses/nonexclusive-distrib... Language model based text-to-speech (TTS) mo... 2023-10-03 2310.00014v1.pdf FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVA...
1 2310.00031 Markus Marks Neehar Kondapaneni, Markus Marks, Manuel Knott... Text-image Alignment for Diffusion-based Perce... Project page: https://www.vision.caltech.edu/t... None None None cs.CV http://arxiv.org/licenses/nonexclusive-distrib... Diffusion models are generative models with ... 2023-10-06 2310.00031v1.pdf Text-image Alignment for Diffusion-based Perce...
2 2310.00032 Qinghua Xu Qinghua Xu, Tao Yue, Shaukat Ali and Maite Arr... Pretrain, Prompt, and Transfer: Evolving Digit... None None None None cs.SE http://creativecommons.org/licenses/by/4.0/ Cyber-Physical Systems (CPSs), e.g., elevato... 2023-10-06 2310.00032v2.pdf PRETRAIN, PROMPT, AND TRANSFER: EVOLVING DIGIT...
3 2310.00034 Yuzhang Shang Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong PB-LLM: Partially Binarized Large Language Models Frist work using network binarization for larg... None None None cs.LG cs.AI cs.CL http://arxiv.org/licenses/nonexclusive-distrib... This paper explores network binarization, a ... 2023-10-03 2310.00034v1.pdf PB-LLM: PARTIALLY BINARIZED LARGE LANGUAGE\nMO...
4 2310.00035 Xi Wang Xi Wang, Laurence Aitchison, Maja Rudolph LoRA ensembles for large language model fine-t... Update the title in the PDF file None None None cs.LG cs.AI http://creativecommons.org/licenses/by/4.0/ Finetuned LLMs often exhibit poor uncertaint... 2023-10-06 2310.00035v1.pdf Preprint. Under review\nENSEMBLE OF LOW-RANK A...

Specify what information to extract

  • title

  • list of author names

  • each author’s email address

  • each author’s affiliation

  • each affiliation’s location in terms of latitude and longitude

{ "title": "The paper's title",
    "authors": [
        {
            "name": "author's name",
            "email": "name@domain.edu",
            "affiliations": [ "list of indices" ]
        }
    ],
    "affiliations": [ 
        {"index": "the index", 
        "name": "The affiliation name", 
        "longitude": "the longitude", 
        "latitude": "the latitude" 
        } 
    ]
 ]
}
_images/arxiv-paper-header.png

Develop a Prompt

OpenAI playground

Prompt engineering

There are various prompting strategies you can use to improve performance. OpenAI has a very good guide to help you out. They also provide lots of examples to look at.

Here are a couple of my tries:

Hide code cell source
import os
import openai
from dotenv import load_dotenv

# load the .env file containing your API key
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-1106-preview",
  messages=[
    {
      "role": "system",
      "content": "You are an expert research librarian. You are precise and can analyze the structure of papers very well. You return information in json format."
    },
    {
      "role": "user",
      "content": "Extract the title and authors and affiliations from the first page of a scientific paper. \n\nUse the following step-by-step instructions to respond to user inputs.\n\nExtract the title and authors from the first page of a scientific paper. The paper text will snipped will be delimited by triple quotes. Geolocate each author affiliation with latitude and longitude.\n\nThe output should have the following format:\n\n{ \"title\": \"The paper's title\",\n  \"authors\": [\n    {\n      \"name\": \"Yong Ren\",\n      \"email\": null,\n      \"affiliations\": [ \"list of indices\" ]\n    }\n  ],\n \"affiliations\": [ {\"index\": \"the index\", \"name\": \"The affiliation name\", \"longitude\": \"the longitude\", \"latitude\": \"the latitude\" } ]\n ]\n}\n\n\"\"\"\nFEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES\nYong Ren1,2, Tao Wang1, Jiangyan Yi1, Le Xu1,2, Jianhua Tao3, Chuyuan Zhang1,2, Junzuo Zhou1,2\n1Institute of Automation, Chinese Academy of Sciences, China\n2University of Chinese Academy of Sciences, China\n3Department of Automation, Tsinghua University, China\nABSTRACT\nLanguage model based text-to-speech (TTS) models, like VALL-E,\nhave gained attention for their outstanding in-context learning capa-\nbility in zero-shot scenarios. Neural speech codec is a critical com-\nponent of these models, which can convert speech into discrete token\nrepresentations. However, excessive token sequences from the codec\nmay negatively affect prediction accuracy and restrict the progres-\nsion of Language model based TTS models. To address this issue,\nthis paper proposes a novel neural speech codec with time-invariant\ncodes named TiCodec. By encoding and quantizing time-invariant\ninformation into a separate code, TiCodec can reduce the amount of\nframe-level information that needs encoding, effectively decreasing\nthe number of tokens as codes of speech. Furthermore, this paper\nintroduces a time-invariant encoding consistency loss to enhance the\nconsistency of time-invariant code within an utterance and force it\nto capture more global information, which can benefit the zero-shot\nTTS task. Experimental results demonstrate that TiCodec can not\nonly enhance the quality of reconstruction speech with fewer tokens\nbut also increase the similarity and naturalness, as well as reduce the\nword error rate of the synthesized speech by the TTS model.\nIndex Terms— speech codec, fewer tokens, time-invariant, lan-\nguage model, text-to-speech\n\"\"\"\n "
    }
  ],
  response_format={"type": "json_object"},
  temperature=0,
  max_tokens=2048,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0,
  seed=42,
)
Hide code cell source
import json

data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=4))
{
    "title": "FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES",
    "authors": [
        {
            "name": "Yong Ren",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Tao Wang",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Jiangyan Yi",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Le Xu",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Jianhua Tao",
            "email": null,
            "affiliations": [
                3
            ]
        },
        {
            "name": "Chuyuan Zhang",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Junzuo Zhou",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        }
    ],
    "affiliations": [
        {
            "index": 1,
            "name": "Institute of Automation, Chinese Academy of Sciences, China",
            "longitude": "116.3975",
            "latitude": "39.9085"
        },
        {
            "index": 2,
            "name": "University of Chinese Academy of Sciences, China",
            "longitude": "116.6514",
            "latitude": "40.1211"
        },
        {
            "index": 3,
            "name": "Department of Automation, Tsinghua University, China",
            "longitude": "116.326",
            "latitude": "40.0036"
        }
    ]
}
Hide code cell source
output = """
{
    "title": "FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES",
    "authors": [
        {
            "name": "Yong Ren",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Tao Wang",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Jiangyan Yi",
            "email": null,
            "affiliations": [
                1
            ]
        },
        {
            "name": "Le Xu",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Jianhua Tao",
            "email": null,
            "affiliations": [
                3
            ]
        },
        {
            "name": "Chuyuan Zhang",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        },
        {
            "name": "Junzuo Zhou",
            "email": null,
            "affiliations": [
                1,
                2
            ]
        }
    ],
    "affiliations": [
        {
            "index": 1,
            "name": "Institute of Automation, Chinese Academy of Sciences, China",
            "longitude": "116.331398",
            "latitude": "39.897445"
        },
        {
            "index": 2,
            "name": "University of Chinese Academy of Sciences, China",
            "longitude": "116.651381",
            "latitude": "40.12114"
        },
        {
            "index": 3,
            "name": "Department of Automation, Tsinghua University, China",
            "longitude": "116.326443",
            "latitude": "40.00368"
        }
    ]
}
"""

Evaluate on a sample

To evaluate the perfomance of the GPT on your dataset, you need some way of externally validating it. At least a portion of your data must be labelled with the correct (or at least likely correct) output. This is called a gold standard.

Hide code cell source
user_prompt_instructions = """
Extract the title and authors and affiliations from the first page of a scientific paper. 

Use the following step-by-step instructions to respond to user inputs.

Extract the title and authors from the first page of a scientific paper. The paper text will snipped will be delimited by triple quotes. Geolocate each author affiliation with latitude and longitude.

The output should have the following format:

{ "title": "The paper's title",
  "authors": [
    {
      "name": "Yong Ren",
      "email": null,
      "affiliations": [ "list of indices" ]
    }
  ],
 "affiliations": [ {"index": "the index", "name": "The affiliation name", "longitude": "the longitude", "latitude": "the latitude" } ]
 ]
}

"""
Hide code cell source
from typing import Dict
import openai

def validate_response_data(data: Dict):
    assert "title" in data, "title not found"
    assert "authors" in data, "authors not found"
    for auth in data['authors']:
        assert "name" in auth, "name not found"
        assert "email" in auth, "email not found"
        assert "affiliations" in auth, "affiliations not found"
    assert "affiliations" in data, "affiliations not found"
    for aff in data['affiliations']:
        assert "index" in aff, "index not found"
        assert "name" in aff, "name not found"
        assert "longitude" in aff, "longitude not found"
        assert "latitude" in aff, "latitude not found"

def analyze_text(client: openai.Client, text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4-1106-preview",
        messages=[
            {
                "role": "system",
                "content": "You are an expert research librarian. You are precise and can analyze the structure of papers very well. You return information in json format."
            },
            {
                "role": "user",
                "content": user_prompt_instructions + '\n\n"""' + text + '\n\n"""'
            }
        ],
        response_format={"type": "json_object"},
        temperature=0,
        max_tokens=2048,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        seed=42,
    )
    try:
        data = json.loads(response.choices[0].message.content)
        print(data)
        validate_response_data(data)
        return json.dumps(data)
    except Exception as e:
        print(e)
        return str(e)
client = openai.Client()
df_sample = df.sample(100, random_state=42)
df_sample['extracted_info'] = df_sample['text'].apply(lambda x: analyze_text(client, x))
df_sample.to_parquet("sample_output.parquet")
import pandas as pd

df_gold = pd.read_parquet('./data/arxiv_metadata.parquet')
df_extracted = pd.read_parquet('./data/extracted_data.parquet')

true_positives = []
false_positives = []
false_negatives = []

for id in df_extracted['id']:
    gold_authors = list(df_gold[df_gold['id'] == id]['authors'])[0]
    gold_authors = {a.strip() for a in gold_authors.split(',')}

    predicted = df_extracted[df_extracted['id'] == id]
    predicted_authors = set(predicted['author'])
    
    for author in predicted_authors:
        if author in gold_authors:
            true_positives.append((id, author))
        else:
            false_positives.append((id, author))
    
    for author in gold_authors:
        if author not in predicted_authors:
            false_negatives.append((id, author))


# round precision to 2 decimal places
precision = round(len(true_positives) / (len(true_positives) + len(false_positives)), 2)

# round recall to 2 decimal places
recall = round(len(true_positives) / (len(true_positives) + len(false_negatives)), 2)

print(f"true_positives count: {len(true_positives)}")
print(f"false_positives count: {len(false_positives)}")
print(f"false_negatives count: {len(false_negatives)}")
print("precision:", precision)
print("recall:", recall)
true_positives count: 3884
false_positives count: 364
false_negatives count: 316
precision: 0.91
recall: 0.92
fp_sample = {fp[1] for fp in false_positives if fp[0] == '2310.08102'}
fn_sample = {fn[1] for fn in false_negatives if fn[0] == '2310.08102'}
print(f"False positives for id 2310.08102\n")
for fp in fp_sample:
    print(f"  {fp}")

print(f"\nFalse negatives for id 2310.08102\n")
for fn in fn_sample:
    print(f"  {fn}")
False positives for id 2310.08102

  Alham Fikri Aji
  Ayu Purwarianti
  Muhammad Razif Rizqullah

False negatives for id 2310.08102

  Ayu Purwarianti (1) and Alham Fikri Aji
  (2) ((1) Bandung Institute of Technology
  (2) Mohamed bin Zayed University of
  Artificial Intelligence)
  Muhammad Razif Rizqullah (1)
_images/arxiv-fp-fn-example.png

Fig. 8 Paper#

Other Forms of Evaluation

Now let’s map these geo-coordinates to get a sanity check on how well GPT-4 did the geo-coding. This is not a substitute for a quantitivate analysis, but it does give us more confidence if it looks reasonable

Hide code cell source
from ipywidgets import HTML
from ipyleaflet import Map, Marker, Popup, MarkerCluster

center = (42.0451, -87.6877)
map2 = Map(center=center, zoom=2, close_popup_on_click=True)

markers = []
for row in list(df_extracted.iterrows())[:100]:
    marker = Marker(location=(row[1]['latitude'], row[1]['longitude']))
    message = HTML()
    message.value = f"{row[1]['author']}: <b>{row[1]['affiliation']}</b>"
    marker.popup = message
    markers.append(marker)

map2.add_layer(MarkerCluster(markers=markers))

map2

Deploy

Deployment of this code to the full dataset requires pre-cautions:

  • Structure your code into source code (functions) along with unit tests

  • Functions are composed in scripts, which should have runtime tests and generate logs

  • All code (source, scripts, tests) need to be in source control, typically git

  • Save logs and output files in a secure location. Do not modify them.

_images/testing.png
Unit Testing
Runtime Validation