Speech-to-text

Speech-to-text#

Whisper is a speech-to-text service provided by OpenAI. It relies on ASR (Automatic Speech Recognition) technology, which is used to convert spoken language into written text. The system is trained on 680,000 hours of audio and the corresponding transcripts collected from the internet. English-language audio matched to English transcripts, non-English audio matched to English transcripts, and non-English audio matched to the corresponding transcript comprise 65%, 18%, and 17% of the data, respectively. Altogether, Whisper can translate audio from 99 different languages. It supports the following file types: m4a, mp3, webm, mp4, mpga, wav, and mpeg.

In this tutorial, we’ll show you how to use Whisper with your OpenAI API credentials.

Example Use Cases:#

Some example use cases for your work could include transcribing:

a Google or Zoom meeting or interview;
an in-person interview recorded on your computer (example today);
a presentation you make at a conference;
any other type of recorded text, like an earnings call audio clip (example provided).

Getting Started#

You can install Whisper with the following:

# pip install openai-whisper

We also use the following libraries in our examples:

# pip install wave
# pip install pyaudio
# pip install transformers

To call these libraries:

# import libraries
import pyaudio
import wave
import openai
import whisper
from transformers import pipeline

Don’t forget to reference your API Key.

import os
from dotenv import load_dotenv # pip install python-dotenv

# load the .env file containing your API key
load_dotenv()

# display (obfuscated) API key
print(f"OPENAI_API_KEY: {os.getenv('OPENAI_API_KEY')[:4]}...")

OPENAI_API_KEY: sk-2...

Example A: Toy Marketing Study for New Candy Reviews#

Before we got started today, we offered everyone the option to try some candy they never had before. If this was a real marketing study we could record the audio using our computer’s microphones with the code below.

AUDIO RECORDING#

# Function to record audio and save as a WAV file
def record_audio(output_dir, participant_name, candy_name, duration=15):
    # Use PyAudio to capture audio from the microphone
    audio = pyaudio.PyAudio()

    format = pyaudio.paInt16
    channels = 1
    rate = 44100
    frames_per_buffer = 1024
    audio_duration_seconds = duration 

    stream = audio.open(format=format,
                        channels=channels,
                        rate=rate,
                        input=True,
                        frames_per_buffer=frames_per_buffer)

    print("Recording audio...")

    frames = []

    for i in range(0, int(rate / frames_per_buffer * audio_duration_seconds)):
        data = stream.read(frames_per_buffer)
        frames.append(data)

    print("Finished recording.")

    stream.stop_stream()
    stream.close()
    audio.terminate()

    # Save the captured audio as a WAV file
    output_file = f"{output_dir}/{participant_name}_{candy_name}.wav"
    audio_file_name = output_file

    # Use wave module to write the frames to a WAV file
    with wave.open(audio_file_name, 'wb') as wf:
        wf.setnchannels(channels)
        wf.setsampwidth(audio.get_sample_size(format))
        wf.setframerate(rate)
        wf.writeframes(b''.join(frames))

    print(f"Audio saved as {audio_file_name}")

    return audio_file_name

# set output directory
output_dir = "."

Using this function, we can record three reviews: (1) an honest review of something disgusting; (2) a polite review; and (3) a sarcastic review.

# honest
# audio = record_audio(output_dir, "person", "gross", duration=10)

# polite
# audio = record_audio(output_dir, "person", "polite", duration=10)

# sarcastic
# audio = record_audio(output_dir, "person", "sarcasm", duration=10)

TRANSCRIPTION#

Whisper makes it easy to transcribe the audio. Let’s transcribe some of the candy reviews we recorded.

Here’s the honest review:

client = openai.OpenAI()


# transcribe the honest review
audio = './person_gross.wav'
audio_file = open(audio, "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file
    )

# transcript = openai.Audio.transcribe("whisper-1", audio_file)
print(transcript.text)
honest = transcript.text

Ugh, that was gross.

Here’s the polite review:

# transcribe the polite review
audio = './person_polite.wav'
audio_file = open(audio, "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file
    )
print(transcript.text)
polite = transcript.text

Um, that was, um, the wrapper, the wrapper was pretty, um, the taste was, um, interesting.

Finally, here’s the sarcastic one:

# record and transcribe the honest review
audio = './person_sarcasm.wav'
audio_file = open(audio, "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file
    )
print(transcript.text)
sarcastic = transcript.text

Oh, that was so good!

SENTIMENT ANALYSIS#

We can also conduct a basic sentiment analysis on the generated audio output for the candy reviews. To analyze sentiments, we’ll apply a pre-trained model called `roberta-base-go-emotions’ from the Hugging Face Model Hub (https://huggingface.co/docs/hub/models-the-hub).

Note that the sentiment analysis code was summarized from this blog post: https://www.smashingmagazine.com/2023/09/generating-real-time-audio-sentiment-analysis-ai/. Please see the original post for additional details.

First, we load the Whisper model for speech recognition. Then, we initialize the sentiment analysis using a pre-trained model from Hugging Face Transformers.

# load Whisper and initialize the sentiment analysis using Roberta
model = whisper.load_model("base")

sentiment_analysis = pipeline(
  "sentiment-analysis",
  framework="pt",
  model="SamLowe/roberta-base-go_emotions",
  top_k=3 # change this number to retrieve more than 3  sentiments
)

Then we create a function that extracts the top 3 emotions, scores, and corresponding emojis from the audio clip.

# find the sentiment and score from the Hugging Face model
def analyze_sentiment(text):
    results = sentiment_analysis(text)
    results = results[0]
    return results

# return the corresponding emoji for the sentiment
def get_sentiment_emoji(sentiment):
  # Define the mapping of sentiments to emojis
  emoji_mapping = {
    "disappointment": "😞",
    "sadness": "😢",
    "annoyance": "😠",
    "neutral": "😐",
    "disapproval": "👎",
    "realization": "😮",
    "nervousness": "😬",
    "approval": "👍",
    "joy": "😄",
    "anger": "😡",
    "embarrassment": "😳",
    "caring": "🤗",
    "remorse": "😔",
    "disgust": "🤢",
    "grief": "😥",
    "confusion": "😕",
    "relief": "😌",
    "desire": "😍",
    "admiration": "😌",
    "optimism": "😊",
    "fear": "😨",
    "love": "❤️",
    "excitement": "🎉",
    "curiosity": "🤔",
    "amusement": "😄",
    "surprise": "😲",
    "gratitude": "🙏",
    "pride": "🦁"
  }
  return emoji_mapping.get(sentiment, "")

# put it altogether
def display_sentiment_results(text):
    sentiment_results = analyze_sentiment(text)
    for sentiment in sentiment_results:
        label = sentiment['label']
        emoji = get_sentiment_emoji(label)
        result = f"{label} {emoji}: {sentiment['score']}"
        print(result)

These are sentiment results for the honest, polite, and sarcastic reviews:

print(polite)
display_sentiment_results(polite)

Um, that was, um, the wrapper, the wrapper was pretty, um, the taste was, um, interesting.

admiration 😌: 0.6466179490089417
approval 👍: 0.15136606991291046
excitement 🎉: 0.09306973963975906

print(honest)
display_sentiment_results(honest)

Ugh, that was gross.
disgust 🤢: 0.8205026984214783
annoyance 😠: 0.10063128918409348
neutral 😐: 0.06488493829965591

print(sarcastic)
display_sentiment_results(sarcastic)

Oh, that was so good!
admiration 😌: 0.9531673192977905
approval 👍: 0.03131399303674698
excitement 🎉: 0.01284793484956026

Example B: Earnings Call Transcript for the `Chocolate Covered Stuff’ company#

Note that this example and explanation builds off the one provided here: https://cookbook.openai.com/examples/whisper_prompting_guide. Please see the original post for additional details.

Let’s take another toy example. I asked GPT to create a 20-second earnings call transcript for a Pakistani company called Chocolate Covered Stuff. I then recorded that transcript using the audio code provided earlier. You can do something similar and record it below:

# audio = record_audio(output_dir, "choco_stuff", "earnings_call", duration=45)
# audio = record_audio(output_dir, "urdu_hindi_line", "earnings_call", duration=5)

We can see a transcription of this earnings call here:

earnings_audio = 'choco_stuff_earnings_call.wav'
audio_file = open(earnings_audio, "rb")
transcript = client.audio.transcriptions.create(
    model="whisper-1", 
    file=audio_file
    )
print(transcript)

Transcription(text="As-salamu alaykum chocolate-covered-stuff family! Today's call is sweeter than the rumor of free chocolate bars. Our profits have surged faster than admiration for an Allama Iqbal poem. As we celebrate our success, it's with a heavy heart that we have to bid farewell to our esteemed CFO Mr. Choco Khan, who's been as sweet as our treat. His dedication has been integral to our growth, much like the chocolate covering our unique confections. And now, the moment that marks a new chapter for us, introducing our delightful line of chocolate-covered rickshaws with Kashmiri chai sprinkles. Hum duniya ko chocolate se muskarate hue banate hain. Let's keep shaping the world with a chocolate-covered smile.")

If we want to improve this transcription, we can use prompts.

PROMPTS#

Whisper has an optional parameter called a prompt. Prompts can be used for the following:

Provide Context for Transcription - stitch together multiple audio segments by providing the text of a previous one to give context to the next one.
Make Spelling Corrections - spell specific words and names mentioned in the audio clip, like Famke Janssen or Lupita Nyong’o.
Specify which Language is Used

Unlike GPT prompting, these prompts cannot instruct the model to perform specific tasks. For instance “Format listed items into Markdown format” or “Translate a French phrase to English” will not work within the Whisper prompts.

The function below shows how to use prompts with Whisper.

# define a wrapper function for seeing how prompts affect transcriptions
def transcribe(audio_filepath, prompt: str) -> str:
    """Given a prompt, transcribe the audio file."""
    transcript = client.audio.transcriptions.create(
        file=open(audio_filepath, "rb"),
        model="whisper-1",
        prompt=prompt,
    )
    return transcript.text

Now, let’s transcribe the earnings call with a prompt. Let’s also conduct a sentiment analysis on the call text.

# earnings_call = transcribe(earnings_audio, prompt="Allama Iqbal was a great poet. Translate one line into Urdu text.")
earnings_call = client.audio.transcriptions.create(
    prompt="Allama Iqbal was a great poet. Translate one line into Urdu text.",
    model="whisper-1", 
    file=audio_file
    )
print(earnings_call)
display_sentiment_results(earnings_call.text)

Transcription(text="Assalamu alaikum chocolate covered stuff family! Today's call is sweeter than the rumor of free chocolate bars. Our profits have surged faster than admiration for an Allama Iqbal poem. As we celebrate our success, it's with a heavy heart that we have to bid farewell to our esteemed CFO, Mr. Chacko Khan, who's been as sweet as our treat. His dedication has been integral to our growth, much like the chocolate covering our unique confections. And now, the moment that marks a new chapter for us, introducing our delightful line of chocolate-covered rickshaws with Kashmiri chai sprinkles. Hum duniya ko chocolate se muskarate hue banate hain. Let's keep shaping the world with a chocolate-covered smile.")
joy 😄: 0.4196963310241699
admiration 😌: 0.3140278458595276
excitement 🎉: 0.09640142321586609

If you notice, one line in the earnings call was in Urdu. However, Whisper did not translate it or write it in Nastaliq or Perso-Arabic script. Let’s take that foreign-language Audio clip itself and see what Whisper does.

# Let's keep shaping the world with a chocolate-covered smile!); Hum duniya ko chocolate se muskuratay hue banatay hain
urdu_hindi_audio = './urdu_hindi_earnings_call.wav'
transcribe(urdu_hindi_audio, prompt="")

'ہم دنیا کو چاکلٹ سے مسکراتے ہوئے بناتے ہیں۔'

Even without a prompt, it seemed to know it was Urdu. However, we can use prompts to tell Whisper it is Hindi instead.

transcribe(urdu_hindi_audio, prompt="This is in Hindi.")

'हम दुन्या को चौकलेट से मुस्कराते हुए बनाते हैं।'

This time it wrote the words in Devanagari script. Now what happens if we prompt Whisper to translate it to English?

transcribe(urdu_hindi_audio, prompt="Translate this Urdu line to English")

'ہم دنیا کو چاکولیٹ سے مسکراتے ہوئے بناتے ہیں'

You can see that this doesn’t work.