Whisper to your keyboard: Setting up a speech-to-text button

Long story short: I broke my arm while riding my bike and I can’t type. Quite the problem as typing is a key part of my daily life as a software engineer. So I decided to add a speech to text button to my keyboard. Here’s how I did it.

How to transcript speech to text on linux

First, I looked into how to transcript speech to text on linux. I found a few solutions:

Whisper.cpp: a way to run the OpenAI whisper model locally
OpenAI’s API: the original Whisper model as a pay as you go API
Deepgram: a pay as you go service that offers a speech to text API

After testing whisper.cpp on my machine, it was too slow and inaccurate (out of the box), so I decided to use an API and abandoned the idea of running it locally (for now).

So I registered for an OpenAI account and tried their API:

(you can download an example recording here)

FILEPATH="./recording" # path to the recording without extension

curl --request POST \
  --url https://api.openai.com/v1/audio/transcriptions \
  --header "Authorization: Bearer $OPEN_AI_TOKEN" \
  --header 'Content-Type: multipart/form-data' \
  --form file="@$FILEPATH.wav" \
  --form model=whisper-1 \
  --form response_format=text \
  -o "${FILEPATH}.txt"

It’s quite magical, already I could transcript my voice to text.

How to record your microphone on linux

This one was a challenge for compatibility and device selection, but basically, what you want to do is record your microphone as a .wav file and save it. (I could not make mp3 encoding work reliably and it was not required)

First use arecord to list your available devices:

arecord -l

Then you should be able to test which input device is your microphone by recording a few samples:

FILEPATH="./recording"
AUDIO_INPUT="hw:0,0" # your microphone device, test with a few devices to find the right one
arecord --device="$AUDIO_INPUT" --format cd "$FILEPATH.wav" --duration=10

How to write the text file emulating a keyboard

This one is quite easy, I used xdotool to emulate a keyboard and write the text file:

FILEPATH="./recording"
perl -pi -e 'chomp if eof' "$FILEPATH.txt" # remove trailing newline if any to avoid sending an extra newline keypress
xdotool type --clearmodifiers --file "$FILEPATH.txt"

Putting it all together: a button to transcript speech to text

Now that we have all the pieces, we can put it all together in a script:

#!/usr/bin/env bash
# usage: exec ./voice-typing.sh twice to start and stop recording
# Dependencies: curl, jq, arecord, xdotool, killall

set -euo pipefail
IFS=$'\n\t'

# Configuration
readonly PID_FILE="${HOME}/.recordpid"
readonly FILE="${HOME}/.voice-type/recording"
readonly MAX_DURATION=15
readonly AUDIO_INPUT='hw:0,0' # Use `arecord -l` to list available devices

start_recording() {
  mkdir -p "$(dirname "$FILE")"
  echo "Starting new recording..."
  nohup arecord --device="$AUDIO_INPUT" --format cd "$FILE.wav" --duration="$MAX_DURATION" &>/dev/null &
  echo $! >"$PID_FILE"
}

stop_recording() {
  echo "Stopping recording..."
  if [ -s "$PID_FILE" ]; then
    local pid
    pid=$(<"$PID_FILE")
    kill "$pid" && wait "$pid" 2>/dev/null || killall -w arecord
    rm -f "$PID_FILE"
    return 0
  fi
  echo "No recording process found."

}

write_transcript() {
  perl -pi -e 'chomp if eof' "$FILE.txt"
  xdotool type --clearmodifiers --file "$FILE.txt"
}

transcribe_with_openai() {
  curl --silent --fail --request POST \
    --url https://api.openai.com/v1/audio/transcriptions \
    --header "Authorization: Bearer $OPEN_AI_TOKEN" \
    --header 'Content-Type: multipart/form-data' \
    --form file="@$FILE.wav" \
    --form model=whisper-1 \
    --form response_format=text \
    -o "${FILE}.txt"
}

main() {
  if [[ -f "$PID_FILE" ]]; then
    stop_recording
    transcribe_with_openai
    write_transcript
  else
    start_recording
  fi
}

main

The script needs to be run twice to start and stop recording. It will then transcript the recording and write the text to the current window.

To trigger it with a button, simply add a keyboard shortcut to run the script in your keyboard configurations settings on your linux distribution. example keyboard shortcut configuration

Conclusion and next steps

It was a fun little project to do, and it’s quite useful to be able to type with your voice. I’m not sure I’ll keep using it after my arm heals, but it’s a nice option to have.

Since using whisper was a little slow, I tried and switched to Deepgram for faster (and sometimes more accurate) transcriptions. I published the complete script on github, you can find it here, it checks for requirements and handles errors a little more gracefully.