Whisper AI self hosted complete installation guide
Complete guide to download, set up, and run Whisper AI on a Linux Mint (Ubuntu) laptop. Have tried the tiny, base and smalle models, and they are close in speed to Whisper AI. The small model is a little slower in my testing but very accurate. Note: this setup also gives you the ability to explicitly set the number of processors to work on the translation part. These instructions explicitly set it to 8 threads, which you can change to whatever you want.
NOTE: IF YOU JUST WANT TO INSTALL WHISPER LOCALLY AND HAVE IT WORK SIMILARLY TO WHISPER AI ONLINE, SKIP TO Section 5. SECTIONS 1-4 DOCUMENT THE PROCESS THAT I WENT THROUGH FROM KNOWING NOTHING ABOUT SELF HOSTING A WHISPER VERSION TO GETTING A FUNCTIONING VERSION RUNNING ON MY LAPTOP.
The benefits to using a locally hosted Whisper AI dictation model are that your data is locally hosted, and that it can also be used in every application that you would normally type in - something Whisper AI online cannot do. Things that do not work in Whisper AI, like Libra Office Writer or this program, VS Code, a simple text editor or a terminal, Whisper Local will allow you to dictate and transcribe to all those applications as well.
Local Whisper Dictation Setup (Linux Mint 22.2)
Project: Install Whisper AI locally, so it can be used in any application running on management PC. Note: After installing and running Whisper Local, I realized that all the many servers available on the home network through my laptop (SSH terminals and web access pages) would also have Whisper Local available for dictation and transcription - through the laptop since that is how I access them all. This was a bonus.
Project Goals:
- Install needed tools
- Set up initial working dictation to text
- Real-time continuous dictation
- Type directly into the active window
- Global hotkey activation
- Test Better Whisper models - like the base and small models in addition to the tiny model
0. (Optional but recommended) Update system
Not strictly required for Whisper, but recommended as a first step.
1
sudo apt update && sudo apt upgrade
1. Verify Python and pip
Check if python3 and pip exist:
1
python3 --version && pip --version
Expected results:
python3should be present (Python 3.12.3 or similar)pipmay be missing
If pip is missing, install python3-pip:
1
sudo apt install -y python3-pip
2. Install pipx (because of PEP 668 / externally-managed-env)
Direct pip install into system Python will give this error:
1
error: externally-managed-environment … See PEP 668 …
Correct response: Use pipxso we don’t break Mint’s system Python.
Install pipx:
1
sudo apt install -y pipx
3. Install Whisper (ctranslate2) via pipx
This gives you a fast, local Whisper CLI in an isolated environment:
1
pipx install whisper-ctranslate2
After this, the whisper-ctranslate2 command is globally available.
4. Install audio + clipboard tools
We need:
arecord(fromalsa-utils) to record from your micxclipto push the transcript into the X11 clipboard
Install both:
1
sudo apt install -y alsa-utils xclip
These may already be installed on your system, but this ensures they’re present.
5. Ensure you have a ~/Scripts directory
Create the directory if it doesn’t exist:
1
mkdir -p ~/Scripts
All Whisper scripts will be stored here.
6. Create the dictation script
Open the script in nano:
1
nano ~/Scripts/whisper_dictate.sh
Paste this exact content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#!/usr/bin/env bash
set -e
DURATION="${1:-30}" # seconds to record; override with: ./whisper_dictate.sh 45
OUT_WAV="/tmp/whisper_dictate.wav"
OUT_TXT="/tmp/whisper_dictate.txt"
echo "Recording for $DURATION seconds... (speak now)"
arecord -q -f cd -t wav -d "$DURATION" -r 16000 -c 1 "$OUT_WAV"
echo "Transcribing with Whisper (tiny, English)..."
whisper-ctranslate2 \
--model tiny \
--language en \
--output_format txt \
--output_dir /tmp \
"$OUT_WAV"
# If the expected output file doesn't exist, try to locate it
if [ ! -f "$OUT_TXT" ]; then
CANDIDATE="$(ls /tmp | grep -E '^whisper_dictate.*\.txt$' | head -n 1 2>/dev/null || true)"
if [ -n "$CANDIDATE" ]; then
OUT_TXT="/tmp/$CANDIDATE"
fi
fi
if [ -f "$OUT_TXT" ]; then
xclip -selection clipboard < "$OUT_TXT"
echo
echo "Transcript (also copied to clipboard):"
echo "--------------------------------------"
cat "$OUT_TXT"
echo
echo "You can now paste into ChatGPT, Claude, Word, LibreOffice, Google Docs, etc."
else
echo "Error: transcript file not found." >&2
fi
Then in nano:
- Ctrl+O, Enter to save
- Ctrl+X to exit
7. Make the script executable
1
chmod +x ~/Scripts/whisper_dictate.sh
8. Usage: test 5-second dictation
From any terminal:
1
~/Scripts/whisper_dictate.sh 5
What this does:
- Records 5 seconds from your default microphone:
1
Recording for 5 seconds... (speak now)
Runs local Whisper (
tinymodel, English) viawhisper-ctranslate2.Finds the output
.txtfile in/tmp.Copies transcript into the clipboard with
xclip.Prints transcript to the terminal, e.g.:
1
2
3
Transcript (also copied to clipboard):
--------------------------------------
Testing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
You can then paste into:
- ChatGPT / Claude / other web chat boxes
- LibreOffice Writer
- Word (via Wine or on another machine if clipboard shared)
- Google Docs in the browser
Section 2 – Continuous, Chunked Dictation with Whisper
This adds a second script that:
- Records in repeating chunks (default: 10 seconds each)
- Transcribes each chunk with Whisper locally
- Appends each chunk to a session transcript
- Copies the full running transcript to your clipboard after every chunk
- Stops cleanly with Ctrl+C
All scripts live in ~/Scripts.
2.1 Create whisper_continuous.sh
Open a new script in nano:
1
nano ~/Scripts/whisper_continuous.sh
Paste this content into the file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/env bash
set -e
# Usage:
# ./whisper_continuous.sh # 10s chunks, tiny model
# ./whisper_continuous.sh 15 # 15s chunks, tiny model
# ./whisper_continuous.sh 10 base # 10s chunks, base model
#
# Press Ctrl+C to stop recording/transcribing.
CHUNK_SECONDS="${1:-10}" # length of each recording chunk
MODEL="${2:-tiny}" # whisper model: tiny, base, small, medium, large-v3, etc.
OUT_DIR="/tmp/whisper_continuous"
mkdir -p "$OUT_DIR"
SESSION_TS="$(date +%Y%m%d_%H%M%S)"
SESSION_TXT="${OUT_DIR}/session_${SESSION_TS}.txt"
echo "Whisper continuous dictation"
echo " Chunk length : ${CHUNK_SECONDS} seconds"
echo " Model : ${MODEL}"
echo " Session file : ${SESSION_TXT}"
echo
echo "Press Ctrl+C at any time to stop."
echo
trap 'echo; echo "Stopping. Final transcript saved at: ${SESSION_TXT}"; exit 0' INT
i=1
while true; do
WAV="${OUT_DIR}/chunk_${SESSION_TS}_${i}.wav"
TXT="${OUT_DIR}/chunk_${SESSION_TS}_${i}.txt"
echo
echo "Chunk #${i} - recording ${CHUNK_SECONDS} seconds... (speak now)"
arecord -q -f cd -t wav -d "${CHUNK_SECONDS}" -r 16000 -c 1 "${WAV}"
echo "Transcribing chunk #${i} with Whisper..."
whisper-ctranslate2 \
--model "${MODEL}" \
--language en \
--output_format txt \
--output_dir "${OUT_DIR}" \
"${WAV}"
if [ -f "${TXT}" ]; then
CHUNK_TEXT="$(cat "${TXT}")"
echo "${CHUNK_TEXT}" >> "${SESSION_TXT}"
# Copy entire running transcript to clipboard
xclip -selection clipboard < "${SESSION_TXT}"
echo
echo "Last chunk:"
echo "-----------"
echo "${CHUNK_TEXT}"
echo
echo "Running transcript (also copied to clipboard):"
echo "----------------------------------------------"
cat "${SESSION_TXT}"
echo
echo "(You can paste this into ChatGPT, Claude, Word, LibreOffice, Google Docs, etc.)"
else
echo "Warning: transcript file not found for chunk #${i} (${TXT})." >&2
fi
i=$((i + 1))
done
Save and exit:
- Ctrl+O, Enter
- Ctrl+X
2.2 Make the script executable
1
chmod +x ~/Scripts/whisper_continuous.sh
2.3 Usage examples
Default continuous dictation (10s chunks, tiny model):
1
~/Scripts/whisper_continuous.sh
- Records chunk #1 for 10 seconds, transcribes, prints:
- “Last chunk” (the latest 10 seconds)
- “Running transcript” (all chunks so far)
- Copies the full running transcript to the clipboard.
- Immediately starts chunk #2, then #3, etc.
- Press Ctrl+C to stop.
- Final transcript is saved as:
1
/tmp/whisper_continuous/session_YYYYMMDD_HHMMSS.txt
Change chunk length (e.g., 15 seconds):
1
~/Scripts/whisper_continuous.sh 15
Use a different Whisper model (e.g., base instead of tiny):
1
~/Scripts/whisper_continuous.sh 10 base
Section 3 — Whisper Dictation That Types Directly Into the Active Window
This script allows Whisper to “type” the transcribed text into whatever window you have focused. Works with: ChatGPT, Claude, LibreOffice, Word, Google Docs, any browser field, any terminal, anything with a text cursor.
3.1 Install typing tool: xdotool
1
sudo apt install -y xdotool
(This will show “already the newest version” if previously installed.)
3.2 Create the typing dictation script
Open the script:
1
nano ~/Scripts/whisper_type.sh
Paste the following content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/usr/bin/env bash
set -e
# Usage:
# ./whisper_type.sh # 10s, tiny model
# ./whisper_type.sh 5 # 5s, tiny model
# ./whisper_type.sh 8 base # 8s, base model
#
# IMPORTANT:
# - Click in the target window (ChatGPT, Claude, Word, LibreOffice, Google Docs, etc.)
# - Make sure the text cursor is where you want the text
# - Then run this script from a terminal
# - When recording finishes, the script will "type" the text into the active window
DURATION="${1:-10}" # seconds to record
MODEL="${2:-tiny}" # whisper model (tiny, base, small, etc.)
OUT_WAV="/tmp/whisper_type.wav"
OUT_TXT="/tmp/whisper_type.txt"
echo "Recording for ${DURATION} seconds... (speak now)"
arecord -q -f cd -t wav -d "${DURATION}" -r 16000 -c 1 "${OUT_WAV}"
echo "Transcribing with Whisper (${MODEL}, English)..."
whisper-ctranslate2 \
--model "${MODEL}" \
--language en \
--output_format txt \
--output_dir /tmp \
"${OUT_WAV}"
if [ ! -f "${OUT_TXT}" ]; then
# Try to locate a matching txt file if the expected one isn't there
CANDIDATE="$(ls /tmp | grep -E '^whisper_type.*\.txt$' | head -n 1 2>/dev/null || true)"
if [ -n "${CANDIDATE}" ]; then
OUT_TXT="/tmp/${CANDIDATE}"
fi
fi
if [ -f "${OUT_TXT}" ]; then
TEXT="$(cat "${OUT_TXT}")"
echo
echo "Transcript:"
echo "-----------"
echo "${TEXT}"
echo
echo "Typing transcript into the active window using xdotool..."
echo "(Make sure your cursor is in the target text box.)"
echo
# Type the text into the currently active window
xdotool type --delay 0 "${TEXT}"
echo
echo "Done."
else
echo "Error: transcript file not found (${OUT_TXT})." >&2
fi
Save and exit:
- Ctrl+O, Enter
- Ctrl+X
3.3 Make it executable
1
chmod +x ~/Scripts/whisper_type.sh
3.4 Usage
- Open any text box anywhere (ChatGPT, Claude, LibreOffice, Google Docs, etc.).
- Run the script from your terminal:
1
~/Scripts/whisper_type.sh 5
- Immediately click into your target window and leave the text cursor there.
- Speak while it records.
- When recording completes, Whisper transcribes and
xdotooltypes your words directly into the active window.
Test results:
- One 5-second recording
- Whisper correctly transcribed
- The script printed the text in your terminal
- Then it successfully typed the same text into your target application
This confirms Section 3 is complete.
Section 4 – Global Hotkey for Whisper Typing (Ctrl+Alt+W)
This section wires your whisper_type.sh script to a global keyboard shortcut, so you can start a short dictation and have it typed into the active window with a single key combo.
4.1 Ensure the script exists and is executable
(Already done in Section 3, but for completeness.)
1
2
nano ~/Scripts/whisper_type.sh
chmod +x ~/Scripts/whisper_type.sh
The whisper_type.sh script content is documented in Section 3.2 above.
4.2 Open Keyboard Shortcuts
From a terminal:
1
cinnamon-settings keyboard
Then:
- Go to the Shortcuts tab.
- In the left “Categories” pane, click Custom Shortcuts.
4.3 Add the custom shortcut
- Click Add custom shortcut.
- In the dialog:
- Name:
1
Whisper Type
- Command:
1
bash -lc "~/Scripts/whisper_type.sh 10"
- Click Add.
You will now see Whisper Type in the list, with unassigned as its keybinding.
4.4 Assign Ctrl+Alt+W
- Click on the word unassigned next to Whisper Type.
- When it says Pick an accelerator…, press:
1
Ctrl + Alt + W
- It should now display: Ctrl+Alt+W.
This finishes the hotkey setup for the one-shot typing script.
4.5 Using the hotkey
- Click into any text box (ChatGPT, Claude, LibreOffice, Google Docs, etc.) so the text cursor is where you want the dictation to appear.
- Press Ctrl+Alt+W.
- Speak for ~10 seconds (or whatever duration you configured).
- When the script finishes:
- Whisper transcribes your audio locally.
xdotooltypes the text directly into the active window.
- The script exits automatically — no separate “stop” key is required for this short, one-shot mode.
Section 5 – Whisper-AI-Style Dictation (Start/Stop Hotkeys)
##This section creates a true Whisper-AI-style dictation system with two separate hotkeys:##
- Ctrl+Alt+Q = Start recording
- Ctrl+Alt+Z = Stop recording and immediately transcribe
Unlike the previous sections which used fixed-duration recording, this system lets you:
- Record for as long as you need (up to 60 seconds max)
- Stop whenever you’re done speaking
- Get instant transcription typed into your active window
- Hear audio feedback (bell sounds) when starting and stopping
5.1 Create the START script
Open a new script in nano:
1
nano ~/Scripts/start_whisper_simple.sh
Paste this content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/env bash
set -e
# Start recording (Option A - Whisper-AI style)
# Bound to: Ctrl+Alt+Q
PIDFILE="/tmp/whisper_recording.pid"
AUDIO_FILE="/tmp/whisper_recording.wav"
# Don't start if already recording
if [ -f "$PIDFILE" ] && ps -p "$(cat "$PIDFILE")" >/dev/null 2>&1; then
echo "Already recording."
exit 0
fi
# Play start sound (optional - comment out if you don't want it)
paplay --volume=32768 /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || true
# Start recording in background (max 60 seconds)
# -f cd = CD quality
# -t wav = WAV format
# -d 60 = max 60 seconds (will be killed early by stop script)
# -r 16000 = 16kHz sample rate (what Whisper expects)
# -c 1 = mono
arecord -q -f cd -t wav -d 60 -r 16000 -c 1 "$AUDIO_FILE" &
PID=$!
echo "$PID" > "$PIDFILE"
echo "Recording started (PID: $PID)"
Save and exit:
- Ctrl+O, Enter
- Ctrl+X
Make it executable:
1
chmod +x ~/Scripts/start_whisper_simple.sh
What this script does:
- Checks if a recording is already in progress (prevents double-recording)
- Plays a bell sound at 50% volume as audio feedback
- Starts
arecordin the background to record audio (max 60 seconds) - Saves the process ID (PID) so the stop script can kill it
- Exits immediately, leaving recording running in background
5.2 Create the STOP script
Open a new script in nano:
1
nano ~/Scripts/stop_whisper_simple.sh
Paste this content:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#!/usr/bin/env bash
set -e
export HF_HUB_OFFLINE=1
# Stop recording and transcribe (Option A - Whisper-AI style)
# Bound to: Ctrl+Alt+Z
PIDFILE="/tmp/whisper_recording.pid"
AUDIO_FILE="/tmp/whisper_recording.wav"
TEXT_FILE="/tmp/whisper_recording.txt"
# Check if recording is running
if [ ! -f "$PIDFILE" ]; then
echo "No recording in progress."
exit 0
fi
PID="$(cat "$PIDFILE")"
# Kill the recording process if it's still running
if ps -p "$PID" >/dev/null 2>&1; then
kill "$PID" 2>/dev/null || true
sleep 0.2 # Give arecord time to finish writing the file
fi
rm -f "$PIDFILE"
# Play stop sound (optional - comment out if you don't want it)
paplay --volume=32768 /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || true
# Check if we have an audio file to transcribe
if [ ! -f "$AUDIO_FILE" ] || [ ! -s "$AUDIO_FILE" ]; then
echo "No audio file found or file is empty."
exit 1
fi
echo "Transcribing..."
# Transcribe with Whisper
whisper-ctranslate2 \
--model base \
--language en \
--device cpu \
--threads 8 \
--output_format txt \
--output_dir /tmp \
"$AUDIO_FILE" >/dev/null 2>&1
# Check if transcription succeeded
if [ ! -f "$TEXT_FILE" ]; then
echo "Error: Transcription failed (no text file generated)."
exit 1
fi
# Get the transcript
TRANSCRIPT="$(cat "$TEXT_FILE")"
if [ -z "$TRANSCRIPT" ]; then
echo "Warning: Transcript is empty."
exit 0
fi
echo "Transcript: $TRANSCRIPT"
echo "Typing into active window..."
# Type into the active window
xdotool type --delay 0 "$TRANSCRIPT"
# Clean up temp files
rm -f "$AUDIO_FILE" "$TEXT_FILE"
echo "Done."
Save and exit:
- Ctrl+O, Enter
- Ctrl+X
Make it executable:
1
chmod +x ~/Scripts/stop_whisper_simple.sh
What this script does:
- Checks if a recording is in progress
- Kills the
arecordprocess to stop recording - Plays a bell sound as audio feedback
- Verifies the audio file exists and has content
- Transcribes using Whisper with the cached model (offline mode)
- Types the transcript into whatever window has focus
- Cleans up all temporary files
Critical performance fix: The line export HF_HUB_OFFLINE=1 forces Whisper to use the cached model without checking Hugging Face servers online. Without this line, transcription takes 25+ seconds. With it, transcription takes 2-3 seconds.
5.3 Test the scripts manually (before hotkeys)
Before binding to hotkeys, test manually to verify everything works:
- Open any text editor (gedit, xed, LibreOffice Writer, etc.)
- Click in the text field so your cursor is positioned
- In a terminal, run:
1
~/Scripts/start_whisper_simple.sh
You should hear a bell sound and see: Recording started (PID: xxxx)
- Speak for 5-10 seconds
- Run:
1
~/Scripts/stop_whisper_simple.sh
You should:
- Hear another bell sound
- See:
Transcribing... - See:
Transcript: [your words] - See:
Typing into active window... - See the text appear in your text editor
If this works, proceed to hotkey setup.
5.4 Set up keyboard shortcuts
Open Keyboard Settings from the terminal:
1
cinnamon-settings keyboard
Then:
- Click the Shortcuts tab
- In the left “Categories” pane, click Custom Shortcuts
- Click Add custom shortcut
Create the START shortcut
In the dialog:
- Name:
Whisper Start Recording - Command:
bash -lc "~/Scripts/start_whisper_simple.sh" - Click Add
Now bind it to a key:
- Click on the word unassigned next to “Whisper Start Recording”
- When prompted, press: Ctrl + Alt + Q
- You should see it change to:
Ctrl+Alt+Q
Create the STOP shortcut
Click Add custom shortcut again.
In the dialog:
- Name:
Whisper Stop and Transcribe - Command:
bash -lc "~/Scripts/stop_whisper_simple.sh" - Click Add
Now bind it to a key:
- Click on the word unassigned next to “Whisper Stop and Transcribe”
- When prompted, press: Ctrl + Alt + Z
- You should see it change to:
Ctrl+Alt+Z
Close the Keyboard Settings window. Your hotkeys are now active.
5.5 Using the hotkey dictation system
Basic workflow:
- Click in any text field where you want the dictation to appear:
- ChatGPT or Claude web interface
- LibreOffice Writer
- Google Docs
- Any text editor
- Any browser text field
- Terminal applications
- Press Ctrl+Alt+Q to start recording
- You’ll hear a bell sound
- Recording begins immediately
- You can speak for up to 60 seconds
- When done speaking, press Ctrl+Alt+Z to stop
- You’ll hear another bell sound
- Whisper transcribes your audio (1-2 seconds)
- Text is automatically typed into your active window
- All temp files are cleaned up
Important notes:
- The focus must remain in your target text field for the typing to work correctly
- If you switch windows after pressing Ctrl+Alt+Z, the text will type into whatever window has focus when transcription completes
- Maximum recording time is 60 seconds (script will auto-stop if you reach this limit)
- You can record for as little as 1 second - there’s no minimum
5.6 Performance notes
Transcription speed:
- 10 seconds of speech = ~2.0 seconds to transcribe
- 30 seconds of speech = ~3.0-4.0 seconds to transcribe
- 60 seconds of speech = ~6.0-8.0 seconds to transcribe
These times are for the base model on CPU with 8 threads.
Why offline mode is critical:
The line export HF_HUB_OFFLINE=1 in the stop script prevents Whisper from checking Hugging Face servers online before transcribing. Without this line:
- Transcription takes 25-35 seconds (unacceptable)
- Only ~6 seconds of actual CPU work, rest is network overhead
- Multiple CPU cores sit idle
With offline mode enabled:
- Transcription takes 2-3 seconds (excellent)
- CPU cores properly utilized during transcription
- No network calls = faster and more reliable
The Whisper model is already cached at:
1
~/.cache/huggingface/hub/models--Systran--faster-whisper-base/
Offline mode simply tells Whisper to use this cache without verifying online.
5.7 Adjusting bell sound volume
The bell sounds use --volume=32768 which is 50% system volume. If this is:
Too loud: Change both scripts to use a lower volume:
1
paplay --volume=16384 /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || true
Too quiet: Change both scripts to use a higher volume:
1
paplay --volume=49152 /usr/share/sounds/freedesktop/stereo/bell.oga 2>/dev/null || true
Remove sounds entirely: Comment out or delete the paplay lines in both scripts.
The volume scale is:
- 0 = silent
- 32768 = 50% volume
- 65536 = 100% volume
5.8 Cleanup: Remove old test scripts
If you followed earlier sections, you may have old test scripts that are no longer needed. Keep only these three Whisper scripts:
Scripts to KEEP:
start_whisper_simple.sh- START recording (Ctrl+Alt+Q)stop_whisper_simple.sh- STOP and transcribe (Ctrl+Alt+Z)whisper_type.sh- Original one-shot script (useful for manual testing)
Scripts to DELETE (if they exist):
1
2
cd ~/Scripts
rm -f start_whisper.sh stop_whisper.sh whisper_continuous.sh whisper_toggle.sh whisper_forever_test.sh whisper_60s.sh whisper_dictate.sh
Clean up old temp directories:
1
rm -rf /tmp/whisper_continuous
The new scripts automatically clean up their own temp files in /tmp after each use.
5.9 Troubleshooting
Problem: Bell sound doesn’t play
Check if the sound file exists:
1
ls -l /usr/share/sounds/freedesktop/stereo/bell.oga
If missing, you can either:
- Install the freedesktop sound theme:
sudo apt install sound-theme-freedesktop - Or comment out the
paplaylines in both scripts
Problem: “Already recording” message when starting
A stale PID file exists. Clean it up:
1
rm -f /tmp/whisper_recording.pid
Problem: Text types into wrong window
Make sure your target text field has focus when you press Ctrl+Alt+Z. The text types into whatever window is active when transcription completes.
Problem: Transcription takes 20+ seconds
The HF_HUB_OFFLINE=1 line is missing from the stop script. Edit the script:
1
nano ~/Scripts/stop_whisper_simple.sh
Make sure this line appears near the top (around line 3):
1
export HF_HUB_OFFLINE=1
Problem: No sound recorded
Check your default audio device:
1
arecord -l
If your microphone is not the default device, you may need to specify the device in the start script. See arecord --help for device selection options.
5.10 Future enhancements
Possible improvements to consider:
Use an even better model: Change --model base to --model small in the stop script for even better transcription accuracy (adds ~2-3 seconds to transcription time). Note: You’re currently using the base model which offers a good balance of speed and accuracy.
Add visual notifications: Replace or supplement the bell sounds with desktop notifications showing “Recording started” and “Transcribing…” messages.
Increase max recording time: Change -d 60 to -d 120 in the start script to allow 2-minute recordings instead of 60 seconds.
Add a cancel key: Create a third script that kills recording without transcribing, bound to something like Ctrl+Alt+X.
Summary
You now have a complete local Whisper dictation system with multiple usage modes:
- Clipboard mode -
whisper_dictate.shrecords and copies to clipboard - Chunked mode -
whisper_continuous.shrecords continuous chunks - One-shot typing -
whisper_type.shor Ctrl+Alt+W for 10-second quick dictation - Whisper-AI-style - Ctrl+Alt+Q (start) and Ctrl+Alt+Z (stop) for flexible dictation
All scripts use local processing - no internet required for transcription. The system works in any application on your Linux Mint system.
Additional Notes: Switching Models and Performance Tips
Switching to a Different Whisper Model
If you want to experiment with different Whisper models (tiny, base, small, medium, or large-v3), you can do so by editing your transcription script (stop_whisper_simple.sh).
Look for the --model line in the transcription command:
1
2
3
4
whisper-ctranslate2 \
--model base \
--language en \
...
To try a different model, simply change the --model argument. For example:
1
--model small
First-Time Use of a New Model
Each model must be downloaded the first time you use it. Because these models are large (e.g. small is ~244 MB, medium is ~769 MB), the first run will take longer while the model is downloaded from Hugging Face.
To allow this download, make sure offline mode is disabled during that first run:
1
# export HF_HUB_OFFLINE=1
Once the model is cached locally, you can re-enable offline mode:
1
export HF_HUB_OFFLINE=1
Transcription with the new model will then be just as fast as your previous one, with better accuracy depending on the model size.
Comparing Model Performance (Optional)
If you’re interested in comparing how different models perform on your system, you can add timing logic to your script.
Here’s an example using date:
1
2
3
4
5
6
7
8
9
10
11
12
START=$(date +%s)
whisper-ctranslate2 \
--model small \
--language en \
--threads 8 \
--output_format txt \
--output_dir /tmp \
"$AUDIO_FILE"
END=$(date +%s)
echo "Transcription took $((END - START)) seconds"
This will show how long each transcription takes, making it easier to compare different models under the same conditions.
Observing CPU Usage
You may notice that not all CPU cores are fully used, even if you set --threads 8. This is normal:
- Whisper processing is done in stages; not all stages can use all threads at once.
- Shorter audio clips and simpler speech may not utilize all available cores.
- Longer recordings (30+ seconds) or larger models (
medium,large-v3) tend to scale CPU usage more effectively.
To monitor live CPU usage:
1
htop
Or:
1
top
Look for how many cores spike during transcription, and how long they stay active.
Model Sizes and Tradeoffs
| Model | Size (approx) | Accuracy | Speed (CPU) |
|---|---|---|---|
tiny | 39 MB | Lowest | Fastest |
base | 74 MB | Better | Very fast |
small | 244 MB | Good | Moderate |
medium | 769 MB | Very good | Slower |
large-v3 | 1.5 GB | Best | Slowest |
Choose a model based on your balance of speed vs. transcription quality. For real-time use, base or small offer the best trade off.