Speech Recognition from Your Audio File

The UWebASR service offers an accessible way to convert speech into text directly in your browser. It is designed to handle recordings of varying quality and length, providing fast results without the need to install any software. The system is developed by the Department of Cybernetics, University of West Bohemia, as part of the LINDAT/CLARIAH-CZ research infrastructure. UWebASR supports both contemporary and historical speech recognition tasks, making it useful for everyday applications as well as for research in the humanities and social sciences.

This service is intended exclusively for scientific and non-commercial use. For other types of use and detailed information about the service, please contact the authors.

1️⃣ Select the language and recognition model

Please, select from the following languages and models: Loading selected model...

Available languages: Czech, Slovak, German, English, Dutch, Polish, Hungarian. For each language, multiple recognition models are provided. The newer models (Zipformer, 2023+) generally achieve higher accuracy, while the older models (Wav2Vec 2.0, oral history–specific models) are kept for scientific reproducibility and long-term compatibility.

2️⃣ Upload your audio file

You can upload an audio file for automatic transcription.
Most common formats are supported, with MP3 and WAV being the most reliable.
Simply click the icon above to select your file.
Recognition starts immediately after the upload.

speech signal nonspeech signal

Partial result

Confidence: <0.0, 0.3)| <0.3, 0.6)| <0.6, 0.9)| <0.9, 1.0>

3️⃣ Recognition result

Confidence: <0.0, 0.3)| <0.3, 0.6)| <0.6, 0.9)| <0.9, 1.0>

Uups, no result for your file...

Possibly:

Your recording is too short.
Your file is not audio at all.

4️⃣ Download your transcription

About

The UWebASR service is developed and maintained by the Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia in Pilsen.

It is also integrated into the national research infrastructure LINDAT/CLARIAH-CZ, which is part of the European CLARIN ERIC network. LINDAT/CLARIN’s mission is to provide open-access to digital linguistic data, tools, and services for the broad research and education community, while ensuring long-term preservation and sustainable operation.cz](https://lindat.cz/preservation-policy?utm_source=chatgpt.com).

More about LINDAT/CLARIN: lindat.cz

Contact: Jan Švec <honzas@kky.zcu.cz>

Citing UWebASR

The Englich, German and Czech models should be cited as: Lehečka, J., Švec, J., Psutka, J.V., Ircing, P. (2023) Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech. Proc. INTERSPEECH 2023, 201-205, doi: 10.21437/Interspeech.2023-872

@inproceedings{lehecka23_interspeech,
  author={Jan Lehečka and Jan Švec and Josef V. Psutka and Pavel Ircing},
  title={{Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={201--205},
  doi={10.21437/Interspeech.2023-872}
}

The Slovak model should be cited as: Lehečka, J., Psutka, J.V., Psutka, J. (2023). Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_29

@InProceedings{lehecka23_tsd,
  author="Lehe{\v{c}}ka, Jan and Psutka, Josef V.  and Psutka, Josef",
  editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
  title="Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak",
  booktitle="Text, Speech, and Dialogue",
  year="2023",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="328--338",
  isbn="978-3-031-40498-6"
}

UWebASR HTTP API

UWebASR HTTP API is a simple interface to speech recognition engine. The input data can be passed directly within the HTTP request (POST method) or as a link to a file in the form of a URL (GET method). Live audio stream recognition from a given URL is supported, as well. The output format includes plain text, machine-readable XML and JSON formats, and the WebVTT format for web captions. Recognition results (except TRS format) are streamed continuously.

For the UWebASR use the following values of variables:

app_id - use one of the following values to use the models provided by the UWebASR service:
- Zipformer architecture (2023+)
- generic/cs/zipformer - Czech (generic speech)
- generic/de/zipformer - German (generic speech)
- generic/pl/zipformer - Polish (generic speech)
- generic/hu/zipformer - Hungarian (generic speech)
- Wav2Vec 2.0 architecture (2020+)
- generic/cs - Czech (generic speech)
- generic/sk - Slovak (generic speech)
- generic/de - German (generic speech)
- generic/en - English (generic speech)
- generic/nl - Dutch (generic speech)
- Wav2Vec 2.0 architecture (2020+) for oral histories
- malach/cs - Czech (oral histories)
- malach/sk - Slovak (oral histories)
- malach/de - German (oral histories)
- malach/en - English (oral histories)

In other words, the full HTTP API endpoint for Czech is https://uwebasr.zcu.cz/api/v2/lindat/malach/cs.

The UWebASR HTTP API uses the underlying SpeechCloud platform. The platform has an architecture employing a set of real-time workers. Therefore it is possible that all of the workers are allocated and no further requests could be processed at the moment. This situation is indicated by the 503 HTTP status.

Each session employing the worker is limited to at most 3600 seconds. Longer sessions are automatically terminated. Since the speech recognizer runs faster than realtime in most situations, it means that the maximum length of the audio processed in the API is actually longer than 3600 seconds.

Request Format

HTTP API endpoint

The HTTP API is implemented using an endpoint in the form:

https://uwebasr.zcu.cz/api/v2/lindat/app_id

Where app_id is the model name exported from the SpeechCloud platform. You must obtain the full URL including the app_id to start using the HTTP API.

HTTP GET

HTTP GET requires parameter url which specifies the URL of the audio file to be recognized. This file is downloaded to the server (User-Agent header of the request is "SpeechCloud HTTP API") using the UWebASR HTTP API and processed. You can also specify the output format using the parameter format. It is also possible to specify the URL, which points to a media streamed over HTTP. In this case, the HTTP API continuously processes the stream. If the error occurs during speech processing, it is signalized in the streamed output.

The GET parameter stream=1 instructs the HTTP API to continue recognition as long as the HTTP connection is open or the source URL is supplying audio data, otherwise, the recognition stops after the first recognition result.

HTTP POST

Sending data using HTTP POST to URL:

https://uwebasr.zcu.cz/api/v2/lindat/app_id

app_id is the model name exported from the SpeechCloud platform.

The body of the HTTP POST contains any audio data with a format supported by FFmpeg. The audio data are decoded on-the-fly during the upload and fed into the speech recognition engine.

HTTP Headers

The HTTP API response has the following additional headers:

SpeechCloud-SessionID - a unique session identifier. Useful for reporting API bugs.
SpeechCloud-TimeRef - the date and time to which the remaining timestamps associated with the stream processing are related. The date and time format is HTTP Date/Time (https://tools.ietf.org/html/rfc7231#section-7.1.1.1).

HTTP GET/POST with output format specification

The output format is specified by the parameter format in the request:

https://uwebasr.zcu.cz/api/v2/lindat/app_id?format=webvtt

Supported formats

Plaintext (format=plaintext)

Clear plaintext in UTF-8 encoding:

Content-Type: text/plain; charset=UTF-8

Transcriber XML (format=trs) [does not support streaming]

Transcriber-accepted XML base (http://trans.sourceforge.net/en/presentation.php). The output TRS file contains the file name, it can be passed to the API using the header Content-Disposition:

Content-Type: text/xml

Content-Disposition: filename="foo.wav"

Extended Transcriber XML (format=extended_trs) [does not support streaming]

XML Transcriber extended by confidence, the rest is the same as the TRS format.

Content-Type: text/xml

WebVTT (format=webvtt)

Web captions with timestamps and transcript support, without confidence (https://www.w3.org/TR/webvtt1/).

Content-Type: text/webvtt

JSON (format=json)

A JSON array containing objects with timestamps, confidence, and word fields:

start, end (number) - word beginning and end, times relative to the beginning of the audio file, or to the SpeechCloud-TimeRef header in the case of streamed media recognition (HTTP GET)
word (string) - the recognized word
confidence (number) - a number between 0 and 1 representing the confidence score of the recognized word, higher numbers are better
speech_end (boolean) - binary value indicating the last word of the recognized input (input end) or end of recognized speech segment (based on voice activity).

The individual JSON objects corresponding to the recognized words are written as one JSON object per line forming a valid JSON array.

Example output (pretty-printed, does not have one JSON object per line):

Content-Type: application/json

[
{"start": 0.05999999865889549,
"end": 0.6299999859184027,
"word": "good",
"confidence": 0.9723399117549626,
"speech_end": false
},
{"start": 0.6299999859184027,
"end": 1.1099999751895666,
"word": "day",
“confidence": 0.9999999999999977,
"speech_end": true
}
]

SpeechCloud JSON (speechcloud_json)

Internal format containing all messages from the SpeechCloud platform. Suitable for further processing and integration into other platforms (it may also contain NLU results, etc.).

Content-Type: application/json; charset=UTF-8

The JSON file consists of a list of messages. Each message has a type property indicating the type of the message. The correct sequence of message types is the following:

{"type": "asr_offline_started"} - indicates the start of input transcoding process
{"type": "asr_result", "partial_result": true, ...} - partial ASR result, usefull for streaming purposes
{"type": "asr_offline_stream_stats", ...} - sent periodically for statistics of the transcoding process
{"type": "asr_offline_finished"} - indicates the end of input transcoding process, now we just wait for the recognition of the remaining audio chunks
{"type": "asr_result", "partial_result": false, ...} - final ASR result
{"type": "asr_input_processed"} - all audio chunks were processed

If there is an error during the processing, the message {"type": "sc_error"} will appear in the message stream.

If the input is not processed till its end, the asr_input_processed will not appear in the message stream.

Error states

The errors which are encountered are indicated in the response body according to a given output format. The errors occuring before writing the HTTP response are indicated also using the HTTP status codes and HTTP status line (together with the indication in the response body).

Error codes

The error codes could be divided into several groups:

Errors caused by the source of (streamed) audio data, i.e. during recognition using HTTP GET. Such errors are reported using the HTTP status code of the upstream HTTP server providing the audio data.
Errors occurring during processing of the audio data in the SpeechCloud platform. The errors are reported using HTTP 5xx codes, for example:
- 503 Cannot connect to SpeechCloud WebSocket URL - the backend server is not reachable from UWebASR HTTP API.
- 503 SpeechCloud WebSocket did not send any message - the backend server didn’t send any message in the specified timeout.
- 503 SpeechCloud WebSocket closed prior to initializing the ASR - the worker closed the communication channel before initializing the ASR.
- 500 Invalid SpeechCloud response - the response from the backend is corrupted.
- Other 500 statuses - the worker process reported error described in the status.
Errors in handling the UWebASR HTTP API request:
- 400 Unknown format - the request specifies unknown format.
- 400 Empty audio data - there were no audio data from the given URL (empty response/non-existent upstream server).
- 410 No data received from HTTP stream for X seconds (threshold Y), closing... - the input audio stream was not closed, but no data was received for a long time.
- 500 No data received from SpeechCloud WebSocket for X seconds (Y), closing... - the worker process stopped sending the messages.
- 503 Recognition delay X is greater than threshold Y, closing… - the speech recognizer was delayed more than a given threshold (the worker hardware is overloaded).

Example

An HTTP GET requesting recognition of a URL returning 404 (http://google.com/foo):

https://uwebasr.zcu.cz/api/v2/lindat/app_id?format=plaintext&stream=1&url=http://google.com/foo

returns HTTP status code 404 Not Found as provided by the up-stream server handling the URL (google.com). The error code is also included in the output HTTP response according to the required output format:

plaintext

The HTTP status code and the reason are specified on the new line in the output after the # symbol:

# 404 Not Found

trs, extended_trs

The HTTP status code and the reason are stored in an ErrorCode and ErrorReason elements in the output:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans audio_filename="input.wav">
  <Episode>
    <ErrorCode>404</ErrorCode>
    <ErrorReason>Not Found</ErrorReason>
  </Episode>
</Trans>

webvtt

The errors in WebVTT format are reported as comments after the NOTE label at the beginning of the line:

WEBVTT
            NOTE Error 404 Not Found

json, speechcloud_json

The error code and status are reported in a JSON object embedded into the output JSON array of the HTTP response:

[
{"status_code": 404, "status_reason": "Not Found"}
]

Examples

The following examples use curl command-line utility as a common HTTP client available almost everywhere.

To recognize the audio file test_wav/i_i2.wav in local directory:

curl -X POST -N \
--data-binary @test_wav/i_i2.wav \
'https://uwebasr.zcu.cz/api/v2/lindat/malach.cs?format=plaintext'

To recognize live audio stream from (http://icecast8.play.cz/cro1-128.mp3) into a JSON format use (note the stream=1 parameter):

curl -X GET -N "https://uwebasr.zcu.cz/api/v2/lindat/malach/cs"\
"?format=json&stream=1&url=http://icecast8.play.cz/cro1-128.mp3"

Shell script uwebasr.sh

You can use the following convenience shell script for automatic processing of input files. It depends on ffmpeg and curl utilities installed in your system. It processes the input into the SpeechCloud JSON format and then it converts it into TXT and VTT formats. The files with .s.txt and .s.vtt suffixes contain also the information about short and long pauses.

#!/bin/bash
set -o nounset
set -o errexit
set -o pipefail

LANG=${1:?Please, pass LANG as \$1}

URL="https://uwebasr.zcu.cz/api/v2/lindat/malach/${LANG}"
CONV_URL="https://uwebasr.zcu.cz/utils/v2/convert-speechcloud-json"

shift
x=${1:?Please, specify one or more input files}

for INPUT_FILE in "$@"; do
    JSON_FILE=${INPUT_FILE%.*}.json
    TXT_FILE=${INPUT_FILE%.*}.txt
    STXT_FILE=${INPUT_FILE%.*}.s.txt
    VTT_FILE=${INPUT_FILE%.*}.vtt
    SVTT_FILE=${INPUT_FILE%.*}.s.vtt

    echo "=== Recognizing to raw JSON: $JSON_FILE"
    ffmpeg -hide_banner -loglevel error -i "$INPUT_FILE" -ar 16000 -ac 1 -q:a 1 -f mp3 - |\
        curl --http1.1 --data-binary @- "${URL}?format=speechcloud_json" > "$JSON_FILE"
    echo "=== Converting to plaintext: $TXT_FILE"
    curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext" > "$TXT_FILE"
    echo "=== Converting to sentext: $STXT_FILE"
    curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext&sp=0.3&pau=2.0" > "$STXT_FILE"
    echo "=== Converting to WebVTT: $VTT_FILE"
    curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=webvtt" > "$VTT_FILE"
    echo "=== Converting to SentVTT: $SVTT_FILE"
    curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=sentvtt&sp=0.3&pau=2.0" > "$SVTT_FILE"
done

Shell script usage

To recognize the file test_wav/i_i2.wav using the Czech model, simply use:

uwebasr.sh cs test_wav/i_i2.wav

The output should look like:


=== Recognizing to raw JSON: test_wav/i_i2.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 56365    0  1204  100 55161     56   2604  0:00:21  0:00:21 --:--:--   312
=== Converting to plaintext: test_wav/i_i2.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1302    0    98  100  1204    971  11938 --:--:-- --:--:-- --:--:-- 14152
=== Converting to sentext: test_wav/i_i2.s.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1307    0   103  100  1204   1029  12028 --:--:-- --:--:-- --:--:-- 14206
=== Converting to WebVTT: test_wav/i_i2.vtt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1402    0   198  100  1204   1879  11430 --:--:-- --:--:-- --:--:-- 14604
=== Converting to SentVTT: test_wav/i_i2.s.vtt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1413    0   209  100  1204   2024  11663 --:--:-- --:--:-- --:--:-- 15031

You can also pass more than one file:

uwebasr.sh cs test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav

For other languages, change the first parameter (en, de, cs, sk):

uwebasr.sh en test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav