    UWebASR HTTP API is a simple interface to speech recognition engine. The input data can be passed directly within the HTTP request (POST method) or as a link to a file in the form of a URL (GET method). Live audio stream recognition from a given URL is supported, as well. The output format includes plain text, machine-readable XML and JSON formats, and the WebVTT format for web captions. Recognition results (except TRS format) are streamed continuously.

    For the UWebASR use the following values of variables:

    In other words, the full HTTP API endpoint for Czech is https://uwebasr.zcu.cz/api/v2/lindat/malach/cs.

    The UWebASR HTTP API uses the underlying SpeechCloud platform. The platform has an architecture employing a set of real-time workers. Therefore it is possible that all of the workers are allocated and no further requests could be processed at the moment. This situation is indicated by the 503 HTTP status.

    Each session employing the worker is limited to at most 3600 seconds. Longer sessions are automatically terminated. Since the speech recognizer runs faster than realtime in most situations, it means that the maximum length of the audio processed in the API is actually longer than 3600 seconds.

    Citing the models

    The Englich, German and Czech models should be cited as: Lehečka, J., Švec, J., Psutka, J.V., Ircing, P. (2023) Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech. Proc. INTERSPEECH 2023, 201-205, doi: 10.21437/Interspeech.2023-872

      author={Jan Lehečka and Jan Švec and Josef V. Psutka and Pavel Ircing},
      title={{Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech}},
      booktitle={Proc. INTERSPEECH 2023},

    The Slovak model should be cited as: Lehečka, J., Psutka, J.V., Psutka, J. (2023). Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_29

      author="Lehe{\v{c}}ka, Jan and Psutka, Josef V.  and Psutka, Josef",
      editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
      title="Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak",
      booktitle="Text, Speech, and Dialogue",
      publisher="Springer Nature Switzerland",

    Request Format

    HTTP API endpoint

    The HTTP API is implemented using an endpoint in the form:


    Where app_id is the model name exported from the SpeechCloud platform. You must obtain the full URL including the app_id to start using the HTTP API.


    HTTP GET requires parameter url which specifies the URL of the audio file to be recognized. This file is downloaded to the server (User-Agent header of the request is "SpeechCloud HTTP API") using the UWebASR HTTP API and processed. You can also specify the output format using the parameter format. It is also possible to specify the URL, which points to a media streamed over HTTP. In this case, the HTTP API continuously processes the stream. If the error occurs during speech processing, it is signalized in the streamed output.

    The GET parameter stream=1 instructs the HTTP API to continue recognition as long as the HTTP connection is open or the source URL is supplying audio data, otherwise, the recognition stops after the first recognition result.


    Sending data using HTTP POST to URL:


    app_id is the model name exported from the SpeechCloud platform.

    The body of the HTTP POST contains any audio data with a format supported by FFmpeg. The audio data are decoded on-the-fly during the upload and fed into the speech recognition engine.

    HTTP Headers

    The HTTP API response has the following additional headers:

    HTTP GET/POST with output format specification

    The output format is specified by the parameter format in the request:


    Supported formats

    Plaintext (format=plaintext)

    Clear plaintext in UTF-8 encoding:

    Content-Type: text/plain; charset=UTF-8

    Transcriber XML (format=trs) [does not support streaming]

    Transcriber-accepted XML base (http://trans.sourceforge.net/en/presentation.php). The output TRS file contains the file name, it can be passed to the API using the header Content-Disposition:

    Content-Type: text/xml
    Content-Disposition: filename="foo.wav"

    Extended Transcriber XML (format=extended_trs) [does not support streaming]

    XML Transcriber extended by confidence, the rest is the same as the TRS format.

    Content-Type: text/xml

    WebVTT (format=webvtt)

    Web captions with timestamps and transcript support, without confidence (https://www.w3.org/TR/webvtt1/).

    Content-Type: text/webvtt

    JSON (format=json)

    A JSON array containing objects with timestamps, confidence, and word fields:

    The individual JSON objects corresponding to the recognized words are written as one JSON object per line forming a valid JSON array.

    Example output (pretty-printed, does not have one JSON object per line):

    Content-Type: application/json
    {"start": 0.05999999865889549,
    "end": 0.6299999859184027,
    "word": "good",
    "confidence": 0.9723399117549626,
    "speech_end": false
    {"start": 0.6299999859184027,
    "end": 1.1099999751895666,
    "word": "day",
    “confidence": 0.9999999999999977,
    "speech_end": true

    SpeechCloud JSON (speechcloud_json)

    Internal format containing all messages from the SpeechCloud platform. Suitable for further processing and integration into other platforms (it may also contain NLU results, etc.).

    Content-Type: application/json; charset=UTF-8

    The JSON file consists of a list of messages. Each message has a type property indicating the type of the message. The correct sequence of message types is the following:

    If there is an error during the processing, the message {"type": "sc_error"} will appear in the message stream.

    If the input is not processed till its end, the asr_input_processed will not appear in the message stream.

    Error states

    The errors which are encountered are indicated in the response body according to a given output format. The errors occuring before writing the HTTP response are indicated also using the HTTP status codes and HTTP status line (together with the indication in the response body).

    Error codes

    The error codes could be divided into several groups:


    An HTTP GET requesting recognition of a URL returning 404 (http://google.com/foo):


    returns HTTP status code 404 Not Found as provided by the up-stream server handling the URL (google.com). The error code is also included in the output HTTP response according to the required output format:


    The HTTP status code and the reason are specified on the new line in the output after the # symbol:

    # 404 Not Found

    trs, extended_trs

    The HTTP status code and the reason are stored in an ErrorCode and ErrorReason elements in the output:

    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE Trans SYSTEM "trans-14.dtd">
    <Trans audio_filename="input.wav">
        <ErrorReason>Not Found</ErrorReason>


    The errors in WebVTT format are reported as comments after the NOTE label at the beginning of the line:

                NOTE Error 404 Not Found

    json, speechcloud_json

    The error code and status are reported in a JSON object embedded into the output JSON array of the HTTP response:

    {"status_code": 404, "status_reason": "Not Found"}


    The following examples use curl command-line utility as a common HTTP client available almost everywhere.

    To recognize the audio file test_wav/i_i2.wav in local directory:

    curl -X POST -N \
    --data-binary @test_wav/i_i2.wav \

    To recognize live audio stream from (http://icecast8.play.cz/cro1-128.mp3) into a JSON format use (note the stream=1 parameter):

    curl -X GET -N "https://uwebasr.zcu.cz/api/v2/lindat/malach/cs"\

    Shell script uwebasr.sh

    You can use the following convenience shell script for automatic processing of input files. It depends on ffmpeg and curl utilities installed in your system. It processes the input into the SpeechCloud JSON format and then it converts it into TXT and VTT formats. The files with .s.txt and .s.vtt suffixes contain also the information about short and long pauses.

    set -o nounset
    set -o errexit
    set -o pipefail
    LANG=${1:?Please, pass LANG as \$1}
    x=${1:?Please, specify one or more input files}
    for INPUT_FILE in "$@"; do
        echo "=== Recognizing to raw JSON: $JSON_FILE"
        ffmpeg -hide_banner -loglevel error -i "$INPUT_FILE" -ar 16000 -ac 1 -q:a 1 -f mp3 - |\
            curl --http1.1 --data-binary @- "${URL}?format=speechcloud_json" > "$JSON_FILE"
        echo "=== Converting to plaintext: $TXT_FILE"
        curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext" > "$TXT_FILE"
        echo "=== Converting to sentext: $STXT_FILE"
        curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext&sp=0.3&pau=2.0" > "$STXT_FILE"
        echo "=== Converting to WebVTT: $VTT_FILE"
        curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=webvtt" > "$VTT_FILE"
        echo "=== Converting to SentVTT: $SVTT_FILE"
        curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=sentvtt&sp=0.3&pau=2.0" > "$SVTT_FILE"

    Shell script usage

    To recognize the file test_wav/i_i2.wav using the Czech model, simply use:

    uwebasr.sh cs test_wav/i_i2.wav

    The output should look like:

    === Recognizing to raw JSON: test_wav/i_i2.json
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 56365    0  1204  100 55161     56   2604  0:00:21  0:00:21 --:--:--   312
    === Converting to plaintext: test_wav/i_i2.txt
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  1302    0    98  100  1204    971  11938 --:--:-- --:--:-- --:--:-- 14152
    === Converting to sentext: test_wav/i_i2.s.txt
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  1307    0   103  100  1204   1029  12028 --:--:-- --:--:-- --:--:-- 14206
    === Converting to WebVTT: test_wav/i_i2.vtt
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  1402    0   198  100  1204   1879  11430 --:--:-- --:--:-- --:--:-- 14604
    === Converting to SentVTT: test_wav/i_i2.s.vtt
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100  1413    0   209  100  1204   2024  11663 --:--:-- --:--:-- --:--:-- 15031

    You can also pass more than one file:

    uwebasr.sh cs test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav

    For other languages, change the first parameter (en, de, cs, sk):

    uwebasr.sh en test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav