Click the icon above to upload your file.
UWebASR HTTP API
UWebASR HTTP API is a simple interface to speech recognition engine. The input data can be passed directly within the HTTP request (POST method) or as a link to a file in the form of a URL (GET method). Live audio stream recognition from a given URL is supported, as well. The output format includes plain text, machine-readable XML and JSON formats, and the WebVTT format for web captions. Recognition results (except TRS format) are streamed continuously.
For the UWebASR use the following values of variables:
app_id
- use one of the following values to use the models provided by the UWebASR service:malach/en
- English Wav2Vec 2.0 model (oral histories)malach/de
- German Wav2Vec 2.0 model (oral histories)malach/cs
- Czech Wav2Vec 2.0 model (oral histories)malach/sk
- Slovak Wav2Vec 2.0 model (oral histories)generic/en
- English Wav2Vec 2.0 model (generic speech)generic/de
- German Wav2Vec 2.0 model (generic speech)generic/cs
- Czech Wav2Vec 2.0 model (generic speech)generic/sk
- Slovak Wav2Vec 2.0 model (generic speech)
In other words, the full HTTP API endpoint for Czech is
https://uwebasr.zcu.cz/api/v2/lindat/malach/cs
.
The UWebASR HTTP API uses the underlying SpeechCloud platform. The platform has an architecture employing a set of real-time workers. Therefore it is possible that all of the workers are allocated and no further requests could be processed at the moment. This situation is indicated by the 503 HTTP status.
Each session employing the worker is limited to at most 3600 seconds. Longer sessions are automatically terminated. Since the speech recognizer runs faster than realtime in most situations, it means that the maximum length of the audio processed in the API is actually longer than 3600 seconds.
Citing the models
The Englich, German and Czech models should be cited as: Lehečka, J., Švec, J., Psutka, J.V., Ircing, P. (2023) Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech. Proc. INTERSPEECH 2023, 201-205, doi: 10.21437/Interspeech.2023-872
@inproceedings{lehecka23_interspeech,
author={Jan Lehečka and Jan Švec and Josef V. Psutka and Pavel Ircing},
title={{Transformer-based Speech Recognition Models for Oral History Archives in English, German, and Czech}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={201--205},
doi={10.21437/Interspeech.2023-872}
}
The Slovak model should be cited as: Lehečka, J., Psutka, J.V., Psutka, J. (2023). Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2023. Lecture Notes in Computer Science(), vol 14102. Springer, Cham. https://doi.org/10.1007/978-3-031-40498-6_29
@InProceedings{lehecka23_tsd,
author="Lehe{\v{c}}ka, Jan and Psutka, Josef V. and Psutka, Josef",
editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
title="Transfer Learning of Transformer-Based Speech Recognition Models from Czech to Slovak",
booktitle="Text, Speech, and Dialogue",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="328--338",
isbn="978-3-031-40498-6"
}
Request Format
HTTP API endpoint
The HTTP API is implemented using an endpoint in the form:
https://uwebasr.zcu.cz/api/v2/lindat/app_id
Where app_id
is the model name exported from the SpeechCloud platform. You
must obtain the full URL including the app_id
to start using the HTTP API.
HTTP GET
HTTP GET requires parameter url which specifies the URL of the audio file to be recognized. This file is downloaded to the server (User-Agent header of the request is "SpeechCloud HTTP API") using the UWebASR HTTP API and processed. You can also specify the output format using the parameter format. It is also possible to specify the URL, which points to a media streamed over HTTP. In this case, the HTTP API continuously processes the stream. If the error occurs during speech processing, it is signalized in the streamed output.
The GET parameter stream=1
instructs the HTTP API to continue recognition as
long as the HTTP connection is open or the source URL is supplying audio data,
otherwise, the recognition stops after the first recognition result.
HTTP POST
Sending data using HTTP POST to URL:
https://uwebasr.zcu.cz/api/v2/lindat/app_id
app_id
is the model name exported from the SpeechCloud platform.
The body of the HTTP POST contains any audio data with a format supported by FFmpeg. The audio data are decoded on-the-fly during the upload and fed into the speech recognition engine.
HTTP Headers
The HTTP API response has the following additional headers:
SpeechCloud-SessionID
- a unique session identifier. Useful for reporting API bugs.SpeechCloud-TimeRef
- the date and time to which the remaining timestamps associated with the stream processing are related. The date and time format is HTTP Date/Time (https://tools.ietf.org/html/rfc7231#section-7.1.1.1).
HTTP GET/POST with output format specification
The output format is specified by the parameter format in the request:
https://uwebasr.zcu.cz/api/v2/lindat/app_id?format=webvtt
Supported formats
Plaintext (format=plaintext)
Clear plaintext in UTF-8 encoding:
Content-Type: text/plain; charset=UTF-8
Transcriber XML (format=trs) [does not support streaming]
Transcriber-accepted XML base
(http://trans.sourceforge.net/en/presentation.php). The output TRS file
contains the file name, it can be passed to the API using the header
Content-Disposition
:
Content-Type: text/xml
Content-Disposition: filename="foo.wav"
Extended Transcriber XML (format=extended_trs) [does not support streaming]
XML Transcriber extended by confidence, the rest is the same as the TRS format.
Content-Type: text/xml
WebVTT (format=webvtt)
Web captions with timestamps and transcript support, without confidence (https://www.w3.org/TR/webvtt1/).
Content-Type: text/webvtt
JSON (format=json)
A JSON array containing objects with timestamps, confidence, and word fields:
- start, end (number) - word beginning and end, times relative to the
beginning of the audio file, or to the
SpeechCloud-TimeRef
header in the case of streamed media recognition (HTTP GET) - word (string) - the recognized word
- confidence (number) - a number between 0 and 1 representing the confidence score of the recognized word, higher numbers are better
- speech_end (boolean) - binary value indicating the last word of the recognized input (input end) or end of recognized speech segment (based on voice activity).
The individual JSON objects corresponding to the recognized words are written as one JSON object per line forming a valid JSON array.
Example output (pretty-printed, does not have one JSON object per line):
Content-Type: application/json
[
{"start": 0.05999999865889549,
"end": 0.6299999859184027,
"word": "good",
"confidence": 0.9723399117549626,
"speech_end": false
},
{"start": 0.6299999859184027,
"end": 1.1099999751895666,
"word": "day",
“confidence": 0.9999999999999977,
"speech_end": true
}
]
SpeechCloud JSON (speechcloud_json)
Internal format containing all messages from the SpeechCloud platform. Suitable for further processing and integration into other platforms (it may also contain NLU results, etc.).
Content-Type: application/json; charset=UTF-8
The JSON file consists of a list of messages. Each message has a type
property indicating
the type of the message. The correct sequence of message types is the following:
{"type": "asr_offline_started"}
- indicates the start of input transcoding process{"type": "asr_result", "partial_result": true, ...}
- partial ASR result, usefull for streaming purposes{"type": "asr_offline_stream_stats", ...}
- sent periodically for statistics of the transcoding process{"type": "asr_offline_finished"}
- indicates the end of input transcoding process, now we just wait for the recognition of the remaining audio chunks{"type": "asr_result", "partial_result": false, ...}
- final ASR result{"type": "asr_input_processed"}
- all audio chunks were processed
If there is an error during the processing, the message {"type": "sc_error"}
will appear in the message stream.
If the input is not processed till its end, the asr_input_processed
will not appear in the message stream.
Error states
The errors which are encountered are indicated in the response body according to a given output format. The errors occuring before writing the HTTP response are indicated also using the HTTP status codes and HTTP status line (together with the indication in the response body).
Error codes
The error codes could be divided into several groups:
- Errors caused by the source of (streamed) audio data, i.e. during recognition using HTTP GET. Such errors are reported using the HTTP status code of the upstream HTTP server providing the audio data.
- Errors occurring during processing of the audio data in the SpeechCloud
platform. The errors are reported using HTTP 5xx codes, for example:
- 503 Cannot connect to SpeechCloud WebSocket URL - the backend server is not reachable from UWebASR HTTP API.
- 503 SpeechCloud WebSocket did not send any message - the backend server didn’t send any message in the specified timeout.
- 503 SpeechCloud WebSocket closed prior to initializing the ASR - the worker closed the communication channel before initializing the ASR.
- 500 Invalid SpeechCloud response - the response from the backend is corrupted.
- Other 500 statuses - the worker process reported error described in the status.
- Errors in handling the UWebASR HTTP API request:
- 400 Unknown format - the request specifies unknown format.
- 400 Empty audio data - there were no audio data from the given URL (empty response/non-existent upstream server).
- 410 No data received from HTTP stream for X seconds (threshold Y), closing... - the input audio stream was not closed, but no data was received for a long time.
- 500 No data received from SpeechCloud WebSocket for X seconds (Y), closing... - the worker process stopped sending the messages.
- 503 Recognition delay X is greater than threshold Y, closing… - the speech recognizer was delayed more than a given threshold (the worker hardware is overloaded).
Example
An HTTP GET requesting recognition of a URL returning 404 (http://google.com/foo):
https://uwebasr.zcu.cz/api/v2/lindat/app_id?format=plaintext&stream=1&url=http://google.com/foo
returns HTTP status code 404 Not Found as provided by the up-stream server handling the URL (google.com). The error code is also included in the output HTTP response according to the required output format:
plaintext
The HTTP status code and the reason are specified on the new line in the output after the # symbol:
# 404 Not Found
trs, extended_trs
The HTTP status code and the reason are stored in an ErrorCode and ErrorReason elements in the output:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans audio_filename="input.wav">
<Episode>
<ErrorCode>404</ErrorCode>
<ErrorReason>Not Found</ErrorReason>
</Episode>
</Trans>
webvtt
The errors in WebVTT format are reported as comments after the NOTE label at the beginning of the line:
WEBVTT
NOTE Error 404 Not Found
json, speechcloud_json
The error code and status are reported in a JSON object embedded into the output JSON array of the HTTP response:
[
{"status_code": 404, "status_reason": "Not Found"}
]
Examples
The following examples use curl command-line utility as a common HTTP client available almost everywhere.
To recognize the audio file test_wav/i_i2.wav
in local directory:
curl -X POST -N \
--data-binary @test_wav/i_i2.wav \
'https://uwebasr.zcu.cz/api/v2/lindat/malach.cs?format=plaintext'
To recognize live audio stream from (http://icecast8.play.cz/cro1-128.mp3
)
into a JSON format use (note the stream=1
parameter):
curl -X GET -N "https://uwebasr.zcu.cz/api/v2/lindat/malach/cs"\
"?format=json&stream=1&url=http://icecast8.play.cz/cro1-128.mp3"
Shell script uwebasr.sh
You can use the following convenience shell script for automatic processing of input files.
It depends on ffmpeg
and curl
utilities
installed in your system. It processes the input into the SpeechCloud JSON format and then it converts it into TXT and VTT formats.
The files with .s.txt and .s.vtt suffixes contain also the information about short and long pauses.
#!/bin/bash
set -o nounset
set -o errexit
set -o pipefail
LANG=${1:?Please, pass LANG as \$1}
URL="https://uwebasr.zcu.cz/api/v2/lindat/malach/${LANG}"
CONV_URL="https://uwebasr.zcu.cz/utils/v2/convert-speechcloud-json"
shift
x=${1:?Please, specify one or more input files}
for INPUT_FILE in "$@"; do
JSON_FILE=${INPUT_FILE%.*}.json
TXT_FILE=${INPUT_FILE%.*}.txt
STXT_FILE=${INPUT_FILE%.*}.s.txt
VTT_FILE=${INPUT_FILE%.*}.vtt
SVTT_FILE=${INPUT_FILE%.*}.s.vtt
echo "=== Recognizing to raw JSON: $JSON_FILE"
ffmpeg -hide_banner -loglevel error -i "$INPUT_FILE" -ar 16000 -ac 1 -q:a 1 -f mp3 - |\
curl --http1.1 --data-binary @- "${URL}?format=speechcloud_json" > "$JSON_FILE"
echo "=== Converting to plaintext: $TXT_FILE"
curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext" > "$TXT_FILE"
echo "=== Converting to sentext: $STXT_FILE"
curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=plaintext&sp=0.3&pau=2.0" > "$STXT_FILE"
echo "=== Converting to WebVTT: $VTT_FILE"
curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=webvtt" > "$VTT_FILE"
echo "=== Converting to SentVTT: $SVTT_FILE"
curl --data-binary "@${JSON_FILE}" "${CONV_URL}?format=sentvtt&sp=0.3&pau=2.0" > "$SVTT_FILE"
done
Shell script usage
To recognize the file test_wav/i_i2.wav
using the Czech model, simply use:
uwebasr.sh cs test_wav/i_i2.wav
The output should look like:
=== Recognizing to raw JSON: test_wav/i_i2.json
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 56365 0 1204 100 55161 56 2604 0:00:21 0:00:21 --:--:-- 312
=== Converting to plaintext: test_wav/i_i2.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1302 0 98 100 1204 971 11938 --:--:-- --:--:-- --:--:-- 14152
=== Converting to sentext: test_wav/i_i2.s.txt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1307 0 103 100 1204 1029 12028 --:--:-- --:--:-- --:--:-- 14206
=== Converting to WebVTT: test_wav/i_i2.vtt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1402 0 198 100 1204 1879 11430 --:--:-- --:--:-- --:--:-- 14604
=== Converting to SentVTT: test_wav/i_i2.s.vtt
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1413 0 209 100 1204 2024 11663 --:--:-- --:--:-- --:--:-- 15031
You can also pass more than one file:
uwebasr.sh cs test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav
For other languages, change the first parameter (en, de, cs, sk
):
uwebasr.sh en test_wav/input1.wav test_wav/input2.wav test_wav/input3.wav