Documentation

Quickstart

Soniqo Cloud is wire-compatible with OpenAI’s Whisper API. Use the official OpenAI SDKs by overriding the base URL. Speaker diarization arrives as an additive x_soniqo field on the response — vanilla clients ignore it.

1. Get an API key

Sign in with Google or GitHub, then go to Account → API keys and click Create new key. The plaintext is shown once; copy it immediately. Save it as a shell env var:

export SONIQO_API_KEY=sk_...

All examples below use this key as the bearer credential. New accounts get $20 in free credits, refreshed monthly.

2. Transcribe a file

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["SONIQO_API_KEY"],
    base_url="https://cloud.soniqo.audio/v1",
)

with open("meeting.wav", "rb") as f:
    result = client.audio.transcriptions.create(
        model="parakeet-tdt",
        file=f,
        response_format="verbose_json",
    )

print(result.text)
# Speaker turns are in the response's `x_soniqo` field.
for seg in result.segments:
    speaker = getattr(seg, "x_speaker_id", None)
    print(f"[{speaker or '?'}] {seg.text}")

Node.js (OpenAI SDK)

import OpenAI from "openai";
import fs from "node:fs";

const client = new OpenAI({
  apiKey:  process.env.SONIQO_API_KEY,
  baseURL: "https://cloud.soniqo.audio/v1",
});

const result = await client.audio.transcriptions.create({
  model: "parakeet-tdt",
  file: fs.createReadStream("meeting.wav"),
  response_format: "verbose_json",
});

console.log(result.text);
for (const seg of result.segments) {
  console.log(`[${seg.x_speaker_id ?? "?"}] ${seg.text}`);
}

curl

curl -X POST https://cloud.soniqo.audio/v1/audio/transcriptions \
  -H "Authorization: Bearer $SONIQO_API_KEY" \
  -F "file=@meeting.wav" \
  -F "model=parakeet-tdt" \
  -F "response_format=verbose_json"
Audio format note. The synchronous endpoint currently accepts raw 16 kHz PCM float32 little-endian. Decode of mp3 / m4a / wav / webm via ffmpeg ships in the next release; until then convert with ffmpeg -i in.mp3 -f f32le -ar 16000 -ac 1 out.pcm.

3. Response shapes

Pass response_format to control the output. All five formats from OpenAI’s spec are supported.

FormatBodyUse case
json (default){"text": "..."}just the transcript
textplain textshell pipelines
verbose_jsonsegments + words + x_soniqo speaker turnsapp integration; the rich one
srtSubRip subtitle filevideo subtitles
vttWebVTT subtitle filebrowser-native subtitles

4. Async for long audio

The Whisper-compat synchronous endpoint blocks for up to 5 minutes and returns 504 if the job runs past that. For longer audio, use Soniqo’s async API: it returns a job_id immediately, and you poll for the result.

# 1. submit
JOB=$(curl -sX POST https://cloud.soniqo.audio/v1/transcribe \
        -H "X-Api-Key: $SONIQO_API_KEY" \
        -F audio=@meeting.pcm | jq -r .job_id)

# 2. poll
while true; do
  RES=$(curl -s "https://cloud.soniqo.audio/v1/transcribe/$JOB" \
            -H "X-Api-Key: $SONIQO_API_KEY")
  STATUS=$(echo "$RES" | jq -r .status)
  case "$STATUS" in
    completed) echo "$RES" | jq .; break ;;
    failed)    echo "$RES" >&2; exit 1 ;;
    *)         sleep 2 ;;
  esac
done

5. Realtime streaming

A WebSocket endpoint at wss://cloud.soniqo.audio/v1/realtime?intent=transcription streams partial transcripts as you send audio, speaking the same OpenAI Realtime event vocabulary (session.update, input_audio_buffer.append, conversation.item.input_audio_transcription.delta). Streaming recognition runs on Nemotron Speech Streaming, so existing OpenAI Realtime clients work by changing the URL.

Realtime protocol reference →

6. Pricing & limits

  • Free tier: $20 in starter credits + $20 monthly. 1¢ per audio-second.
  • Synchronous endpoint: 5 minute wall-clock cap; longer audio uses the async path.
  • Soft rate limit: 1000 req/min per IP. Higher quotas via support.
  • Concurrent jobs scale with your billing class; free tier is best-effort.

7. What’s different from OpenAI

  • Speaker diarization is built in. Set response_format=verbose_json and read x_soniqo for speaker turns.
  • Speaker identification is opt-in. Register profiles via POST /v1/speakers, then pass identify_speakers=true to label them in the transcript.
  • The pipeline is multilingual. 25 languages via Parakeet TDT plus the Omnilingual fallback for the long tail. Hint with language=es or auto-detect by leaving it empty.
  • Self-hosted SDKs available. The same models run on-device under Apache 2.0 if you would rather not go through a cloud at all — speech-swift for Apple platforms, and speech-core, a cross-platform C++ engine for Linux, Windows, macOS, and Android with the same speech-to-text, diarization, speaker ID, and realtime streaming.