Use case

Text to Speech and Music API

Builders combining synthesized narration with music, for example audiobooks, explainers, or IVR, who need both halves through APIs.

The problem

Narration alone sounds flat. Audiobooks, explainer videos, and voice apps need music under or around the voice, but stitching a separate music source into a TTS pipeline is fiddly and the moods rarely match. You want the music half to be as programmable as the speech half.

Why MusicAPI fits this

Generate instrumental beds and themes to score synthesized narration, controlled entirely through a REST API.

make_instrumental keeps the score clear of vocals so it never competes with the spoken track.

The same job-plus-poll flow slots next to a TTS step in a pipeline, so both speech and music are programmatic.

Code sample

A real request against the live API: start a job, then poll the task endpoint until the audio is ready.

curl

# Generate an instrumental score to sit under TTS narration
# 1. Start a generation job
curl -X POST https://api.musicapi.ai/api/v1/sonic/create \
  -H "Authorization: Bearer $MUSICAPI_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "custom_mode": true,
    "mv": "sonic-v5",
    "prompt": "A subtle, warm instrumental underscore for a narrated explainer",
    "tags": "warm, neutral, subtle, narration underscore",
    "title": "Narrated Bed",
    "make_instrumental": true
  }'

# Response: { "message": "success", "task_id": "a1b2c3d4-..." }

# 2. Poll until the audio is ready (state: "succeeded")
curl https://api.musicapi.ai/api/v1/sonic/task/a1b2c3d4-... \
  -H "Authorization: Bearer $MUSICAPI_KEY"

# Response data[].audio_url is a ready-to-stream MP3 once state is "succeeded".

Pricing

MusicAPI is pay-as-you-go with credit packs, plus predictable monthly subscriptions. The per-credit rate is the same across packs and subscriptions. See the pricing page for current rates, free credits, and volume options.

Related: Suno API · Producer AI API

FAQ

Does this API also do text to speech?

This endpoint generates music. The intent here is the music half of a narration plus music workflow: you generate the spoken track with your TTS provider and use MusicAPI to generate the instrumental score that sits with it.

How do I keep the music from masking the voice?

Set make_instrumental to true and keep tags subtle and low-key. An instrumental, restrained bed stays clear of the narration so speech remains intelligible.

Can I match the music length to the narration?

Generate a base score, then use the extend endpoint to match the length of the narration segment. Poll the task endpoint for the finished audio at each step.

Build it in 5 minutes

Get free credits on signup and run real generations before any payment. No credit card required to start.

Start free

API details verified 2026-05-16. The API surface evolves; the pricing page always has current rates.