音頻#

學習如何使用 Xinference 將音訊轉換為文字或將文字轉換為音訊。

您好！请问您需要我介绍什么内容呢？比如产品、服务、技术，或者其他方面？请提供更多信息，以便我为您提供准确的介绍。#

Audio API 提供了三種與音訊互動的方法：

轉錄終端將音訊轉錄為輸入語言。
翻譯端點將音訊轉換為英文。
轉錄終端將音頻轉錄為輸入語言。

API 端點	OpenAI 相容端點
轉錄 API	/v1/audio/transcriptions
API	/v1/audio/translations
語音 API	/v1/audio/speech

支援的模型列表#

在Xinference中，以下模型支援音訊API：

語音轉文字#

僅適用於 Mac M 系列晶片：

文字轉語音（TTS）#

支援zero-shot的模型 （無需參考音訊）

**支援語音克隆的模型**（需參考音訊）

支援情感控制的模型

IndexTTS2

僅適用於 Mac M 系列晶片：

快速入門#

轉錄#

Transcription API 模仿了 OpenAI 的 create transcriptions API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 客戶端來嘗試 Transcription API：

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.transcriptions.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.transcriptions(audio=audio_file.read())

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

请提供需要翻译的文本。#

Translation API 模仿了 OpenAI 的 create translations API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 客戶端來嘗試使用 Translation API：

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "file": "<audio bytes>",
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
    client.audio.translations.create(
        model=<MODEL_UID>,
        file=audio_file,
    )

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
    model.translations(audio=audio_file.read())

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

語音#

Transcription API 模仿了 OpenAI 的 create speech API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 用戶端來嘗試 Speech API：

Speech API 預設使用非串流

ChatTTS 的串流輸出不如非串流的效果好，參考：2noise/ChatTTS#564
ffmpeg<7 的串流需求：https://pytorch.org/audio/stable/installation.html#optional-dependencies

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    "voice": "echo",
    "stream": True,
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    voice="echo",
)

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
model.speech(
    input=<The text to generate audio for>,
    voice="echo",
    stream: True,
)

The output will be an audio binary.

ChatTTS 使用#

基本使用，參考語音使用章節。

固定音色。我們可以使用由 6drf21e/ChatTTS_Speaker 提供的固定音色，下載 evaluation_result.csv ，以 seed_2155 音色作為範例，我們使用 emb_data 欄位的資料。

import pandas as pd

df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]

使用 seed_2155 固定音色來建立語音。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
    voice=emb_data_2155,
    input=<The text to generate audio for>
)

CosyVoice 模型使用#

CosyVoice 有兩個版本：CosyVoice 1.0 和 CosyVoice 2.0。CosyVoice 1.0 有 3 個不同模型：

CosyVoice-300M-SFT: 如果你只想把文字轉換為語音，選擇這個模型。它提供了一些預訓練的音色: ['中文女', '中文男', '日語男', '粵語女', '英文女', '英文男', '韓語女']
CosyVoice-300M: 如果你想複製聲音或將文字轉換為另一種語言的語音，請選擇這個模型。使用此模型時，你必須提供 prompt_speech WAV格式的音頻檔案，建議使用 16,000 Hz 取樣率以獲得更好的效能。
CosyVoice-300M-Instruct：如要精準控制音調與音色，請選擇此模型。

基本使用，載入模型 CosyVoice-300M-SFT。

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "input": "<The text to generate audio for>",
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    "voice": "中文女"
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
    model=<MODEL_UID>,
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女",
)
response.stream_to_file('1.mp3')

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
    input=<The text to generate audio for>,
    # ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
    voice="中文女"
)
with open('1.mp3', 'wb') as f:
    f.write(speech_bytes)

複製聲音，載入模型 CosyVoice-300M。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

zero_shot_prompt_text = ("<the words in the text exactly match "
                         "the audio file of the zero-shot prompt>")
# The words said in the audio file should be identical
# to zero_shot_prompt_text.
#
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(zero_shot_prompt_file, "rb") as f:
    zero_shot_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_text=zero_shot_prompt_text,
    prompt_speech=zero_shot_prompt,
)

跨語言使用，載入模型 CosyVoice-300M。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(cross_lingual_prompt_file, "rb") as f:
    cross_lingual_prompt = f.read()

speech_bytes = model.speech(
    "<The text to generate audio for>",  # text could be another language
    prompt_speech=cross_lingual_prompt,
)

基於指令的聲音合成，載入模型 CosyVoice-300M-Instruct。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

response = model.speech(
    "在面对挑战时，他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
    voice="中文男",
    instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
    "Fights with fervor for justice, but struggles with impulsiveness.",
)

CosyVoice 2.0 只有一個模型，但它包含了 CosyVoice 三個模型的所有能力。使用方法與 CosyVoice 一樣。

CosyVoice 2.0 流式使用，載入模型 CosyVoice2-0.5B。

# Launch model
from xinference.client import Client

model_uid = client.launch_model(
    model_name=model_name,
    model_type="audio",
    download_hub="modelscope",
)

endpoint = "http://127.0.0.1:9997"
input_string = "你好，我是通义生成式语音大模型，请问有什么可以帮您的吗？"

# Stream request by openai client
import openai
import tempfile

openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
    model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    response.stream_to_file(f.name)
    assert os.stat(f.name).st_size > 0

# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
    for chunk in response:
        f.write(chunk)

更多指令和範例，請參考 https://fun-audio-llm.github.io/ 。

FishSpeech 模型使用#

基本使用，參考語音使用章節。

克隆語音，啟動模型 FishSpeech-1.5。請使用 prompt_speech 而不是 reference_audio 以及 prompt_text 而不是 reference_text 來為 FishSpeech 模型提供參考音頻。這個參數和 CosyVoice 的語音克隆保持一致。

from xinference.client import Client

client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")

model = client.get_model("<MODEL_UID>")

# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
    reference_audio = f.read()
reference_text = ""  # text in the audio

speech_bytes = model.speech(
    "<The text to generate audio for>",
    prompt_speech=reference_audio,
    prompt_text=reference_text
)

Paraformer 使用說明#

model	語音活動檢測（vad）	標點恢復（punc）	時間戳	說話人	熱詞
paraformer-zh	是	是	否	否	否
paraformer-zh-hotword	是	是	否	否	是
paraformer-zh-spk	是	是	是	是	否
paraformer-zh-long	是	是	是	是	否
seaco-paraformer-zh （推薦）	是	是	是	是	是

VAD 與標點符號的使用

所有 Paraformer 模型均支援 VAD 與標點功能。
時間戳記與說話人識別使用說明

僅以下模型支援時間戳和說話人辨識：
- paraformer-zh-spk
- paraformer-zh-long
- seaco-paraformer-zh
其中，僅 paraformer-zh-spk 預設啟用說話人識別功能。

如果你使用的是 paraformer-zh-long 或 seaco-paraformer-zh，且需要啟用說話人識別功能：
- 在 Web UI 中：添加名為 spk_model、值為 cam++ 的參數
- 在命令列中：添加參數 --spk_model cam++
示例：
```
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, response_format="verbose_json")
```

熱詞功能使用說明

僅以下模型支援 hotword （熱詞功能）：

paraformer-zh-hotword
seaco-paraformer-zh

示例：

from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("seaco-paraformer-zh")
with open("asr_example.wav", "rb") as audio_file:
    audio = audio_file.read()
    model.transcriptions(audio, hotword="小艾 魔搭")

SenseVoiceSmall 離線使用#

現在 SenseVoiceSmall 使用一個小的 VAD 模型 fsmn-vad，因此它需要網路來下載。

對於離線環境，你可以事先下載這個 VAD 模型。

從 huggingface 或者 modelscope 下載。假設下載到 /path/to/fsmn-vad。

然後當用 Web UI 載入 SenseVoiceSmall 時，新增額外選項，key 是 vad_model，值是之前的下載路徑 /path/to/fsmn-vad。用命令列載入時，增加選項 --vad_model /path/to/fsmn-vad。

Kokoro 模型使用#

Kokoro模型支援多語言，預設為英文。若想使用非預設語言（例如中文），則需要安裝額外依賴套件，並在模型啟動時加入對應參數。

pip install misaki[zh]
使用 lang_code='z' 參數初始化模型，可以參考 kokoro source code 查看所有支援的 lang_code。如果你是透過 Web UI 啟動的模型，則需要添加額外參數，key 是 lang_code，value 是 z。如果你是透過 xinference client 啟動的模型，則可以參考如下程式碼傳遞參數：
```
model_uid = client.launch_model(
    model_name="Kokoro-82M",
    model_type="audio",
    compile=False,
    download_hub="huggingface",
    lang_code="z",
)
```
當進行推理時，需使用 'z' 開頭的 voice，例如：zf_xiaoyi。目前支援的 voices 可參考 https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices。使用方式如下：
```
input_string = "重新启动即可更新"
response = model.speech(input_string, voice="zf_xiaoyi")
```

IndexTTS2 使用方式#

IndexTTS2模型支援情感控制，你可以透過使用一些額外的參數來啟用這個功能。以下為IndexTTS2的使用方式：

單一參考音頻（音色克隆）：

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Translate for me, what is a surprise!",
    prompt_speech=test_prompt_speech,
)

指定情感參考音頻：

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

with open("example/emo_sad.wav", "rb") as f:
    emo_prompt_speech = f.read()

response = model.speech(
    input="It's such a shame the singer didn't make it to the finals.",
    prompt_speech=test_prompt_speech,
    emo_audio_prompt=emo_prompt_speech
)

當指定情感參考音頻時，可以選擇設定 emo_alpha 參數以調整其對輸出的影響程度。有效範圍為 0.0 - 1.0，預設值為 1.0 (100%)。

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

with open("example/emo_sad.wav", "rb") as f:
    emo_prompt_speech = f.read()

response = model.speech(
    input="It's such a shame the singer didn't make it to the finals.",
    prompt_speech=test_prompt_speech,
    emo_audio_prompt=emo_prompt_speech,
    emo_alpha=0.9
)

可以省略情緒參考音頻，改為提供一個包含8個浮點數的列表，按以下順序指定每種情緒的強度：[快樂, 憤怒, 悲傷, 恐懼, 厭惡, 憂鬱, 驚訝, 平靜]。您還可以使用 use_random 參數在推理過程中引入隨機性情緒；預設值為 False，設定為 True 即可啟用隨機性情緒。

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Wow, I'm so lucky!",
    prompt_speech=test_prompt_speech,
    emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0],
    use_random=False
)

或者，您可以啟用 use_emo_text 功能，根據您提供的 text 腳本引導情感表達。您的文字腳本將自動轉換為情感向量。使用文字情感模式時，建議將 emo_alpha 設定為 0.6 左右（或更低），以獲得更自然的語音效果。您可以透過 use_random 引入隨機性（預設值：False ；True 啟用隨機性）：

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Quick, hide! He's coming! He's coming to get us!",
    prompt_speech=test_prompt_speech,
    emo_alpha=0.6,
    use_emo_text=True,
    use_random=False
)

您也可以透過 emo_text 參數直接提供特定的文字情緒描述。您的情緒文字將自動轉換為情緒向量。這使您能夠分別控制文字腳本和文字情緒描述：

from xinference.client import Client
client = Client("http://0.0.0.0:6735")
model = client.get_model("IndexTTS2")

with open("../mp3_test_voice.mp3", "rb") as f:
    test_prompt_speech = f.read()

response = model.speech(
    input="Quick, hide! He's coming! He's coming to get us!",
    prompt_speech=test_prompt_speech,
    emo_alpha=0.6,
    use_emo_text=True,
    emo_text="You scared the hell out of me! Are you a ghost?",
    use_random=False
)

IndexTTS2 離線使用#

IndexTTS2需要多個小型模型，這些模型會在初始化過程中自動下載。在離線環境中，您可以將這些模型下載到單一目錄，並指定該目錄路徑。

簡易設定方法

設定離線使用最簡單的方法是使用： hf download 指令去提前下載所有小模型。

# Create your local models directory
mkdir -p /path/to/small_models

# Download models from Hugging Face
hf download facebook/w2v-bert-2.0 --local-dir /path/to/small_models/w2v-bert-2.0
hf download funasr/campplus --local-dir /path/to/small_models/campplus
hf download nvidia/bigvgan_v2_22khz_80band_256x --local-dir /path/to/small_models/bigvgan
hf download amphion/MaskGCT --local-dir /path/to/small_models/MaskGCT

最終的目錄結構應如下所示：

/path/to/small_models/
├── w2v-bert-2.0/                 # Feature extraction model
├── campplus/                     # Speaker recognition model
├── bigvgan/                      # Vocoder model
└── MaskGCT/                      # Semantic codec model

支援的模型列表

小型模型將按以下方式自動映射：

w2v-bert-2.0 (models--facebook--w2v-bert-2.0) - 特徵提取模型
campplus (models--funasr--campplus) - 說話人辨識模型
bigvgan (models--nvidia--bigvgan_v2_22khz_80band_256x) - 語音編碼器模型
語義編解碼器 (models--amphion--MaskGCT) - 語義編碼/解碼模型

使用離線模式啟動IndexTTS2

在透過Web UI啟動IndexTTS2時，可新增額外參數：- small_models_dir - 包含所有小型模型的目錄路徑

在透過命令列啟動時，您可以加入以下選項：

xinference launch --model-name IndexTTS2 --model-type audio \
    --small_models_dir /path/to/small_models

使用 Python 客戶端啟動時：

model_uid = client.launch_model(
    model_name="IndexTTS2",
    model_type="audio",
    small_models_dir="/path/to/small_models"
)