音頻#
學習如何使用 Xinference 將音訊轉換為文字或將文字轉換為音訊。
您好!请问您需要我介绍什么内容呢?比如产品、服务、技术,或者其他方面?请提供更多信息,以便我为您提供准确的介绍。#
Audio API 提供了三種與音訊互動的方法:
轉錄終端將音訊轉錄為輸入語言。
翻譯端點將音訊轉換為英文。
轉錄終端將音頻轉錄為輸入語言。
API 端點 |
OpenAI 相容端點 |
|---|---|
轉錄 API |
/v1/audio/transcriptions |
API |
/v1/audio/translations |
語音 API |
/v1/audio/speech |
支援的模型列表#
在Xinference中,以下模型支援音訊API:
語音轉文字#
僅適用於 Mac M 系列晶片:
文字轉語音(TTS)#
支援zero-shot的模型 (無需參考音訊)
MeloTTS series
**支援語音克隆的模型**(需參考音訊)
支援情感控制的模型
僅適用於 Mac M 系列晶片:
快速入門#
轉錄#
Transcription API 模仿了 OpenAI 的 create transcriptions API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 客戶端來嘗試 Transcription API:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/transcriptions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.transcriptions.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.transcriptions(audio=audio_file.read())
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
请提供需要翻译的文本。#
Translation API 模仿了 OpenAI 的 create translations API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 客戶端來嘗試使用 Translation API:
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/translations' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"file": "<audio bytes>",
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
with open("speech.mp3", "rb") as audio_file:
client.audio.translations.create(
model=<MODEL_UID>,
file=audio_file,
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
with open("speech.mp3", "rb") as audio_file:
model.translations(audio=audio_file.read())
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
語音#
Transcription API 模仿了 OpenAI 的 create speech API。你可以透過 cURL、OpenAI Client 或 Xinference 的 Python 用戶端來嘗試 Speech API:
Speech API 預設使用非串流
ChatTTS 的串流輸出不如非串流的效果好,參考:2noise/ChatTTS#564
ffmpeg<7 的串流需求:https://pytorch.org/audio/stable/installation.html#optional-dependencies
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
"voice": "echo",
"stream": True,
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
voice="echo",
)
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.speech(
input=<The text to generate audio for>,
voice="echo",
stream: True,
)
The output will be an audio binary.
ChatTTS 使用#
基本使用,參考 語音使用章節。
固定音色。我們可以使用由 6drf21e/ChatTTS_Speaker 提供的固定音色,下載 evaluation_result.csv ,以 seed_2155 音色作為範例,我們使用 emb_data 欄位的資料。
import pandas as pd
df = pd.read_csv("evaluation_results.csv")
emb_data_2155 = df[df['seed_id'] == 'seed_2155'].iloc[0]["emb_data"]
使用 seed_2155 固定音色來建立語音。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
resp_bytes = model.speech(
voice=emb_data_2155,
input=<The text to generate audio for>
)
CosyVoice 模型使用#
CosyVoice 有兩個版本:CosyVoice 1.0 和 CosyVoice 2.0。CosyVoice 1.0 有 3 個不同模型:
CosyVoice-300M-SFT: 如果你只想把文字轉換為語音,選擇這個模型。它提供了一些預訓練的音色: ['中文女', '中文男', '日語男', '粵語女', '英文女', '英文男', '韓語女']
CosyVoice-300M: 如果你想複製聲音或將文字轉換為另一種語言的語音,請選擇這個模型。使用此模型時,你必須提供
prompt_speechWAV格式的音頻檔案,建議使用 16,000 Hz 取樣率以獲得更好的效能。CosyVoice-300M-Instruct:如要精準控制音調與音色,請選擇此模型。
基本使用,載入模型 CosyVoice-300M-SFT。
curl -X 'POST' \
'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/audio/speech' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "<MODEL_UID>",
"input": "<The text to generate audio for>",
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
"voice": "中文女"
}'
import openai
client = openai.Client(
api_key="cannot be empty",
base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.audio.speech.create(
model=<MODEL_UID>,
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女",
)
response.stream_to_file('1.mp3')
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
speech_bytes = model.speech(
input=<The text to generate audio for>,
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
voice="中文女"
)
with open('1.mp3', 'wb') as f:
f.write(speech_bytes)
複製聲音,載入模型 CosyVoice-300M。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
zero_shot_prompt_text = ("<the words in the text exactly match "
"the audio file of the zero-shot prompt>")
# The words said in the audio file should be identical
# to zero_shot_prompt_text.
#
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(zero_shot_prompt_file, "rb") as f:
zero_shot_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>",
prompt_text=zero_shot_prompt_text,
prompt_speech=zero_shot_prompt,
)
跨語言使用,載入模型 CosyVoice-300M。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
# The audio input file must be in WAV format.
# For optimal performance, use a 16,000 Hz sample rate.
#
# Files with different sample rates will be resampled to 16,000 Hz,
# which may increase processing time.
with open(cross_lingual_prompt_file, "rb") as f:
cross_lingual_prompt = f.read()
speech_bytes = model.speech(
"<The text to generate audio for>", # text could be another language
prompt_speech=cross_lingual_prompt,
)
基於指令的聲音合成,載入模型 CosyVoice-300M-Instruct。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
response = model.speech(
"在面对挑战时,他展现了非凡的<strong>勇气</strong>与<strong>智慧</strong>。",
voice="中文男",
instruct_text="Theo 'Crimson', is a fiery, passionate rebel leader. "
"Fights with fervor for justice, but struggles with impulsiveness.",
)
CosyVoice 2.0 只有一個模型,但它包含了 CosyVoice 三個模型的所有能力。使用方法與 CosyVoice 一樣。
CosyVoice 2.0 流式使用,載入模型 CosyVoice2-0.5B。
# Launch model
from xinference.client import Client
model_uid = client.launch_model(
model_name=model_name,
model_type="audio",
download_hub="modelscope",
)
endpoint = "http://127.0.0.1:9997"
input_string = "你好,我是通义生成式语音大模型,请问有什么可以帮您的吗?"
# Stream request by openai client
import openai
import tempfile
openai_client = openai.Client(api_key="not empty", base_url=f"{endpoint}/v1")
# ['中文女', '中文男', '日语男', '粤语女', '英文女', '英文男', '韩语女']
response = openai_client.audio.speech.with_streaming_response.create(
model=model_uid, input=input_string, voice="英文女"
)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
response.stream_to_file(f.name)
assert os.stat(f.name).st_size > 0
# Stream request by xinference client
response = model.speech(input_string, stream=True)
assert inspect.isgenerator(response)
with tempfile.NamedTemporaryFile(suffix=".mp3", delete=True) as f:
for chunk in response:
f.write(chunk)
更多指令和範例,請參考 https://fun-audio-llm.github.io/ 。
FishSpeech 模型使用#
基本使用,參考 語音使用章節。
克隆語音,啟動模型 FishSpeech-1.5。請使用 prompt_speech 而不是 reference_audio 以及 prompt_text 而不是 reference_text 來為 FishSpeech 模型提供參考音頻。這個參數和 CosyVoice 的語音克隆保持一致。
from xinference.client import Client
client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
# The reference audio file is the voice file
# the words said in the file should be identical to reference_text
with open(reference_audio_file, "rb") as f:
reference_audio = f.read()
reference_text = "" # text in the audio
speech_bytes = model.speech(
"<The text to generate audio for>",
prompt_speech=reference_audio,
prompt_text=reference_text
)
Paraformer 使用說明#
model |
語音活動檢測(vad) |
標點恢復(punc) |
時間戳 |
說話人 |
熱詞 |
|---|---|---|---|---|---|
是 |
是 |
否 |
否 |
否 |
|
是 |
是 |
否 |
否 |
是 |
|
是 |
是 |
是 |
是 |
否 |
|
是 |
是 |
是 |
是 |
否 |
|
seaco-paraformer-zh (推薦) |
是 |
是 |
是 |
是 |
是 |
VAD 與標點符號的使用
所有 Paraformer 模型均支援 VAD 與標點功能。
時間戳記與說話人識別使用說明
僅以下模型支援 時間戳 和 說話人 辨識:
paraformer-zh-spk
paraformer-zh-long
seaco-paraformer-zh
其中,僅 paraformer-zh-spk 預設啟用說話人識別功能。
如果你使用的是 paraformer-zh-long 或 seaco-paraformer-zh,且需要啟用說話人識別功能:
在 Web UI 中:添加名為
spk_model、值為cam++的參數在命令列中:添加參數
--spk_model cam++
示例:
from xinference.client import Client client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>") model = client.get_model("seaco-paraformer-zh") with open("asr_example.wav", "rb") as audio_file: audio = audio_file.read() model.transcriptions(audio, response_format="verbose_json")
熱詞功能使用說明
僅以下模型支援 hotword (熱詞功能):
paraformer-zh-hotword
seaco-paraformer-zh
示例:
from xinference.client import Client client = Client("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>") model = client.get_model("seaco-paraformer-zh") with open("asr_example.wav", "rb") as audio_file: audio = audio_file.read() model.transcriptions(audio, hotword="小艾 魔搭")
SenseVoiceSmall 離線使用#
現在 SenseVoiceSmall 使用一個小的 VAD 模型 fsmn-vad,因此它需要網路來下載。
對於離線環境,你可以事先下載這個 VAD 模型。
從 huggingface 或者 modelscope 下載。假設下載到 /path/to/fsmn-vad。
然後當用 Web UI 載入 SenseVoiceSmall 時,新增額外選項,key 是 vad_model,值是之前的下載路徑 /path/to/fsmn-vad。用命令列載入時,增加選項 --vad_model /path/to/fsmn-vad。
Kokoro 模型使用#
Kokoro模型支援多語言,預設為英文。若想使用非預設語言(例如中文),則需要安裝額外依賴套件,並在模型啟動時加入對應參數。
pip install misaki[zh]
使用 lang_code='z' 參數初始化模型,可以參考 kokoro source code 查看所有支援的 lang_code。如果你是透過 Web UI 啟動的模型,則需要添加額外參數,key 是
lang_code,value 是z。如果你是透過 xinference client 啟動的模型,則可以參考如下程式碼傳遞參數:model_uid = client.launch_model( model_name="Kokoro-82M", model_type="audio", compile=False, download_hub="huggingface", lang_code="z", )
當進行推理時,需使用 'z' 開頭的 voice,例如:
zf_xiaoyi。目前支援的 voices 可參考 https://huggingface.co/hexgrad/Kokoro-82M/tree/main/voices。使用方式如下:input_string = "重新启动即可更新" response = model.speech(input_string, voice="zf_xiaoyi")
IndexTTS2 使用方式#
IndexTTS2模型支援情感控制,你可以透過使用一些額外的參數來啟用這個功能。以下為IndexTTS2的使用方式:
單一參考音頻(音色克隆):
from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Translate for me, what is a surprise!", prompt_speech=test_prompt_speech, )
指定情感參考音頻:
from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() with open("example/emo_sad.wav", "rb") as f: emo_prompt_speech = f.read() response = model.speech( input="It's such a shame the singer didn't make it to the finals.", prompt_speech=test_prompt_speech, emo_audio_prompt=emo_prompt_speech )
當指定情感參考音頻時,可以選擇設定
emo_alpha參數以調整其對輸出的影響程度。有效範圍為0.0 - 1.0,預設值為1.0(100%)。from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() with open("example/emo_sad.wav", "rb") as f: emo_prompt_speech = f.read() response = model.speech( input="It's such a shame the singer didn't make it to the finals.", prompt_speech=test_prompt_speech, emo_audio_prompt=emo_prompt_speech, emo_alpha=0.9 )
可以省略情緒參考音頻,改為提供一個包含8個浮點數的列表,按以下順序指定每種情緒的強度:
[快樂, 憤怒, 悲傷, 恐懼, 厭惡, 憂鬱, 驚訝, 平靜]。您還可以使用use_random參數在推理過程中引入隨機性情緒;預設值為False,設定為True即可啟用隨機性情緒。from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Wow, I'm so lucky!", prompt_speech=test_prompt_speech, emo_vector=[0, 0, 0, 0, 0, 0, 0.45, 0], use_random=False )
或者,您可以啟用
use_emo_text功能,根據您提供的text腳本引導情感表達。您的文字腳本將自動轉換為情感向量。使用文字情感模式時,建議將emo_alpha設定為 0.6 左右(或更低),以獲得更自然的語音效果。您可以透過use_random引入隨機性(預設值:False;True啟用隨機性):from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Quick, hide! He's coming! He's coming to get us!", prompt_speech=test_prompt_speech, emo_alpha=0.6, use_emo_text=True, use_random=False )
您也可以透過
emo_text參數直接提供特定的文字情緒描述。您的情緒文字將自動轉換為情緒向量。這使您能夠分別控制文字腳本和文字情緒描述:from xinference.client import Client client = Client("http://0.0.0.0:6735") model = client.get_model("IndexTTS2") with open("../mp3_test_voice.mp3", "rb") as f: test_prompt_speech = f.read() response = model.speech( input="Quick, hide! He's coming! He's coming to get us!", prompt_speech=test_prompt_speech, emo_alpha=0.6, use_emo_text=True, emo_text="You scared the hell out of me! Are you a ghost?", use_random=False )
IndexTTS2 離線使用#
IndexTTS2需要多個小型模型,這些模型會在初始化過程中自動下載。在離線環境中,您可以將這些模型下載到單一目錄,並指定該目錄路徑。
簡易設定方法
設定離線使用最簡單的方法是使用: hf download 指令去提前下載所有小模型。
# Create your local models directory
mkdir -p /path/to/small_models
# Download models from Hugging Face
hf download facebook/w2v-bert-2.0 --local-dir /path/to/small_models/w2v-bert-2.0
hf download funasr/campplus --local-dir /path/to/small_models/campplus
hf download nvidia/bigvgan_v2_22khz_80band_256x --local-dir /path/to/small_models/bigvgan
hf download amphion/MaskGCT --local-dir /path/to/small_models/MaskGCT
最終的目錄結構應如下所示:
/path/to/small_models/
├── w2v-bert-2.0/ # Feature extraction model
├── campplus/ # Speaker recognition model
├── bigvgan/ # Vocoder model
└── MaskGCT/ # Semantic codec model
支援的模型列表
小型模型將按以下方式自動映射:
w2v-bert-2.0 (
models--facebook--w2v-bert-2.0) - 特徵提取模型campplus (
models--funasr--campplus) - 說話人辨識模型bigvgan (
models--nvidia--bigvgan_v2_22khz_80band_256x) - 語音編碼器模型語義編解碼器 (
models--amphion--MaskGCT) - 語義編碼/解碼模型
使用離線模式啟動IndexTTS2
在透過Web UI啟動IndexTTS2時,可新增額外參數:- small_models_dir - 包含所有小型模型的目錄路徑
在透過命令列啟動時,您可以加入以下選項:
xinference launch --model-name IndexTTS2 --model-type audio \
--small_models_dir /path/to/small_models
使用 Python 客戶端啟動時:
model_uid = client.launch_model(
model_name="IndexTTS2",
model_type="audio",
small_models_dir="/path/to/small_models"
)