自訂模型#

Xinference 提供了一種靈活且全面的方式來整合、管理及應用自訂模型。

無需註冊而直接啟動自訂模型#

從 v0.14.0 版本開始，如果你需要註冊的模型家族是 Xinference 內建支援的模型，你可以直接透過 launch 介面中的 model_path 參數來啟動它，從而省去註冊步驟的麻煩。現在非常推薦使用這種方式。

例如：

xinference launch --model-path <model_file_path> --model-engine <engine> -n qwen1.5-chat

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<engine>",
  "model_name": "qwen1.5-chat",
  "model_path": "<model_file_path>"
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_name="qwen1.5-chat",
  model_path="<model_file_path>"
)
print('Model uid: ' + model_uid)

上面的範例展示了當我已有 qwen1.5-chat 模型檔案時，如何直接 launch 它。

對於分散式場景，將你的模型檔案置於某個 worker，然後透過 launch 介面的 worker_ip 和 model_path 參數來達到直接 launch 的效果。

備註

針對命令列界面（CLI）的使用，請優先使用 --model-path``（分號分隔的大小寫混合形式）。--model_path`` 相容舊版規範，但不建議使用。

定義一個自定義模型#

Web UI：自動解析大型語言模型配置#

在 v2.0.0 版被加入.

透過Web UI註冊自訂LLM時，Xinference可自動解析模型配置並為您預填關鍵欄位。

請僅提供：

模型路徑/模型ID （模型所在位置，本地路徑或中心ID）
模型家族

解析後，使用者介面可自動填入以下欄位：

上下文長度
模型語言
模型能力
模型規格

在儲存自訂模型之前，您可以檢視並編輯這些欄位。

請根據以下模板定義自訂模型：

{
    "version": 2,
    "context_length": 32768,
    "model_name": "custom-qwen-2.5",
    "model_lang": [
        "en",
        "zh"
    ],
    "model_ability": [
        "generate"
    ],
    "model_description": "This is a custom model description.",
    "model_family": "my-custom-qwen-2.5",
    "model_specs": [
        {
            "model_format": "pytorch",
            "model_size_in_billions": "0_5",
            "quantization": "none",
            "model_id": null,
            "model_hub": "huggingface",
            "model_uri": "file:///path/to/models--Qwen--Qwen2.5-0.5B",
            "model_revision": null,
            "activated_size_in_billions": null
        }
    ],
    "chat_template": null,
    "stop_token_ids": null,
    "stop": null,
    "reasoning_start_tag": null,
    "reasoning_end_tag": null,
    "cache_config": null,
    "virtualenv": {
        "packages": [],
        "inherit_pip_config": true,
        "index_url": null,
        "extra_index_url": null,
        "find_links": null,
        "trusted_host": null,
        "no_build_isolation": null
    },
    "is_builtin": false
}

{
   "version": 2,
   "model_name": "my-bge-large-zh-v1.5",
   "dimensions": 1024,
   "max_tokens": 512,
   "language": [
       "zh"
   ],
   "model_specs": [
      {
          "model_format": "pytorch",
          "model_hub": "huggingface",
          "model_id": null,
          "model_uri": "file:///path/to/my-bge-large-zh-v1.5",
          "model_revision": null,
          "quantization": "none"
      }
   ],
   "cache_config": null,
   "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
   },
   "is_builtin": false
}

{
  "version": 2,
  "model_name": "my-bge-reranker-base",
  "model_specs": [
      {
          "model_format": "pytorch",
          "model_hub": "huggingface",
          "model_id": null,
          "model_revision": null,
          "model_uri": "file:///path/to/my-bge-reranker-base",
          "quantization": "none"
      }
  ],
  "language": [
      "en",
      "zh"
  ],
  "type": "unknown",
  "max_tokens": 512,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "is_builtin": false
}

{
  "model_name": "my-qwen-image",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_family": "stable_diffusion",
  "model_ability": null,
  "controlnet": [],
  "default_model_config": {},
  "default_generate_config": {},
  "gguf_model_id": null,
  "gguf_quantizations": null,
  "gguf_model_file_name_template": null,
  "lightning_model_id": null,
  "lightning_versions": null,
  "lightning_model_file_name_template": null,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "model_uri": "file:///path/to/my-qwen-image",
  "is_builtin": false
}

{
  "model_name": "my-ChatTTS",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_family": "ChatTTS",
  "multilingual": false,
  "language": null,
  "model_ability": [
      "text2audio"
  ],
  "default_model_config": null,
  "default_transcription_config": null,
  "engine": null,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "model_uri": "file:///path/to/my-ChatTTS",
  "is_builtin": false
}

{
  "model_name": "my-flexible-model",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_description": "This is a model description.",
  "model_uri": "file:///path/to/my-flexible-model",
  "launcher": "xinference.model.flexible.launchers.transformers",
  "launcher_args": "{}",
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "is_builtin": false
}

model_name: 模型名稱。名稱必須以字母或數字開頭，且只能包含字母、數字、底線或連字號。
context_length: 一個可選的整數，模型支援的最大上下文長度，包括輸入和輸出長度。如果未定義，預設值為2048個token（約1,500個詞）。
dimensions: 一個整數，用於定義嵌入模型輸出的向量大小。
max_tokens: 一個整數，定義嵌入模型在單次請求中可處理的最大輸入 token 數量。
model_lang: 一個字串列表，表示模型支援的語言。例如：['en']，表示該模型支援英語。
model_ability: 一個字串列表，定義模型的能力。它可以包含像 'embed'、'generate' 和 'chat' 這樣的選項。此範例表示模型具有 'generate' 的能力。
model_family: 一個必要的字串，表示要註冊的模型族。該參數名稱不得與任何內建模型名稱衝突。
model_specs: 一個包含定義模型規格的物件陣列。這些規格包括：
- model_format: 一個定義模型格式的字串，可以是 'pytorch' 或 'ggufv2'。
model_size_in_billions: 一個整數，定義模型的參數量，以十億為單位。
quantizations: 一個字串列表，定義模型的量化方式。對於 PyTorch 模型，可以是 "4-bit"、"8-bit" 或 "none"。對於 ggufv2 模型，量化方式應與 model_file_name_template 中的值對應。部分引擎亦支援 fp4 / fp8 / bnb 格式（後端支援詳情請參閱安裝）。
- model_id：代表模型 id 的字串，可以是該模型對應的 HuggingFace 倉庫 id。如果 model_uri 欄位缺失，Xinference 將嘗試從此 id 指示的 HuggingFace 倉庫下載該模型。
- model_hub: 一個可選字串，表示從何處下載模型，例如 HuggingFace 或 modelscope。
- model_uri：表示模型檔案位置的字串，例如本地目錄："file:///path/to/llama-2-7b"。當 model_format 是 ggufv2 時，此欄位必須是具體的模型檔案路徑。而當 model_format 是 pytorch 時，此欄位必須是一個包含所有模型檔案的目錄。
- model_revision: 一個字串，表示從儲存庫中使用的模型檔案的具體版本或提交雜湊值。
chat_template：如果 model_ability 中包含 chat，則此選項必須配置以生成合適的完整提示詞。這是一個 Jinja 模板字串。通常，你可以在模型目錄的 tokenizer_config.json 檔案中找到。
stop_token_ids：如果 model_ability 中包含 chat，那麼建議配置此選項以合理控制對話的停止。這是一個包含整數的列表，你可以在模型目錄的 generation_config.json 和 tokenizer_config.json 檔案中提取對應的值。
stop：若 model_ability 中包含 chat，則建議設定此選項以合理控制對話的停止。這是一個包含字串的清單，你可以在模型目錄的 tokenizer_config.json 檔案中找到 token 值對應的字串。
reasoning_start_tag: 一個特殊的 token 或 prompt，用於明確指示大型語言模型在其輸出中思維鏈或推理過程的起點。
reasoning_end_tag: 一個特殊的 token 或 prompt，用於明確指示大型語言模型在其輸出中思維鏈或推理過程的終點。
cache_config: 一個字串，表示系統儲存和管理暫存資料（快取）的參數。
virtualenv: A settings object for model dependency isolation. Please refer to this document for details.

註冊一個自訂模型#

以程式碼的方式註冊自訂模型

import json
from xinference.client import Client

with open('model.json') as fd:
    model = fd.read()

# replace with real xinference endpoint
endpoint = 'http://localhost:9997'
client = Client(endpoint)
client.register_model(model_type="<model_type>", model=model, persist=False)

以命令列的方式

xinference register --model-type <model_type> --file model.json --persist

注意將以下部分的 <model_type> 替換為 LLM、embedding 或 rerank。

列出內建與自訂模型#

```python # 内置模型示例 from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor from sklearn.svm import SVC

registrations = client.list_model_registrations(model_type="<model_type>")

以命令列的方式

xinference registrations --model-type <model_type>

啟動自訂模型#

以程式碼的方式啟動自訂模型

uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch')

以命令列的方式

xinference launch --model-name custom-llama-2 --model-format pytorch

使用自訂模型#

以程式碼的方式呼叫模型

model = client.get_model(model_uid=uid)
model.generate('What is the largest animal in the world?')

結果為：

{
   "id":"cmpl-a4a9d9fc-7703-4a44-82af-fce9e3c0e52a",
   "object":"text_completion",
   "created":1692024624,
   "model":"43e1f69a-3ab0-11ee-8f69-fa163e74fa2d",
   "choices":[
      {
         "text":"\nWhat does an octopus look like?\nHow many human hours has an octopus been watching you for?",
         "index":0,
         "logprobs":"None",
         "finish_reason":"stop"
      }
   ],
   "usage":{
      "prompt_tokens":10,
      "completion_tokens":23,
      "total_tokens":33
   }
}

或者以命令行的方式，用實際的模型 UID 取代 ${UID}：

xinference generate --model-uid ${UID}

取消自訂模型#

以程式碼的方式登出自訂模型

model = client.unregister_model(model_type="<model_type>", model_name='custom-llama-2')

以命令列的方式

xinference unregister --model-type <model_type> --model-name custom-llama-2