使用#

本地執行 Xinference#

讓我們以一個經典的大語言模型 qwen2.5-instruct 來展示如何在本地用 Xinference 運行大模型。

在這個快速入門之後，可以繼續學習如何在一個分散式叢集環境下部署 Xinference。

拉起本地服務#

首先，請根據這個文件的指導確保本地安裝了 Xinference。使用以下命令啟動本地的 Xinference 服務：

xinference-local --host 0.0.0.0 --port 9997

INFO     Xinference supervisor 0.0.0.0:64570 started
INFO     Xinference worker 0.0.0.0:64570 started
INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)

備註

預設情況下，Xinference 會使用 <HOME>/.xinference 作為主要目錄來儲存一些必要資訊，例如日誌檔案和模型檔案，其中 <HOME> 就是目前使用者的主目錄。

你可以透過配置環境變數 XINFERENCE_HOME 修改主目錄，比如：

XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

恭喜！你已經在本地啟動了 Xinference 服務。一旦 Xinference 服務開始運行，有多種方式可以使用，包括使用網頁、cURL 指令、命令列或 Xinference 的 Python SDK。

可以通過訪問 http://127.0.0.1:9997/ui 來使用 UI，訪問 http://127.0.0.1:9997/docs 來查看 API 文件。

可以透過以下指令安裝後，利用 Xinference 命令列工具或 Python 程式碼來使用：

pip install xinference

命令列工具是 xinference。可以透過以下指令查看有哪些可使用的指令：

xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --version       Show the version and exit.
  --log-level TEXT
  -H, --host TEXT
  -p, --port INTEGER
  --help              Show this message and exit.

Commands:
  cached
  cal-model-mem
  chat
  engine
  generate
  launch
  list
  login
  register
  registrations
  remove-cache
  stop-cluster
  terminate
  unregister
  vllm-models

若只需要安裝 Xinference 的 Python SDK，可以使用以下命令安裝最少依賴。需要注意的是版本必須與 Xinference 服務的版本保持一致。

pip install xinference-client==${SERVER_VERSION}

關於模型的推理引擎#

自 v0.11.0 版本開始，在載入 LLM 模型之前，你需要指定具體的推理引擎。目前，Xinference 支援以下推理引擎：

vllm
sglang
llama.cpp
transformers
MLX

關於這些推理引擎的詳細資訊，請參考這裡。

請注意，當載入 LLM 模型時，所能執行的引擎與 model_format 和 quantization 參數息息相關。

Xinference 提供了 xinference engine 命令幫助你查詢相關的參數組合。

例如：

我想查詢與 qwen-chat 模型相關的參數組合，以決定它能夠如何在各種推理引擎上運行。

xinference engine -e <xinference_endpoint> --model-name qwen-chat

我想將 qwen-chat 跑在 VLLM 推理引擎上，但是我不知道什麼樣的其他參數符合這個要求。

xinference engine -e <xinference_endpoint> --model-name qwen-chat --model-engine vllm

我想載入 GGUF 格式的 qwen-chat 模型，我需要知道其餘的參數組合。

xinference engine -e <xinference_endpoint> --model-name qwen-chat -f ggufv2

總之，相較於之前的版本，當載入 LLM 模型時，需要額外傳入 model_engine 參數。你可以透過 xinference engine 指令查詢你想執行的推理引擎與其他參數組合的關係。

備註

關於何時使用什麼引擎，以下是一些建議：

Linux
- 在能使用的情況下，優先使用 vLLM 或 SGLang，因為他們有更好的效能。
- 如果資源有限，可以考慮使用 llama.cpp，因為它提供了更多的量化選項。
- 其他使用考慮使用 Transformers，它幾乎支援所有的模型。
Windows
- 推薦使用 WSL，這個時候選擇和 Linux 一致。
- 其他時候推薦使用 llama.cpp，對於不支援的模型，選擇使用 Transformers。
Mac
- 在模型支援的情況下，推薦使用 MLX 引擎，它擁有最佳效能。
- 其他時候推薦使用 llama.cpp，對於不支援的模型，選擇使用 Transformers。

运行 qwen2.5-instruct#

讓我們來執行一個內建的 qwen2.5-instruct 模型。當你需要執行一個模型時，第一次執行需要從 HuggingFace 下載模型參數，一般來說根據模型大小需要下載 10 到 30 分鐘不等。當下載完成後，Xinference 本地會有快取處理，以後再執行相同的模型不需要重新下載。

備註

Xinference 也允許從其他模型託管平台下載模型。可以透過在啟動 Xinference 時指定環境變數，例如，如果想要從 ModelScope 下載模型，可以使用以下指令：

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

可以使用 --model-uid 或者 -u 參數指定模型的 UID，如果沒有指定，Xinference 會隨機生成一個 ID。預設的 ID 和模型名稱保持一致。

xinference launch --model-engine <inference_engine> -n qwen2.5-instruct -s 0_5 -f pytorch

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<inference_engine>",
  "model_name": "qwen2.5-instruct",
  "model_format": "pytorch",
  "size_in_billions": "0_5"
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_name="qwen2.5-instruct",
  model_format="pytorch",
  size_in_billions="0_5"
)
print('Model uid: ' + model_uid)

Model uid: qwen2.5-instruct

備註

對於一些推理引擎，例如 vllm，用戶需要在運行模型時指定引擎相關的參數，在這種情況下，直接在命令列中指定對應的參數名稱和數值即可，例如：

xinference launch --model-engine vllm -n qwen2.5-instruct -s 0_5 -f pytorch --gpu_memory_utilization 0.9

在運行模型時，gpu_memory_utilization=0.9 會傳遞到 vllm 後端。

備註

關於模型載入更多技巧，請參考模型載入指南。

到這一步，恭喜你已經成功透過 Xinference 將 qwen2.5-instruct 運行起來了。一旦這個模型在運行中，我們可以透過命令列、cURL 或者是 Python 程式碼來進行互動：

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen2.5-instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ]
  }'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("qwen2.5-instruct")
model.chat(
    messages=[
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "qwen2.5-instruct",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Xinference 提供了與 OpenAI 相容的 API，因此可以將 Xinference 運行的模型視為 OpenAI 的本地替代方案。例如：

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")

response = client.chat.completions.create(
    model="qwen2.5-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

以下是支援的 OpenAI API：

對話生成：https://platform.openai.com/docs/api-reference/chat
https://platform.openai.com/docs/api-reference/completions
向量生成：https://platform.openai.com/docs/api-reference/embeddings

Xinference 還支援透過基礎 URL http://127.0.0.1:9997/anthropic 呼叫 Anthropic API，你可以在 Claude Code 等環境中使用 Xinference。更多詳情請參閱 anthropic client。

管理模型#

除了啟動模型，Xinference 提供了管理模型整個生命週期的能力。同樣地，你可以使用命令列、cURL 以及 Python 程式碼來管理：

可以列出所有 Xinference 支援的指定類型的模型：

xinference registrations -t LLM

curl http://127.0.0.1:9997/v1/model_registrations/LLM

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_model_registrations(model_type='LLM'))

接下來的指令可以列出所有正在運行的模型：

xinference list

curl http://127.0.0.1:9997/v1/models

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_models())

當你不需要某個正在運行的模型，可以透過以下方式來停止它並釋放資源：

xinference terminate --model-uid "qwen2.5-instruct"

curl -X DELETE http://127.0.0.1:9997/v1/models/qwen2.5-instruct

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
client.terminate_model(model_uid="qwen2.5-instruct")

集群中部署 Xinference#

若要在集群環境中部署 Xinference，需要在一台機器中啟動 supervisor 節點，並在當前或其他節點啟動 worker 節點。

首先，根據文件確保所有的伺服器上都安裝了 Xinference。接下來按照步驟：

啟動 Supervisor#

在伺服器上執行以下指令來啟動 Supervisor 節點：

xinference-supervisor -H "${supervisor_host}"

使用當前節點的 IP 來替換 ${supervisor_host}。

可以在 http://${supervisor_host}:9997/ui 訪問 Web 界面，在 http://${supervisor_host}:9997/docs 存取 API 文件。

啟動 Worker#

在需要啟動 Xinference worker 的機器上執行以下命令：

xinference-worker -e "http://${supervisor_host}:9997" -H "${worker_host}"

備註

請注意，必須使用當前 Worker 節點的 IP 來取代 ${worker_host}。

備註

需要注意的是，如果你需要透過命令列與叢集互動，應該透過 -e 或 --endpoint 參數來指定 supervisor 的位址，例如：

xinference launch -n qwen2.5-instruct -s 0_5 -f pytorch -e "http://${supervisor_host}:9997"

使用 Docker 部署 Xinference#

請使用以下指令在容器中執行 Xinference：

在擁有NVIDIA顯示卡的機器上運行#

對於 CUDA 12.4：

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0.0.0.0 --log-level debug

對於 cuda 12.8：

在 v1.8.1 版被加入: CUDA 12.8 版本為實驗性質，歡迎提供意見回饋以利我們改進。

在 v1.16.0 版的變更: CUDA 12.8 版本已在 v1.16.0 中移除。

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version>-cu128 xinference-local -H 0.0.0.0 --log-level debug

對於 cuda 12.9：

在 v1.16.0 版被加入: 在 Xinference v2.0.0 發布後，CUDA 12.9 將成為預設版本。

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version>-cu129 xinference-local -H 0.0.0.0 --log-level debug

在僅有 CPU 的機器上運行#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0.0.0.0 --log-level debug

將 <your_version> 替換為 Xinference 的版本，例如 v0.10.3，可以使用 latest 來表示最新版本。

更多 docker 使用，請參考使用 docker 鏡像。

使用#