Getting Started
If you haven’t already, pull a model before making requests. For example, pull Qwen3:bash
bash
http://127.0.0.1:18181 by default. Keep the terminal that runs the server open, and make your requests from another terminal tab.
To see a full list of configurable options for the server, run
nexa serve -h
While you can try out nexa server with any HTTP tools, the easiest way to quickly get start is to run
nexa pull. The run command will also starts an REPL conversation UI just like nexa infer, but fulfilling your chat by sending requests to the server hosted by your nexa serve command.
Model Choice
Certain models can only be run on specific platforms. For example, MLX models can only be run on MacOS 13+ devices. OmniNeural can only be run on a Qualcomm laptop with NPU. Below is a table that contains example models for each OS for you to try:| OS | Modality | Recommended Model |
|---|---|---|
| macOS | LLM | NexaAI/gpt-oss-20b-MLX-4bit |
| macOS | VLM | NexaAI/gemma-3n-E4B-it-4bit-MLX |
| macOS | Image Generation | NexaAI/sdxl-turbo |
| macOS | ASR | NexaAI/whisper-large-v3-turbo-MLX |
| macOS | TTS | NexaAI/Kokoro-82M-bf16-MLX |
| Windows x86 | LLM | NexaAI/Qwen3-4B-GGUF |
| Windows x86 | VLM | NexaAI/gemma-3n |
| Windows x86 | Image Generation | NexaAI/Prefect-illustrious-XL-v2.0p |
| Windows Qualcomm ARM64 | LLM | NexaAI/Qwen3-4B-npu |
| Windows Qualcomm ARM64 | VLM | NexaAI/OmniNeural-4B |
| Windows Qualcomm ARM64 | ASR | NexaAI/parakeet-tdt-0.6b-v3-npu |
| Windows AMD NPU | Image Generation | NexaAI/sdxl-turbo-amd-npu |
| Windows Intel NPU | LLM | NexaAI/llama-3.1-8B-intel-npu |
/v1/chat/completions
Creates a model response for a given conversation. Supports LLM(text-only) and VLM (image+text).Use LLM
Request body
Example Value
Usage Example
Use VLM
Request body
Example Value
Usage Example
/v1/images/generations
Creates an image based on given a prompt.The example below uses
NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda as the model, which is recommended for most CUDA (NVIDIA GPU) environments.If you are running on Apple Silicon, use an MLX-compatible model (e.g., nexaml/sdxl-turbo-ryzen-ai).Always make sure the model you select matches your hardware capabilities.Request body
Example Value
Usage Example
/v1/embeddings
Creates an embedding for the given input. Use this to convert text (or document chunks) to vectors for indexing.Use this endpoint when you need to convert text or document chunks into vectors for indexing in a retrieval system.Make sure you select a model that supports embeddings (e.g.,
djuna/jina-embeddings-*). Calling this API with a non-embedding model will result in an error.Minimal request body
Example Value
Usage Example
/v1/reranking
Rerank documents based on their relevance to a query. Returns a list of relevance scores aligned with the input order (higher = more relevant).Use this endpoint after a coarse retrieval step (e.g., embeddings Top-K) to improve final ranking quality.Ensure the selected model supports reranking. Calling this API with a non-reranking model will result in an error.
Minimal request body
Example Value
Usage Example
Was this page helpful?