REST API

Getting Started

If you haven’t already, pull a model before making requests. For example, pull Qwen3:

bash

nexa pull NexaAI/Qwen3-0.6B-GGUF

To use the API, first open a terminal from the project root. Then, start the Nexa server by:

bash

nexa serve

The server runs on http://127.0.0.1:18181 by default.
Keep the terminal that runs the server open, and make your requests from another terminal tab.
To see a full list of configurable options for the server, run nexa serve -h While you can try out nexa server with any HTTP tools, the easiest way to quickly get start is to run

nexa run NexaAI/Qwen3-0.6B-GGUF

You may replace the model name with any model names that has been pulled with nexa pull. The run command will also starts an REPL conversation UI just like nexa infer, but fulfilling your chat by sending requests to the server hosted by your nexa serve command.

Model Choice

Certain models can only be run on specific platforms. For example, MLX models can only be run on MacOS 13+ devices. OmniNeural can only be run on a Qualcomm laptop with NPU. Below is a table that contains example models for each OS for you to try:

OS	Modality	Recommended Model
macOS	LLM	NexaAI/gpt-oss-20b-MLX-4bit
macOS	VLM	NexaAI/gemma-3n-E4B-it-4bit-MLX
macOS	Image Generation	NexaAI/sdxl-turbo
macOS	ASR	NexaAI/whisper-large-v3-turbo-MLX
macOS	TTS	NexaAI/Kokoro-82M-bf16-MLX
Windows x86	LLM	NexaAI/Qwen3-4B-GGUF
Windows x86	VLM	NexaAI/gemma-3n
Windows x86	Image Generation	NexaAI/Prefect-illustrious-XL-v2.0p
Windows Qualcomm ARM64	LLM	NexaAI/Qwen3-4B-npu
Windows Qualcomm ARM64	VLM	NexaAI/OmniNeural-4B
Windows Qualcomm ARM64	ASR	NexaAI/parakeet-tdt-0.6b-v3-npu
Windows AMD NPU	Image Generation	NexaAI/sdxl-turbo-amd-npu
Windows Intel NPU	LLM	NexaAI/llama-3.1-8B-intel-npu

/v1/chat/completions

Creates a model response for a given conversation. Supports LLM(text-only) and VLM (image+text).

Use LLM

Request body

Example Value

{
  "model": "NexaAI/Qwen3-0.6B-GGUF",
  "messages": [
    {"role": "user", "content": "Hello! Briefly introduce yourself."}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/Qwen3-0.6B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 64}"

Use VLM

Request body

Example Value

{
  "model": "NexaAI/qwen3vl-GGUF",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image succinctly."},
        {"type": "image_url", "image_url": {"url": "</path/to/image>"}}
      ]
    }
  ]
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/qwen3vl-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"what is main color of the picture\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"</path/to/image>\"}}]}], \"stream\": false}"

/v1/images/generations

Creates an image based on given a prompt.

The example below uses NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda as the model, which is recommended for most CUDA (NVIDIA GPU) environments.If you are running on Apple Silicon, use an MLX-compatible model (e.g., nexaml/sdxl-turbo-ryzen-ai).Always make sure the model you select matches your hardware capabilities.

Request body

Example Value

{
  "model": "NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda",
  "prompt": "A white cat with blue eyes",
  "n": 1,
  "size": "512x512",
  "response_format": "url"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/images/generations -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda\",\"prompt\":\"A white cat with blue eyes\",\"n\":1,\"size\":\"512x512\",\"response_format\":\"url\"}"

/v1/embeddings

Creates an embedding for the given input. Use this to convert text (or document chunks) to vectors for indexing.

Use this endpoint when you need to convert text or document chunks into vectors for indexing in a retrieval system.Make sure you select a model that supports embeddings (e.g., djuna/jina-embeddings-*). Calling this API with a non-embedding model will result in an error.

Minimal request body

Example Value

{
  "model": "djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF",
  "input": "Hello, world!"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/embeddings -H "Content-Type: application/json" -d "{\"model\":\"djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF\",\"input\":\"Hello, world!\"}"

/v1/reranking

Rerank documents based on their relevance to a query. Returns a list of relevance scores aligned with the input order (higher = more relevant).

Use this endpoint after a coarse retrieval step (e.g., embeddings Top-K) to improve final ranking quality.Ensure the selected model supports reranking. Calling this API with a non-reranking model will result in an error.

Minimal request body

Example Value

{
  "model": "NexaAI/jina-v2-rerank-npu",
  "query": "What is machine learning?",
  "documents": [
    "Machine learning is a subset of artificial intelligence.",
    "Machine learning algorithms learn patterns from data.",
    "The weather is sunny today.",
    "Deep learning is a type of machine learning."
  ],
  "batch_size": 4,
  "normalize": true,
  "normalize_method": "softmax"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/reranking -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/jina-v2-rerank-npu\",\"query\":\"What is machine learning?\",\"documents\":[\"Machine learning is a subset of artificial intelligence.\",\"Machine learning algorithms learn patterns from data.\",\"The weather is sunny today.\",\"Deep learning is a type of machine learning.\"],\"batch_size\":4,\"normalize\":true,\"normalize_method\":\"softmax\"}"

Was this page helpful?

Yes

Get Started

Usage

Python Library

Mobile

Getting Started

Model Choice

/v1/chat/completions

Use LLM

Request body

Usage Example

Use VLM

Request body

Usage Example

/v1/images/generations

Request body

Usage Example

/v1/embeddings

Minimal request body

Usage Example

/v1/reranking

Minimal request body

Usage Example

Get Started

Usage

Python Library

Mobile

​Getting Started

​Model Choice

​/v1/chat/completions

​Use LLM

​Request body

​Usage Example

​Use VLM

​Request body

​Usage Example

​/v1/images/generations

​Request body

​Usage Example

​/v1/embeddings

​Minimal request body

​Usage Example

​/v1/reranking

​Minimal request body

​Usage Example

Getting Started

Model Choice

/v1/chat/completions

Use LLM

Request body

Usage Example

Use VLM

Request body

Usage Example

/v1/images/generations

Request body

Usage Example

/v1/embeddings

Minimal request body

Usage Example

/v1/reranking

Minimal request body

Usage Example