Skip to main content

Getting Started

If you haven’t already, pull a model before making requests. For example, pull Qwen3:
bash
nexa pull NexaAI/Qwen3-0.6B-GGUF
To use the API, first open a terminal from the project root. Then, start the Nexa server by:
bash
nexa serve
The server runs on http://127.0.0.1:18181 by default.
Keep the terminal that runs the server open, and make your requests from another terminal tab.
To see a full list of configurable options for the server, run nexa serve -h
While you can try out nexa server with any HTTP tools, the easiest way to quickly get start is to run
nexa run NexaAI/Qwen3-0.6B-GGUF
You may replace the model name with any model names that has been pulled with nexa pull. The run command will also starts an REPL conversation UI just like nexa infer, but fulfilling your chat by sending requests to the server hosted by your nexa serve command.

Model Choice

Certain models can only be run on specific platforms. For example, MLX models can only be run on MacOS 13+ devices. OmniNeural can only be run on a Qualcomm laptop with NPU. Below is a table that contains example models for each OS for you to try:
OSModalityRecommended Model
macOSLLMNexaAI/gpt-oss-20b-MLX-4bit
macOSVLMNexaAI/gemma-3n-E4B-it-4bit-MLX
macOSImage GenerationNexaAI/sdxl-turbo
macOSASRNexaAI/whisper-large-v3-turbo-MLX
macOSTTSNexaAI/Kokoro-82M-bf16-MLX
Windows x86LLMNexaAI/Qwen3-4B-GGUF
Windows x86VLMNexaAI/gemma-3n
Windows x86Image GenerationNexaAI/Prefect-illustrious-XL-v2.0p
Windows Qualcomm ARM64LLMNexaAI/Qwen3-4B-npu
Windows Qualcomm ARM64VLMNexaAI/OmniNeural-4B
Windows Qualcomm ARM64ASRNexaAI/parakeet-tdt-0.6b-v3-npu
Windows AMD NPUImage GenerationNexaAI/sdxl-turbo-amd-npu
Windows Intel NPULLMNexaAI/llama-3.1-8B-intel-npu

/v1/chat/completions

Creates a model response for a given conversation. Supports LLM(text-only) and VLM (image+text).

Use LLM

Request body

Example Value
{
  "model": "NexaAI/Qwen3-0.6B-GGUF",
  "messages": [
    {"role": "user", "content": "Hello! Briefly introduce yourself."}
  ],
  "max_tokens": 256,
  "temperature": 0.7,
  "stream": false
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/Qwen3-0.6B-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}], \"max_tokens\": 64}"

Use VLM

Request body

Example Value
{
  "model": "NexaAI/qwen3vl-GGUF",
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image succinctly."},
        {"type": "image_url", "image_url": {"url": "</path/to/image>"}}
      ]
    }
  ]
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/chat/completions -H "Content-Type: application/json" -d "{\"model\": \"NexaAI/qwen3vl-GGUF\", \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"what is main color of the picture\"}, {\"type\": \"image_url\", \"image_url\": {\"url\": \"</path/to/image>\"}}]}], \"stream\": false}"

/v1/images/generations

Creates an image based on given a prompt.
The example below uses NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda as the model, which is recommended for most CUDA (NVIDIA GPU) environments.If you are running on Apple Silicon, use an MLX-compatible model (e.g., nexaml/sdxl-turbo-ryzen-ai).Always make sure the model you select matches your hardware capabilities.

Request body

Example Value
{
  "model": "NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda",
  "prompt": "A white cat with blue eyes",
  "n": 1,
  "size": "512x512",
  "response_format": "url"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/images/generations -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/Prefect-illustrious-XL-v2.0p-fp16-cuda\",\"prompt\":\"A white cat with blue eyes\",\"n\":1,\"size\":\"512x512\",\"response_format\":\"url\"}"

/v1/embeddings

Creates an embedding for the given input. Use this to convert text (or document chunks) to vectors for indexing.
Use this endpoint when you need to convert text or document chunks into vectors for indexing in a retrieval system.Make sure you select a model that supports embeddings (e.g., djuna/jina-embeddings-*). Calling this API with a non-embedding model will result in an error.

Minimal request body

Example Value
{
  "model": "djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF",
  "input": "Hello, world!"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/embeddings -H "Content-Type: application/json" -d "{\"model\":\"djuna/jina-embeddings-v2-small-en-Q5_K_M-GGUF\",\"input\":\"Hello, world!\"}"

/v1/reranking

Rerank documents based on their relevance to a query. Returns a list of relevance scores aligned with the input order (higher = more relevant).
Use this endpoint after a coarse retrieval step (e.g., embeddings Top-K) to improve final ranking quality.Ensure the selected model supports reranking. Calling this API with a non-reranking model will result in an error.

Minimal request body

Example Value
{
  "model": "NexaAI/jina-v2-rerank-npu",
  "query": "What is machine learning?",
  "documents": [
    "Machine learning is a subset of artificial intelligence.",
    "Machine learning algorithms learn patterns from data.",
    "The weather is sunny today.",
    "Deep learning is a type of machine learning."
  ],
  "batch_size": 4,
  "normalize": true,
  "normalize_method": "softmax"
}

Usage Example

curl -X POST http://127.0.0.1:18181/v1/reranking -H "Content-Type: application/json" -d "{\"model\":\"NexaAI/jina-v2-rerank-npu\",\"query\":\"What is machine learning?\",\"documents\":[\"Machine learning is a subset of artificial intelligence.\",\"Machine learning algorithms learn patterns from data.\",\"The weather is sunny today.\",\"Deep learning is a type of machine learning.\"],\"batch_size\":4,\"normalize\":true,\"normalize_method\":\"softmax\"}"

I