Run nexa commands from the nexa executable directory.
nexa pull
Download a model and store it locally. After entering the pull command, you will be guided through a setup process to choose the model type, main model file, tokenizer (optional), and extra files (optional).General Behavior
After runningnexa pull <model-name>, the CLI will prompt:
- 
Quant version selection
 If the model supports multiple quantized versions, you will see a menu like this:Select a quant version you prefer.Quant version selection
- 
Download begins
 After selection, the model files will start downloading automatically.
LLM
bash
VLM
bash
Function Call
bash
Omni Model
bash
ASR
bash
TTS
bash
Embedder
bash
Reranker
bash
nexa list
Display all downloaded models in a table with their names and sizes.bash
nexa remove
Remove a specific local model by name. For example, remove the locally downloaded model NexaAI/Qwen3-0.6B-GGUF from the cache directory. This will free up disk space and make the model unavailable for future inference unless re-downloaded.bash
nexa clean
Delete all locally cached models.bash
nexa infer
Run inference with a specified model. The model must be downloaded and cached locally.Helper menu
bash
nexa infer.
LLM
Launch an interactive chat session with the language model.bash
--think option to control whether the model outputs its internal reasoning process.
- --think=false: The model responds directly without showing reasoning.
- --think=true: The model displays its reasoning steps before the final response.
bash
VLM
Text-only or response from image file (interactive image input):bash
If you’d like the model to response from an image, provide the absolute path to an image at the end of your message.
Example prompt:
Describe this picture </path/to/image.png>
Omni Models
Text-only or response from audio file (interactive audio output):bash
If you’d like the model to response from an audio, provide an absolute path to an audio at the end of your message.
Example prompt:
Convert this audio into text </path/to/audio.mp3>
ASR
Currently, ASR is only supported on macOS using the mlx runtime.
bash
- -m asr: Sets the model type to ASR.
- --input: Specifies the input audio file.
- --language: Sets the language code (e.g., en for English, zh for Chinese).
TTS
Currently, TTS is only supported on macOS using the mlx runtime.
bash
- -m TTS: Sets the model type to TTS.
- --voice-identifier: Specifies the speaker’s voice.When no- --voice-identifieris provided, NexaCLI will return a full list of supported voices in the error message. This is useful for discovering all available voice options.
- -p: The text prompt to synthesize.
- -o: Output file for the generated .wav audio.
Embedder
Generate embeddings for multiple pieces of text using an embedding model.bash
- -m embedder: Sets the model type to Embedder.
- --prompt: Provide one or more pieces of text to embed.
Reranker
Use a reranker model to score and sort documents based on relevance to a query.bash
- -m reranker: Sets the model type to Reranker.
- --query: The main query string used to evaluate document relevance.
- --document: One or more documents to score against the query.
nexa serve
Launch the Nexa inference server using REST API.Helper menu
bash
nexa serve.
Start serve
For example, start a local inference server bound to 127.0.0.1:8080. The server supports OpenAI-compatible APIs, and —keepalive 600 keeps models in memory for 10 minutes between requests.bash
nexa run
Connect to a running Nexa server (via OpenAI-compatible API) and start a chat interface. You should start server first.Helper menu
bash
nexa run.
Run model
For example: launch an interactive streaming chat session with the NexaAI/Qwen3-0.6B-GGUF model. The model generates and displays output incrementally as tokens are produced.bash
--disable-stream|-s: disable streaming and respond the entire json back.
Was this page helpful?