CPU / GPU

LLM Usage

Large Language Models for text generation and chat applications.

Streaming Conversation

We support CPU/GPU inference for GGUF format models. You can pick any GGUF models from the community and run with the cpu_gpu plugin.

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "", // For GGUF CPU/GPU models, leave model_name empty.
            model_path = "<your-model-path>",
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 0  // 0 for CPU, > 0 for GPU
            ),
            plugin_id = "cpu_gpu",
            device_id = null  // null for CPU, "GPUOpenCL" for GPU
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

val chatList = arrayListOf(ChatMessage("user", "What is AI?"))

llmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    val genConfig = GenerationConfig(maxTokens = 2048)
    
    llmWrapper.generateStreamFlow(template.formattedText, genConfig).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

CPU/GPU Configuration

Control whether your model runs on CPU or GPU using a combination of device_id and nGpuLayers: GPU Execution Requirements:

device_id must be set to "GPUOpenCL"
nGpuLayers must be greater than 0 (typically set to 999 to offload all layers)

CPU Execution:

device_id is null (default)
OR nGpuLayers is 0

Example: Running on GPU

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "",
            model_path = "<your-model-path>",
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 999  // Offload all layers to GPU
            ),
            plugin_id = "cpu_gpu",
            device_id = "GPUOpenCL"  // Use GPU device
        )
    )
    .build()
    .onSuccess { llmWrapper = it }
    .onFailure { error -> println("Error: ${error.message}") }

Example: Running on CPU (Default)

LlmWrapper.builder()
    .llmCreateInput(
        LlmCreateInput(
            model_name = "",
            model_path = "<your-model-path>",
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 0  // All on CPU
            ),
            plugin_id = "cpu_gpu",
            device_id = null  // Default to CPU
        )
    )
    .build()
    .onSuccess { llmWrapper = it }

Multimodal Usage

Vision-Language Models for image understanding and multimodal applications.

Streaming Conversation

We support CPU/GPU inference for GGUF format models.

VlmWrapper.builder()
    .vlmCreateInput(
        VlmCreateInput(
            model_name = "",  // For GGUF on CPU/GPU, leave empty (no model name needed)
            model_path = <your-model-path>,
            mmproj_path = <your-mmproj-path>,  // vision projection weights
            config = ModelConfig(
                nCtx = 4096,
                nGpuLayers = 0  // 0 for CPU, > 0 for GPU
            ),
            plugin_id = "cpu_gpu",
            device_id = null  // null for CPU, "GPUOpenCL" for GPU
        )
    )
    .build()
    .onSuccess { vlmWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Use the loaded VLM with image and text
val contents = listOf(
    VlmContent("image", <your-image-path>),
    VlmContent("text", <your-text>)
)

val chatList = arrayListOf(VlmChatMessage("user", contents))

vlmWrapper.applyChatTemplate(chatList.toTypedArray(), null, false).onSuccess { template ->
    // Create base GenerationConfig with maxTokens
    val baseConfig = GenerationConfig(maxTokens = 2048)
    
    // Inject media paths from chatList into config
    val configWithMedia = vlmWrapper.injectMediaPathsToConfig(
        chatList.toTypedArray(),
        baseConfig
    )
    
    vlmWrapper.generateStreamFlow(template.formattedText, configWithMedia).collect { result ->
        when (result) {
            is LlmStreamResult.Token -> println(result.text)
            is LlmStreamResult.Completed -> println("Done!")
            is LlmStreamResult.Error -> println("Error: ${result.throwable}")
        }
    }
}

ASR Usage

Automatic Speech Recognition for audio transcription.

Basic Usage

We support CPU inference for whisper.cpp models.

// Load ASR model for whisper.cpp inference
AsrWrapper.builder()
    .asrCreateInput(
        AsrCreateInput(
            model_name = "",  // Empty for whisper.cpp
            model_path = <your-model-path>,  // e.g., "ggml-base-q8_0.bin"
            config = ModelConfig(
                nCtx = 4096  // Context size (use nCtx instead of max_tokens)
            ),
            plugin_id = "whisper_cpp"  // Use whisper.cpp backend
        )
    )
    .build()
    .onSuccess { asrWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Transcribe audio file
asrWrapper.transcribe(
    AsrTranscribeInput(
        audioPath = <your-audio-path>,  // Path to .wav file (16kHz recommended)
        language = "en",                // Language code: "en", "zh", "es", etc.
        timestamps = null               // Optional timestamp format
    )
).onSuccess { result ->
    println("Transcription: ${result.result.transcript}")
}

TTS Usage

Text-to-Speech synthesis for converting text into natural-sounding speech.

Basic Usage

We support CPU inference for TTS models in GGUF format.

// Load TTS model for CPU inference
TtsWrapper.builder()
    .ttsCreateInput(
        TtsCreateInput(
            model_name = "",  // Empty for CPU/GPU models
            model_path = <your-model-path>,  // Path to TTS model (e.g., Kokoro GGUF model)
            config = ModelConfig(
                nCtx = 4096  // Context size
            ),
            plugin_id = "tts_cpp"  // Use TTS backend
        )
    )
    .build()
    .onSuccess { ttsWrapper = it }
    .onFailure { error -> 
        println("Error: ${error.message}")
    }

// Synthesize speech from text
ttsWrapper.synthesize(
    TtsSynthesizeInput(
        textUtf8 = "Hello, this is a text to speech demo using Nexa SDK.",
        outputPath = <your-output-audio-path>  // Path where audio will be saved (e.g., "/path/to/output.wav")
    )
).onSuccess { result ->
    println("Speech synthesized successfully!")
    println("Audio saved to: ${result.outputPath}")
}.onFailure { error ->
    println("Error during synthesis: ${error.message}")
}

Need Help?

Join our community to get support, share your projects, and connect with other developers.

Discord Community

Get real-time support and chat with the Nexa AI community

Slack Community

Collaborate with developers and access community resources

Was this page helpful?

Yes

Get Started

Nexa CLI Usage

Android SDK

Linux Docker

Python Library

iOS & macOS SDK

Community

LLM Usage

Streaming Conversation

CPU/GPU Configuration

Example: Running on GPU

Example: Running on CPU (Default)

Multimodal Usage

Streaming Conversation

ASR Usage

Basic Usage

TTS Usage

Basic Usage

Need Help?

Discord Community

Slack Community

Get Started

Nexa CLI Usage

Android SDK

Linux Docker

Python Library

iOS & macOS SDK

Community

​LLM Usage

​Streaming Conversation

​CPU/GPU Configuration

​Example: Running on GPU

​Example: Running on CPU (Default)

​Multimodal Usage

​Streaming Conversation

​ASR Usage

​Basic Usage

​TTS Usage

​Basic Usage

​Need Help?

Discord Community

Slack Community

LLM Usage

Streaming Conversation

CPU/GPU Configuration

Example: Running on GPU

Example: Running on CPU (Default)

Multimodal Usage

Streaming Conversation

ASR Usage

Basic Usage

TTS Usage

Basic Usage

Need Help?