Make your own Ollama Client

Rather than use Rig.rs or Ollama-rs let’s consider the bare bones of the frameworks that we take for granted..

We make a POST reqwest to the Ollama API and send a body specifying the model and messages ( role & content ).

Next – iterate over the stream and display the chunks until there are no more!

We use a struct to ensure the response is formatted how we expect. (OllamaResponse)

At the end we print the elapsed time.

Dependencies

[dependencies]
reqwest = { version = "0.11", features = ["json", "stream"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1", features = ["full"] }
futures-util = "0.3"

Code

use reqwest::Client;
use serde::Deserialize;
use futures_util::StreamExt;
use std::time::Instant;

#[derive(Debug, Deserialize)]
struct OllamaResponse {
    message: Message,
    done: bool,
}

#[derive(Debug, Deserialize)]
struct Message {
    content: String,
}


#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Start timer
    let start = Instant::now();
    println!("Program started at: {start:?}");

    let client = Client::new();

    let body = serde_json::json!({
        "model": "llama3.2",
        "messages": [
            {
                "role": "user",
                "content": "Hello, Ollama! Why are bananas yellow?"
            }
        ]
    });

    let res = client
        .post("http://localhost:11434/api/chat")
        .json(&body)
        .send()
        .await?;

    if !res.status().is_success() {
        eprintln!("Request failed: {}", res.status());
        return Ok(());
    }

    let mut stream = res.bytes_stream();

    let mut full_response = String::new();

    while let Some(item) = stream.next().await {
        let chunk = item?;
        for line in std::str::from_utf8(&chunk)?.lines() {
            if line.trim().is_empty() {
                continue;
            }

            let parsed: OllamaResponse= serde_json::from_str(line)?;
            print!("{}", parsed.message.content);
            full_response.push_str(&parsed.message.content);

            if parsed.done {
                // End timer
                let duration = start.elapsed();
                println!("\n--- Done ---");
                println!("Full response:\n{full_response}");
                println!("Elapsed time: {duration:.2?}");
                return Ok(());
            }
        }
    }

    Ok(())
}

Building a Streaming Ollama Client in Rust: Complete Code Breakdown

This tutorial breaks down a complete Rust application that communicates with Ollama’s local AI server using streaming responses. The code demonstrates async programming, HTTP requests, JSON parsing, and real-time streaming.

Section 1: Dependencies and Imports

use reqwest::Client;
use serde::Deserialize;
use futures_util::StreamExt;
use std::time::Instant;

What’s happening here:

  • reqwest::Client – HTTP client library for making web requests
  • serde::Deserialize – Automatic JSON deserialization trait
  • futures_util::StreamExt – Provides stream processing capabilities for async data
  • std::time::Instant – High-precision timer for measuring execution time

Purpose: These imports provide the foundation for HTTP communication, JSON handling, streaming data processing, and performance measurement.

Section 2: Data Structure Definitions

#[derive(Debug, Deserialize)]
struct OllamaResponse {
    message: Message,
    done: bool,
}

#[derive(Debug, Deserialize)]
struct Message {
    content: String,
}

What’s happening here:

  • #[derive(Debug, Deserialize)] – Automatically implements debugging output and JSON deserialization
  • OllamaResponse – Represents each chunk of data received from Ollama’s streaming API
  • Message – Contains the actual text content from the AI model
  • done: bool – Flag indicating when the AI has finished responding

Purpose: These structs define the expected format of JSON responses from Ollama, allowing automatic parsing of streaming data chunks.

Section 3: Application Setup and Timing

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Start timer
    let start = Instant::now();
    println!("Program started at: {start:?}");

    let client = Client::new();

What’s happening here:

  • #[tokio::main] – Marks this as an async main function using the Tokio runtime
  • Result<(), Box<dyn std::error::Error>> – Error handling that can return any error type
  • Instant::now() – Captures the current time for performance measurement
  • Client::new() – Creates a new HTTP client instance

Purpose: Sets up the async runtime, initializes performance tracking, and creates the HTTP client for API communication.

Section 4: Request Payload Construction

let body = serde_json::json!({
    "model": "llama3.2",
    "messages": [
        {
            "role": "user",
            "content": "Hello, Ollama! Why are bananas yellow?"
        }
    ]
});

What’s happening here:

  • serde_json::json! – Macro for creating JSON objects with Rust syntax
  • "model": "llama3.2" – Specifies which AI model to use
  • messages array – Contains the conversation history in OpenAI-compatible format
  • role: "user" – Identifies the message sender (user vs assistant)

Purpose: Creates the JSON payload that tells Ollama which model to use and what question to ask.

Section 5: HTTP Request Execution

let res = client
    .post("http://localhost:11434/api/chat")
    .json(&body)
    .send()
    .await?;

if !res.status().is_success() {
    eprintln!("Request failed: {}", res.status());
    return Ok(());
}

What’s happening here:

  • .post() – Makes an HTTP POST request to Ollama’s chat endpoint
  • .json(&body) – Serializes the request body as JSON and sets appropriate headers
  • .send().await? – Executes the request asynchronously and handles potential errors
  • res.status().is_success() – Checks if the response has a 2xx status code
  • eprintln! – Prints error messages to stderr

Purpose: Sends the chat request to Ollama and validates that the server responded successfully.

Section 6: Stream Processing Setup

let mut stream = res.bytes_stream();
let mut full_response = String::new();

while let Some(item) = stream.next().await {
    let chunk = item?;

What’s happening here:

  • res.bytes_stream() – Converts the HTTP response into a stream of byte chunks
  • String::new() – Creates an empty string to accumulate the complete response
  • while let Some(item) = stream.next().await – Loops through each chunk as it arrives
  • let chunk = item? – Extracts bytes from each chunk, handling potential errors

Purpose: Sets up streaming processing to handle data as it arrives rather than waiting for the complete response.

Section 7: Chunk Processing and JSON Parsing

for line in std::str::from_utf8(&chunk)?.lines() {
    if line.trim().is_empty() {
        continue;
    }

    let parsed: OllamaResponse = serde_json::from_str(line)?;
    print!("{}", parsed.message.content);
    full_response.push_str(&parsed.message.content);

What’s happening here:

  • std::str::from_utf8(&chunk)? – Converts bytes to UTF-8 string
  • .lines() – Splits the chunk into individual lines (each line is a JSON object)
  • line.trim().is_empty() – Skips empty lines
  • serde_json::from_str(line)? – Parses each line as JSON into our struct
  • print!("{}", parsed.message.content) – Immediately displays new content
  • full_response.push_str() – Accumulates content for final display

Purpose: Processes each streaming chunk by parsing JSON and displaying content in real-time.

Section 8: Completion Handling and Results

    if parsed.done {
        // End timer
        let duration = start.elapsed();
        println!("\n--- Done ---");
        println!("Full response:\n{full_response}");
        println!("Elapsed time: {duration:.2?}");
        return Ok(());
    }
}

What’s happening here:

  • if parsed.done – Checks if Ollama has finished generating the response
  • start.elapsed() – Calculates total execution time
  • println! statements – Display completion status, full response, and timing
  • return Ok(()) – Successfully exits the program
  • {duration:.2?} – Formats duration with 2 decimal places

Purpose: Detects when the AI has finished responding, measures performance, and displays final results.

Key Concepts Demonstrated

Async Programming: The code uses async/await throughout to handle I/O operations without blocking Streaming: Processes data as it arrives rather than waiting for complete responses Error Handling: Uses ? operator for clean error propagation JSON Processing: Automatic serialization/deserialization with serde Performance Measurement: Built-in timing to measure response speed

Use Cases

This pattern is perfect for:

  • Building chatbots with real-time responses
  • Creating CLI tools for AI interaction
  • Developing applications that need immediate feedback
  • Monitoring AI model performance and response times

Deterministic output?

let body = serde_json::json!({
    "model": "llama3.2",
    "messages": [
        {
            "role": "user",
            "content": "Hello, Ollama! Why are bananas yellow?"
        }
    ],
    "options": {
        "seed": 123, // any number will do
        "temperature": 0.7,  // You can add other options too
        "top_p": 0.9
    }
});

Thanks for reading, even if you use a framework, it’s good to “roll your own” client

In a future article we’ll look at creating embeddings

curl http://localhost:11434/api/embed -d '{
  "model": "bge-m3:latest",
  "input": "Why is the sky blue?"
}'

ollama rust client
AI ML

Previous article

Why MCP uses JSON RPCNew!!