Make your own Ollama Client
Rather than use Rig.rs or Ollama-rs let’s consider the bare bones of the frameworks that we take for granted..
We make a POST reqwest to the Ollama API and send a body specifying the model and messages ( role & content ).
Next – iterate over the stream and display the chunks until there are no more!
We use a struct to ensure the response is formatted how we expect. (OllamaResponse)
At the end we print the elapsed time.
Dependencies
[dependencies]
reqwest = { version = "0.11", features = ["json", "stream"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1", features = ["full"] }
futures-util = "0.3"
Code
use reqwest::Client;
use serde::Deserialize;
use futures_util::StreamExt;
use std::time::Instant;
#[derive(Debug, Deserialize)]
struct OllamaResponse {
message: Message,
done: bool,
}
#[derive(Debug, Deserialize)]
struct Message {
content: String,
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Start timer
let start = Instant::now();
println!("Program started at: {start:?}");
let client = Client::new();
let body = serde_json::json!({
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello, Ollama! Why are bananas yellow?"
}
]
});
let res = client
.post("http://localhost:11434/api/chat")
.json(&body)
.send()
.await?;
if !res.status().is_success() {
eprintln!("Request failed: {}", res.status());
return Ok(());
}
let mut stream = res.bytes_stream();
let mut full_response = String::new();
while let Some(item) = stream.next().await {
let chunk = item?;
for line in std::str::from_utf8(&chunk)?.lines() {
if line.trim().is_empty() {
continue;
}
let parsed: OllamaResponse= serde_json::from_str(line)?;
print!("{}", parsed.message.content);
full_response.push_str(&parsed.message.content);
if parsed.done {
// End timer
let duration = start.elapsed();
println!("\n--- Done ---");
println!("Full response:\n{full_response}");
println!("Elapsed time: {duration:.2?}");
return Ok(());
}
}
}
Ok(())
}
Building a Streaming Ollama Client in Rust: Complete Code Breakdown
This tutorial breaks down a complete Rust application that communicates with Ollama’s local AI server using streaming responses. The code demonstrates async programming, HTTP requests, JSON parsing, and real-time streaming.
Section 1: Dependencies and Imports
use reqwest::Client;
use serde::Deserialize;
use futures_util::StreamExt;
use std::time::Instant;
What’s happening here:
reqwest::Client
– HTTP client library for making web requestsserde::Deserialize
– Automatic JSON deserialization traitfutures_util::StreamExt
– Provides stream processing capabilities for async datastd::time::Instant
– High-precision timer for measuring execution time
Purpose: These imports provide the foundation for HTTP communication, JSON handling, streaming data processing, and performance measurement.
Section 2: Data Structure Definitions
#[derive(Debug, Deserialize)]
struct OllamaResponse {
message: Message,
done: bool,
}
#[derive(Debug, Deserialize)]
struct Message {
content: String,
}
What’s happening here:
#[derive(Debug, Deserialize)]
– Automatically implements debugging output and JSON deserializationOllamaResponse
– Represents each chunk of data received from Ollama’s streaming APIMessage
– Contains the actual text content from the AI modeldone: bool
– Flag indicating when the AI has finished responding
Purpose: These structs define the expected format of JSON responses from Ollama, allowing automatic parsing of streaming data chunks.
Section 3: Application Setup and Timing
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Start timer
let start = Instant::now();
println!("Program started at: {start:?}");
let client = Client::new();
What’s happening here:
#[tokio::main]
– Marks this as an async main function using the Tokio runtimeResult<(), Box<dyn std::error::Error>>
– Error handling that can return any error typeInstant::now()
– Captures the current time for performance measurementClient::new()
– Creates a new HTTP client instance
Purpose: Sets up the async runtime, initializes performance tracking, and creates the HTTP client for API communication.
Section 4: Request Payload Construction
let body = serde_json::json!({
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello, Ollama! Why are bananas yellow?"
}
]
});
What’s happening here:
serde_json::json!
– Macro for creating JSON objects with Rust syntax"model": "llama3.2"
– Specifies which AI model to usemessages
array – Contains the conversation history in OpenAI-compatible formatrole: "user"
– Identifies the message sender (user vs assistant)
Purpose: Creates the JSON payload that tells Ollama which model to use and what question to ask.
Section 5: HTTP Request Execution
let res = client
.post("http://localhost:11434/api/chat")
.json(&body)
.send()
.await?;
if !res.status().is_success() {
eprintln!("Request failed: {}", res.status());
return Ok(());
}
What’s happening here:
.post()
– Makes an HTTP POST request to Ollama’s chat endpoint.json(&body)
– Serializes the request body as JSON and sets appropriate headers.send().await?
– Executes the request asynchronously and handles potential errorsres.status().is_success()
– Checks if the response has a 2xx status codeeprintln!
– Prints error messages to stderr
Purpose: Sends the chat request to Ollama and validates that the server responded successfully.
Section 6: Stream Processing Setup
let mut stream = res.bytes_stream();
let mut full_response = String::new();
while let Some(item) = stream.next().await {
let chunk = item?;
What’s happening here:
res.bytes_stream()
– Converts the HTTP response into a stream of byte chunksString::new()
– Creates an empty string to accumulate the complete responsewhile let Some(item) = stream.next().await
– Loops through each chunk as it arriveslet chunk = item?
– Extracts bytes from each chunk, handling potential errors
Purpose: Sets up streaming processing to handle data as it arrives rather than waiting for the complete response.
Section 7: Chunk Processing and JSON Parsing
for line in std::str::from_utf8(&chunk)?.lines() {
if line.trim().is_empty() {
continue;
}
let parsed: OllamaResponse = serde_json::from_str(line)?;
print!("{}", parsed.message.content);
full_response.push_str(&parsed.message.content);
What’s happening here:
std::str::from_utf8(&chunk)?
– Converts bytes to UTF-8 string.lines()
– Splits the chunk into individual lines (each line is a JSON object)line.trim().is_empty()
– Skips empty linesserde_json::from_str(line)?
– Parses each line as JSON into our structprint!("{}", parsed.message.content)
– Immediately displays new contentfull_response.push_str()
– Accumulates content for final display
Purpose: Processes each streaming chunk by parsing JSON and displaying content in real-time.
Section 8: Completion Handling and Results
if parsed.done {
// End timer
let duration = start.elapsed();
println!("\n--- Done ---");
println!("Full response:\n{full_response}");
println!("Elapsed time: {duration:.2?}");
return Ok(());
}
}
What’s happening here:
if parsed.done
– Checks if Ollama has finished generating the responsestart.elapsed()
– Calculates total execution timeprintln!
statements – Display completion status, full response, and timingreturn Ok(())
– Successfully exits the program{duration:.2?}
– Formats duration with 2 decimal places
Purpose: Detects when the AI has finished responding, measures performance, and displays final results.
Key Concepts Demonstrated
Async Programming: The code uses async/await
throughout to handle I/O operations without blocking Streaming: Processes data as it arrives rather than waiting for complete responses Error Handling: Uses ?
operator for clean error propagation JSON Processing: Automatic serialization/deserialization with serde Performance Measurement: Built-in timing to measure response speed
Use Cases
This pattern is perfect for:
- Building chatbots with real-time responses
- Creating CLI tools for AI interaction
- Developing applications that need immediate feedback
- Monitoring AI model performance and response times
Deterministic output?
let body = serde_json::json!({
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "Hello, Ollama! Why are bananas yellow?"
}
],
"options": {
"seed": 123, // any number will do
"temperature": 0.7, // You can add other options too
"top_p": 0.9
}
});
Thanks for reading, even if you use a framework, it’s good to “roll your own” client
In a future article we’ll look at creating embeddings
curl http://localhost:11434/api/embed -d '{
"model": "bge-m3:latest",
"input": "Why is the sky blue?"
}'
