Processing images and getting structured outputs with Pixtral vision model
In today's data-driven world, the ability to extract structured information from visual content is becoming increasingly valuable across various industries. From analyzing medical images to interpreting financial charts, from processing historical documents to cataloging diverse product lines, the applications are vast and varied.
Pixtral, developed by Mistral AI, is a powerful vision model capable of understanding and describing images with remarkable accuracy. By leveraging Pixtral, we can automate the process of extracting structured information from a wide range of visual inputs, including photographs, scanned documents, graphs, and more.
This tutorial will guide you through the process of using the Pixtral vision model to analyze images and automatically generate structured outputs. We'll use Python to interact with the model and structure our data, making it easy to integrate this solution into your existing workflows. While we'll use a product catalog as an example, the techniques demonstrated here can be adapted to various use cases across different industries.
Before you start
To complete the actions presented below, you must have:
- A Scaleway account logged into the console
- A Python environment (version 3.7 or higher)
- An API key from Scaleway Identity and Access Management
- Access to a Scaleway Managed Inference endpoint with Pixtral deployed or to Scaleway Generative APIs service
- The
openai
andpydantic
Python libraries installed
Setting up the environment
Before we dive into using Pixtral, let's set up our Python environment and install the necessary libraries.
-
Create a new directory for your project:
mkdir pixtral-image-processor cd pixtral-image-processor
-
Create a virtual environment and activate it:
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required libraries:
pip install openai pydantic
Defining the data model
We'll start by defining our data model using pydantic
. This will ensure that our structured output has a consistent format and that all required fields are present.
Create a new file called models.py
and add the following code:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import date
class Dimension(BaseModel):
length: float
width: float
height: float
unit: str = Field(..., pattern="(cm|in|mm)")
class Price(BaseModel):
amount: float
currency: str = Field(..., pattern="[A-Z]{3}")
class Review(BaseModel):
user_id: str
rating: int = Field(..., ge=1, le=5)
comment: Optional[str] = None
date: date
class Product(BaseModel):
id: str
name: str
description: str
category: str
subcategory: Optional[str] = None
brand: str
sku: str
price: Price
dimensions: Dimension
weight: float
weight_unit: str = Field(..., pattern="(kg|g|lb|oz)")
in_stock: bool
available_colors: List[str] = Field(..., min_items=1)
features: List[str] = Field(..., min_items=1)
image_urls: List[str] = Field(..., min_items=1)
reviews: List[Review] = Field(default_factory=list)
average_rating: Optional[float] = None
class ProductCatalog(BaseModel):
products: List[Product]
total_products: int
last_updated: date
This model defines the structure for our product catalog, which we'll use as an example of structured output from image processing.
Setting up the Pixtral client
Next, we'll set up the client to interact with the Pixtral model. Create a new file called pixtral_client.py
and add the following code:
from openai import OpenAI
import os
MODEL = "pixtral-12b-2409"
API_KEY = os.environ.get("SCALEWAY_API_KEY")
BASE_URL = os.environ.get("SCALEWAY_INFERENCE_ENDPOINT_URL")
# use https://api.scaleway.ai/v1 for Scaleway Generative APIs
client = OpenAI(
base_url=BASE_URL,
api_key=API_KEY
)
def get_pixtral_client():
return client
Make sure to set the SCALEWAY_API_KEY
and SCALEWAY_INFERENCE_ENDPOINT_URL
environment variables with your actual API key from Scaleway IAM, and the appropriate endpoint URL for Scaleway Managed Inference or Generative APIs service.
Creating the image processor
Now, let's create the main script that will use Pixtral to analyze images and generate our structured output. Create a file called process_images.py
and add the following code:
import json
from datetime import date
from pixtral_client import get_pixtral_client
from models import ProductCatalog
def process_images(image_urls):
client = get_pixtral_client()
prompt = """
Extract detailed information from the provided images. Create an entry for each image.
Include the following details for each item:
- A descriptive name and detailed description
- Appropriate category and subcategory
- Realistic dimensions, weight, and pricing (if applicable)
- At least 3 key features or characteristics
- Any visible attributes (e.g., colors, materials)
- Generate 2 hypothetical reviews or interpretations
Ensure all information is consistent with what can be seen or reasonably inferred from the images.
"""
messages = [
{
"role": "system",
"content": "You are a helpful assistant. Only reply in JSON.",
},
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
},
*[{"type": "image_url", "image_url": {"url": url}} for url in image_urls]
],
},
]
try:
response = client.chat.completions.create(
messages=messages,
model=MODEL,
response_format={
"type": "json_schema",
"json_schema": {
"strict": True,
"name": "Processed Image Data",
"schema": ProductCatalog.model_json_schema()
}
},
)
processed_data = response.choices[0].message.parsed
structured_output = ProductCatalog(**processed_data)
structured_output.last_updated = date.today()
structured_output.total_products = len(structured_output.products)
return structured_output
except Exception as e:
print(f"Error processing images: {e}")
return None
def save_output_to_json(output, filename):
with open(filename, 'w') as f:
json.dump(output.model_dump(), f, indent=2, default=str)
if __name__ == "__main__":
image_urls = [
"https://picsum.photos/id/26/800/600", # Sample image 1
"https://picsum.photos/id/3/800/600" # Sample image 2
]
processed_output = process_images(image_urls)
if processed_output:
save_output_to_json(processed_output, "processed_image_data.json")
print(f"Image processing complete. Structured data for {processed_output.total_products} items generated.")
else:
print("Failed to process images.")
This script does the following:
- Imports the necessary modules and models.
- Defines a function to process images using the Pixtral model.
- Creates a prompt that instructs the model on how to analyze the images and what information to extract.
- Sends the images and prompt to the Pixtral model and receives the generated structured data.
- Validates the received data against our
pydantic
models. - Saves the generated structured output to a JSON file.
Running the image processor
To use the image processor:
-
Set the environment variables for your Scaleway API key and inference endpoint URL:
export SCALEWAY_API_KEY="your_api_key_here" export SCALEWAY_INFERENCE_ENDPOINT_URL="your_endpoint_url_here"
-
Run the script:
python process_images.py
The script will process the sample images and generate a processed_image_data.json
file containing the extracted structured information.
Customizing the image processor
You can easily customize the image processor for your specific needs:
- Modify the
prompt
inprocess_images.py
to extract different or additional information from the images. - Update the
models.py
file to change the structure of your output data to fit your specific use case. - Add error handling and logging to make the script more robust for production use.
Conclusion
In this tutorial, we've explored how to leverage Mistral's Pixtral vision model to process images and generate structured outputs following a strict and complex JSON schema. This approach can be applied to a wide range of industries and use cases, from cataloging products to analyzing medical images, from interpreting financial charts to processing historical documents.
By combining the power of AI vision models with structured data validation, we've created a flexible and extensible solution that can be adapted to various image processing needs.
Visit our Help Center and find the answers to your most frequent questions.
Visit Help Center