Skip to navigationSkip to main contentSkip to footerScaleway DocsSparklesIconAsk our AI
SparklesIconAsk our AI

Building a web browser agent with Generative APIs, Holo2 and Selenium

inferenceAPIseleniumbrowserholo2agentvisionai

This tutorial will guide you through creating a web agent that can interact with websites using LLM vision and browser automation. You will build on Scaleway's Generative APIs to create a system that can "see" web pages and take actions based on visual understanding. This approach is particularly useful for building more flexible pipelines compared to HTML DOM parsing, and requires less maintenance over time when website HTML code changes.

Why Holo2 model ?

The Holo2 model is a vision-language model optimized to understand Graphical User Interfaces (GUIs) such as web pages or mobile applications, and perform actions on them. Compared to traditional HTML DOM parsing, using a vision model allows you to build more flexible pipelines that require less maintenance over time when website code and structure changes.

What you will learn

  • How to take screenshot and perform actions using Selenium
  • How to analyze images and process actions output using Holo2 vision model

Before you start

To complete the actions presented below, you must have:

Install required packages

Run the following command to install the required packages:

pip install openai selenium
  • openai enables to query Holo2 model through Generative APIs
  • selenium enables interaction with browsers

Import dependencies

Create a new holo2-agent.py file with the following content:

#holo2-agent.py
from openai import OpenAI
from selenium import webdriver
from selenium.webdriver.common.actions.action_builder import ActionBuilder
import base64
import os
import json
import time

Define tasks to perform and website to browse

Define what tasks you want your agent to perform and on which website:

#holo2-agent.py
TASKS = ["Accept cookies", "Click on changelog", "Select newly added feature"]
WEBSITE_URL = "https://www.scaleway.com/en/docs/"

For this example we will make the agent go to the Scaleway Documentation website, and look for recently added features.

Create the model output structure

By default, Holo2 outputs text data. However, it can be guided to output coordinates in a structured way using JSON:

#holo2-agent.py
output_structure = {
    'x': {
        'description': 'The x coordinate, normalized between 0 and 1000.',
        'type': 'integer'
    },
    'y': {
        'description': 'The y coordinate, normalized between 0 and 1000.',
        'type': 'integer'
    }
}

This structure ensures coordinates are valid within the expected range (0-1000), normalized and independant of the exact browser window size).

Connect to Scaleway's Generative APIs

Set up the OpenAI client to connect to Scaleway's API:

#holo2-agent.py
client = OpenAI(
    base_url="https://api.scaleway.ai/v1",
    api_key=os.getenv("SCW_SECRET_KEY")  # Store your IAM API KEY in this environment variable or replace directly with your IAM API key
)

Identify where to click on an image

Create a function that sends a task and an interface screenshot to the AI model and output (x,y) coordinates on the screen. The coordinates correspond to the location the agent needs to click to perform the task:

#holo2-agent.py
def get_next_action(task):
    # Read and encode the current screenshot
    with open("current_screen.png", "rb") as file:
        image_content = file.read()
        base64_img = base64.b64encode(image_content).decode("utf-8")

    # Send the image and task to the vision model
    response = client.chat.completions.create(
        model="holo2-30b-a3b",
        messages=[
            { 
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{base64_img}",
                        },
                    },
                    {
                        "type": "text",
                        "text": f"""Localize an element on the GUI image according to the provided target and output a click position.
                        * You must output a valid JSON following the format: {json.dumps(output_structure)}
                        Your target is: {task}"""
                    }
                ]
            }
        ],
        max_tokens=10000,
        temperature=0.8,
        top_p=0.95
    )

    # Parse the response
    if response.choices[0].message.reasoning != None:
        next_action = json.loads(response.choices[0].message.reasoning)
    
    if response.choices[0].message.content != None:
        next_action = json.loads(response.choices[0].message.content)
    
    return next_action

Automate browser actions

Use Selenium to:

  • Take a screenshot of the current page
  • Click on the right location to perform the task.

Click coordinates are retrieved from Holo2 using the get_next_action function. These coordinates are adjusted to exact browser coordinates using window.innerWidth and window.innerHeight browser properties.

#holo2-agent.py
driver = webdriver.Chrome() # Use webdriver.Firefox() for Firefox

try:
    # Navigate to the target website
    driver.get(WEBSITE_URL)
    time.sleep(3)  # Wait for the page to load
    
    # Process each task in sequence
    for task in TASKS:
        # Take a screenshot of the current page
        driver.save_screenshot('current_screen.png')
        
        # Get the current page dimensions
        page_width = driver.execute_script("return window.innerWidth;")
        page_height = driver.execute_script("return window.innerHeight;")

        # Get click position from the AI model
        next_action = get_next_action(task)
        
        # Convert normalized coordinates to actual screen coordinates
        click_x = (next_action['x'] / 1000) * page_width
        click_y = (next_action['y'] / 1000) * page_height

        # Create and perform the click action
        action = ActionBuilder(driver)
        action.pointer_action.move_to_location(click_x, click_y)
        action.pointer_action.click()
        action.perform()
        
        print(f"Performing task: {task}. Clicked at coordinates: X={click_x}, Y={click_y}")
        time.sleep(3)  # Wait to see the result
        
finally:
    # Always close the browser
    driver.quit()

Run the agent

Execute the agent with:

SCW_SECRET_KEY="your_scaleway_secret_key" \
python holo2-agent.py

You should see your browser open and perform the given actions until the most recently added features are displayed.

Congratulations! You have built a web browser agent navigating through website only based on text written tasks.

Going further

  • Add different tasks and see the agent adapt to them
  • Add other actions types such as scrolling through the page or typing text.
  • Use Pydantic to define output structure and check its validity
  • Extract information from the browsed page
  • Use the agent with "thick" client applications, such as desktop or mobile applications (without Selenium)
Questions?

Visit our Help Center and find the answers to your most frequent questions.

Visit Help CenterArrowRightIcon
SearchIcon
No Results