Using OpenAI GPT-4V model for image reasoning

In this notebook, we show how to use the OpenAI LLM abstraction with GPT4V for image understanding/reasoning.

We also show several functions that are currently supported in the OpenAI LLM class when working with GPT4V:

complete (both sync and async): for a single prompt and list of images
chat (both sync and async): for multiple chat messages
stream complete (both sync and async): for steaming output of complete
stream chat (both sync and async): for steaming output of chat

%pip install llama-index-llms-openai matplotlib

Use GPT4V to understand Images from URLs

import os

OPENAI_API_KEY = "sk-..."  # Your OpenAI API token here
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

Initialize `OpenAIMultiModal` and Load Images from URLs

from llama_index.llms.openai import OpenAI

image_urls = [
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://i2-prod.mirror.co.uk/incoming/article7160664.ece/ALTERNATES/s1200d/FIFA-Ballon-dOr-Gala-2015.jpg",
]

openai_llm = OpenAI(model="gpt-4o", max_new_tokens=300)

from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)

https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg





<matplotlib.image.AxesImage at 0x11a2dc920>

png

Ask the model to describe what it sees

from llama_index.core.llms import (
    ChatMessage,
    ImageBlock,
    TextBlock,
    MessageRole,
)

msg = ChatMessage(
    role=MessageRole.USER,
    blocks=[
        TextBlock(text="Describe the images as an alternative text"),
        ImageBlock(url=image_urls[0]),
        ImageBlock(url=image_urls[1]),
    ],
)

response = openai_llm.chat(messages=[msg])

print(response)

assistant: **Image 1:** The Colosseum in Rome is illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater stands prominently against a deep blue sky, with some clouds visible. The foreground shows a construction area with barriers and a few people walking nearby.

**Image 2:** A line graph titled "The U.S. Mortgage Rate Surge" compares the U.S. 30-year fixed-rate mortgage (in red) with existing home sales (in blue) from 2014 to 2023. The mortgage rate line shows a significant increase, reaching its highest level in over 20 years. Existing home sales fluctuate, with a notable decline in recent years. A text box highlights that in 2023, high mortgage rates and rising home prices have led to the lowest housing affordability since 1989.

We can also stream the model response asynchronously

async_resp = await openai_llm.astream_chat(messages=[msg])
async for delta in async_resp:
    print(delta.delta, end="")

**Image 1:** The Colosseum in Rome is illuminated at night with the colors of the Italian flag: green, white, and red. The ancient structure stands prominently against a deep blue sky, with some clouds visible. The lower part of the image shows a construction area with barriers and a few people walking nearby.

**Image 2:** A line graph titled "The U.S. Mortgage Rate Surge" compares the U.S. 30-year fixed-rate mortgage (in red) with existing home sales (in blue) from 2014 to 2023. The graph shows mortgage rates rising sharply in recent years, while home sales fluctuate. A note highlights that in 2023, high mortgage rates and rising home prices have led to the lowest housing affordability since 1989.

Use GPT4V to understand images from local files

%pip install llama-index-readers-file

from pathlib import Path
import shutil

import requests

img_path = Path().resolve() / "image.jpg"
response = requests.get(image_urls[-1])
with open(img_path, "wb") as file:
    file.write(response.content)

msg = ChatMessage(
    role=MessageRole.USER,
    blocks=[
        TextBlock(text="Describe the image as an alternative text"),
        ImageBlock(path=img_path, image_mimetype="image/jpeg"),
    ],
)

response = openai_llm.chat(messages=[msg])

print(response)

assistant: A person in a black tuxedo and bow tie is holding a golden soccer ball trophy on a stage. The background is a warm yellow color with spotlights shining upwards.

Using OpenAI GPT-4V model for image reasoning

Use GPT4V to understand Images from URLs

Initialize OpenAIMultiModal and Load Images from URLs

Ask the model to describe what it sees

Use GPT4V to understand images from local files

Initialize `OpenAIMultiModal` and Load Images from URLs