logo
Published on

Integrating Vision using the latest OpenAI API

Authors
  • avatar
    Name
    Athos Georgiou
    Twitter

Integrating Vision using the latest OpenAI API

integrate-vision

Welcome to the latest instalment of the series on building an AI chat assistant from scratch. In this article, I'll be integrating vision into the AI chat assistant I've been building in the previous articles. This work will include creating new UI components for the Vision API, as well as creating new API routes to support the new functionality. I'll also be refactoring the existing code to improve readability and maintainability.

In prior articles, I've covered the following topics:

  • Part 1 - Integrating Markdown in Streaming Chat for AI Assistants
  • Part 2 - Creating a Customized Input Component in a Streaming AI Chat Assistant Using Material-UI
  • Part 3 - Integrating Next-Auth in a Streaming AI Chat Assistant Using Material-UI
  • Part 5 - Integrating the OpenAI Assistants API in a Streaming AI Chat Assistant Using Material-UI

Although the the topics are not all inclusive of the entire project, they do provide a good starting point for anyone interested in building their own AI chat assistant and may serve as the foundation for the rest of the series.

If you'd prefer to skip and get the code yourself, you can find it on GitHub.

Overview

OpenAI Vision is a recent feature that allows you to get insights from images. It can be used to detect objects, faces, landmarks, text, and more. The API is currently in beta and is subject to change. For more information, check out the OpenAI Vision API documentation.

Generally, integrating Vision into your application involves the following steps:

  1. Create a new API route to handle the API calls.
  2. Create a new UI component to allow the user to upload images, or URLs.
  3. Show the results of the API calls in the chat UI component, using streaming chat.

Our goal is to create a customized configuration component for the Vision API. This component will allow the user to upload images, or URLs and display the results of the API calls in the chat UI component, using streaming chat. The user will also be able to delete all Vision related data/files. Finally, the user will be able to enable/disable the Vision API by clicking on the switch.

Prerequisites

Before we dive in, make sure you have the following:

  • A basic understanding of React and Material-UI
  • Node.js and npm installed in your environment.
  • A React project set up, Ideally using Next.js. Keep in mind that I'll be using Titanium, which is a template that already has a lots of basic functionality set up for building an AI Assistant. You can use it as a starting point for your own project, or you can just follow along and copy/paste the code snippets you need.

Step 1: Creating new UI Components

In this step, we'll be creating the new UI components needed to support the new functionality. This includes a new AssistantDialog component, which will be used to display the Assistant configuration dialog. We'll also be updating the CustomizedInputBase component, to support the new functionality.

Update the CustomizedInputBase component

I've updated the CustomizedInputBase component to support the new functionality. This includes adding a new VisionDialog component, which will be used to display the Vision configuration dialog. I've also added a new Vision button, which will be used to open the Vision configuration dialog.

Although the initial implementation used several state variables to track the state of the UI, I've refactored the code to use react-hook-form instead. This allowed me to simplify the code and make it more readable/maintainable.

You can view the see the CustomizedInputBase file on github: CustomizedInputBase, but the the UI component changes are shown below:

...
    <>
      <Paper
        component="form"
        sx={{
          p: '2px 4px',
          display: 'flex',
          alignItems: 'center',
          width: isSmallScreen ? '100%' : 650,
        }}
        onKeyDown={(event) => {
          if (event.key === 'Enter') {
            event.preventDefault();
            handleSendClick();
          }
        }}
      >
        <IconButton
          sx={{ p: '10px' }}
          aria-label="menu"
          onClick={handleMenuOpen}
        >
          <MenuIcon />
        </IconButton>
        <Menu
          anchorEl={anchorEl}
          open={Boolean(anchorEl)}
          onClose={handleMenuClose}
          anchorOrigin={{
            vertical: 'top',
            horizontal: 'right',
          }}
          transformOrigin={{
            vertical: 'top',
            horizontal: 'right',
          }}
        >
          <MenuItem onClick={handleAssistantsClick}>
            <ListItemIcon>
              <AssistantIcon />
            </ListItemIcon>
            Assistant
          </MenuItem>
          <MenuItem onClick={handleVisionClick}>
            <ListItemIcon>
              <VisionIcon fontSize="small" />
            </ListItemIcon>
            Vision
          </MenuItem>
        </Menu>
        <InputBase
          sx={{ ml: 1, flex: 1 }}
          placeholder="Enter your message"
          value={inputValue}
          onChange={handleInputChange}
        />
        <IconButton
          type="button"
          sx={{ p: '10px' }}
          aria-label="send"
          onClick={handleSendClick}
        >
          <SendIcon />
        </IconButton>
      </Paper>

      <AssistantDialog
        open={isAssistantDialogOpen}
        onClose={() => setIsAssistantDialogOpen(false)}
      />

      <VisionDialog
        open={isVisionDialogOpen}
        onClose={() => setIsVisionDialogOpen(false)}
      />
    </>

...

Create the VisionDialog component

I've created a new VisionDialog component, which will be used to display the Vision configuration dialog. This component will allow the user to upload images, or URLs and display the results of the API calls in the chat UI component, using streaming chat. The user will also be able to delete all Vision related data/files. Finally, the user will be able to enable/disable the Vision API by clicking on the switch.

Thankfully, a lot of the grunt work has been done already for the Assistant I implemented prior, so it was a matter of repurposing the existing code to support the new functionality. I've added a new VisionFileList component, which will be used to display the files uploaded to the Vision API. I've also added a new AddUrlDialog component, which will be used to display the dialog for adding URLs to the Vision API.

You can view the see the VisionDialog file on github: VisionDialog, but the the UI component changes are shown below:


...

       <Dialog open={open} onClose={onClose}>
        <DialogTitle style={{ textAlign: 'center' }}>
          Add Vision Images
        </DialogTitle>
        <DialogContent style={{ paddingBottom: 8 }}>
          <VisionFileList files={visionFiles} onDelete={handleRemoveUrl} />
        </DialogContent>
        <DialogActions style={{ paddingTop: 0 }}>
          <Box
            display="flex"
            flexDirection="column"
            alignItems="stretch"
            width="100%"
          >
            <Button
              onClick={handleUpdate}
              style={{ marginBottom: '8px' }}
              variant="outlined"
              color="success"
            >
              Update
            </Button>
            <Box display="flex" justifyContent="center" alignItems="center">
              <Button onClick={handleCloseClick}>Close Window</Button>
              <Button onClick={handleAddUrlClick}>Add URL</Button>
              <Typography variant="caption" sx={{ mx: 1 }}>
                Disable
              </Typography>
              <Switch
                checked={isVisionEnabled}
                onChange={handleToggle}
                name="activeVision"
              />
              <Typography variant="caption" sx={{ mx: 1 }}>
                Enable
              </Typography>
              <input
                type="file"
                ref={visionFileInputRef}
                style={{ display: 'none' }}
              />
            </Box>
          </Box>
        </DialogActions>
      </Dialog>

      <AddUrlDialog
        open={isAddUrlDialogOpen}
        onClose={() => setIsAddUrlDialogOpen(false)}
        onAddUrl={handleAddUrl}
      />
    </>

...

Create the VisionFileList component

I've created a new VisionFileList component, which will be used to display the files uploaded to the Vision API. This component will allow the user to delete all Vision related data/files. This component is almost identical to the one I used for the Assistant, so again it was relatively straight forward.

One challenge I faced was to ensure that the user can delete all Vision related data/files. This is because the Vision API doesn't support deleting files, so I had to come up with a workaround. More on this when I go over the API routes.

Also, to avoid clutter, I had to do some refactoring to ensure that the AssistantFileList and VisionFileList components are reusable. This involved moving the AssistantFileList component to a new FileList component, which is used by both the AssistantDialog and VisionDialog components.

It looks like the code below, but if you want to see the file, you can view the see the VisionFileList file on github: VisionFileList


...


  <FilePaper
    files={files}
    renderFileItem={(file) => (
      <ListItem
        key={file.id}
        secondaryAction={
          <IconButton
            edge="end"
            aria-label="delete"
            onClick={() => onDelete(file)}
          >
            <DeleteIcon />
          </IconButton>
        }
      >
        <ListItemAvatar>
          <Avatar>
            <FolderIcon />
          </Avatar>
        </ListItemAvatar>
        <ListItemText primary={file.name} />
      </ListItem>
    )}
  />

...


Create the AddUrlDialog component

This is a simple popup dialog, which allows the user to add URLs. The user can add multiple URLs, which will be stored in the visionFiles state variable. The user can then chat with the completions API, asking questions about the images.

As of now, only URLs are supported, but I'm planning to add support for image uploads in the future.

You can view the see the AddUrlDialog file on github: AddUrlDialog, or see some of the code below:


...

    <Dialog open={open} onClose={handleClose}>
      <DialogTitle sx={{ textAlign: 'center' }}>Add URL</DialogTitle>
      <DialogContent style={{ paddingBottom: 8, width: '600px' }}>
        <FormControl
          fullWidth
          margin="dense"
          error={error.name}
          variant="outlined"
        >
          <TextField
            fullWidth
            label="Name"
            variant="outlined"
            value={nameInput}
            onChange={(e) => setNameInput(e.target.value)}
            error={error.name}
            helperText={error.name ? 'Name is required' : ' '}
          />
        </FormControl>
        <FormControl
          fullWidth
          margin="dense"
          error={error.url}
          variant="outlined"
        >
          <TextField
            fullWidth
            label="URL"
            variant="outlined"
            value={urlInput}
            onChange={(e) => setUrlInput(e.target.value)}
            error={error.url}
            helperText={error.url ? 'URL is required' : ' '}
          />
        </FormControl>
      </DialogContent>
      <DialogActions>
        <Box display="flex" justifyContent="center" width="100%">
          <Button onClick={handleAddUrl} color="primary">
            Add
          </Button>
          <Button onClick={handleClose}>Cancel</Button>
        </Box>
      </DialogActions>
    </Dialog>

...

Step 2: Creating new API routes

In this step, we'll be creating the new API routes needed to support the new functionality. These routes will handle

The route

import { NextRequest, NextResponse } from 'next/server';
import {
  getDatabaseAndUser,
  getDb,
  handleErrorResponse,
  sendErrorResponse,
} from '@/app/lib/utils/db';

export async function GET(req: NextRequest): Promise<NextResponse> {
  try {
    const db = await getDb();
    const userEmail = req.headers.get('userEmail') as string;
    const serviceName = req.headers.get('serviceName');
    const { user } = await getDatabaseAndUser(db, userEmail);

    if (serviceName === 'vision' && user.visionId) {
      const fileCollection = db.collection<IFile>('files');
      const visionFileList = await fileCollection
        .find({ visionId: user.visionId })
        .toArray();

      return NextResponse.json(
        {
          message: 'Vision retrieved',
          visionId: user.visionId,
          visionFileList,
          isVisionEnabled: user.isVisionEnabled,
        },
        { status: 200 }
      );
    }

    return sendErrorResponse('Vision not configured for the user', 200);
  } catch (error: any) {
    return handleErrorResponse(error);
  }
}

Update the route (On load and on Click Update Button)

import { NextRequest, NextResponse } from 'next/server';
import { getDb, getUserByEmail, sendErrorResponse } from '@/app/lib/utils/db';
import { Collection } from 'mongodb';

async function updateVision(
  user: IUser,
  usersCollection: Collection<IUser>,
  isVisionEnabled: boolean
): Promise<void> {
  let isAssistantEnabled = isVisionEnabled ? false : user.isAssistantEnabled;
  let visionId = user.visionId;
  if (!visionId) {
    console.log('No visionId found. Creating a new one');
    visionId = crypto.randomUUID();
  }
  await usersCollection.updateOne(
    { email: user.email },
    {
      $set: {
        isAssistantEnabled: isAssistantEnabled,
        isVisionEnabled: isVisionEnabled,
        visionId: visionId,
      },
    }
  );
}

export async function POST(req: NextRequest): Promise<NextResponse> {
  try {
    const db = await getDb();
    const { isVisionEnabled, userEmail } = (await req.json()) as {
      isVisionEnabled: boolean;
      userEmail: string;
    };

    const usersCollection = db.collection<IUser>('users');
    const user = await getUserByEmail(usersCollection, userEmail);

    if (!user) {
      return sendErrorResponse('User not found', 404);
    }

    await updateVision(user, usersCollection, isVisionEnabled);

    return NextResponse.json(
      {
        message: 'Vision updated',
        visionId: user.visionId,
        isVisionEnabled: isVisionEnabled,
      },
      { status: 200 }
    );
  } catch (error: any) {
    console.error('Error in vision update:', error);
    return sendErrorResponse('Error in vision update', 500);
  }
}

Add a URL (User Action)

import { NextRequest, NextResponse } from 'next/server';
import {
getDatabaseAndUser,
getDb,
sendErrorResponse,
} from '@/app/lib/utils/db';

export async function POST(req: NextRequest): Promise<NextResponse> {
try {
  const db = await getDb();

  const { file, userEmail } = await req.json();
  const { user } = await getDatabaseAndUser(db, userEmail);
  let visionId;
  const usersCollection = db.collection<IUser>('users');
  if (!user.visionId) {
    console.log('No visionId found. Creating a new one');
    visionId = crypto.randomUUID();
    await usersCollection.updateOne(
      { email: user.email },
      { $set: { visionId: visionId } }
    );
  } else {
    visionId = user.visionId;
  }
  file.visionId = visionId;
  const fileCollection = db.collection<IFile>('files');
  const insertFileResponse = await fileCollection.insertOne(file);

  return NextResponse.json({
    message: 'File processed successfully',
    response: insertFileResponse,
    file: file,
    status: 200,
  });
} catch (error) {
  console.error(error);
  return sendErrorResponse('Error processing file', 500);
}
}

Delete a URL (User Action)

import { NextRequest, NextResponse } from 'next/server';
import {
  getDatabaseAndUser,
  getDb,
  sendErrorResponse,
} from '@/app/lib/utils/db';
export async function POST(req: NextRequest): Promise<NextResponse> {
  try {
    const db = await getDb();
    const { file, userEmail } = await req.json();
    const { user } = await getDatabaseAndUser(db, userEmail);
    if (user.visionId !== file.visionId) {
      return sendErrorResponse('User VisionId not found', 404);
    }

    const fileCollection = db.collection<IFile>('files');
    const deleteFileResponse = await fileCollection.deleteOne({
      visionId: file.visionId,
    });

    return NextResponse.json({
      status: 200,
      message: 'Url deleted successfully',
      response: deleteFileResponse,
    });
  } catch (error) {
    return sendErrorResponse('Error deleting file', 500);
  }
}

Step 4: Honorable mentions

If you had followed the series until now, you will have likely noticed that some refactoring has gone in the application to improve the code and make it more readable/maintainable. More to come on this aspect in a future article, but for now, I'll just throw this out there: react-hook-form is awesome!

Step 5: Testing the new functionality

Now that we have the new UI components and API routes in place, we can test the new functionality. To do this, we'll need to start the application and open the browser. Once the application is running, we can open the Vision configuration dialog by clicking on the Vision button in the chat UI component.

We can add URLs by clicking on the Add URL button, which will open the Add URL dialog. We can then enter the name and URL of the image we want to add and click on the Add button. The URL will be added to the list of URLs and the Add URL dialog will close.

We can then click on the Update button, which will update the Vision API with the new URLs. We can then chat with the Vision API, asking questions about the images. As I mentioned earlier, the results are pretty impressive, with Vision being able to detect objects, faces, landmarks and text at remarkable accuracy. Let's not forget about text, which is also pretty remarkable.

We can also delete the URLs by clicking on the delete button next to the URL. This will delete the URL from the list of URLs and the Vision API.

Finally, we can enable/disable the Vision API by clicking on the switch. This will enable/disable the Vision API and update the UI accordingly.

The easiest way to test the new functionality is to use the Titanium Template, where you can also find instructions on how to set it up and running on your local machine, or deploying it to Vercel.

Conclusion and Next Steps

In this article, I've covered how to integrate Vision with OpenAI API, including new UI components and API routes to support the new functionality, as well as some mentions on refactoring the existing code to improve readability and maintainability.

To be honest with you, I was pleasantly surprised by how easy it was to integrate Vision into the existing application. I was expecting a lot more work, but it was relatively straight forward. And the results are pretty cool! Vision is capable of detecting objects, faces, landmarks and text at remarkable accuracy. I'm really looking forward to seeing what people will build with this new API.

Feel free to check out Titanium, which already has a lots of basic functionality set up for building an AI Assistant. You can use it as a starting point for your own project, or you can just follow along and copy/paste the code snippets you need.

If you have any questions or comments, feel free to reach out to me on GitHub, LinkedIn, or via email.

See ya around and happy coding!