Optical Character Recognition, or OCR, is a class of technology that can recognize text in digital images. For example, reading the name and address off of a drivers license, or extracting data from a scanned W-2 fall under the umbrella of OCR. By combining information about the text in an image, along with the placement of that text, we can start to gain a basic level of understanding of the image.
OCR implementation is one of those problems that surfaces from time to time with our clients and something we have solved numerous times in the past decade. For this reason we've had lots of opportunity to watch the technology grow and accelerate. Today's Large Language Models have started to show remarkable abilities in visual understanding of images, and this post is intended to help C# developers leverage those abilities in their code.
There are many OCR products available in the developer market. These tools include python tools like PyTesseract and hosted services like Azure Document Intelligence. While libraries work well for simple use cases like simply extracting text from an image, you need more powerful models to extract true understanding from an image.
To work with a service like Document Intelligence, you need to provide a number of sample images and tag out the kinds of data you want to extract. You quickly run into limitations if you need to extract the same kind of information from documents in many different formats. Resume's are a great example of this. While the data that we put into a resume is fairly standard, many people take creative license with the formatting of their resume so actually finding the "Education" section can be tricky.
The OpenAI Vision Service provides a powerful API for developers to extract information from images. It can function essentially as an AI-enabled OCR platform, without the need for any time spent training a model. Further, the image understanding model is able to recognize data themes in objects rather than having a reliance on format and layout. This creates a one-stop-shop for many OCR needs.
The main limitation of the current implementation of the Vision Service is that you cannot use tooling (i.e. function calling) nor specify that you want the API to return a JSON object. This can make it difficult to extract structured data from an image that can be used in other parts of an application.
The OpenAI GPT function calling model allows you to send the API function definitions which consist of a function name and a description of the parameters to those functions. This description essentially follows the standards of [JSON Schema](https://json-schema.org/). As web-focused C# developers, we are well versed in serializing/deserializing C# classes using JSON.
We started by creating a method that could take a C# class and output a basic JSON Schema definition for it. We quickly realized that the OpenAI API needed a bit more information to make the schema truly useful. Things like constrained values needed to be communicated and there were several cases where the property name alone was not sufficient to tell OpenAI how to populate our models. To solve this we introduced a few custom attributes that allow us to tell the model exactly how to treat our data in the result.
Once we were happy with the way this method worked for function calling, we tried applying it to the Vision Service as well by simply including the result object definition in our prompt to the model . It worked! While there were some formatting challenges, on the whole we have been able to get OpenAI to respond to our image queries with reliable JSON that we can then pass on to other parts of our application.
Here is a simple example of the img-to-json library in use. Imagine you have a pile of resume images that you need to sort through for an "Administrative Assistant" job.
We create a C# class representing a resume and the data we want to extract from it. Note our instructions to the OpenAI API specifically around handling the relevance and highest level of education.
Running that image through the img-to-json library, we extract the following information:
If you're a C# developer interested in learning more about this method, the `img-to-json` library is available freely. You can check it out in our GitHub repository.
If you're a medium to large business, especially in the hospitality, healthcare or government sectors and you're looking to add visual understanding capabilities to your tools and platforms be sure to get in touch with us and book a free consultation today!