Evaluation
Evaluation is important because it lets us compare our approach against a well-known baseline and examines how feasible our product is for airport navigation. For our evaluation, we chose to compare four different models: BLIP, Llama3, gpt-4o, and gpt-4o mini.
We first evaluated these models using automated metrics, which included ROUGE, BertScore, and runtime. For our baseline, we chose BLIP, a well known image captioning model. Expectedly, our baseline performed the slowest and lowest our of our chosen models, given that it does not allow an input promp or prompt tuning. Next, we tried llama3 because it’s open and free. Performance was improved over the baseline but was actually slower in terms of runtime. Finally, for the gpt-4o models (mini and regular), we noticed the performance, speed, and cost between them was very similar, and that they both outperformed our baseline and llama3.
While quantitative evaluations are helpful, we believe that qualitative evaluations are just as, if not more important, to ensure that the outputs are actually applicable to navigating an airport. To evaluate this quality, we manually reviewed the outputs of each of our models on a random subset of images for conciseness, informativeness, and accuracy. Given the similarities observed between the 2 gpt-4o models, we only annotated gpt-4o and not gpt-4o mini to save time.
For this human evaluation, we created a scoring rubric for each metric and had multiple annotators evaluate the outputs for each model to ensure sufficient inter-annotator agreement. Interestingly, we found that gpt-4o performed highly for informativeness and accuracy, but for conciseness, it failed to follow our prompting instructions to output responses of less than three sentences. However, it was much more accurate than llama3 and BLIP.
Next, we checked the random subset of model outputs to see whether each model successfully identified key vision accessibility features when they were present in a given image, as we instructed in our prompts. For example, if a tactile yellow guide strip was present in the image, our models should have identified them in the descriptions.
In one example, we found that all four models mentioned a hand railing feature. However, BLIP failed to mention the tactile guide strip and other important information, including the moving walkway and signage.
Next, we reviewed another important qualitative feature of our models: whether they can detect when a photo contains insufficient information for navigation. For example, given an image of a ceiling, we expect the models to ask for more information as we instructed them to do in our prompts. In this case, we found that other than BLIP, all three models mentioned the lack of relevant navigation information in the given image. However, Llama3 didn't request another image. In this same description, we found that Llama3 suggested asking airport staff for assistance. While this isn’t bad advice, it is interesting that the model recommends this, given that it goes against our mission of increasing the independence of visually impaired travelers.