This webpage visualizes a subset of the Programmatic VLM Evaluation (PROVE) dataset. For each image-caption pair from the DOCCI test set, we construct and display a programmatically queryable scene graph representation. We also display the set of LLM-generated visual question answer pairs and verification programs that we run to validate the correctness of the generated answer.
In addition, we display evaluation results for free-form responses from various VLMs (selected from the drop-down) -- we show response helpfulness (hscore) and truthfulness (tscore), and their corresponding sets of tuple matches and match scores.