Programmatic VLM Evaluation (PROVE)

Caption

A chess board is seen from a close angle with all the pieces in a starting position. The chess board is all metal, dark metallic silver and black. The board itself has a triple-layer design, with each lower layer slightly larger, creating a step design around the board. The pieces themselves are made of small vehicle pieces. Most of the pieces have some sort of gear on them, and two pieces of each color have a spark plug. The side that is usually white is seen from the lower left to the bottom at an angle and is dark silver besides the white of the spark plugs. The black side is on the opposite side, towards the upper right corner, and is completely black. The board sits on a beige and cream tile floor with consistent indoor lighting.

Scene Graph

QA + verification programs

Q: Where is the dark silver side of the chessboard viewed from?
A: The dark silver side is viewed from the lower left to the bottom, at an angle.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'side' in entity and 'dark silver' in                            
    scene_graph.get_attributes(entity).get('color', ""):                        
  3             relations = scene_graph.get_outgoing_relations(entity)          
  4             return ', '.join([f"{rel['spatial']} {loc}" for loc, rel in     
    relations.items()])                                                         
  5     return None                                                             
  6

gpt-4o: The dark silver side of the chessboard is viewed from the bottom left corner of the image. The pieces on this side are positioned closest to the viewer, while the opposing side's pieces are farther away.
hscore: 63.65%

• <spatial (view, side, of)> == <spatial (opposing side's pieces, viewer, farther away)"> (70.0%)
• <state (view, from lower left to bottom)> == <state (chessboard, viewed from bottom left corner)> (60.0%)
• <state (view, at an angle)> == <state (chessboard, viewed from bottom left corner)> (50.0%)

tscore: 80.14%

• <whole (chessboard)> == <whole (chess board)> (90.0%)
• <state (chessboard, viewed from bottom left corner)> == <state (chess pieces, starting position)> (70.0%)
• <state (chessboard, dark silver side)> == <color (chess board, dark metallic silver and black)> (70.0%)

Q: What texture is the floor below the chessboard?
A: The floor below the chessboard has a tile texture.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'floor' in entity:                                               
  3             attributes = scene_graph.get_attributes(entity)                 
  4             return attributes.get('texture', "")                            
  5     return None                                                             
  6

gpt-4o: The floor below the chessboard appears to be made of ceramic or stone tiles. The tiles have a smooth texture with a slightly mottled or speckled pattern, and they are arranged in a grid with visible grout lines.
hscore: 90.62%

• <whole (floor)> == <whole (floor)> (100.0%)
• <spatial (floor, chessboard, below)> == <spatial (chessboard, floor, below)"> (90.0%)
• <texture (floor, tile)> == <texture (tiles, smooth)> (70.0%)

tscore: 78.78%

• <whole (floor)> == <whole (floor)> (100.0%)
• <spatial (chessboard, floor, below)"> == <spatial (board, floor, on)> (80.0%)
• <state (tiles, grid)> == <image> (70.0%)

Caption

A close up shot of an uncolored pencil sketch of the Gundam RX-76, brandishing its beam rifle in its right hand and holding its shield in its left hand. The sketch is placed in a square section of a wooden shelf with a red background. A reglinleif spider mech model kit is on top of a small black box to the left of the gundam drawing. A small figure of Saber from Fate/Grand Order is positioned sitting with its feet hanging off the edge of the shelf in front of the gundam drawing. "LIGHT NOVELS" is written in white letters on a blue box. The wall behind the box is green.

Scene Graph

QA + verification programs

Q: What is sitting on the shelf with its legs dangling?
A: The small figure of Saber from Fate/Grand Order is sitting on the shelf with its legs dangling.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         attributes = scene_graph.get_attributes(entity)                     
  3         if 'Saber' in attributes.get('type', '') and 'sit' in               
    attributes.get('state', ''):                                                
  4             relations_out = scene_graph.get_outgoing_relations(entity)      
  5             if 'shelf' in relations_out and 'on' in                         
    relations_out['shelf']['spatial']:                                          
  6                 return                                                      
    scene_graph.describe(scene_graph.generate_subgraph([entity]))               
  7     return None                                                             
  8

gpt-4o: On the shelf, there is a small figurine of a character with its legs dangling over the edge. The character appears to be wearing armor or a costume, and is seated on the right side of the shelf.
hscore: 58.99%

• <spatial (Saber, shelf, on)> == <spatial (figurine, shelf, on)"> (70.0%)
• <state (Saber, on shelf)> == <state (character, right side of the shelf)> (60.0%)
• <state (Saber, sitting)> == <state (character, seated)> (60.0%)

tscore: 78.22%

• <whole (shelf)> == <whole (shelf)> (100.0%)
• <spatial (figurine, shelf, on)"> == <spatial (figure, shelf, on)> (90.0%)
• <whole (character)> == <image> (70.0%)

Q: Where is the reglinleif spider mech model kit located?
A: The reglinleif spider mech model kit is placed on top of a small black box to the left of the Gundam drawing.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'reglinleif spider mech model kit' in entity:                    
  3             return                                                          
    scene_graph.describe(scene_graph.generate_subgraph([entity]))               
  4     return None                                                             
  5

gpt-4o: The Reginleif spider mech model kit is located on the middle shelf, to the left of the large robot drawing and below the "Light Novels" sign.
hscore: 67.33%

• <whole (spider mech model kit)> == <whole (Reginleif spider mech model kit)> (90.0%)
• <spatial (spider mech model kit, Gundam drawing, to the left of)> == <spatial (Reginleif spider mech model kit, large robot drawing, to the left of)> (80.0%)
• <spatial (spider mech model kit, black box, on top of)> == <spatial (Reginleif spider mech model kit, middle shelf, on)> (80.0%)

tscore: 84.39%

• <whole (Reginleif spider mech model kit)> == <whole (reglinleif spider mech model kit)> (100.0%)
• <spatial (Reginleif spider mech model kit, sign, below)"> == <spatial (reglinleif spider mech model kit, box, on top of)> (80.0%)
• <spatial (Reginleif spider mech model kit, large robot drawing, to the left of)> == <spatial (reglinleif spider mech model kit, gundam drawing, to the left of)> (80.0%)

Q: Describe the pencil sketch on the shelf.
A: The sketch is an uncolored pencil drawing of the Gundam RX-76, brandishing its beam rifle in its right hand and holding a shield in its left hand.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'sketch' in entity:                                              
  3             return                                                          
    scene_graph.describe(scene_graph.generate_subgraph([entity]))               
  4     return None                                                             
  5

gpt-4o: The pencil sketch on the shelf depicts a detailed drawing of a humanoid robot or mecha. The robot is shown in a dynamic pose, holding a large weapon in its right hand and a shield in its left hand. The design includes intricate details such as armor plating, joints, and mechanical components, giving it a futuristic and robust appearance. The sketch is placed on a shelf with a red background, and there are two figures on either side of it: a small model of a spider-like robot on the left and a small figurine of a person in armor sitting on the right. The shelf is labeled "Light Novels" at the top.
hscore: 60.55%

• <state (drawing, sketch)> == <state (pencil sketch, detailed)> (90.0%)
• <state (drawing, uncolored pencil)> == <state (pencil sketch, intricate details)> (70.0%)
• <part (Gundam RX> == <whole (small figurine of a person in armor)> (50.0%)

tscore: 71.56%

• <whole (shelf)> == <whole (shelf)> (100.0%)
• <whole (pencil sketch)> == <whole (sketch)> (80.0%)
• <type (pencil sketch, humanoid robot or mecha)> == <image> (70.0%)

Q: What is written on the blue box?
A: "LIGHT NOVELS" is written on the blue box.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'word' in entity:                                                
  3             attributes = scene_graph.get_attributes(entity)                 
  4             return attributes.get('text rendering', '')                     
  5     return None                                                             
  6

gpt-4o: The blue box has "Light Novels" written on it.
hscore: 86.51%

• <color (box, blue)> == <color (box, blue)> (100.0%)
• <text rendering (box, \LIGHT NOVELS\)> == <text rendering (box, \"Light Novels\")"> (90.0%)
• <color (text, blue)> == <color (box, blue)> (60.0%)

tscore: 92.7%

• <whole (box)> == <whole (box)> (100.0%)
• <color (box, blue)> == <color (box, blue)> (100.0%)
• <text rendering (box, \"Light Novels\")"> == <text rendering (word, "LIGHT NOVELS")> (70.0%)

Caption

A medium view of a "WELCOME" sign that is written vertically. The "O" is replaced with the state of Texas and its flag, while the "E" is covered by a gray cylindrical container that has a black label that reads "SPOONS" in white. On the right side of the spoon's container is a knife's container, while on the left is a fork's container. Each container is the same as the spoon one, but they have the black sign marked respectively towards what they are. The containers are sitting on top of a small wooden table. Behind the sign, there is a wall that is made up of dark brown wooden planks that run vertically but at an angle to the left. A bright white light shines brightly onto the containers and sign.

Scene Graph

QA + verification programs

Q: What is replacing the letter 'O' in the sign?
A: The letter 'O' is replaced by the state of Texas and its flag.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'O' in entity:                                                   
  3             incoming_relations = scene_graph.get_incoming_relations(entity) 
  4             for source_entity in incoming_relations:                        
  5                 if 'replace' in                                             
    incoming_relations[source_entity]['spatial']:                               
  6                     return                                                  
    scene_graph.describe(scene_graph.generate_subgraph([source_entity]))        
  7     return None                                                             
  8

gpt-4o: The letter 'O' in the sign is replaced by the shape of the state of Texas, which is colored with the Texas flag design.
hscore: 73.36%

• <whole (state of Texas)> == <whole (state of Texas)> (100.0%)
• <part (state of Texas's flag)> == <color (state of Texas, Texas flag design)"> (70.0%)
• <substitution (letter, state of Texas)> == <replacement (letter 'O', shape of the state of Texas)> (70.0%)

tscore: 80.19%

• <whole (state of Texas)> == <whole (state of Texas)> (100.0%)
• <replacement (letter 'O', shape of the state of Texas)> == <spatial (state of Texas, O, replace)> (70.0%)
• <part (sign)> == <whole (sign)> (70.0%)

Q: What is the color of the container covering the 'E'?
A: The color of the container covering the 'E' is gray.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'E' in entity:                                                   
  3             incoming_relations = scene_graph.get_incoming_relations(entity) 
  4             for source_entity in incoming_relations:                        
  5                 if 'cover' in incoming_relations[source_entity]['spatial']: 
  6                     return                                                  
    scene_graph.get_attributes(source_entity).get('color', '')                  
  7     return None                                                             
  8

gpt-4o: The container covering the 'E' is gray.
hscore: 89.25%

• <part (container covering the 'E')> == <part (container covering the 'E')> (100.0%)
• <color (container covering the 'E', gray)> == <color (container, gray)"> (70.0%)

tscore: 85.42%

• <color (container, gray)"> == <color (container, gray)> (90.0%)
• <part (container covering the 'E')> == <image> (70.0%)

Q: What material is the wall made of?
A: The wall is made of dark brown wood.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'planks' in entity:                                              
  3             return scene_graph.get_attributes(entity).get('material', '')   
  4     return None                                                             
  5

gpt-4o: The wall in the image appears to be made of wood. The texture and grain patterns visible suggest that it is wooden paneling.
hscore: 80.92%

• <material (wall, wood)> == <material (wall, wood)> (100.0%)
• <color (wall, dark brown)> == <material (wall, wood)> (60.0%)

tscore: 71.77%

• <whole (wall)> == <whole (wall)> (100.0%)
• <material (wall, wood)> == <image> (70.0%)
• <state (wall, wooden paneling)> == <image> (70.0%)

Caption

A cream colored Labradoodle is sitting on top of a cushion with a multicolored kente pattern. The dog has its tongue out and is looking at the camera. The dog has light brown floppy ears. A cream colored dog toy is laying on the cushion beside the dog. The cushion is laying on a brown hard wood floor. A Bicycle wheel is behind the dog and next to a metal object. A gray and white colored wall is beside the dog and the cushion and has a shadow of the dog on it. Light is shining onto the dog, cushion, and floor.

Scene Graph

QA + verification programs

Q: What is the color and pattern of the cushion the Labradoodle is sitting on?
A: The cushion is multicolored with a kente pattern.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'cushion' in entity:                                             
  3             attributes = scene_graph.get_attributes(entity)                 
  4             if 'multicolored' in attributes.get('pattern', ''):             
  5                 return attributes.get('pattern', '')                        
  6     return None                                                             
  7

gpt-4o: The cushion the Labradoodle is sitting on has a colorful, woven pattern. The pattern consists of interlocking rectangles in various colors, including shades of yellow, pink, green, and white.
hscore: 81.33%

• <state (cushion, multicolored)> == <state (cushion's pattern, colorful)> (90.0%)
• <whole (cushion)> == <part (cushion)> (80.0%)
• <type (cushion, kente pattern)> == <part (cushion's pattern)> (70.0%)

tscore: 76.16%

• <part (cushion)> == <whole (cushion)> (80.0%)
• <color (cushion's pattern, white)"> == <image> (70.0%)
• <state (cushion's pattern, colorful)> == <pattern (cushion, multicolored kente)> (70.0%)

Q: What is the state of the dog's ears?
A: The dog's ears are floppy.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if "dog's ears" in entity:                                          
  3             return scene_graph.get_attributes(entity).get('state', '')      
  4     return None                                                             
  5

gpt-4o: The dog's ears are floppy and hanging down on the sides of its head.
hscore: 100%

• <state (dog's ears, floppy)> == <state (dog's ears, floppy)> (100.0%)

tscore: 92.83%

• <state (dog's ears, floppy)> == <state (dog's ears, floppy)> (100.0%)
• <part (dog's ears)> == <part (dog's ears)> (100.0%)
• <spatial (dog's ears, dog's head, on the sides of)"> == <spatial (dog toy, dog, beside)> (70.0%)

Caption

An outdoors top-down view of an orange toy basketball, a yellow toy volleyball, and a water football floating in that order from left to right on the surface of a pool within inches of each other in a U-shaped arch. The volleyball is the smallest, with thin indention lines creating the panels of the ball. The orange basketball medium is textured with small dots. The football has an oval-shaped logo with black text upside down that reads, "COOP" on its left side facing the volleyball. The still light blue pool water reflects the overhanging treetops and the light blue sky above. Daytime.

Scene Graph

QA + verification programs

Q: Which ball is the smallest and how can you recognize it?
A: The yellow toy volleyball is the smallest, and it has thin indentation lines creating the panels of the ball.

  0 def smallest_ball_description(scene_graph):                                 
  1     entities = scene_graph.get_entities()                                   
  2     for entity in entities:                                                 
  3         if 'volleyball' in entity:                                          
  4             attributes = scene_graph.get_attributes(entity)                 
  5             if attributes.get('size') == 'smallest':                        
  6                 indentations = scene_graph.get_attributes("volleyball's     
    indention lines")                                                           
  7                 return f"The {entity} is {attributes['size']} with features 
    of {indentations.get('shape', 'Unknown features')}."                        
  8     return None                                                             
  9

gpt-4o: The smallest ball in the image is the yellow ball in the middle. You can recognize it as the smallest because it has a smaller diameter compared to the orange ball on the left and the elongated shape of the ball on the right.
hscore: 68.66%

• <color (volleyball, yellow)> == <color (yellow ball, yellow)> (80.0%)
• <state (volleyball, toy)> == <state (yellow ball, middle)> (70.0%)
• <whole (volleyball)> == <whole (yellow ball)> (60.0%)

tscore: 73.29%

• <color (orange ball, orange)"> == <color (basketball, orange)> (80.0%)
• <color (yellow ball, yellow)> == <color (volleyball, yellow)> (80.0%)
• <shape (elongated ball, elongated)> == <image> (70.0%)

Q: In which direction are the objects floating on the pool surface arranged?
A: The objects are floating from left to right in a U-shaped arch on the pool surface.

  0 def direction_of_floating_objects(scene_graph):                             
  1     volleyball_relations = scene_graph.get_outgoing_relations('volleyball') 
  2     basketball_relations = scene_graph.get_outgoing_relations('basketball') 
  3     direction = None                                                        
  4     if 'football' in volleyball_relations:                                  
  5         direction = volleyball_relations['football'].get('spatial')         
  6     if 'volleyball' in basketball_relations:                                
  7         direction = basketball_relations['volleyball'].get('spatial')       
  8     return direction                                                        
  9

gpt-4o: The objects floating on the pool surface are arranged in a diagonal line from the bottom left to the top right.
hscore: 66.4%

• <spatial (objects, pool surface, on)> == <spatial (objects, pool surface, on)> (100.0%)
• <direction (objects, left to right)> == <direction (diagonal line, from bottom left to top right)"> (60.0%)
• <spatial (objects, U> == <spatial (objects, diagonal line, in)> (60.0%)

tscore: 65.3%

• <spatial (objects, pool surface, on)> == <image> (70.0%)
• <spatial (objects, diagonal line, in)> == <image> (70.0%)
• <direction (diagonal line, from bottom left to top right)"> == <spatial (basketball, volleyball, left to right)> (40.0%)

Caption

An angled view down of a pine tree that is lying on the grass. The top of it is pointed to the right. The top of the tree is missing branches and needles. It is not full. Dried leaves are on the ground on the grass in front of it. A square metal plate is in the ground near the bottom of the tree. Behind the tree are shrubs in the dirt going across. The leaves on it are green, and they are bright green behind the tree.

Scene Graph

QA + verification programs

Q: What shape is the plate near the bottom of the tree?
A: The plate near the bottom of the tree is square.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'plate' in entity:                                               
  3             return scene_graph.get_attributes(entity).get('shape', "")      
  4     return None                                                             
  5

gpt-4o: The plate near the bottom of the tree is rectangular in shape.
hscore: 94.79%

• <spatial (plate, tree, near the bottom of)> == <spatial (plate, tree, near the bottom)"> (90.0%)
• <shape (plate, square)> == <shape (plate, rectangular)> (90.0%)

tscore: 96.6%

• <whole (plate)> == <whole (plate)> (100.0%)
• <shape (plate, rectangular)> == <shape (plate, square)> (90.0%)
• <spatial (plate, tree, near the bottom)"> == <spatial (plate, tree, at the bottom of)> (90.0%)

Q: What elements are found near the bottom of the pine tree?
A: A square metal plate is in the ground near the bottom of the pine tree.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'plate' in entity:                                               
  3             outgoing_relations = scene_graph.get_outgoing_relations(entity) 
  4             description =                                                   
    scene_graph.describe(scene_graph.generate_subgraph(['plate', 'ground']))    
  5             return description                                              
  6     return None                                                             
  7

gpt-4o: Near the bottom of the pine tree in the image, you can see: 1. Grass and patches of bare soil. 2. Fallen leaves scattered around. 3. A rectangular stone or concrete slab.
hscore: 52.27%

• <spatial (metal plate, bottom of the pine tree, near)> == <spatial (pine tree, bottom, near)> (70.0%)
• <spatial (metal plate, pine tree, near)> == <spatial (pine tree, bottom, near)> (70.0%)
• <spatial (metal plate, ground, in)> == <spatial (rectangular stone or concrete slab, pine tree, near)"> (40.0%)

tscore: 73.35%

• <whole (pine tree)> == <whole (pine tree)> (100.0%)
• <spatial (pine tree, bottom, near)> == <spatial (pine tree, grass, on)> (80.0%)
• <spatial (fallen leaves, pine tree, around)> == <spatial (pine tree, grass, on)> (80.0%)

Caption

An overhead view looking down at brick tiles in the ground. Pieces of dirt and leaves are in between the cracks of the tile. Grass stains are on some of the tiles as well. The color of the tiles are maroon and gray and are running vertically and horizontally on the ground. Black letters are in the middle of the image on a horizontal maroon tile that reads "BASS".

Scene Graph

QA + verification programs

Q: What are the predominant colors of the tiles, and how are these colors arranged?
A: The predominant colors of the tiles are maroon and gray, arranged vertically and horizontally.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'tiles' in entity:                                               
  3             attributes = scene_graph.get_attributes(entity)                 
  4             return f"{attributes.get('color', '')}, {attributes.get('state',
    '')}"                                                                       
  5     return None                                                             
  6

gpt-4o: The predominant colors of the tiles are red and light gray. The tiles are arranged in an alternating pattern, with red and light gray tiles placed next to each other in a grid-like formation. The red tile in the center has the word "BASS" inscribed on it.
hscore: 80.27%

• <color (tiles, gray)> == <color (tiles, light gray)> (90.0%)
• <color (tiles, maroon)> == <color (tiles, red)> (80.0%)
• <arrangement (tiles, horizontal)> == <state (tiles, arranged in pattern)> (70.0%)

tscore: 84.27%

• <whole (tiles)> == <whole (tiles)> (100.0%)
• <color (tiles, light gray)> == <color (tiles, gray)> (90.0%)
• <text rendering (center tile, \"BASS\")"> == <text rendering (tile, "BASS")> (80.0%)

Caption

A side night view of a Japan, Germany, Finland, Slovenia, France, Ireland flag lined up beside each other. A metal flag staff is holding the flags up by a pulley system. All flags are raised at full staff. A parking lot full of cars that is partially cut off from the left side of the image is to the left of the flags. One of the cars has its headlights on and is face toward the flags. Streetlight posts with lights on are behind the parking lot. Three small trees are around the parking lot. A metal structure is in the background. A tent is in the background as well, next to a building with purple lighting coming from the inside of it. More trees are to the right of the building. Stands of a stadium are to the right of the image and partially cut off by the camera. A red fence is at the bottom right of the image. A semicircle made up of two blue neon glowing strips is in the dark sky.

Scene Graph

QA + verification programs

Q: Which country flag is raised at the end on the flag staff among the six flags?
A: The Ireland flag, denoted as flag_6, is raised at the end among the six flags.

  0 def execute_command(scene_graph):                                           
  1     # Iterate through entities                                              
  2     for entity in scene_graph.get_entities():                               
  3         # If entity is a flag and matches the type 'Ireland'                
  4         if 'flag_6' in entity:                                              
  5             return scene_graph.get_attributes(entity).get('type', '')       
  6     return None                                                             
  7

gpt-4o: The flag raised at the end on the flagstaff among the six flags is the flag of Italy.
hscore: 74.86%

• <count (flags, ==6)> == <count (flags, ==6)> (100.0%)
• <part (flag)> == <part (flags)> (90.0%)
• <whole (flags)> == <part (flags)> (80.0%)

tscore: 77.92%

• <count (flags, ==6)> == <count (flags, ==6)> (100.0%)
• <part (flags)> == <whole (flags)> (80.0%)
• <whole (flagstaff)> == <image> (70.0%)

Q: What color are the strips in the semicircle observed in the sky?
A: The strips in the semicircle observed in the sky are blue.

  0 def execute_command(scene_graph):                                           
  1     semicircle_desc = None                                                  
  2     for entity in scene_graph.get_entities():                               
  3         if 'semicircle' in entity:                                          
  4             for attr in scene_graph.get_attributes(entity).get('shape',     
    '').split(', '):                                                            
  5                 if 'strips' in attr:                                        
  6                     semicircle_desc = attr                                  
  7             return scene_graph.get_attributes('strips').get('color', '')    
  8     return semicircle_desc                                                  
  9

gpt-4o: The strips in the semicircle observed in the sky are green.
hscore: 75.84%

• <color (strips, blue)> == <color (strips, green)> (80.0%)
• <shape (semicircle in the sky, semicircle)> == <spatial (semicircle, sky, in)"> (70.0%)
• <whole (semicircle in the sky)> == <whole (semicircle)> (70.0%)

tscore: 87.09%

• <whole (semicircle)> == <whole (semicircle)> (100.0%)
• <spatial (semicircle, sky, in)"> == <spatial (semicircle, sky, in)> (90.0%)
• <color (strips, green)> == <color (strips, blue)> (80.0%)

Q: Which object is responsible for holding the flags up?
A: The metal flag staff is responsible for holding the flags up.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'flag staff' in entity:                                          
  3             relations_out = scene_graph.get_outgoing_relations(entity)      
  4             if 'flags' in relations_out and 'hold up' in                    
    relations_out['flags'].get('spatial', ''):                                  
  5                 return '{} flag                                             
    staff'.format(scene_graph.get_attributes(entity).get('material', ''))       
  6     return None                                                             
  7

gpt-4o: The objects responsible for holding the flags up in the image are flagpoles.
hscore: 72.07%

• <whole (flag staff)> == <whole (flagpoles)> (80.0%)
• <functional (flag staff, flag, hold up)> == <functional (flagpoles, flags, hold up)"> (70.0%)
• <material (flag staff, metal)> == <functional (flagpoles, flags, hold up)"> (50.0%)

tscore: 77.47%

• <whole (flagpoles)> == <whole (flag staff)> (80.0%)
• <functional (flagpoles, flags, hold up)"> == <image> (70.0%)

Caption

An outdoor nighttime slightly angled up close-up front view of a plastic orange pumpkin that has a cutout rigid smile, and triangle shaped eyes and nose. There is a red light within the pumpkin that is shining on its inner surface giving the pumpkin a dramatic look. The pumpkin is placed on top of dark gray metal post, and the area surrounding the pumpkin is pitch black.

Scene Graph

QA + verification programs

Q: How many triangle-shaped features does the pumpkin have?
A: The pumpkin has three triangle-shaped features: two eyes and one nose.

  0 def execute_command(scene_graph):                                           
  1     triangle_count = 0                                                      
  2     for entity in scene_graph.get_entities():                               
  3         attributes = scene_graph.get_attributes(entity)                     
  4         if 'triangle' in attributes.get('shape', ''):                       
  5             triangle_count += 1                                             
  6     return f'The pumpkin has {triangle_count} triangle-shaped features: two 
    eyes and one nose.' if triangle_count > 0 else None                         
  7

gpt-4o: The pumpkin has five triangle-shaped features: two for the eyes, one for the nose, and two as part of the mouth.
hscore: 100%

• <part (pumpkin's eyes)> == <part (pumpkin's eyes)> (100.0%)
• <part (pumpkin's nose)> == <part (pumpkin's nose)> (100.0%)
• <shape (pumpkin's nose, triangle)> == <shape (pumpkin's nose, triangle)> (100.0%)

tscore: 90.73%

• <whole (pumpkin)> == <whole (pumpkin)> (100.0%)
• <part (pumpkin's nose)> == <part (pumpkin's nose)> (100.0%)
• <shape (pumpkin's eyes, triangle)> == <shape (pumpkin's eyes, triangle)> (100.0%)

Q: What material is the pumpkin made of and what is its position relative to the metal post?
A: The pumpkin is made of plastic and is placed on top of the dark gray metal post.

  0 def execute_command(scene_graph):                                           
  1     for entity in scene_graph.get_entities():                               
  2         if 'pumpkin' in entity:                                             
  3             pumpkin_material =                                              
    scene_graph.get_attributes(entity).get('material', '')                      
  4             relations_to = scene_graph.get_outgoing_relations(entity)       
  5             if 'post' in relations_to and 'on top of' in                    
    relations_to['post'].get('spatial', ''):                                    
  6                 return (f'The pumpkin is made of {pumpkin_material} and is  
    placed on top of the dark gray metal post.')                                
  7     return None                                                             
  8

gpt-4o: The pumpkin appears to be made of a synthetic material, likely plastic or fiberglass, given its smooth surface and the way it is illuminated from within. The pumpkin is positioned on top of a metal post, which serves as its support.
hscore: 85.63%

• <spatial (pumpkin, metal post, on top of)> == <spatial (pumpkin, metal post, on top of)> (100.0%)
• <whole (metal post)> == <whole (metal post)> (100.0%)
• <material (pumpkin, plastic)> == <material (pumpkin, synthetic)> (90.0%)

tscore: 80.56%

• <whole (pumpkin)> == <whole (pumpkin)> (100.0%)
• <material (pumpkin, synthetic)> == <material (pumpkin, plastic)> (90.0%)
• <spatial (pumpkin, metal post, on top of)> == <spatial (pumpkin, post, on top of)> (80.0%)

Caption

A close up view of a blue lit indoor fountain. The fountain is a semicircle and is a pedestal bird bath shape. The fountain is up against an arched up small square and slanted tile wall. Over it is a plaque with a sculpted lion's head with a scrolling design beneath and to each side. Water cascades out of the bowl evenly and forms a semicircle that is highlighted in light blue, in the pool below. A raised walkway is behind the fountain. The walkway has an iron fence railing that has exes for each section between the posts. The top of the fountain wall obscures most of the railing sections. Behind the walkway in the top left corner are three dressed and spot-lighted mannequins in a storefront window. To their right on the wall is a small backlit swordfish model facing to the left. The overall color of the frame is a rich blue hue. The wall in the background goes from a lavender glow over the storefront window and wall, to a golden lit tan, to violet on the right.

Scene Graph

QA + verification programs

Q: What features distinguish the fountain in the scene?
A: The fountain is blue lit, semicircular, and has a pedestal bird bath shape. It is also noted for being indoor.

   0 def execute_command(scene_graph):                                          
   1     fountain_attr = {}                                                     
   2     for entity in scene_graph.get_entities():                              
   3         if 'fountain' in entity:                                           
   4             fountain_attr = scene_graph.get_attributes(entity)             
   5             break                                                          
   6     return "The fountain is {}. It is also noted for being {}.".format(    
   7         ', '.join(f"{key} {value}" for key, value in                       
     fountain_attr.items()),                                                    
   8         ', '.join(f"{key}" for key, value in fountain_attr.items() if key  
     == 'state')                                                                
   9     )                                                                      
  10

gpt-4o: The fountain in the scene has several distinguishing features: 1. **Lion Head Spout**: The water appears to be coming out of a lion's head, which is a common decorative element in classical fountain designs. 2. **Ornate Backdrop**: The backdrop of the fountain is adorned with intricate carvings and designs, adding to its decorative appeal. 3. **Cascading Water**: Water cascades over the edge of a basin, creating a curtain-like effect as it falls into the pool below. 4. **Tile Work**: The wall behind the fountain is covered in a diamond-patterned tile, adding texture and visual interest. 5. **Lighting**: The fountain is illuminated with blue lighting, which enhances its visual impact, especially in low-light conditions. 6. **Pedestal Basin**: The water collects in a large, pedestal-style basin before overflowing, which is a classic design element in many traditional fountains. 7. **Surrounding Decor**: The fountain is set against a backdrop that includes a window display with mannequins and a dolphin wall decoration, indicating it might be located in a commercial or public space. These features collectively give the fountain a classical and elegant appearance.
hscore: 71.95%

• <state (fountain, lit)> == <state (fountain, illuminated)> (90.0%)
• <state (fountain, indoor)> == <state (fountain, illuminated)> (80.0%)
• <color (fountain, blue)> == <color (lighting, blue)> (70.0%)

tscore: 75.12%

• <whole (fountain)> == <whole (fountain)> (100.0%)
• <spatial (water, pool, into)> == <spatial (water, pool, in)> (90.0%)
• <spatial (fountain, backdrop, against)> == <spatial (fountain, wall, up against)> (80.0%)

Q: What is situated above the fountain?
A: A plaque with a sculpted lion's head is situated above the fountain.

   0 def execute_command(scene_graph):                                          
   1     above_desc = "A plaque "                                               
   2     for entity in scene_graph.get_entities():                              
   3         if 'plaque' in entity:                                             
   4             relations_out = scene_graph.get_outgoing_relations(entity)     
   5             if "lion's head" in relations_out:                             
   6                 above_desc += "with a sculpted lion's head"                
   7                 if 'over' in relations_out["lion's head"].get('spatial',   
     '').split(','):                                                            
   8                     above_desc += " is situated above the fountain."       
   9                     break                                                  
  10     return above_desc                                                      
  11

gpt-4o: Above the fountain, there is a decorative element featuring a sculpted face with water flowing from its mouth. Additionally, there is a display window with mannequins dressed in clothing, and a wall-mounted light fixture. To the right, there is a wall decoration in the shape of a dolphin.
hscore: 67.98%

• <spatial (plaque, fountain, above)> == <spatial (wall decoration, fountain, to the right of)"> (70.0%)
• <part (plaque's lion's head)> == <part (decorative element)> (60.0%)
• <whole (plaque)> == <whole (wall decoration)> (60.0%)

tscore: 71.8%

• <whole (fountain)> == <whole (fountain)> (100.0%)
• <whole (wall> == <whole (wall)> (90.0%)
• <spatial (wall decoration, fountain, to the right of)"> == <spatial (fountain, wall, up against)> (80.0%)

About

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs

Caption

Scene Graph

QA + verification programs