Qualitative results of our SNOC model on the test set of the held-out MSCOCO dataset. The words in orange are not present during training, i.e., the novel objects. The words in purple are generated in the Retrieving mode, while other words in sentences are from the Generating mode. “Detected” shows the object detection results from the external detection model. “Proxy visual word” is the proxy visual word for the unseen word used in building the object memory (Eqn. (5)). “Ours” contains two sentences. The first one is the generated sentence of our Switchable LSTM with proxy visual words. The second one is the revised sentence by replacing the proxy visual words with the detected labels. “GT” means the human-annotated ground truth sentences.