The overview of the proposed method. In the example, the object “tennis racket” is unseen during training. We first leverage the object detection model to build the key-value object memory. For the unseen object “tennis racket”, we find its most similar candidate from seen objects by calculating the visual feature distance. The most similar object is “baseball bat” in this case, which is used in building the memory. The Switchable LSTM takes advantage of both the global image feature and the object memory as input. When predicting the second word (“person”), the indicator insides the cell turns on the Retrieving mode. Thus the model takes the hidden state as a query to locate the object memory. When the sentence generation is finished, we replace the proxy visual words by its accurate label name provided by the external detection model.