Method overview. The method consists of two main components. First, a discriminative variational autoencoder is proposed to distinguish the plausible shapes and the implausible shapes via learning two separate distributions for each of them in the latent feature space. Second, given an RGB-D image with an object in it, the RGB-D transformer aggregates the multi-modality feature and estimates symmetry as well as the symmetry-induced object proposal. The network training optimizes the predicted symmetry, such that the object proposal is similar to the plausible shapes as much as possible. This is achieved by using the plausibility loss and the visibility loss while keeping the parameters in the pre-trained encoder of the discriminative VAE fixed.