This paper reports our explorations on learning Sensory Media Association through Reciprocating Training (SMART). The proposed learning system contains two deep autoencoders, one for learning speech representations and another for learning image representations. Two deep networks are trained to bridge the latent spaces of two autoencoders, yielding representation mappings for both speech-to-image and image-to-speech. To improve feature clustering in both latent spaces, the system alternately uses one modality to guide the learning of another modality. Different from traditional technology that uses a fixed modality for supervision (e.g. using text labels for image classification), the proposed approach facilitates a machine to learn from sensory data of two or more modalities through alternating guidance among these modalities. We evaluate the proposed model with MNIST digit images and corresponding digit speeches in the Google Command Digit Dataset (GCDD). We also evaluate the model with a dataset based on COIL-100 and corresponding Watson synthesized speech. The results demonstrate the model's promising viability for sensory media association.