Projector-camera systems can turn any surface such as tabletops and walls into an interactive display. A basic problem is to recognize the gesture actions on the projected UI widgets. Previous approaches using finger template matching or occlusion patterns have issues with environmental lighting conditions, artifacts and noise in the video images of a projection, and inaccuracies of depth cameras. In this work, we propose a new recognizer that employs a deep neural net with an RGB-D camera; specifically, we use a CNN (Convolutional Neural Network) with optical flow computed from the color and depth channels. We evaluated our method on a new dataset of RGB-D videos of 12 users interacting with buttons projected on a tabletop surface.