Video summarization and video captioning are considered two separate tasks in existing studies. For longer videos, automatically identifying the important parts of video content and annotating them with captions will enable a richer and more concise condensation of the video. We propose a general neural network architecture that jointly considers two supervisory signals (i.e., an image-based video summary and text-based video captions) in the training phase and generates both a video summary and corresponding captions for a given video in the test phase. Our main idea is that the summary signals can help a video captioning model learn to focus on important frames. On the other hand, caption signals can help a video summarization model to learn better semantic representations. Jointly modeling both the video summarization and the video captioning tasks offers a novel end-to-end solution that generates a captioned video summary enabling users to index and navigate through the highlights in a video. Moreover, our experiments show the joint model can achieve better performance than state-of- the-art approaches in both individual tasks.