Photo editing can be a challenging task, and it becomes even more difficult on the small, portable screens of mobile devices that are now frequently used to capture and edit images. To address this problem we present PixelTone, a multimodal photo editing interface that combines speech and direct manipulation. We observe existing image editing practices and derive a set of principles that guide our design. In particular, we use natural language for expressing desired changes to an image, and sketching to localize these changes to specific regions. To support the language commonly used in photoediting we develop a customized natural language interpreter that maps user phrases to specific image processing operations. Finally, we perform a user study that evaluates and demonstrates the effectiveness of our interface.