Computer Science Department
School of Computer Science, Carnegie Mellon University


Large Scale Scene Matching
for Graphics and Vision

James Hays

July 2009

Ph.D. Thesis


Keywords: Scene matching, image completion, image geolocation, scene classification, object detection, context, lazy learning, SVM, graphics, vision, internet data, computationl photography

Our visual experience is extraordinarily varied and complex. The diversity of the visual world makes it difficult for computer vision to understand images and for computer graphics to synthesize visual content. But for all its richness, it turns out that the space of "scences" might not be astronomically large. With access to imagery on an Internet scale, regularities start to emerge – for most images, there exist numerous examples of semantically and structurally similar scenes. Is it possible to sample the space of scenes so densely that one can use similar scenes to "brute force" otherwise difficult image understanding and manipulation tasks? This thesis is focused on exploiting and refining large scale scene matching to short circuit the typical computer vision and graphics pipelines for image understanding and manipulation.

First, in "Scene Completion" we patch up holes in images by copying content from matching scenes. We find scenes so similar that the manipulations are undetectable to naive viewers and we quantify our success rate with a perceptual study. Second, in "im2gps" we estimate geographic properties and global geolocation for photos using scene matching with a database of 6 million geo-tagged Internet images. We introduce a range of features for scene matching and use them, together with lazy SVM learning, to dramatically improve scene matching – doubling the performance of single image geolocation over our baseline method. Third, we study human photo geolocation to gain insights into the geolocation problem, our algorithms, and human scene understanding. This study shows that our algorithms significantly exceed human geolocation performance. Finally, we use our geography estimates, as well as Internet text annotations, to provide context for deeper image understanding, such as object detection.

166 pages

Return to: SCS Technical Report Collection
School of Computer Science

This page maintained by