All Your Questions Answered — CVPR Day 1

Caesar’s Palace — Location of CVPR 2016

This is the first post in what will (hopefully) be a multi-part series from this year’s Computer Vision/Pattern Recognition (CVPR 2016)conference. Over the course of the series, I plan to talk about trends that I’m seeing, how some of this academic work might be commercially interesting, and point out what I consider to be the best papers/presentations/posters from each day.

  • Edit: Recaps of other days of CVPR can be found at Day 2, Day 3, and Day 4.

Before I get started, I want to say a few words on the conference itself. When I signed up for CVPR this year, I was disappointed to see that it was in Las Vegas — June in Vegas is not so much fun (it was 99 degrees outside at 11PM last night). Why couldn’t they pick some more awesome places like ICLR (San Juan) or NIPS (Barcelona!)? But when I got to the conference today, I finally understood. There are a ridiculous number of people here.

People at this morning’s opening presentations

Here are some stats:

  • 1983 CVPR → 100 Attendees
  • 2016 CVPR → 3600 Attendees and counting!
  • 2145 Papers were submitted, 643 were accepted
  • Conference materials (papers and supplemental data) — 2.5GB

There just aren’t many conference locations that can host this many people. So I guess that I can forgive the location choice.

I’d also like to give the organizing committee a shout-out for making it a priority to get women to participate at CVPR. Over 50% of the organizing committee was female. And they’ve done a bunch of smart things such as getting childcare for participants on premise. Childcare is a great idea, and I hope it is one that is picked up by other conference organizers.

Things I noticed today:

  • Multi-modal Deep Learning is the new frontier. There was a Tutorial on this topic area yesterday, but it was impressive how many of the best papers use multi-modal techniques to show better generalization performance.
  • Image Captioning and Question Answering have come a long way in a short time. The first paper that caught my attention in this space was Show and Tell by Vinyals, et al, and that paper was published less than 2 years ago. The work that was presented today was significantly more robust (see Best Talks section below).
  • Annotating the Microsoft COCO data set is the new norm. Rather than creating newer and larger image data sets, researchers are instead building on top of COCO. Examples include Visual Question Answering and Google Referring Expressions.
  • Not enough progress is being made in Video Analysis. Given the forward progress that has been made in other computer vision tasks, I would expect Video Analysis performance to be mind-blowing. It’s not. I get the sense that researchers avoid the topic because it is computationally expensive and there aren’t as many awesome data sets.

Best Talks:

  • Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. Lisa Anne Hendricks did a fantastic job explaining the problem (the fact that image captioning is limited by the availability of images with paired captions) with pictures of bears and an anteater :). Given that the conference is focused on Computer Vision, you would think that more speakers would use images to help explain what they actually did. Anyways, great paper and great talk. It was a strong start to the conference.
  • Neural Module Networks. Just when I thought that I’d pull my hair out if I had to hear about yet another approach to image question answering, Jacob Andreas pulled me back from the abyss (it probably helped that he made a joke about how he was the third person in a row to talk about the topic). I really liked the approach (decomposing questions into substructures and training network modules for these subtasks) and the way that Jacob chose to visualize it (using Lego blocks). I haven’t read the paper yet, but great presentation.
  • You Only Look Once: Unified, Real-Time Object Detection. With a name like YOLO, Joe Redmon started off on a good foot. Anyone brave enough to do a live demo during a 12 minute talk deserves our attention. YOLO mistakenly detecting a piece of shiny white molding behind Joe as a “Toilet” let everyone know it was a real demo and actually added to the impressiveness.
Gargantua and Black Hole — Image courtesy of interstellarfilm.wikia.com
  • Computation Imaging for VLBI Image Reconstruction. Any presentation that begins with imagery from a Christopher Nolan movie is off to a great start. And while Katherine Bouman didn’t use any pictures from The Dark Knight or Inception, I suppose that it made more sense for her to have a picture of the black hole in Interstellar. I don’t pretend to fully understand her work, but the gist is this: in order to do research on a variety of celestial phenomenon, better telescopes are necessary. And one way to get a earth-sized telescope (which would be required to image the black hole that was discussed in the talk), is to computationally combine telescopes all around the globe. This presents a number of challenges that Computer Vision researchers may be able to help with. Ms Bouman is a reminder to other speakers that being passionate about your topic is vital in order to be engaging. I imagine that most of the audience was about as knowledgeable as I am about black holes. But if the poster session after the main presentations is any indication, that didn’t stop them from seeking her out to learn more about the problem.

That’s all I’ve got for today. Check in again tomorrow for my Day 2 updates from CVPR.