3D Scene Generation

CVPR 2019 Workshop, Long Beach, CA

Sunday June 16 2019, 8:45am -- 5:40pm, location TBA

Image credit: [1, 2, 7, 12, 6, 4, 5]


People spend a large percentage of their lives indoors---in bedrooms, living rooms, offices, kitchens, and other such spaces---and the demand for virtual versions of these real-world spaces has never been higher. Game developers, VR/AR designers, architects, and interior design firms are all increasingly making use virtual 3D scenes for prototyping and final products. Furthermore, AI/vision/robotics researchers are also turning to virtual environments to train data-hungry models for tasks such as visual navigation, 3D reconstruction, activity recognition, and more.

As the vision community turns from passive internet-images-based vision tasks to applications such as the ones listed above, the need for virtual 3D environments becomes critical. The community has recently benefited from large scale datasets of both synthetic 3D environments [13] and reconstructions of real spaces [8, 9, 14, 16], and the development of 3D simulation frameworks for studying embodied agents [3, 10, 11, 15]. While these existing datasets are a valuable resource, they are also finite in size and don't adapt to the needs of different vision tasks. To enable large-scale embodied visual learning in 3D environments, we must go beyond such static datasets and instead pursue the automatic synthesis of novel, task-relevant virtual environments.

In this workshop, we aim to bring together researchers working on automatic generation of 3D environments for computer vision research with researchers who are making use of 3D environment data for a variety of computer vision tasks. We define "generation of 3D environments" to include methods that generate 3D scenes from sensory inputs (e.g. images) or from high-level specifications (e.g. "a chic apartment for two people"). Vision tasks that consume such data include automatic scene classification and segmentation, 3D reconstruction, human activity recognition, robotic visual navigation, and more.

Call for Papers

Call for papers: We invite extended abstracts for work on tasks related to 3D scene generation or tasks leveraging generated 3D scenes. Paper topics may include but are not limited to:

  • Generative models for 3D scene synthesis
  • Synthesis of 3D scenes from sensor inputs (e.g., images, videos, or scans)
  • Representations for 3D scenes
  • 3D scene understanding based on synthetic 3D scene data
  • Completion of 3D scenes or objects in 3D scenes
  • Learning from real world data for improved models of virtual worlds
  • Use of 3D scenes for simulation targeted to learning in computer vision, robotics, and cognitive science

Submission: we encourage submissions of up to 6 pages excluding references and acknowledgements. The submission should be in the CVPR format. Reviewing will be single blind. Accepted extended abstracts will be made publicly available as non-archival reports, allowing future submissions to archival conferences or journals. We also welcome already published papers that are within the scope of the workshop (without re-formatting), including papers from the main CVPR conference. Please submit your paper to the following address by the deadline: 3dscenegeneration@gmail.com Please mention in your email if your submission has already been accepted for publication (and the name of the conference).

Important Dates

Paper Submission Deadline May 17 2019
Notification to Authors May 31 2019
Camera-Ready Deadline June 7 2019
Workshop Date June 17 2019


Welcome and Introduction 8:45am - 9:00am
Invited Speaker Talk 1 9:00am - 9:25am
Invited Speaker Talk 2 9:25am - 9:50am
Spotlight Talks (x3) 9:50am - 10:10am
Coffee Break and Poster Session 10:10am - 11:10am
Invited Speaker Talk 3 11:10am - 11:35am
Invited Speaker Talk 4 11:35am - 12:00pm
Lunch Break 12:00pm - 1:30pm
Invited Speaker Talk 5 (Industry Talks) 1:30pm - 2:00pm
Invited Speaker Talk 6 2:00pm - 2:25pm
Oral 1 2:25pm - 2:45pm
Oral 2 2:45pm - 3:05pm
Coffee Break and Poster Session 3:05pm - 4:00pm
Invited Speaker Talk 7 4:00pm - 4:25pm
Invited Speaker Talk 8 4:25pm - 4:50pm
Panel Discussion and Conclusion 4:50pm - 5:40pm

Accepted Papers


Invited Speakers

Vladlen Koltun is a Senior Principal Researcher and the director of the Intelligent Systems Lab at Intel. The lab is devoted to high-impact basic research on intelligent systems. Previously, he has been a Senior Research Scientist at Adobe Research and an Assistant Professor at Stanford where his theoretical research was recognized with the National Science Foundation (NSF) CAREER Award (2006) and the Sloan Research Fellowship (2007).

Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Scientist in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on visual recognition and search. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an Alfred P. Sloan Research Fellow and Microsoft Research New Faculty Fellow, a recipient of NSF CAREER and ONR Young Investigator awards, the PAMI Young Researcher Award in 2013, the 2013 Computers and Thought Award from the International Joint Conference on Artificial Intelligence (IJCAI), the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2013, and the Helmholtz Prize in 2017.

Marc Pollefeys is a full professor and head of the Institute for Visual Computing of the Dept. of Computer Science of ETH Zurich which he joined in 2007. He leads the Computer Vision and Geometry lab. Previously he was with the Dept. of Computer Science of the University of North Carolina at Chapel Hill where he started as an assistant professor in 2002 and became an associate professor in 2005. Before he was a postdoctoral researcher at the Katholieke Universiteit Leuven in Belgium, where he also received his M.S. and Ph.D. degrees in 1994 and 1999, respectively. His main area of research is computer vision. One of his main research goals is to develop flexible approaches to capture visual representations of real world objects, scenes and events. Dr. Pollefeys has received several prizes for his research, including a Marr prize, an NSF CAREER award, a Packard Fellowship and a ERC Starting Grant. He is the author or co-author of more than 280 peer-reviewed papers.

Ellie Pavlick is an Assistant Professor of Computer Science at Brown University, and an academic partner with Google AI. She received her PhD in Computer Science from the University of Pennsylvania. She is interested in building better computational models of natural language semantics and pragmatics: how does language work, and how can we get computers to understand it the way humans do?

Daniel Aliaga does research primarily in the area of 3D computer graphics but overlaps with computer vision and visualization while also having strong multi-disciplinary collaborations outside of computer science. His research activities are divided into three groups: a) his pioneering work in the multi-disciplinary area of inverse modeling and design; b) his first-of-its-kind work in codifying information into images and surfaces, and c) his compelling work in a visual computing framework including high-quality 3D acquisition methods. Dr. Aliaga’s inverse modeling and design is particularly focused at digital city planning applications that provide innovative “what-if” design tools enabling urban stake holders from cities worldwide to automatically integrate, process, analyze, and visualize the complex interdependencies between the urban form, function, and the natural environment.

Angela Dai is a postdoctoral researcher at the Technical University of Munich. She received her Ph.D. in Computer Science at Stanford University advised by Pat Hanrahan. Her research focuses on 3D reconstruction and understanding with commodity sensors. She received her Masters degree from Stanford University and her Bachelors degree from Princeton University. She is a recipient of a Stanford Graduate Fellowship.

Jiajun Wu is a fifth-year PhD student at MIT, advised by Bill Freeman and Josh Tenenbaum. He received his undergraduate degree from Tsinghua University, working with Zhuowen Tu. He has also spent time at research labs of Microsoft, Facebook, and Baidu. His research has been supported by fellowships from Facebook, Nvidia, Samsung, Baidu, and Adobe. He studies machine perception, reasoning, and its interaction with the physical world, drawing inspiration from human cognition.

Industry Participants

The workshop also features presentations by representatives of the following companies:


Angel X. Chang
Eloquent Labs, Simon Fraser University
Daniel Ritchie
Brown University
Qixing Huang
UT Austin
Manolis Savva
Facebook AI Research, Simon Fraser University


Thanks to visualdialog.org for the webpage format.


[1] Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models
D. Ritchie, K. Wang, and Y.a. Lin
CoRR, vol. arXiv:1811.12463, 2018

[2] GRAINS: Generative Recursive Autoencoders for INdoor Scenes
M. Li, A.G. Patil, K. Xu, S. Chaudhuri, O. Khan, A. Shamir, C. Tu, B. Chen, D. Cohen-Or, and H. Zhang
CoRR, vol. arXiv:1807.09193, 2018

[3] Gibson env: real-world perception for embodied agents
F. Xia, A. R. Zamir, Z.Y. He, A. Sax, J. Malik, and S. Savarese
Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, IEEE, 2018

[4] VirtualHome: Simulating Household Activities via Programs
X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba
CVPR, 2018

[5] Embodied Question Answering
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra
CVPR, 2018

[6] ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans
A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner
Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2018

[7] SeeThrough: Finding Objects in Heavily Occluded Indoor Scene Images
N. Mitra, V. Kim, E. Yumer, M. Hueting, N. Carr, and P. Reddy
2018 International Conference on 3D Vision (3DV), 2018

[8] Matterport3D: Learning from RGB-D Data in Indoor Environments
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang
International Conference on 3D Vision (3DV), 2017

[9] Joint 2D-3D-semantic data for indoor scene understanding
I. Armeni, S. Sax, A.R. Zamir, and S. Savarese
arXiv preprint arXiv:1702.01105, 2017

[10] MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments
M. Savva, A.X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun
arXiv:1712.03931, 2017

[11] AI2-THOR: An interactive 3D environment for visual AI
E. Kolve, R. Mottaghi, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi
arXiv preprint arXiv:1712.05474, 2017

[12] Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks
Y. Zhang, S. Song, E. Yumer, M. Savva, J.Y. Lee, H. Jin, and T. Funkhouser
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

[13] Semantic scene completion from a single depth image
S. Song, F. Yu, A. Zeng, A.X. Chang, M. Savva, and T. Funkhouser
Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2017

[14] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
A. Dai, A.X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner
Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

[15] CARLA: An Open Urban Driving Simulator
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun
1–16, Proceedings of the 1st Annual Conference on Robot Learning, 2017

[16] SceneNN: A Scene Meshes Dataset with aNNotations
B.S. Hua, Q.H. Pham, D.T. Nguyen, M.K. Tran, L.F. Yu, and S.K. Yeung
International Conference on 3D Vision (3DV), 2016