
Here at Lab41, we work with a variety of groups from industry, academia, and even the government. Altair, one of Lab41’s projects this year, involves a close collaboration between Lab41 and the team developing nbgallery, an open-source project from the U.S. Department of Defense. At a high level, Altair is about investigating deep learning techniques for measuring source code similarity. We’re working with the nbgallery team to incorporate these techniques into more informed content-based Jupyter notebook recommendations for nbgallery users. We’ve invited their development team for a quick Q&A to provide more context about Altair and nbgallery itself:
Lab41: For those who are unfamiliar, what’s the elevator pitch for nbgallery?
nbgallery team: nbgallery is an enterprise Jupyter Notebook sharing and collaboration platform with the goal of making it easier for data scientists and analysts to share and run code-based analytics.

Why did your team create nbgallery? What use-cases does it address?
As huge fans of IPython/Jupyter, we wanted to be able to easily share and collaborate on Jupyter notebooks across our large, distributed organization. Specifically, we wanted to empower “citizen data scientists” — those who have the aptitude, curiosity, and creativity to explore data but lack the formal education and technical background of data scientists or computer programmers.
In our experience, we’ve found that there has been a relatively high barrier to entry to running Jupyter notebooks for non-technical users. Users or groups would first need to stand up their own Jupyter or JupyterHub instance, which required a background in Linux system administration, or at least familiarity with a command line. After that, they would need a version control system like GitHub to share and collaborate on code-based Jupyter notebooks. We felt that requiring the use of command-line git to save and check out code-based notebooks put Juypter beyond the reach of those who didn’t have at least some software engineering background.
We wanted to simplify the process of sharing and executing Jupyter notebooks so that users from a wide variety of skill sets and backgrounds could benefit from and collaborate on cutting-edge analytics written in Jupyter notebooks. By creating the web-based nbgallery, which acts as a visual middleman between the user and a remote git repo, we’ve provided simplified access for our non-technical users to these code-based analytics.

What differentiates nbgallery from other similar products or projects?
We’re certainly not the first to go down this path; however, our internal compute environment presented some unique challenges that required a slightly different approach.
While there are some exciting projects that achieve many of our same overarching goals, the challenge came in attempting to integrate those products into our enterprise data security and compliance frameworks. As you can imagine, our organization deals with a lot of sensitive data and has a strict “need-to-know” framework, meaning that our users should not be allowed to execute notebooks or access data for which they don’t have sufficient clearance or authorization. Additionally we have strict restrictions on co-habitation of analytic input and output data, so nbgallery must prevent the sharing of a notebook’s inputs and outputs. These requirements ruled out any public SAAS platforms and made it difficult to integrate any off-the-shelf product.

nbgallery also allows for a Jupyter execution environment that is independent from the notebook sharing platform. For example, the way we’ve instrumented our user’s Jupyter instances is to use an ephemeral (i.e. short-lived) personalized compute environment. This helps to ensure that while notebooks can be shared widely on the nbgallery server, they are executed in enclaves that maintain data security policy and protections. To prevent a notebook’s output from leaving that enclave, all notebook output is stripped before being saved back to the nbgallery server.
As an open source project from the government, what challenges did your project deal with during development?
We built nbgallery with the intent to release it to the open source community. During its development we made sure it was possible for organizations to adapt it to their own needs, since we ourselves needed customized capabilities to address many of our unique requirements. Our solution was an extension framework for nbgallery that allows for customized plug-ins to support an organization’s unique requirements.
As one simple example, we developed and use an nbgallery extension that requires users to include control markings for each notebook, which can later be used for more fine-grained security access. We also support multiple authentication methods; our internal deployment of nbgallery integrates with our enterprise user authentication service, whereas the open source release available to the public includes support for more standard username/password and OAuth-based authentication.
What features should the larger Jupyter community know about?
Because of the ephemeral execution environment that I previously discussed, the time and speed it takes to install Jupyter was a very important factor for us. To address this, we developed a minimal Jupyter Docker image (<250MBs). The minimal image is based on Alpine Linux and offers a dozen language kernels, most of which are installed dynamically when the user tries to open a new notebook in that language.

To maintain such a small image, we couldn’t include every possible library that an analytic author might need. However, since many notebooks do require external libraries, we’ve created capabilities for both Python and Ruby to allow language and OS dependencies to be installed on the fly when a user runs the notebook. This means users do not have to know how to install packages from the command line in order to successfully execute code-based analytics found within nbgallery.
While our Alpine Linux Jupyter Docker image contains all of the integration capability to interact with the nbgallery server, we have also developed a python package (called jupyter-nbgallery) which adds nbgallery integration into any existing Jupyter or JupyterHub servers. This enables groups or organizations who deploy and maintain their own Jupyter instances to still save notebooks to and run notebooks from a nbgallery server.


Any features on the roadmap ahead?
nbgallery has been under active development since November of 2015, and has been in regular use internally since April 2016. A big push for our efforts in 2017 is to improve the “discoverability” of notebooks in nbgallery. We see two areas of focus here — the first is in a recommendation system that can operate on code-based notebooks written in multiple languages. The second is a system to measure a notebooks “health” which can identify when notebooks have gone “stale” and no longer run as expected. Combined, these efforts can help to improve the ability for a user to easily discover the most relevant notebook for their needs.
You mentioned that you wanted to improve the “discoverability” of notebooks. How does the work on Altair dealing with code similarity fit into nbgallery?
The internal recommendation capability with nbgallery is a big area of interest to us since ultimately we want to improve collaboration. We’ve initially set up systems to recommend notebooks based on content similarity using TF-IDF as well as cosine similarity of user interactions. We contacted Lab41 a few months ago based on their initial research of recommender systems as part of the 2016 Hermes challenge. In particular, could approaches like doc2vec, py2vec, machine learning or deep learning algorithms provide a better way to analyze content similarity among a large corpus of code-based notebooks? We plan on integrating their solutions into the recommendation services used within nbgallery. As one example, the content-based approach will help with recommendations for newly created notebooks.
What was the most significant lesson you learned during the development of nbgallery?
The most important lesson we learned was how valuable it is to empower users of all backgrounds to create their own analytic solutions. While users without computer programming experience might initially balk at the thought of writing or executing code-based analytics, Jupyter’s web-based Notebooks provide a powerful, yet approachable, analytic platform. And nbgallery makes it easier for non-technical users to mix and match pieces of code from other, well-vetted notebooks for their own purposes.
Ultimately, by reducing the barriers to entry we allow users to employ popular and approachable languages like Python, Ruby and R to focus their creativity and innovation on solving pressing challenges of data-driven organizations.
What was the one surprising feature request you learned from listening to your users?
Notwithstanding what I just mentioned, we still found a significant amount of users who were successfully able to discover notebooks of interest in nbgallery but got stuck when trying to actually execute the code within the notebook itself. We’d like to see how we can help improve the user-experience of running notebooks for beginner users who may be intimidated by the full suite of options within Jupyter — not to mention the sight of the raw code itself. Things like Jupyter Dashboards offer a lot of potential, and we look forward to further investigating that as well as other options.
Where can I find nbgallery?
Our code is available on Github and our Docker images are available on DockerHub. We invite everyone to take a look and let us know what you think!