Research

Our proposed techniques bring together a unique blend of skills that includes machine learning, human-computer interaction, and experience with scientific domains and users at facilities. Our goal is to make

data a first-class discoverable resource at supercomputing centers through the powerful concept of search.

We envision the ScienceSearch framework will be available at supercomputing centers and users can make their data available to the system. The ScienceSearch framework will use the data sets and, ecosystem artifacts associated with the data (e.g., proposals, workflow and system logs, publications) to learn and generate metadata labels. The ScienceSearch framework will use active learning to surface the metadata labels to users for feedback. The users can validate, add, delete or edit labels. Similarly, we anticipate that the models will

learn from the relevance of the search results. The framework will need to ascertain the effect of user's inputs on the models and track the evolution of metadata. Specifically, in this proposal we focus on

three research themes in the context of ScienceSearch.

Research Thrust 1: Methods for generating automated metadata.

We will explore machine learning techniques to generate metadata for scientific data using various artifacts of the ecosystem, including the datasets, job scripts, workflows, and usage logs. The metadata generated can lead to valuable knowledge about the data itself. It can also capture the relationships between code, people, and datasets, widening the search capabilities. In addition to context of use, it is critical to extract scientific features and tag the datasets appropriately. We will leverage existing work and explore techniques to generate domain-specific metadata in the context of scientific domains, as a basis for generalizing techniques for other datasets. We will start with using existing techniques in both unsupervised, supervised and semi-supervised learning and extend/customize techniques.

Research Thrust 2: Scalable Metadata Infrastructure.

We will address scalability challenges in building the metadata infrastructure for search. Specifically, we will a) build and expand on earlier work \cite{sparks2013mli,neon:website} to address the scalability challenges of

running the machine learning algorithms at scale on current and future systems and, b) build on the open-source metadata service, Ground \cite{ground:website}, under development at UC Berkeley to capture and find the metadata extracted from the machine learning algorithms

Research Thrust 3: Usable Metadata Infrastructure.

The complexity of scientific data and the interactive process of search results has a multitude of human factors that need to be considered. We will need to understand the types of search queries users would like to perform and we need to study the interaction of the user with the algorithms. Specifically, we will address usability

and accuracy issues with generated metadata allowing users to mark errors, edit, and add additional annotations. We will also study the use of active learning to surface ``important'' (i.e. uncertain and influential) guesses to the users for validation and determine how to weigh human input and include the inputs to the metadata generation models. Similarly, we will study how the feedback of the relevance of the search results may be used as feedback into the model.