Research

Our proposed techniques bring together a unique blend of skills that includes machine learning, human-computer interaction, and experience with scientific domains and users at facilities. Our goal is to make data a first-class discoverable resource at supercomputing centers through the powerful concept of search.

We envision the ScienceSearch framework will be available at supercomputing centers and users can make their data available to the system. The ScienceSearch framework will use the data sets and, ecosystem artifacts associated with the data (e.g., proposals, workflow and system logs, publications) to learn and generate metadata labels. The ScienceSearch framework will use active learning and generative AI to surface the metadata labels to users for feedback. The users can validate, add, delete or edit labels. Similarly, we anticipate that the models will learn from the relevance of the search results. The framework will need to ascertain the effect of user's inputs on the models and track the evolution of metadata. The trustworthiness and quality of metadata generation across the scientific lifecycle also needs to be monitored. Specifically, in this proposal we focus on four research themes in the context of ScienceSearch:


Research Thrust 1: Methods for generating automated metadata.

We explore the use of machine learning techniques to generate metadata for scientific data using various artifacts of the ecosystem, including the datasets, job scripts, workflows, and usage logs. The metadata generated can lead to valuable knowledge about the data itself. It can also capture the relationships between code, people, and datasets, widening the search capabilities. In addition to context of use, it is critical to extract scientific features and tag the datasets appropriately. We leverage existing work and explore techniques to generate domain-specific metadata in the context of scientific domains, as a basis for generalizing techniques for other datasets. We use existing techniques in both unsupervised, supervised and semi-supervised learning, as well as state-of-the-art generative AI models,  to enrich scientific metadata for search.


Research Thrust 2: Scalable Metadata Infrastructure.

We address scalability challenges in building the metadata infrastructure for search. Specifically, we will a) build and expand on earlier work \cite{sparks2013mli,neon:website} to address the scalability challenges of running the machine learning algorithms at scale on current and future systems and, b) build on the open-source metadata service, Ground \cite{ground:website}, under development at UC Berkeley to capture and find the metadata extracted from the machine learning algorithms.


Research Thrust 3: Usable Metadata Infrastructure.

The complexity of scientific data and the interactive process of search results has a multitude of human factors that need to be considered. There is a need to understand the types of search queries users would like to perform and we need to study the interaction of the user with the algorithms. Specifically, we address usability and accuracy issues with generated metadata allowing users to mark errors, edit, and add additional annotations. We also study the use of active learning to surface ``important'' (i.e. uncertain and influential) guesses to the users for validation and determine how to weigh human input and include the inputs to the metadata generation models. Similarly, we investigate how the feedback of the relevance of the search results may be used as feedback into the model.


Research Thrust 4: Trustworthiness and Verification of AI and Generative Models in Scientific Applications

Researchers increasingly rely on AI/ML tools throughout the scientific process, from literature review (e.g., Elicit) to technical writing (e.g., language models, Grammerly, Scite). However, as these tools advance, ensuring the trustworthiness, accuracy, reproducibility and provenance tracking at each stage in the scientific lifecycle has become increasingly challenging yet essential. To address this challenge, we are working to developing a framework for ensuring the trustworthiness, accuracy and reproducibility of generative and predictive AI tools used within any specific workflow. As part of this effort, we will also be exploring automated metadata generation and checkpointing for AI models to facilitate provenance tracking within the scientific pipeline.