Next-generation scientific discoveries rely on the insights we can derive from the large amounts of data that are produced through simulations and experimental and observational facilities. Today however, data is accessed and analyzed primarily by those who generate or produce the data, since it is difficult to search and find relevant data sets. The goal of this proposal is to investigate using machine learning techniques to generate automated metadata that will enable search on data. Enabling search on data will accelerate scientific discoveries through virtual experiments, multidisciplinary and multimodal data assimilation.
ASCR facilities are increasingly handling large amounts of data from experiments and simulations. Facilities like NERSC are already ingesting data at a rate of 1-2 PB per month, and new instruments and detectors are expected to produce data at a rate of approximately a Tb/sec within the next 5 years. Without robust searching capabilities, the ability to fuse multimodal data together from different instruments, the ability to compare experimental results across domains, and the ability to reproduce scientific results will remain one-off, labor-intensive efforts. There is an urgent need to understand and capture the characteristics of the data to enable efficient search on data.} However, there is limited metadata associated with many datasets and metadata systems are often incomplete and require manual effort by a researcher.
The growth of data volumes at DOE Office of Science user facilities introduces a sense of urgency for new methods of learning about the data and making it widely accessible. While some communities (e.g., astronomy) have community-accessible databases, wide open-access to data is still fairly limited. In recent years, the concept of superfacility has been proposed. The vision for superfacility is a network of facilities, software and expertise to enable new modes of discovery around data. Scientific data search needs to become routine for the superfacility concept to be most compelling and deliver new scientific discoveries that can keep pace with the rate of data volumes.
In biology, data standards already support efficient search for relevant data and its reuse in new virtual experiments For example, in medical drug design, data search and reuse has made it possible to re-purpose existing drugs. Making data generated across DOE Office of Science user and computing facilities searchable and reusable in new virtual experiments has the potential of being similarly transformative.
A generalized search infrastructure that learns and captures the knowledge of the data such that it can be searched by other users is needed. Infrastructure that uses machine learning techniques to build knowledge about the data, i.e., generate metadata is the first step towards building a search engine for science. In the proposed work, we investigate ScienceSearch, a framework that will use machine learning to auto-generate metadata to enable searchable data at supercomputing facilities. Scientific data available at supercomputing centers provides a unique opportunity to enable wider access to data and build search capabilities.