Support

Support Options

Report a problem

About you
This is a randomly generated image of letters and numbers. Letters not clear? Click to renew CAPTCHA.
About the problem

NSF CHE/DMS Innovation Lab: Learning the Power of Data in Chemistry

Discoverability
Visible
Join Policy
Invite Only
Created
14 Sep 2018

There are many reasons why collaborations between data scientists and chemists could prove highly fruitful for both parties. On this page, we have added some ideas for inspiration.

These are meant as "food for thought," but are not vision statements of the Innnovation Lab. Applicants should bring their own concepts, ideas, and inspirations, going well beyond what the organizers offer here.

 

The Data Revolution and the National Science Foundation

Harnessing the Data Revolution (HDR) is one of the NSF Big Ideas that “provides a profound opportunity to transform research across all fields of science and engineering through new insights gained from data.” Both the Division of Chemistry (CHE) and Division of Mathematical Sciences (DMS) at NSF have been and will continuously be promoting data-related research activities in their respective disciplines.  The concept of this Innovation Lab is built on the host of new opportunities envisioned when two communities join forces and bring data scientists and chemists together to interchange ideas, develop new methods, and address long-standing problems.
See the following links to NSF materials on the Data Revolution:
www.nsf.gov/news/special_reports/big_ideas/harnessing.jsp
www.nsf.gov/cise/harnessingdata/

www.nsf.gov/pubs/2018/nsf18075/nsf18075.jsp

 

Data Science for Chemical Measurement

    Chemical measurement provides precise, detailed descriptions of atomic and molecular species in a wide variety of environments. Some of these measurements come in real-time, as analytical instruments are taken outside of the lab into relatively uncontrolled conditions. Due to the limited availability of data collection in the real world, the right measurements need to be made at the right time for the chemical science to advance at a meaningful rate.
    Imagine the following: you have two hours on an Arctic flight to collect data on the particles that are signatures of northern hemisphere atmosphere, and there are strong implications about how these particles relate to global climate change. The incoming data are simultaneous measurements of 1000s of signals, and only a subset of these signals has been previously mapped to known species. How do you sample effectively to optimally inform scientific models? How do you avoid rediscovering what is known? Or missing out on profound new insight because a pollution source couldn’t be tracked down? With limited resources, how should we measure to build the best climate models?
    Traditionally, sampling protocols are determined outside the environment, without statistically sound means to resample as data comes in. We envision that on-the-fly adaptation of measuring strategies could be revolutionary, permitting novel discoveries by refocusing measurements on the unexpected, interesting data and avoiding repeated collection of data where the outcome is already statistically certain. Data science concepts such as uncertainty quantification, active and reinforcement learning, and statistically rigorous design of experiments would prove exceedingly valuable in this area. Much foundational work needs to be done to combine cutting edge data science with chemical measurement, and we look forward to seeing the profound discoveries that are made at this interface.

 

Heterogeneous Multi-task Experiments for Design

In electrolyte design for batteries, a chemist would like to conduct experiments that measure battery conductivity in order to identify an electrolyte that maximizes the conductivity. On a different day, she would like to conduct experiments with different electrolyte designs to learn how the viscosity of the electrolyte changes with design. The chemist can typically measure both conductivity and viscosity with a single experiment. Since such experiments are expensive, it is wasteful to first perform a set of experiments to optimize conductivity and then a fresh set to learn viscosity. It is preferable to design a single set of experiments that simultaneously achieves both goals. While multi-task problems have been considered in data analytics, they typically address problems of same type e.g. multiple regressions or multi-objective optimization. In this example, the two tasks are quite different - maximize one function while learning another function, and requires design of new algorithms that can handle heterogeneous tasks simultaneously.

 

Data Science in Support of Multimodal Measurements in Complex Chemical Systems

Complexity plays an important role in the design, control and engineering of emerging properties of chemical systems.  Most of the time, the complex interactions exhibited by such systems are governed by a series of non-linear dynamical processes, and the resulting behavior cannot be described by just adding the properties of the individual components (i.e. behavior is more than the sum of its component parts). From the scientist’s perspective, the main challenge entails the use of highly accurate and reliable measurements (involving various instrumentation with different length and time scales of measurement) to assess the essential characteristics of complex systems. Development of appropriate data and informatics tools to distill this experimental information could aid in the development of novel chemical manufacturing processes, or even allow modification of these processes in an in-operando manner, improving the versatility of manufacturing while reducing cost.  Given that connecting chemistry to functionality requires the ability to perform chemical analysis across length and time scales, this task has been recognized as a grand challenge that will require cross-disciplinary efforts to overcome. 

 

New Ideas for Data Science in Chemistry

  1. Topological Data Analysis (TDA): Existing applications of TDA to chemistry and protein structures construct persistence diagrams from van der Waals radii for atoms. Could one obtain more insight about geometric structures, their shapes and surfaces, by basing persistence diagrams instead on electron density, or other choices that are physically motivated?
  2. Network Science: Graph Theory is currently used mostly for the representation of organic molecule structures. Could generalizations, such as graphons or infinite graphs, be employed for the representation of extended periodic systems? Furthermore, are there novel developments that might assist chemical understanding in areas where graphs are already widely used?
  3. Partial differential equations: A key application is the modeling of reaction kinetics. Experimental identification of reaction networks is often incomplete and afflicted with large error bars: too many competing reactions can make it impossible to perform experimental measurements for individual species. The following tools might help to discern important path ways and identify key constituents:
    1. Latent variable detection
    2. Uncertainty quantification
    3. Sensitivity analysis (e.g. variance-based Sobol indices)
    4. Parameter estimation
    5. Adaptation of mathematical approaches for biological/metabolic networks
  4. Regression problems. They are ubiquitous in chemical modeling, and one of the issues is bias. On the one hand, randomized algorithms based on sampling and sketching can make the solution of large-scale systems tractable through dimension reduction but at the possible cost of introduce bias. On the other hand, can judicious sampling help to identify (and thereby avoid exacerbation of) biases inherent in the chemical datasets themselves, including absent discussions of “failed” experiments in publications, and biased reaction choices; and can statistical techniques detect and compensate for biases?
  5. Dimensionality reduction of immense and correlated features spaces, based on small amounts of data: An example is SISSO (sure independence screening and sparsifying operator).

 

We look forward to the unique ideas you will bring to the Innovation Lab!


The opinions, findings, and conclusions or recommendations expressed on this site are those of the author(s) and do not necessarily reflect the views of Knowinnovation Inc.