In Japan, the Japan Science and Technology Agency (JST) launched the
Materials Research by Information Integration Initiative (MI2I) at National
Institute of Materials Science (NIMS) on July 2015. The Institute of
Statistical Mathematics has been designated to be a recommitment site of
MI2I as the central institute of data science in Japan.
Python Library on Representation & Learning for Materials Data
OverviewÁ
XenonPy is
a Python library that implements a comprehensive set of machine learning
tools for materials informatics. Its functionalities partially depend on
Python (PyTorch) and R (MXNet). This package is still under development. The
current released version provides some limited features:
•
Interface to the public materials database
•
Library of materials descriptors (compositional/structural descriptors)
•
pre-trained model library XenonPy.MDL (v0.1.0.beta,
2019/8/9: more than 140,000 models (include private models) in 35 properties
of small molecules, polymers, and inorganic compounds) [Currently under
major maintenance, expected to be recovered in v0.7]
•
Machine learning tools.
•
Transfer learning using the pre-trained models in XenonPy.MDL
Inverse molecular design for R: iqspr v2.4
Introduction
The structure of chemical species can be uniquely encoded in a single string
of standard text characters called SMILES (Simplified Molecular Input Line
Entry Specification). A very nice presentation of the SMILES notation can be
found here. If one knows the SMILES of a chemical compound, its 2D structure
can be univoquely re-constructed. One of the aspect of the SMILES format is
that it's particularly useful in the prediction of properties of compounds.
The link that exists between the structure of a compound and its properties
is generally called a QSPR (Quantitative Structure-Properties Relationship),
and it has been widely used in cheminformatics for the design of new
compounds. Generally, compounds structures are primarily investigated by
chemists following a trial-and-error construction controlled by their
existing knowledge of the chemistry and their intuition. The properties of
the investigated compounds are then checked by direct experiments and/or
driven by a QSPR analysis. In this kind of analysis, numerous descriptors
can be build from the SMILES format. These descriptors can be represented a
set of binary and/or continuous properties based on the existence of certain
fragments in a molecule, or on the ability of its bonds to rotate for
example. An introduction and overview concerning the molecular descriptors
can be found here. Then, the descriptors are parsed as input features for a
given regression model to predict output properties for a list of novel
compounds. This kind of reconstruction of the properties of compounds from
descriptors is called a forward prediction.