In Japan, the Japan Science and Technology Agency (JST) launched the Materials Research by Information Integration Initiative (MI2I) at National Institute of Materials Science (NIMS) on July 2015. The Institute of Statistical Mathematics has been designated to be a recommitment site of MI2I as the central institute of data science in Japan.
Python Library on Representation & Learning for Materials Data
XenonPy is a Python library that implements a comprehensive set of machine learning tools for materials informatics. Its functionalities partially depend on Python (PyTorch) and R (MXNet). This package is still under development. The current released version provides some limited features:
· Interface to the public materials database
· Library of materials descriptors (compositional/structural descriptors)
· pre-trained model library XenonPy.MDL (v0.1.0.beta, 2019/8/9: more than 140,000 models (include private models) in 35 properties of small molecules, polymers, and inorganic compounds) [Currently under major maintenance, expected to be recovered in v0.7]
· Machine learning tools.
· Transfer learning using the pre-trained models in XenonPy.MDL
The structure of chemical species can be uniquely encoded in a single string of standard text characters called SMILES (Simplified Molecular Input Line Entry Specification). A very nice presentation of the SMILES notation can be found here. If one knows the SMILES of a chemical compound, its 2D structure can be univoquely re-constructed. One of the aspect of the SMILES format is that it's particularly useful in the prediction of properties of compounds. The link that exists between the structure of a compound and its properties is generally called a QSPR (Quantitative Structure-Properties Relationship), and it has been widely used in cheminformatics for the design of new compounds. Generally, compounds structures are primarily investigated by chemists following a trial-and-error construction controlled by their existing knowledge of the chemistry and their intuition. The properties of the investigated compounds are then checked by direct experiments and/or driven by a QSPR analysis. In this kind of analysis, numerous descriptors can be build from the SMILES format. These descriptors can be represented a set of binary and/or continuous properties based on the existence of certain fragments in a molecule, or on the ability of its bonds to rotate for example. An introduction and overview concerning the molecular descriptors can be found here. Then, the descriptors are parsed as input features for a given regression model to predict output properties for a list of novel compounds. This kind of reconstruction of the properties of compounds from descriptors is called a forward prediction.