Monday, May 9, 2022

NMRlipids databank: Current status and structure

After the latest NMRlipids publication, the main focus has been in the development of the NMRlipids databank that would enable automatic analyses over all the data contributed to the project. After the original idea and preliminary version, the development has been facilitated by several meetings (Berlin, online I, online II, Prague, online III). The upcoming meeting in Helsinki on 1.-3.6.2022 will contain also educative parts for using the NMRlipids databank.

The databank is now essentially functional and preparation of the first manuscript has been started. The overlay structure and current content of the databank are illustrated in figure 1. The structure of the databank is discussed more detailed below. Quality evaluation and preliminary results will be discussed in the upcoming posts.

Figure 1: a) Structure of an overlay databank. More detailed structure of the layer 2 in the NMRlipids databank is illustrated in figure 2. b) Distribution of the lengths of the trajectories, total number of trajectories and total lenght of the simulations in the NMRlipids databank. c) Distribution of lipids present in the trajectories in the NMRlipids databank. Lipids occuring in five or less simulations (’others’) are listed in the right. d) Currently available binary mixtures in the NMRlipids databank. e) Distribution of force fields in the simulations in the NMRlipids databank. The figures and numbers are created on 9th of May 2022 with stats.ipynb.

Structure of the NMRlipids databank. As illustrated in Fig. 2., the script creates a README.yaml file that contains all the essential information of an added simulation based on the information given according to the instructions. The created README.yaml files are stored in folders in Data/Simulations. The folders are named after the hash identities of trajectory and topology files. While the raw simulation data is not directly stored in the NMRlipids databank, the README.yaml files contain permanent links from where the raw data can be accessed when needed.

For the quality evaluation, simulations are connected to the available experimental C-H bond order parameters from NMR and x-ray scattering from factors, which are also included in the NMRlipids databank. The connection between a simulation and experimental data set is made by the script when molar concentrations of all molecules are within ±5 percentage units, charged lipids have the same counterions, and temperature is within ±2 degrees. In such cases, the paths to the experimental data are added into the simulation README.yaml file.

Figure 2: Figure 2: Structure of the NMRlipids databank. Manually added input data (blue boxes) includes basic information on the simulation (more details from here), permanent links to the raw data, and experimental data if available. The databank entries (red box) and analysis results (green boxes) are automatically generated by the computer programs included in the NMRlipids databank (yellow boxes) and stored in here. Because raw data are not permanently stored in the NMRlipids databank but can be accessed based on the information in the databank, this connection is marked with the dashed line.

Analysing simulations in the NMRlipids databank. Because README.yaml files contain all the essential information from each simulation, including the permanent location of raw data and unique naming convention for all atoms and molecules (see below), arbitrary analyses of simulations can be automatically performed for all simulations in the NMRlipids databank. For example, the code that calculates all C-H bond order parameters of all systems first loops over all README.yaml files (i.e., simulations) in the NMRlipids databank, then downloads the raw simulation to a local computer if needed, and then uses the information about the atom and molecule naming conventions in README.yaml and mapping files to perform the desired analyses. A minimal example of an analysis code is available in here. Results for order parameters, form factors, area per lipid and thickness are stored in same locations as README.yaml files. Further analyses can be conventiently stored in separate repositories with the same folder structure based on hash identities of trajectory and topology files as done, for example, for the preparation of the NMRlipids databank manuscript.  

Molecule and atom naming convention. Unique naming convetions for molecules and atoms are needed for automatic analyses over large sets of simulation data in the NMRlipids databank. Because such convention was not available for lipids, we have generated mapping files (available in here) that connect lipid atoms names in each simulation to the universal atom names and universal abbreviations of lipid names for the NMRlipids databank (see the second table in here). For a new entry into the NMRlipids databank, the universal abbreviation for each lipid and the corresponding mapping file are given as input in the COMPOSITION dictionary. The numbers of each molecule in the simulation are then automatically calculated by the (see figure 2) and stored in the COMPOSITION dictionary in README.yaml files. This information enables selection of any molecule or atom when analysing simulations in the NMRlipids databank.

No comments:

Post a Comment

Please sign in before writing your comment.