GNN Implicit Solvent Data: Acquisition & Formatting Guide

Nov 25, 2025 by Alex Johnson 58 views

Welcome to a comprehensive guide on acquiring, formatting, and validating the training data from the Riniker Lab for their GNNImplicitSolvent model. This guide is designed to help you navigate the process of preparing the data for use with our mlff-trainer pipeline, ensuring data integrity and usability. We will delve into each phase, from data acquisition to validation, providing detailed steps and insights to facilitate a smooth and successful implementation. Let's get started and make your data ready!

Phase 1: Data Acquisition - Downloading and Preparing the Data

Our first step involves acquiring the necessary data from the ETH Zurich Research Collection. This is where we'll download the raw data published by the Riniker Lab, which is crucial for training our models. Before diving in, it is essential to document the data license and any usage restrictions. This ensures compliance and ethical use of the data throughout the project. We're talking about approximately 369,486 small molecules, each with up to 9 conformations, leading to around 3.2 million data points with mean solvent forces. The data is available at DOI: 10.3929/ethz-b-000599309, which is your direct link to the dataset. Once you have the raw data, assess the total data size to understand your storage requirements, it is very important to get ready for the next phase. Moreover, consider cloning the GNNImplicitSolvent repository; it's a goldmine of reference scripts that will help you. Understanding the structure and context of the data will make the following steps smoother, and more efficient.

Detailed Steps for Data Acquisition

Download the Data: Start by downloading the dataset from the ETH Zurich Research Collection using the provided DOI (10.3929/ethz-b-000599309). This will give you the raw data in its original format.
Clone the GNNImplicitSolvent Repository: This repository contains scripts and documentation that can help you understand the data format and structure. It is a valuable resource for data analysis and conversion.
Document Data License and Usage Restrictions: Carefully review the data license to understand how you can use the data. Note any restrictions or requirements for proper use.
Assess Total Data Size and Storage Requirements: Estimate the total storage space required for the downloaded data and the processed data. This helps you plan your storage infrastructure.

Phase 2: Data Format Analysis - Understanding the Structure

Now we'll dive into the original data format, which is likely in NumPy/JSON format generated from GROMACS output. The core of this phase is to map the fields from the Riniker Lab's data to our HDF5 schema. This requires a close examination of the data structure. The HDF5 format is essential for the mlff-trainer pipeline. Your goal is to map fields such as atomic positions, atomic numbers/species, partial charges, mean solvent forces, and Born radii (if available). The goal here is to identify missing fields and compute them as needed. This meticulous mapping is critical for ensuring that the data is correctly interpreted by our pipeline. Remember to consult the details of Issue #45 (Research) for specific data format details.

Detailed Steps for Data Format Analysis

Analyze the Original Data Format: Examine the data to understand the structure. This is often in NumPy/JSON format, potentially derived from GROMACS outputs.
Map Fields to Our HDF5 Schema:
- Atomic Positions: Direct mapping.
- Atomic Numbers/Species: Direct mapping.
- Partial Charges: A new field required for our model.
- Mean Solvent Forces: These will be your target labels.
- Born Radii: Add if available, as a new field.
Identify Missing Fields: Assess whether any fields must be computed to ensure all required fields are present and correctly formatted.

Phase 3: Data Conversion - Transforming the Data

This is where we'll build a data loader for the Riniker format, specifically using src/mlff_distiller/data/riniker_loader.py. This loader will be the bridge between the original data and our HDF5 format. The main task is to implement the conversion to our DistillationDataset HDF5 format. During conversion, include validation checks to guarantee data integrity. Once converted, this data will be ready for the mlff-trainer pipeline. Lastly, generate train/val/test splits to get the most out of our data. Creating a robust data loader and ensuring the data's integrity are vital for our analysis.

Detailed Steps for Data Conversion

Create a Data Loader: Develop a data loader for the Riniker format using src/mlff_distiller/data/riniker_loader.py.
Implement Conversion to HDF5: Convert the data to our DistillationDataset HDF5 format.
Add Validation Checks: Include checks to ensure data integrity during conversion.
Generate Train/Val/Test Splits: Create standard data splits to evaluate and train the model effectively.

Phase 4: Data Validation - Ensuring Accuracy and Integrity

In the final stage, we'll validate the processed data. We begin by computing key statistics, such as force distributions and molecular properties. This helps identify any anomalies or inconsistencies. Then, compare the data with our existing vacuum training data to ensure that the properties are consistent. It is very important to verify that force magnitudes are reasonable for solvation. Lastly, create visualizations of solvation force patterns to understand data behavior. Data validation is a very important task to avoid any future problems. Make sure the data is ready for training and further analysis.

Detailed Steps for Data Validation

Compute Statistics: Calculate force distributions and molecular properties to gain insights.
Compare with Vacuum Training Data: Contrast the data with existing vacuum training data to ensure consistency.
Verify Force Magnitudes: Ensure force magnitudes are realistic and align with solvation effects.
Create Visualizations: Generate visualizations of solvation force patterns to aid analysis.

Data Schema Mapping - Field by Field

Here’s a detailed mapping of fields from the Riniker format to our format, including notes on the conversion process:

Riniker Format	Our Format	Notes
atom_positions	positions	Direct mapping
atom_types	atomic_numbers	May require conversion to match our species representation.
partial_charges	charges	New field added for solvation effects.
mean_solvent_forces	forces	Target labels for training the model.
born_radii	born_radii	New field, included if available in the source data.

Dependencies and Deliverables - The Checklist

Dependencies

Issue #45 (Research): Essential for understanding detailed data formats.
Blocks Issue #49 (Phase A Training): The successful completion of this phase is crucial for the training phase.

Deliverables

data/implicit_solvent/raw/ - Contains the original, downloaded data.
data/implicit_solvent/processed/ - Includes the HDF5 formatted data.
src/mlff_distiller/data/riniker_loader.py - The data loader script.
Documentation in docs/data/IMPLICIT_SOLVENT_DATA.md - Comprehensive documentation that outlines the data acquisition and usage.

This project is assigned to the Data Pipeline Agent and is marked as High Priority because it directly affects the training phase. By meticulously following these steps, you will ensure the preparation of high-quality data. This will create a solid foundation for training and enable effective solvation modeling with the GNNImplicitSolvent model.

For additional information, you can check the official documentation from the ETH Zurich.