Datasets from articles

For code repositories, see Codes and Computing.

Schrödinger equation dataset

Numerical solutions of the time-independent Schrödinger equation for an electron in 2D confining potentials, used to train deep convolutional neural networks. The dataset contains millions of solved instances across four classes of electrostatic potentials (simple harmonic oscillator, infinite well, double inverted Gaussians, and random potentials). Each entry includes the potential, ground-state wave function, and associated energies.

Size: ~700 GB total (411 HDF5 files across 5 ZIP archives); sample dataset of 8.5 GB also available
Format: HDF5 (.h5), compressed as ZIP
Paper: K. Mills, M. Spanner, I. Tamblyn, Deep learning and the Schrödinger equation, Physical Review A 96, 042113 (2017). DOI
Download: NRC Digital Repository

Deep learning and density functional theory

Self-consistent charge densities and energy components (correlation, exchange, external, kinetic, and total energies) for multielectron systems computed with three exchange-correlation functionals: LDA, PBE, and MGGA-Pittalis. Systems include 1, 2, 3, and 10 electrons in both simple harmonic oscillator (SHO) and random (RND) external potentials.

Size: SHO_10e.h5 is ~100 GB; smaller variants for 1, 2, and 3 electrons also available
Format: HDF5 (.h5)
Files: SHO_10e, SHO_3e, SHO_2e, SHO_1e, RND_10e, RND_3e, RND_2e, RND_1e
Paper: K. Ryczko, D.A. Strubbe, I. Tamblyn, Deep learning and density-functional theory, Physical Review A 100, 022512 (2019). DOI
Download: Google Drive | Command-line script: cli_download_from_gdrive.sh

Deep learning and high harmonic generation

Training data for deep neural networks applied to high harmonic generation (HHG). Contains time-dependent dipoles and spectra from reduced-dimensionality models of di- and triatomic systems, parameterized by laser pulse intensity, internuclear distance, and molecular orientation. Includes data for both the forward problem (molecular parameters to spectra) and the inverse problem (spectra to molecular parameters).

Format: Numerical data files with accompanying Jupyter notebooks (TensorFlow/Keras)
Paper: M. Lytova, M. Spanner, I. Tamblyn, Deep learning and high harmonic generation, Canadian Journal of Physics 101(3) (2022). DOI
Download: Google Drive | Code: GitHub

Hyperspectral stimulated Raman microscopy

Hyperspectral stimulated Raman scattering (SRS) microscopy image data of a lithium ore sample. Used to demonstrate UHRED (Unsupervised Hyperspectral Resolution Enhancement and Denoising), an unsupervised deep learning method for automatic denoising that requires only a single hyperspectral image (one-shot, no labelled training data needed). Combined with k-means clustering, the method produces automatic chemical species maps.

Format: Hyperspectral image data (3D arrays: x, y, spectral channels)
Paper: P. Abdolghader, G. Resch, A. Ridsdale, T. Grammatikopoulos, F. Légaré, A. Stolow, A.F. Pegoraro, I. Tamblyn, Unsupervised Hyperspectral Stimulated Raman Microscopy Image Enhancement: De-Noising and Segmentation via One-Shot Deep Learning, Optics Express 29(21), 34205-34219 (2021). DOI
Download: Google Drive | Code: GitHub

Big graphene dataset

Over 500,000 density functional theory calculations of graphene structures (3.5 nm x 3.5 nm unit cell, 60-atom systems) with random structural defects, computed using the PBE functional in VASP. Each entry contains carbon atom coordinates and total energy values. Used to demonstrate that extensive deep neural networks (EDNNs) trained on small systems can predict total energies of larger systems in ~57 ms with DFT-level accuracy.

Size: ~3.7 GB compressed; 501,473 training files and 60,744 testing files
Format: HDF5 (.h5), distributed as .tar.gz
Paper: K. Mills, M. Spanner, I. Tamblyn, Extensive deep neural networks for transferring small scale learning to large scale systems, Chemical Science 10(15), 4129-4140 (2019). DOI
License: Open Government Licence - Canada / CC BY 2.0
Download: Google Drive | NRC Digital Repository

Reinforcement learning environments

SubWorld

A procedurally generated nautical navigation environment for reinforcement learning under partial observability (POMDPs). Agents navigate through water currents around islands while dealing with incomplete state information, combining state-modifying actions with information-revealing measurement actions. Demonstrates that dynamic programming with partial information can construct safe navigation policies while reducing measurement costs.

Paper: C. Beeler, X. Li, C. Bellinger, M. Crowley, M. Fraser, I. Tamblyn, Dynamic programming with incomplete information to overcome navigational uncertainty in POMDPs, Proceedings of the Canadian Conference on Artificial Intelligence (2024). arXiv
Code: GitHub

ChemGymRL

An open-source framework providing customizable virtual chemistry laboratory benches where RL agents learn to perform chemical synthesis and material discovery tasks. Built on the Gymnasium API, it includes reaction, extraction, distillation, and characterization benches, plus a Lab Manager for orchestrating multi-agent workflows.

Paper: C. Beeler, S.G. Subramanian, K. Sprague, et al., ChemGymRL: An Interactive Framework for Reinforcement Learning for Digital Chemistry, Digital Discovery (2024). DOI
Website: ChemGymRL.com | Code: GitHub

Heat engine gym

A collection of 11 OpenAI Gym environments modeling thermodynamic heat engines. Agents learn to discover optimal thermodynamic cycles (Carnot, Stirling, Otto) by performing isothermal, adiabatic, and irreversible processes. Using evolutionary RL, the agents discovered a previously unknown thermodynamic cycle.

Environments: Carnot-v0 through v4, Stirling-v0/v1, Otto-v0/v1, Beeler-v0/v1
Paper: C. Beeler, U. Yahorau, R. Coles, K. Mills, S. Whitelam, I. Tamblyn, Optimizing thermodynamic trajectories using evolutionary and gradient-based reinforcement learning, Physical Review E 104, 064128 (2021). DOI
Code: GitHub

Computational Laboratory for Energy And Nanoscience

Datasets