Setting up a high-throughput infrastructure with AiiDA#
One of the main use cases of AiiDA is to set up powerful infrastructures for running calculations in high-throughput on e.g. a set of structures. A straightforward example from the condensed matter community here would be to calculate the band structure of said structures in two steps:
Optimize the geometry of the input structure.
Calculate the band structure of the optimized geometry.
There are many ways to use AiiDA to solve this problem!
In this blog post, we will use a collection of submission controllers to set up a command line interface for automatically calculating the band structure for all the structures in an AiiDA group using the aiida-quantumespresso
plugin.
All the code examples are gathered in a single repository:
Feel free to clone the repository and try to execute the controllers with your local setup.
Warning
This tutorial is aimed at more advanced users! While reading it can still be instructive for beginners, not that more basic concepts such as work chains, groups etc will not be explained.
aiida-submission-controller
#
As discussed in our previous blogpost, the submission controller is a tool for:
Taking care of keeping a certain number of workflows active at any time, without needing to monitor the runs.
Making sure you don’t run the same structure twice.
Unfortunately, the tool is currently largely undocumented, and users can only rely on two examples to help them understand how to use it.
Here, we’ll show how to set up two submission controllers for optimizing the geometry and calculating the band structure, respectively.
Let’s start by having a look at a submission controller for running the PwRelaxWorkChain
:
class PwRelaxSubmissionController(FromGroupSubmissionController):
"""SubmissionController to run PwRelaxWorkChains from a group of `StructureData` nodes."""
pw_code: str
"""The label of the `Code` to use for the `PwRelaxWorkChain`."""
overrides: dict = {}
"""A dictionary of overrides to pass to `PwRelaxWorkChain.get_builder_from_protocol()`."""
def get_inputs_and_processclass_from_extras(self, extras_values):
parent_node = self.get_parent_node_from_extras(extras_values)
builder = PwRelaxWorkChain.get_builder_from_protocol(
code=orm.load_code(self.pw_code),
structure=parent_node,
overrides=self.overrides,
protocol='fast'
)
return builder
In order to write the submission controller, you have to do three things:
Inherit from the
FromGroupSubmissionController
. This means the submission controller will define a “parent” group whose nodes it will use to submit work chains. TheFromGroupSubmissionController
also already defines some basic inputs, which we’ll get to later.Define additional inputs, in this case the
pw.x
code from Quantum ESPRESSO (pw_code
) and theoverrides
dictionary, which will allow us to set different settings than the protocol defaults.Tell the submission controller how to construct the builder based on the extras in the
get_inputs_and_processclass_from_extras
method. In this case, we simply extract theparent_node
based on the extras and use theget_builder_from_protocol
method of thePwRelaxWorkChain
to get a fully populated builder.
Running the controller#
Now that we have defined the controller class, let’s use it to submit our first batch of work chains.
If you’re running the high-throughput-blog
example, you can use ./cli.py init
to set up the basic groups:
structures
, which is automatically populated with some basicStructureData
nodes for us to run.the work chain groups,
workchain/relax
andworkchain/bands
, in which we’ll be storing the work chain nodes.
First we initialise the PwRelaxSubmissionController
:
relax_controller = PwRelaxSubmissionController(
group_label='workchain/relax',
parent_group_label='structures',
pw_code='pw@localhost',
max_concurrent=2,
)
Let’s go over the various inputs:
group_label
defines the group in which the work chains will be stored.parent_group_label
is the label of the parent group, from whichStructureData
nodes will be taken to run thePwRelaxWorkChain
for.pw_code
is the label of the installedpw.x
code of Quantum ESPRESSO.max_concurrent
is the maximum number of work chains that will run at any time.
In order to submit the work chains and store them in the workchain/relax
group, we can use the submit_new_batch
command:
relax_controller.submit_new_batch(verbose=True)
By using the verbose=True
setting, the submission controller will print the current status of the submissions:
Status
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┓
┃ Total ┃ Submitted ┃ Left to run ┃ Max active ┃ Active ┃ Available ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━┩
│ 5 │ 0 │ 5 │ 2 │ 0 │ 2 │
└───────┴───────────┴─────────────┴────────────┴────────┴───────────┘
We can see that there are 5 structures in total to be run, and none have been submitted.
Since the max_concurrent
setting is set to none, but no work chains are currently active, the submission controller will proceed with submitting two work chains:
Info: 🚀 Submitting 2 new workchains!
Report: Submitted work chain <uuid: 9e461a90-2f38-4a6b-b44a-8a1447d9908a (pk: 73007) (aiida.workflows:quantumespresso.pw.relax)> for extras <('519d8e113651f745c6906a11e749f3e32fbcca820f47faca6f455dc815d0dc14',)>.
Report: Submitted work chain <uuid: 395849ff-8522-46d5-b265-a6566d95d197 (pk: 73023) (aiida.workflows:quantumespresso.pw.relax)> for extras <('c9f1c1ea824b2be42ad61ce4d6505b9b2bd03b77c4f863c0c9d54fe29260b318',)>.
Great, our first PwRelaxChain
s are in progress!
Since now two work chains are already running, rerunning the command should lead to no new submissions:
Status
┏━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┓
┃ Total ┃ Submitted ┃ Left to run ┃ Max active ┃ Active ┃ Available ┃
┡━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━┩
│ 5 │ 2 │ 3 │ 2 │ 2 │ 0 │
└───────┴───────────┴─────────────┴────────────┴────────┴───────────┘
Info: 😴 Nothing to submit.
If we wait a while, some of the work chains running will finish.
At this point, executing the submit_new_batch
method will submit more PwRelaxChain
work chains, but it will skip structures that have already been run.
Note
Looking at the report on the submissions above, we can see that the extras used by the submission controller are rather bulky (e.g. 519d8e113651f745c6906a11e749f3e32fbcca820f47faca6f455dc815d0dc14
).
This is because by default, the FromGroupSubmissionController
will use the AiiDA hash of the parent node to keep track of which work chains have already been run.
Using a more descriptive hash, like a unique identifier from a source database, can be quite useful when you are managing large project and want to quickly find the work chain corresponding to a certain structure.
Combining submission controllers into a “pipeline”#
Now that we understand how to run a single submission controller, let’s also set one up for the PwBandsWorkChain
and combine them into a single “pipeline”.
The PwBandsSubmissionController
is very similar to the one for the geometry optimizations, we simply add the following line in the get_inputs_and_processclass_from_extras
method:
structure = parent_node.outputs.output_structure
To extract the output structure of the PwRelaxWorkChain
, which is then passed to the get_builder_from_protocol
method of the PwBandsWorkChain
.
Combining the two submission controllers is pretty straightforward: simply initialise one after the other and submit a new batch of work chains for each.
The cli.py
file in the high-throughput-blog
example defines the following run
command:
def run():
while True:
relax_controller = PwRelaxSubmissionController(
group_label='workchain/relax',
parent_group_label='structures',
pw_code='pw@localhost',
max_concurrent=2,
)
relax_controller.submit_new_batch(verbose=True)
bands_controller = PwBandsSubmissionController(
group_label='workchain/bands',
parent_group_label='workchain/relax',
pw_code='pw@localhost',
max_concurrent=1,
filters={"attributes.exit_status": 0},
)
bands_controller.submit_new_batch(verbose=True)
time.sleep(30)
In order to keep the submissions running in the background, the submission code is simply wrapped in a while
loop with a sleep
at the end.
One more thing to note here is the following line in the initialisation of the PwBandsSubmissionController
:
filters={"attributes.exit_status": 0}
By default, the submission controller will try to submit the work chain for all the nodes in the parent group that haven’t been run yet.
However, in case the parent node is a work chain that is still running, it won’t have the necessary output_structure
to submit the band structure work chain with.
Hence, the submission will fail in that case.
In the end the submission controller will just skip that one and keep submitting, but it’s cleaner to immediately filter for parent work chains that have completed successfully.
Final notes#
Hopefully you now understand a bit better how submission controllers work, and how you can combine them to set up a pipeline for your high-throughput research project.
Note that the aiida-submission-controller
package is currently still heavily under development, and there might still be breaking changes in the (near) future!
The documentation is also still severely lacking, which this blog post tries to alleviate to some extent.
Hopefully we’ll soon have more time to polish the package, I’ll make sure to keep this blog post updated in case we do!