Work Package 2:

High-throughput data handling and vetoing

Q&A with Dr David Pennicard, HIR³X WP co-lead on “High-throughput data handling and vetoing”

Q: The description of this work package says openly that x-ray FEL facilities are often unable to cope with the extremely high data rates coming from advanced x-ray imagers and x-ray sources of extremely high brightness. Just how big is this problem?

A: Even just a decade ago, the data produced by scans using synchrotron light could still be taken home on a portable hard drive. But more recently the data volumes have become so large that there's a much greater need to process the data in big computing centres on site. That also goes for the data generated by x-ray FELs.

Light sources like FELs and synchrotrons are improving incredibly fast. How effectively each generation of these machines can focus a beam’s photons on a sample has increased exponentially, and has even outpaced Moore’s Law.

Simply the cost of storing this data and processing it becomes a significant challenge. In HIR³X, I’m working in the group developing better x-ray detectors, so in some way we’re the source of the problem! Of course larger amounts of data ultimately allow people to solve new, important scientific questions.

I got involved in data processing because it's important for developers when designing detectors to be aware of these problems and contribute in the early stages of handling data, to reduce the workload on computing centres.

The proposal behind HIR³X is that handling the large amounts of data produced by this generations of experiments is a big task. One step in that task is to do data reduction as early as possible to reduce the workload for later stages of processing.

DESY and SLAC developed SpeckleNN and MP-FAST to achieve this. Can you explain in general what these programs do?

Our group worked on MP-FAST; each program is designed for different types of biology-based experiments, though both are working on throwing away clearly bad data.

There are two sorts of situations machine learning can be useful. One situation is where humans can do a task – like distinguishing photos of dogs and cats, for example – but we haven’t been able to develop an algorithm for this. Here, machine learning makes it possible for a computer to learn from examples provided by humans. Another situation is when we have an algorithm that can solve a task, but it requires huge amounts of number crunching. In that case, AI can take examples where the task has been solved by the computationally-expensive algorithm, and find a cheaper shortcut to the same result.

MP-FAST is working in an area where there are the more traditional, non-machine-learning based ways of solving the same problem. We wanted to see if we could do this in a computationally cheaper way, so that as these light sources improve, we can still handle the higher rates of data.

How do you apply these to experiments?

In this particular case, the application of MP-FAST is for what’s called “serial femtosecond crystallography.” This is an experiment looking at proteins, which are the building blocks of cells; for example, this technique has been used to understand how some bacteria resist antibiotics by producing proteins which break the antibiotics down. We're interested in the protein, and then we can grow crystals using the protein to boost our signal; we line up lots of these protein molecules together to give us a larger signal. These are very small crystals, so they're not very demanding to grow; free electron lasers make it much easier examine proteins that don't crystallise so easily, such as the proteins that form the walls of our cells and play an important role in whether things like viruses or medications are able to enter them.

In this sort of experiment each sample you look at is, in effect, identical to one another. You also need to scan many samples because each individual sample only produces a relatively weak signal. This means after scanning all samples you need to combine information from them to get a clear enough picture.

Another factor is that when you take one of these crystals do an experiment on it with a single x-ray pulse you're effectively looking at your sample from only one direction. Now, when we do the experiment, different crystals have different orientations, so by combining different snapshots we get get all the information we need to find the full structure. But putting this all together, you can see how complicated it can be to get good information.

Another complicating factor is that a high proportion of the time, the X-ray pulse doesn’t manage to hit a protein crystal, or only partially hits one, and we don’t get a good image. So, we want to be able to reject these bad images, ideally without wasting a lot of computing power trying to analyse them. There are already a few rule-of-thumbs used to sort good data from bad. For example, the unique diffraction patterns you get from hitting the protein crystal usually have a distinctive pattern of spots.

The approach we used combined techniques from computer vision to do a relatively computationally cheap check to seek for features in the image. These features will be the spots that indicate whether it's a good pattern or not.

But this computationally cheap method might be less robust than some existing methods for deciding what's a real diffraction spot and what isn't. So then we used a neural network to make the decision about whether it was a good enough image or not.

I think this is a good demonstration of tasks that have previously been done by some non-machine learning algorithm can also be done by machine learning methods to achieve a computational shortcut.

Why did you choose to develop for this application?

For us it was an example of how to greatly reduce data, because a large fraction of the scanned images aren't useful. In these sorts of experiments, you only hit protein crystals a small fraction of the time, so many images are simply bad and can be thrown away. It was an area where we felt there was a lot of potential for MP-FAST to help.

It was also practically useful since XFEL has a long history of being used in molecular biology research, and we have strong expertise as a group at DESY. It's also aimed at our strengths.

Suggested research:

Rahmani, Vahid, et al. "Robust image descriptor for machine learning based data reduction in serial crystallography." Applied Crystallography 57.2 (2024): 413-430.
Wang, Cong, et al. "SpeckleNN: a unified embedding for real-time speckle pattern classification in X-ray single-particle imaging with limited labeled examples." IUCrJ 10.5 (2023): 568-578.

Work Package 2:

High-throughput data handling and vetoing

Q&A with Dr David Pennicard, HIR3X WP co-lead on “High-throughput data handling and vetoing”

Q&A with Dr David Pennicard, HIR³X WP co-lead on “High-throughput data handling and vetoing”