Deal With Large Data Size for K2 Movie Stack

Chen Xu

$BrandeisEM: 2015-06-14 08:11:10 emdoc-xml/en_US.ISO8859-1/articles/deal-with-k2-movie-data-size/article.xml xuchen Exp $

When we take a lot of dose fractionation series, or we call movie stacks, we have to consider the data size we deal with. For a super-resolution movie stack, each single exposure can generate a few GB data for a single stack. It is a challenging task for any of us who want to keep the data for longer term storage. Even transferring off the amount of data from K2 computer to other devices can take significant mount of time.

Therefore, our main goal is to reduce data size as much as possible without losing information. SerialEM has implemented some feature to help with this situation. In this document, I would like to give you an example how to deal with this.

Most of the information in this document can be found from SerialEM helpfile regarding Direct Detectors.

You can also get pdf version of this document here.


Table of Contents
1 Background Information
2 Packing and Compression
3 Keep Gatan Software Gain Reference File
4 Post-Processing: Decompress, Unpack and Apply Gain Reference

1 Background Information

For a Super-resolution exposure, the subframe output AFTER hardware processors is in format of 4-bit unsigned integer. If it is passed to a software layer such as DigitalMicrograph, this 4-bit integer data is first converted into 32-bit floating points and then applied with a software gain reference which is also a 32-bit floats. For a Counted mode image, the subframe output from hardware processors is 16-bit integer. It has to be converted into floats first and applied software gain reference, like Super-resolution case.

As you can see, in both cases, this process will not only consume significant mount of memory, but also generate relatively bigger dataset, as they are in 32-bit floating point format.

It might be worth mentioning that the 4-bit unsigned integer means all the pixel value in a frame is within the range 0-15. Therefore, we have to set our imaging condition accordingly. For example, if the beam is at dose rate of 10 electron per physical pixel per second, and if we use 1.5 seconds or more as frame time, then we could reach the limit of 15. In this case, the pixel value will overflow, and we lose information. Although this is almost unlikely to be the real condition we ever to use, we should be aware of such limitation.

The fact is that the real information is in such 4-bit or 16-bit integer frame and it is necessary to apply the software gain reference for normalization purpose. For K2 Counted and Super-res image, the dose rate is small, usually around the range of 10 electron per pixel per second or even lower. The image from such low dose rate contains a lot of zeros (~50%) and mean value for such image is usually just above 1. This kind of image is well suitable for loss-less compression algorithm such as LZW and ZIP.

So the idea is NOT to apply software gain reference in the data collection step. Instead, the gain reference file is saved somewhere and to be applied later as post-processing.

For these unnormalized, integer images, SerialEM tries to reduce the file size by Packing and Compression.