Skip to content

Data generator

DataGenerator ¤

Bases: Sequence

Infinite Data Generator which automatically creates batches from a list of samples.

The created batches are model ready. This generator can be supplied directly to a NeuralNetwork train() & predict() function (also compatible to tensorflow.keras.model fit() & predict() function).

The DataGenerator is the second of the three pillars of AUCMEDI.

Pillars of AUCMEDI

The DataGenerator can be used for training, validation as well as for prediction.

Example
# Import
from aucmedi import *

# Initialize model
model = NeuralNetwork(
    n_labels=8,
    channels=3,
    architecture="2D.ResNet50"
)

# Do some training
datagen_train = DataGenerator(
    samples=samples[:100],
    path_imagedir="images_dir/",
    image_format=image_format,
    labels=class_ohe[:100],
    resize=model.meta_input,
    standardize_mode=model.meta_standardize
)

model.train(datagen_train, epochs=50)

# Do some predictions
datagen_test = DataGenerator(
    samples=samples[100:150],
    path_imagedir="images_dir/",
    image_format=image_format,
    labels=None,
    resize=model.meta_input,
    standardize_mode=model.meta_standardize
)

preds = model.predict(datagen_test)

It supports real-time batch generation as well as beforehand preprocessing of images, which are then temporarily stored on disk (requires enough disk space!).

The resulting batches are created based the following pipeline:

  1. Image Loading
  2. Application of Subfunctions
  3. Resize image
  4. Application of Data Augmentation
  5. Standardize image
  6. Stacking processed images to a batch
Warning

When instantiating a DataGenerator, it is highly recommended, to pass the image_format parameter provided by the input_interface() and the resize & standardize_mode parameters provided by the NeuralNetwork class attributes meta_input & meta_standardize.

It assures, that the samples contain the expected file extension, input shape and standardization.

Build on top of the library

Tensorflow.Keras Iterator: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/Iterator

Example: How to integrate metadata in AUCMEDI?
from aucmedi import *
import numpy as np

my_metadata = np.random.rand(len(samples), 10)

my_model = NeuralNetwork(n_labels=8, channels=3, architecture="2D.DenseNet121",
                          meta_variables=10)

my_dg = DataGenerator(samples, "images_dir/",
                      labels=None, metadata=my_metadata,
                      resize=my_model.meta_input,                  # (224,224)
                      standardize_mode=my_model.meta_standardize)  # "torch"
Source code in aucmedi/data_processing/data_generator.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
class DataGenerator(Sequence):
    """ Infinite Data Generator which automatically creates batches from a list of samples.

    The created batches are model ready. This generator can be supplied directly
    to a [NeuralNetwork][aucmedi.neural_network.model.NeuralNetwork] train() & predict()
    function (also compatible to tensorflow.keras.model fit() & predict() function).

    The DataGenerator is the second of the three pillars of AUCMEDI.

    ??? info "Pillars of AUCMEDI"
        - [aucmedi.data_processing.io_data.input_interface][]
        - [aucmedi.data_processing.data_generator.DataGenerator][]
        - [aucmedi.neural_network.model.NeuralNetwork][]

    The DataGenerator can be used for training, validation as well as for prediction.

    ???+ example
        ```python
        # Import
        from aucmedi import *

        # Initialize model
        model = NeuralNetwork(
            n_labels=8,
            channels=3,
            architecture="2D.ResNet50"
        )

        # Do some training
        datagen_train = DataGenerator(
            samples=samples[:100],
            path_imagedir="images_dir/",
            image_format=image_format,
            labels=class_ohe[:100],
            resize=model.meta_input,
            standardize_mode=model.meta_standardize
        )

        model.train(datagen_train, epochs=50)

        # Do some predictions
        datagen_test = DataGenerator(
            samples=samples[100:150],
            path_imagedir="images_dir/",
            image_format=image_format,
            labels=None,
            resize=model.meta_input,
            standardize_mode=model.meta_standardize
        )

        preds = model.predict(datagen_test)
        ```

    It supports real-time batch generation as well as beforehand preprocessing of images,
    which are then temporarily stored on disk (requires enough disk space!).

    The resulting batches are created based the following pipeline:

    1. Image Loading
    2. Application of Subfunctions
    3. Resize image
    4. Application of Data Augmentation
    5. Standardize image
    6. Stacking processed images to a batch

    ???+ warning
        When instantiating a `DataGenerator`, it is highly recommended, to pass the `image_format` parameter provided
        by the `input_interface()` and the `resize` & `standardize_mode` parameters provided by the
        `NeuralNetwork` class attributes `meta_input` & `meta_standardize`.

        It assures, that the samples contain the expected file extension, input shape and standardization.

    ???+ abstract "Build on top of the library"
        Tensorflow.Keras Iterator: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/Iterator

    ??? example "Example: How to integrate metadata in AUCMEDI?"
        ```python
        from aucmedi import *
        import numpy as np

        my_metadata = np.random.rand(len(samples), 10)

        my_model = NeuralNetwork(n_labels=8, channels=3, architecture="2D.DenseNet121",
                                  meta_variables=10)

        my_dg = DataGenerator(samples, "images_dir/",
                              labels=None, metadata=my_metadata,
                              resize=my_model.meta_input,                  # (224,224)
                              standardize_mode=my_model.meta_standardize)  # "torch"
        ```
    """
    #-----------------------------------------------------#
    #                    Initialization                   #
    #-----------------------------------------------------#
    def __init__(self, samples, path_imagedir, labels=None, metadata=None,
                 image_format=None, subfunctions=[], batch_size=32,
                 resize=(224, 224), standardize_mode="z-score", data_aug=None,
                 shuffle=False, grayscale=False, sample_weights=None, workers=1,
                 prepare_images=False, loader=image_loader, seed=None,
                 **kwargs):
        """ Initialization function of the DataGenerator which acts as a configuration hub.

        If using for prediction, the 'labels' parameter has to be `None`.

        For more information on Subfunctions, read here: [aucmedi.data_processing.subfunctions][].

        Data augmentation is applied even for prediction if a DataAugmentation object is provided!

        ???+ warning
            Augmentation should only be applied to a **training** DataGenerator!

            For test-time augmentation, [aucmedi.ensemble.augmenting][] should be used.

        Applying `None` to `resize` will result into no image resizing. Default (224, 224)

        ???+ info "IO_loader Functions"
            | Interface                                                        | Description                                  |
            | ---------------------------------------------------------------- | -------------------------------------------- |
            | [image_loader()][aucmedi.data_processing.io_loader.image_loader] | Image Loader for image loading via Pillow. |
            | [sitk_loader()][aucmedi.data_processing.io_loader.sitk_loader]   | SimpleITK Loader for loading NIfTI (nii) or Metafile (mha) formats.    |
            | [numpy_loader()][aucmedi.data_processing.io_loader.numpy_loader] | NumPy Loader for image loading of .npy files.    |
            | [cache_loader()][aucmedi.data_processing.io_loader.cache_loader] | Cache Loader for passing already loaded images. |

            More information on IO_loader functions can be found here: [aucmedi.data_processing.io_loader][]. <br>
            Parameters defined in `**kwargs` are passed down to IO_loader functions.

        Args:
            samples (list of str):              List of sample/index encoded as Strings. Provided by
                                                [input_interface][aucmedi.data_processing.io_data.input_interface].
            path_imagedir (str):                Path to the directory containing the images.
            labels (numpy.ndarray):             Classification list with One-Hot Encoding. Provided by
                                                [input_interface][aucmedi.data_processing.io_data.input_interface].
            metadata (numpy.ndarray):           NumPy Array with additional metadata. Have to be shape (n_samples, meta_variables).
            image_format (str):                 Image format to add at the end of the sample index for image loading.
                                                Provided by [input_interface][aucmedi.data_processing.io_data.input_interface].
            subfunctions (List of Subfunctions):List of Subfunctions class instances which will be SEQUENTIALLY executed on the data set.
            batch_size (int):                   Number of samples inside a single batch.
            resize (tuple of int):              Resizing shape consisting of a X and Y size. (optional Z size for Volumes)
            standardize_mode (str):             Standardization modus in which image intensity values are scaled.
                                                Calls the [Standardize][aucmedi.data_processing.subfunctions.standardize] Subfunction.
            data_aug (Augmentation Interface):  Data Augmentation class instance which performs diverse augmentation techniques.
                                                If `None` is provided, no augmentation will be performed.
            shuffle (bool):                     Boolean, whether dataset should be shuffled.
            grayscale (bool):                   Boolean, whether images are grayscale or RGB.
            sample_weights (list of float):     List of weights for samples. Can be computed via
                                                [compute_sample_weights()][aucmedi.utils.class_weights.compute_sample_weights].
            workers (int):                      Number of workers. If n_workers > 1 = use multi-threading for image preprocessing.
            prepare_images (bool):              Boolean, whether all images should be prepared and backup to disk before training.
                                                Recommended for large images or volumes to reduce CPU computing time.
            loader (io_loader function):        Function for loading samples/images from disk.
            seed (int):                         Seed to ensure reproducibility for random function.
            **kwargs (dict):                    Additional parameters for the sample loader.
        """
        # Cache class variables
        self.samples = samples
        self.labels = labels
        self.metadata = metadata
        self.sample_weights = sample_weights
        self.prepare_images = prepare_images
        self.workers = workers
        self.sample_loader = loader
        self.kwargs = kwargs
        self.path_imagedir = path_imagedir
        self.image_format = image_format
        self.grayscale = grayscale
        self.subfunctions = subfunctions
        self.batch_size = batch_size
        self.data_aug = data_aug
        self.standardize_mode = standardize_mode
        self.resize = resize
        self.shuffle = shuffle
        self.seed = seed
        # Cache keras.Sequence class variables
        self.n = len(samples)
        self.max_iterations = (self.n + self.batch_size - 1) // self.batch_size
        self.iterations = self.max_iterations
        self.seed_walk = 0
        self.index_array = None

        # Initialize Standardization Subfunction
        if standardize_mode is not None:
            self.sf_standardize = Standardize(mode=standardize_mode)
        else : self.sf_standardize = None
        # Initialize Resizing Subfunction
        if resize is not None : self.sf_resize = Resize(shape=resize)
        else : self.sf_resize = None
        # Sanity check for full sample list
        if samples is not None and len(samples) == 0:
            raise ValueError("Provided sample list is empty!", len(samples))
        # Sanity check for label correctness
        if labels is not None and len(samples) != len(labels):
            raise ValueError("Samples and labels do not have same size!",
                             len(samples), len(labels))
        # Sanity check for metadata correctness
        if metadata is not None and len(samples) != len(metadata):
            raise ValueError("Samples and metadata do not have same size!",
                             len(samples), len(metadata))
        # Sanity check for sample weights correctness
        if sample_weights is not None and len(samples) != len(sample_weights):
            raise ValueError("Samples and sample weights do not have same size!",
                             len(samples), len(sample_weights))
        # Verify that labels, metadata and sample weights are NumPy arrays
        if labels is not None and not isinstance(labels, np.ndarray):
            self.labels = np.asarray(self.labels)
        if metadata is not None and not isinstance(metadata, np.ndarray):
            self.metadata = np.asarray(self.metadata)
        if sample_weights is not None and not isinstance(sample_weights,
                                                         np.ndarray):
            self.sample_weights = np.asarray(self.sample_weights)

        # If prepare_image modus activated
        # -> Preprocess images beforehand and store them to disk for fast usage later
        if self.prepare_images:
            self.prepare_dir_object = tempfile.TemporaryDirectory(
                                               prefix="aucmedi.tmp.",
                                               suffix=".data")
            self.prepare_dir = self.prepare_dir_object.name

            # Preprocess image for each index - Sequential
            if self.workers == 0 or self.workers == 1:
                for i in range(0, len(samples)):
                    self.preprocess_image(index=i, prepared_image=False,
                                          run_aug=False, run_standardize=False,
                                          dump_pickle=True)
            # Preprocess image for each index - Multi-threading
            else:
                with ThreadPool(self.workers) as pool:
                    index_array = list(range(0, len(samples)))
                    mp_params = zip(index_array, repeat(False), repeat(False),
                                    repeat(False), repeat(True))
                    pool.starmap(self.preprocess_image, mp_params)
            print("A directory for image preparation was created:",
                  self.prepare_dir)

    #-----------------------------------------------------#
    #              Batch Generation Function              #
    #-----------------------------------------------------#
    """ Internal function for batch generation given a list of random selected samples. """
    def _get_batches_of_transformed_samples(self, index_array):
        # Initialize Batch stack
        batch_stack = ([],)
        if self.labels is not None : batch_stack += ([],)
        if self.sample_weights is not None : batch_stack += ([],)

        # Process image for each index - Sequential
        if self.workers == 0 or self.workers == 1:
            for i in index_array:
                batch_img = self.preprocess_image(index=i,
                                                  prepared_image=self.prepare_images)
                batch_stack[0].append(batch_img)
        # Process image for each index - Multi-threading
        else:
            with ThreadPool(self.workers) as pool:
                mp_params = zip(index_array, repeat(self.prepare_images))
                batches_img = pool.starmap(self.preprocess_image, mp_params)
            batch_stack[0].extend(batches_img)

        # Add classification to batch if available
        if self.labels is not None:
            batch_stack[1].extend(self.labels[index_array])
        # Add sample weight to batch if available
        if self.sample_weights is not None:
            batch_stack[2].extend(self.sample_weights[index_array])

        # Stack images and optional metadata together into a batch
        input_stack = np.stack(batch_stack[0], axis=0)
        if self.metadata is not None:
            input_stack = (input_stack, self.metadata[index_array])
        batch = (input_stack, )
        # Stack classifications together into a batch if available
        if self.labels is not None:
            batch += (np.stack(batch_stack[1], axis=0), )
        # Stack sample weights together into a batch if available
        if self.sample_weights is not None:
            batch += (np.stack(batch_stack[2], axis=0), )
        # Return generated Batch
        return batch

    #-----------------------------------------------------#
    #                 Image Preprocessing                 #
    #-----------------------------------------------------#
    def preprocess_image(self, index, prepared_image=False, run_aug=True,
                         run_standardize=True, dump_pickle=False):
        """ Internal preprocessing function for applying Subfunctions, augmentation, resizing and standardization
            on an image given its index.

        Can be utilized for debugging purposes.

        Activating the prepared_image option also allows loading a beforehand preprocessed image from disk.

        Deactivating the run_aug & run_standardize option to output image without augmentation and standardization.

        Activating dump_pickle will store the preprocessed image as pickle on disk instead of returning.
        """
        # Load prepared image from disk
        if prepared_image:
            # Load from disk
            path_img = os.path.join(self.prepare_dir, "img_" + str(index))
            with open(path_img + ".pickle", "rb") as pickle_loader:
                img = pickle.load(pickle_loader)
            # Apply image augmentation on image if activated
            if self.data_aug is not None and run_aug:
                img = self.data_aug.apply(img)
            # Apply standardization on image if activated
            if self.sf_standardize is not None and run_standardize:
                img = self.sf_standardize.transform(img)
        # Preprocess image during runtime
        else:
            # Load image from disk
            img = self.sample_loader(self.samples[index], self.path_imagedir,
                                     image_format=self.image_format,
                                     grayscale=self.grayscale,
                                     **self.kwargs)
            # Apply subfunctions on image
            for sf in self.subfunctions:
                img = sf.transform(img)
            # Apply resizing on image if activated
            if self.sf_resize is not None:
                img = self.sf_resize.transform(img)
            # Apply image augmentation on image if activated
            if self.data_aug is not None and run_aug:
                img = self.data_aug.apply(img)
            # Apply standardization on image if activated
            if self.sf_standardize is not None and run_standardize:
                img = self.sf_standardize.transform(img)
        # Dump preprocessed image to disk (for later usage via prepared_image)
        if dump_pickle:
            path_img = os.path.join(self.prepare_dir, "img_" + str(index))
            with open(path_img + ".pickle", "wb") as pickle_writer:
                pickle.dump(img, pickle_writer)
        # Return preprocessed image
        else : return img

    #-----------------------------------------------------#
    #              Sample Generation Function             #
    #-----------------------------------------------------#
    """ Internal function for calling the batch generation process. """
    def __getitem__(self, raw_idx):
        # Obtain the index based on the passed index offset to allow repetition
        idx = raw_idx % self.max_iterations
        # Build index array for the start
        if self.index_array is None:
            self.__set_index_array__()
        # Select samples for next batch
        index_array = self.index_array[
            self.batch_size * idx : self.batch_size * (idx + 1)
        ]
        # Generate batch
        return self._get_batches_of_transformed_samples(index_array)

    #-----------------------------------------------------#
    #                 Generator Functions                 #
    #-----------------------------------------------------#
    """ Internal function for identifying the generator length. """
    def __len__(self):
        return self.iterations

    """ Configuration function for fixing the number of iterations. """
    def set_length(self, iterations):
        self.iterations = iterations

    """ Configuration function for reseting the number of iterations. """
    def reset_length(self):
        self.iterations = self.max_iterations

    """ Internal function for initializing and shuffling the index array. """
    def __set_index_array__(self):
        # Generate index array
        self.index_array = np.arange(self.n)
        # Shuffle if needed
        if self.shuffle:
            # Update seed for repeated permutation of the index_array
            if self.seed is not None:
                np.random.seed(self.seed + self.seed_walk)
                self.seed_walk += 1
            # Permutate index array
            self.index_array = np.random.permutation(self.n)

    """ Internal function at the end of an epoch. """
    def on_epoch_end(self):
        self.__set_index_array__()

__init__(samples, path_imagedir, labels=None, metadata=None, image_format=None, subfunctions=[], batch_size=32, resize=(224, 224), standardize_mode='z-score', data_aug=None, shuffle=False, grayscale=False, sample_weights=None, workers=1, prepare_images=False, loader=image_loader, seed=None, **kwargs) ¤

Initialization function of the DataGenerator which acts as a configuration hub.

If using for prediction, the 'labels' parameter has to be None.

For more information on Subfunctions, read here: aucmedi.data_processing.subfunctions.

Data augmentation is applied even for prediction if a DataAugmentation object is provided!

Warning

Augmentation should only be applied to a training DataGenerator!

For test-time augmentation, aucmedi.ensemble.augmenting should be used.

Applying None to resize will result into no image resizing. Default (224, 224)

IO_loader Functions
Interface Description
image_loader() Image Loader for image loading via Pillow.
sitk_loader() SimpleITK Loader for loading NIfTI (nii) or Metafile (mha) formats.
numpy_loader() NumPy Loader for image loading of .npy files.
cache_loader() Cache Loader for passing already loaded images.

More information on IO_loader functions can be found here: aucmedi.data_processing.io_loader.
Parameters defined in **kwargs are passed down to IO_loader functions.

Parameters:

Name Type Description Default
samples list of str

List of sample/index encoded as Strings. Provided by input_interface.

required
path_imagedir str

Path to the directory containing the images.

required
labels numpy.ndarray

Classification list with One-Hot Encoding. Provided by input_interface.

None
metadata numpy.ndarray

NumPy Array with additional metadata. Have to be shape (n_samples, meta_variables).

None
image_format str

Image format to add at the end of the sample index for image loading. Provided by input_interface.

None
subfunctions List of Subfunctions

List of Subfunctions class instances which will be SEQUENTIALLY executed on the data set.

[]
batch_size int

Number of samples inside a single batch.

32
resize tuple of int

Resizing shape consisting of a X and Y size. (optional Z size for Volumes)

(224, 224)
standardize_mode str

Standardization modus in which image intensity values are scaled. Calls the Standardize Subfunction.

'z-score'
data_aug Augmentation Interface

Data Augmentation class instance which performs diverse augmentation techniques. If None is provided, no augmentation will be performed.

None
shuffle bool

Boolean, whether dataset should be shuffled.

False
grayscale bool

Boolean, whether images are grayscale or RGB.

False
sample_weights list of float

List of weights for samples. Can be computed via compute_sample_weights().

None
workers int

Number of workers. If n_workers > 1 = use multi-threading for image preprocessing.

1
prepare_images bool

Boolean, whether all images should be prepared and backup to disk before training. Recommended for large images or volumes to reduce CPU computing time.

False
loader io_loader function

Function for loading samples/images from disk.

image_loader
seed int

Seed to ensure reproducibility for random function.

None
**kwargs dict

Additional parameters for the sample loader.

{}
Source code in aucmedi/data_processing/data_generator.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
def __init__(self, samples, path_imagedir, labels=None, metadata=None,
             image_format=None, subfunctions=[], batch_size=32,
             resize=(224, 224), standardize_mode="z-score", data_aug=None,
             shuffle=False, grayscale=False, sample_weights=None, workers=1,
             prepare_images=False, loader=image_loader, seed=None,
             **kwargs):
    """ Initialization function of the DataGenerator which acts as a configuration hub.

    If using for prediction, the 'labels' parameter has to be `None`.

    For more information on Subfunctions, read here: [aucmedi.data_processing.subfunctions][].

    Data augmentation is applied even for prediction if a DataAugmentation object is provided!

    ???+ warning
        Augmentation should only be applied to a **training** DataGenerator!

        For test-time augmentation, [aucmedi.ensemble.augmenting][] should be used.

    Applying `None` to `resize` will result into no image resizing. Default (224, 224)

    ???+ info "IO_loader Functions"
        | Interface                                                        | Description                                  |
        | ---------------------------------------------------------------- | -------------------------------------------- |
        | [image_loader()][aucmedi.data_processing.io_loader.image_loader] | Image Loader for image loading via Pillow. |
        | [sitk_loader()][aucmedi.data_processing.io_loader.sitk_loader]   | SimpleITK Loader for loading NIfTI (nii) or Metafile (mha) formats.    |
        | [numpy_loader()][aucmedi.data_processing.io_loader.numpy_loader] | NumPy Loader for image loading of .npy files.    |
        | [cache_loader()][aucmedi.data_processing.io_loader.cache_loader] | Cache Loader for passing already loaded images. |

        More information on IO_loader functions can be found here: [aucmedi.data_processing.io_loader][]. <br>
        Parameters defined in `**kwargs` are passed down to IO_loader functions.

    Args:
        samples (list of str):              List of sample/index encoded as Strings. Provided by
                                            [input_interface][aucmedi.data_processing.io_data.input_interface].
        path_imagedir (str):                Path to the directory containing the images.
        labels (numpy.ndarray):             Classification list with One-Hot Encoding. Provided by
                                            [input_interface][aucmedi.data_processing.io_data.input_interface].
        metadata (numpy.ndarray):           NumPy Array with additional metadata. Have to be shape (n_samples, meta_variables).
        image_format (str):                 Image format to add at the end of the sample index for image loading.
                                            Provided by [input_interface][aucmedi.data_processing.io_data.input_interface].
        subfunctions (List of Subfunctions):List of Subfunctions class instances which will be SEQUENTIALLY executed on the data set.
        batch_size (int):                   Number of samples inside a single batch.
        resize (tuple of int):              Resizing shape consisting of a X and Y size. (optional Z size for Volumes)
        standardize_mode (str):             Standardization modus in which image intensity values are scaled.
                                            Calls the [Standardize][aucmedi.data_processing.subfunctions.standardize] Subfunction.
        data_aug (Augmentation Interface):  Data Augmentation class instance which performs diverse augmentation techniques.
                                            If `None` is provided, no augmentation will be performed.
        shuffle (bool):                     Boolean, whether dataset should be shuffled.
        grayscale (bool):                   Boolean, whether images are grayscale or RGB.
        sample_weights (list of float):     List of weights for samples. Can be computed via
                                            [compute_sample_weights()][aucmedi.utils.class_weights.compute_sample_weights].
        workers (int):                      Number of workers. If n_workers > 1 = use multi-threading for image preprocessing.
        prepare_images (bool):              Boolean, whether all images should be prepared and backup to disk before training.
                                            Recommended for large images or volumes to reduce CPU computing time.
        loader (io_loader function):        Function for loading samples/images from disk.
        seed (int):                         Seed to ensure reproducibility for random function.
        **kwargs (dict):                    Additional parameters for the sample loader.
    """
    # Cache class variables
    self.samples = samples
    self.labels = labels
    self.metadata = metadata
    self.sample_weights = sample_weights
    self.prepare_images = prepare_images
    self.workers = workers
    self.sample_loader = loader
    self.kwargs = kwargs
    self.path_imagedir = path_imagedir
    self.image_format = image_format
    self.grayscale = grayscale
    self.subfunctions = subfunctions
    self.batch_size = batch_size
    self.data_aug = data_aug
    self.standardize_mode = standardize_mode
    self.resize = resize
    self.shuffle = shuffle
    self.seed = seed
    # Cache keras.Sequence class variables
    self.n = len(samples)
    self.max_iterations = (self.n + self.batch_size - 1) // self.batch_size
    self.iterations = self.max_iterations
    self.seed_walk = 0
    self.index_array = None

    # Initialize Standardization Subfunction
    if standardize_mode is not None:
        self.sf_standardize = Standardize(mode=standardize_mode)
    else : self.sf_standardize = None
    # Initialize Resizing Subfunction
    if resize is not None : self.sf_resize = Resize(shape=resize)
    else : self.sf_resize = None
    # Sanity check for full sample list
    if samples is not None and len(samples) == 0:
        raise ValueError("Provided sample list is empty!", len(samples))
    # Sanity check for label correctness
    if labels is not None and len(samples) != len(labels):
        raise ValueError("Samples and labels do not have same size!",
                         len(samples), len(labels))
    # Sanity check for metadata correctness
    if metadata is not None and len(samples) != len(metadata):
        raise ValueError("Samples and metadata do not have same size!",
                         len(samples), len(metadata))
    # Sanity check for sample weights correctness
    if sample_weights is not None and len(samples) != len(sample_weights):
        raise ValueError("Samples and sample weights do not have same size!",
                         len(samples), len(sample_weights))
    # Verify that labels, metadata and sample weights are NumPy arrays
    if labels is not None and not isinstance(labels, np.ndarray):
        self.labels = np.asarray(self.labels)
    if metadata is not None and not isinstance(metadata, np.ndarray):
        self.metadata = np.asarray(self.metadata)
    if sample_weights is not None and not isinstance(sample_weights,
                                                     np.ndarray):
        self.sample_weights = np.asarray(self.sample_weights)

    # If prepare_image modus activated
    # -> Preprocess images beforehand and store them to disk for fast usage later
    if self.prepare_images:
        self.prepare_dir_object = tempfile.TemporaryDirectory(
                                           prefix="aucmedi.tmp.",
                                           suffix=".data")
        self.prepare_dir = self.prepare_dir_object.name

        # Preprocess image for each index - Sequential
        if self.workers == 0 or self.workers == 1:
            for i in range(0, len(samples)):
                self.preprocess_image(index=i, prepared_image=False,
                                      run_aug=False, run_standardize=False,
                                      dump_pickle=True)
        # Preprocess image for each index - Multi-threading
        else:
            with ThreadPool(self.workers) as pool:
                index_array = list(range(0, len(samples)))
                mp_params = zip(index_array, repeat(False), repeat(False),
                                repeat(False), repeat(True))
                pool.starmap(self.preprocess_image, mp_params)
        print("A directory for image preparation was created:",
              self.prepare_dir)

preprocess_image(index, prepared_image=False, run_aug=True, run_standardize=True, dump_pickle=False) ¤

Internal preprocessing function for applying Subfunctions, augmentation, resizing and standardization on an image given its index.

Can be utilized for debugging purposes.

Activating the prepared_image option also allows loading a beforehand preprocessed image from disk.

Deactivating the run_aug & run_standardize option to output image without augmentation and standardization.

Activating dump_pickle will store the preprocessed image as pickle on disk instead of returning.

Source code in aucmedi/data_processing/data_generator.py
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def preprocess_image(self, index, prepared_image=False, run_aug=True,
                     run_standardize=True, dump_pickle=False):
    """ Internal preprocessing function for applying Subfunctions, augmentation, resizing and standardization
        on an image given its index.

    Can be utilized for debugging purposes.

    Activating the prepared_image option also allows loading a beforehand preprocessed image from disk.

    Deactivating the run_aug & run_standardize option to output image without augmentation and standardization.

    Activating dump_pickle will store the preprocessed image as pickle on disk instead of returning.
    """
    # Load prepared image from disk
    if prepared_image:
        # Load from disk
        path_img = os.path.join(self.prepare_dir, "img_" + str(index))
        with open(path_img + ".pickle", "rb") as pickle_loader:
            img = pickle.load(pickle_loader)
        # Apply image augmentation on image if activated
        if self.data_aug is not None and run_aug:
            img = self.data_aug.apply(img)
        # Apply standardization on image if activated
        if self.sf_standardize is not None and run_standardize:
            img = self.sf_standardize.transform(img)
    # Preprocess image during runtime
    else:
        # Load image from disk
        img = self.sample_loader(self.samples[index], self.path_imagedir,
                                 image_format=self.image_format,
                                 grayscale=self.grayscale,
                                 **self.kwargs)
        # Apply subfunctions on image
        for sf in self.subfunctions:
            img = sf.transform(img)
        # Apply resizing on image if activated
        if self.sf_resize is not None:
            img = self.sf_resize.transform(img)
        # Apply image augmentation on image if activated
        if self.data_aug is not None and run_aug:
            img = self.data_aug.apply(img)
        # Apply standardization on image if activated
        if self.sf_standardize is not None and run_standardize:
            img = self.sf_standardize.transform(img)
    # Dump preprocessed image to disk (for later usage via prepared_image)
    if dump_pickle:
        path_img = os.path.join(self.prepare_dir, "img_" + str(index))
        with open(path_img + ".pickle", "wb") as pickle_writer:
            pickle.dump(img, pickle_writer)
    # Return preprocessed image
    else : return img