Skip to content

Iterative

Internal classes to allow iterative stratification in percentage-split and k-fold cross-validation for multi-label sampling.

Use the corresponding core functions from aucmedi.sampling.split and aucmedi.sampling.kfold with the parameter iterative=True.

Personal Note

This code originates from https://github.com/trent-b.

If you are reading this, leave trent-b a star on his GitHub! :)
His code is open-source, really well written and structured.

Reference - Implementation

Author: trend-b
GitHub Profile: https://github.com/trent-b
https://github.com/trent-b/iterative-stratification

Reference - Publication

Sechidis K., Tsoumakas G., Vlahavas I. 2011. On the Stratification of Multi-Label Data. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg. Aristotle University of Thessaloniki.
https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10

MultilabelStratifiedKFold ¤

Bases: _BaseKFold

Multilabel stratified K-Folds cross-validator.

Provides train/test indices to split multilabel data into train/test sets. This cross-validation object is a variation of KFold that returns stratified folds for multilabel data. The folds are made by preserving the percentage of samples for each label.

Example
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
>>> import numpy as np
>>> X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
>>> y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
>>> mskf = MultilabelStratifiedKFold(n_splits=2, random_state=0)
>>> mskf.get_n_splits(X, y)
2
>>> print(mskf)  # doctest: +NORMALIZE_WHITESPACE
MultilabelStratifiedKFold(n_splits=2, random_state=0, shuffle=False)
>>> for train_index, test_index in mskf.split(X, y):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
Note

Train and test sizes may be slightly different in each fold.

See also

RepeatedMultilabelStratifiedKFold: Repeats Multilabel Stratified K-Fold n times.

Source code in aucmedi/sampling/iterative.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
class MultilabelStratifiedKFold(_BaseKFold):
    """Multilabel stratified K-Folds cross-validator.

    Provides train/test indices to split multilabel data into train/test sets.
    This cross-validation object is a variation of KFold that returns
    stratified folds for multilabel data. The folds are made by preserving
    the percentage of samples for each label.

    ??? example
        ```python
        >>> from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
        >>> import numpy as np
        >>> X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
        >>> y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
        >>> mskf = MultilabelStratifiedKFold(n_splits=2, random_state=0)
        >>> mskf.get_n_splits(X, y)
        2
        >>> print(mskf)  # doctest: +NORMALIZE_WHITESPACE
        MultilabelStratifiedKFold(n_splits=2, random_state=0, shuffle=False)
        >>> for train_index, test_index in mskf.split(X, y):
        ...    print("TRAIN:", train_index, "TEST:", test_index)
        ...    X_train, X_test = X[train_index], X[test_index]
        ...    y_train, y_test = y[train_index], y[test_index]
        TRAIN: [0 3 4 6] TEST: [1 2 5 7]
        TRAIN: [1 2 5 7] TEST: [0 3 4 6]
        ```

    ???+ note
        Train and test sizes may be slightly different in each fold.

    ???+ note "See also"
        RepeatedMultilabelStratifiedKFold: Repeats Multilabel Stratified K-Fold
        n times.

    """

    def __init__(self, n_splits=3, shuffle=False, random_state=None):
        """
        Args:
            n_splits (int, default=3):      Number of folds. Must be at least 2.
            shuffle (boolean, optional):    Whether to shuffle each stratification of the data before splitting
                                            into batches.
            random_state (int, RandomState instance or None, optional, default=None): If int, random_state is the
                                            seed used by the random number generator;
                                            If RandomState instance, random_state is the random number generator;
                                            If None, the random number generator is the RandomState instance used
                                            by `np.random`. Unlike StratifiedKFold that only uses random_state
                                            when ``shuffle`` == True, this multilabel implementation
                                            always uses the random_state since the iterative stratification
                                            algorithm breaks ties randomly.
        """
        super(MultilabelStratifiedKFold, self).__init__(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

    def _make_test_folds(self, X, y):
        y = np.asarray(y, dtype=bool)
        type_of_target_y = type_of_target(y)

        if type_of_target_y != 'multilabel-indicator':
            raise ValueError(
                'Supported target type is: multilabel-indicator. Got {!r} instead.'.format(type_of_target_y))

        num_samples = y.shape[0]

        rng = check_random_state(self.random_state)
        indices = np.arange(num_samples)

        if self.shuffle:
            rng.shuffle(indices)
            y = y[indices]

        r = np.asarray([1 / self.n_splits] * self.n_splits)

        test_folds = IterativeStratification(labels=y, r=r, random_state=rng)

        return test_folds[np.argsort(indices)]

    def _iter_test_masks(self, X=None, y=None, groups=None):
        test_folds = self._make_test_folds(X, y)
        for i in range(self.n_splits):
            yield test_folds == i

    def split(self, X, y, groups=None):
        """ Generate indices to split data into training and test set.

        ???+ note
            Randomized CV splitters may return different results for each call of
            split. You can make the results identical by setting ``random_state``
            to an integer.: train->     The training set indices for that split.

        Args:
            X (array-like, shape (n_samples, n_features) ): Training data, where n_samples is the number of samples
                                                            and n_features is the number of features.
                                                            Note that providing ``y`` is sufficient to generate the splits and
                                                            hence ``np.zeros(n_samples)`` may be used as a placeholder for
                                                            ``X`` instead of actual training data.
            y (array-like, shape (n_samples, n_labels) ):   The target variable for supervised learning problems.
                                                            Multilabel stratification is done based on the y labels.
            groups (object, optional):                      Always ignored, exists for compatibility.

        Returns:
          train (numpy.ndarray):        The training set indices for that split.
          test (numpy.ndarray):         The testing set indices for that split.
        """
        y = check_array(y, ensure_2d=False, dtype=None)
        return super(MultilabelStratifiedKFold, self).split(X, y, groups)

__init__(n_splits=3, shuffle=False, random_state=None) ¤

Parameters:

Name Type Description Default
n_splits int, default=3

Number of folds. Must be at least 2.

3
shuffle boolean

Whether to shuffle each stratification of the data before splitting into batches.

False
random_state int, RandomState instance or None, optional, default=None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Unlike StratifiedKFold that only uses random_state when shuffle == True, this multilabel implementation always uses the random_state since the iterative stratification algorithm breaks ties randomly.

None
Source code in aucmedi/sampling/iterative.py
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def __init__(self, n_splits=3, shuffle=False, random_state=None):
    """
    Args:
        n_splits (int, default=3):      Number of folds. Must be at least 2.
        shuffle (boolean, optional):    Whether to shuffle each stratification of the data before splitting
                                        into batches.
        random_state (int, RandomState instance or None, optional, default=None): If int, random_state is the
                                        seed used by the random number generator;
                                        If RandomState instance, random_state is the random number generator;
                                        If None, the random number generator is the RandomState instance used
                                        by `np.random`. Unlike StratifiedKFold that only uses random_state
                                        when ``shuffle`` == True, this multilabel implementation
                                        always uses the random_state since the iterative stratification
                                        algorithm breaks ties randomly.
    """
    super(MultilabelStratifiedKFold, self).__init__(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

split(X, y, groups=None) ¤

Generate indices to split data into training and test set.

Note

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.: train-> The training set indices for that split.

Parameters:

Name Type Description Default
X array-like, shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

required
y array-like, shape (n_samples, n_labels)

The target variable for supervised learning problems. Multilabel stratification is done based on the y labels.

required
groups object

Always ignored, exists for compatibility.

None

Returns:

Name Type Description
train numpy.ndarray

The training set indices for that split.

test numpy.ndarray

The testing set indices for that split.

Source code in aucmedi/sampling/iterative.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
def split(self, X, y, groups=None):
    """ Generate indices to split data into training and test set.

    ???+ note
        Randomized CV splitters may return different results for each call of
        split. You can make the results identical by setting ``random_state``
        to an integer.: train->     The training set indices for that split.

    Args:
        X (array-like, shape (n_samples, n_features) ): Training data, where n_samples is the number of samples
                                                        and n_features is the number of features.
                                                        Note that providing ``y`` is sufficient to generate the splits and
                                                        hence ``np.zeros(n_samples)`` may be used as a placeholder for
                                                        ``X`` instead of actual training data.
        y (array-like, shape (n_samples, n_labels) ):   The target variable for supervised learning problems.
                                                        Multilabel stratification is done based on the y labels.
        groups (object, optional):                      Always ignored, exists for compatibility.

    Returns:
      train (numpy.ndarray):        The training set indices for that split.
      test (numpy.ndarray):         The testing set indices for that split.
    """
    y = check_array(y, ensure_2d=False, dtype=None)
    return super(MultilabelStratifiedKFold, self).split(X, y, groups)

MultilabelStratifiedShuffleSplit ¤

Bases: BaseShuffleSplit

Multilabel Stratified ShuffleSplit cross-validator.

Provides train/test indices to split data into train/test sets. This cross-validation object is a merge of MultilabelStratifiedKFold and ShuffleSplit, which returns stratified randomized folds for multilabel data. The folds are made by preserving the percentage of each label. Note: like the ShuffleSplit strategy, multilabel stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

Example
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> import numpy as np
>>> X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
>>> y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
>>> msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5,
...    random_state=0)
>>> msss.get_n_splits(X, y)
3
>>> print(mss)       # doctest: +ELLIPSIS
MultilabelStratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,
                                 train_size=None)
>>> for train_index, test_index in msss.split(X, y):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]
Note

Train and test sizes may be slightly different from desired due to the preference of stratification over perfectly sized folds.

Source code in aucmedi/sampling/iterative.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
class MultilabelStratifiedShuffleSplit(BaseShuffleSplit):
    """Multilabel Stratified ShuffleSplit cross-validator.

    Provides train/test indices to split data into train/test sets.
    This cross-validation object is a merge of MultilabelStratifiedKFold and
    ShuffleSplit, which returns stratified randomized folds for multilabel
    data. The folds are made by preserving the percentage of each label.
    Note: like the ShuffleSplit strategy, multilabel stratified random splits
    do not guarantee that all folds will be different, although this is
    still very likely for sizeable datasets.

    ??? example
        ```python
        >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
        >>> import numpy as np
        >>> X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
        >>> y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
        >>> msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5,
        ...    random_state=0)
        >>> msss.get_n_splits(X, y)
        3
        >>> print(mss)       # doctest: +ELLIPSIS
        MultilabelStratifiedShuffleSplit(n_splits=3, random_state=0, test_size=0.5,
                                         train_size=None)
        >>> for train_index, test_index in msss.split(X, y):
        ...    print("TRAIN:", train_index, "TEST:", test_index)
        ...    X_train, X_test = X[train_index], X[test_index]
        ...    y_train, y_test = y[train_index], y[test_index]
        TRAIN: [1 2 5 7] TEST: [0 3 4 6]
        TRAIN: [2 3 6 7] TEST: [0 1 4 5]
        TRAIN: [1 2 5 6] TEST: [0 3 4 7]
        ```

    ???+ note
        Train and test sizes may be slightly different from desired due to the
        preference of stratification over perfectly sized folds.
    """

    def __init__(self, n_splits=10, test_size="default", train_size=None,
                 random_state=None):
        """
        Args:
            n_splits (int):                         Number of re-shuffling & splitting iterations.
            test_size (float, int, None, optional): If float, should be between 0.0 and 1.0 and represent the proportion
                                                    of the dataset to include in the test split. If int, represents the
                                                    absolute number of test samples. If None, the value is set to the
                                                    complement of the train size. By default, the value is set to 0.1.
                                                    The default will change in version 0.21. It will remain 0.1 only
                                                    if ``train_size`` is unspecified, otherwise it will complement
                                                    the specified ``train_size``.
            train_size (float, int, or None, default is None):  If float, should be between 0.0 and 1.0 and represent the
                                                    proportion of the dataset to include in the train split. If
                                                    int, represents the absolute number of train samples. If None,
                                                    the value is automatically set to the complement of the test size.
            random_state (int, RandomState instance or None, optional): If int, random_state is the seed used by the random number generator;
                                                    If RandomState instance, random_state is the random number generator;
                                                    If None, the random number generator is the RandomState instance used
                                                    by `np.random`. Unlike StratifiedShuffleSplit that only uses
                                                    random_state when ``shuffle`` == True, this multilabel implementation
                                                    always uses the random_state since the iterative stratification
                                                    algorithm breaks ties randomly.
        """
        super(MultilabelStratifiedShuffleSplit, self).__init__(
            n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)

    def _iter_indices(self, X, y, groups=None):
        n_samples = _num_samples(X)
        y = check_array(y, ensure_2d=False, dtype=None)
        y = np.asarray(y, dtype=bool)
        type_of_target_y = type_of_target(y)

        if type_of_target_y != 'multilabel-indicator':
            raise ValueError(
                'Supported target type is: multilabel-indicator. Got {!r} instead.'.format(
                    type_of_target_y))

        n_train, n_test = _validate_shuffle_split(n_samples, self.test_size,
                                                  self.train_size)

        n_samples = y.shape[0]
        rng = check_random_state(self.random_state)
        y_orig = y.copy()

        r = np.array([n_train, n_test]) / (n_train + n_test)

        for _ in range(self.n_splits):
            indices = np.arange(n_samples)
            rng.shuffle(indices)
            y = y_orig[indices]

            test_folds = IterativeStratification(labels=y, r=r, random_state=rng)

            test_idx = test_folds[np.argsort(indices)] == 1
            test = np.where(test_idx)[0]
            train = np.where(~test_idx)[0]

            yield train, test

    def split(self, X, y, groups=None):
        """Generate indices to split data into training and test set.

        ???+ note
            Randomized CV splitters may return different results for each call of
            split. You can make the results identical by setting ``random_state``
            to an integer.

        Args:
            X (array-like, shape (n_samples, n_features) ): Training data, where n_samples is the number of samples
                                                            and n_features is the number of features.
                                                            Note that providing ``y`` is sufficient to generate the splits and
                                                            hence ``np.zeros(n_samples)`` may be used as a placeholder for
                                                            ``X`` instead of actual training data.
            y (array-like, shape (n_samples, n_labels) ):   The target variable for supervised learning problems.
                                                            Multilabel stratification is done based on the y labels.
            groups (object, optional):                      Always ignored, exists for compatibility.


        Returns:
            train (numpy.ndarray):        The training set indices for that split.
            test (numpy.ndarray):         The testing set indices for that split.
        """
        y = check_array(y, ensure_2d=False, dtype=None)
        return super(MultilabelStratifiedShuffleSplit, self).split(X, y, groups)

__init__(n_splits=10, test_size='default', train_size=None, random_state=None) ¤

Parameters:

Name Type Description Default
n_splits int

Number of re-shuffling & splitting iterations.

10
test_size float, int, None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. By default, the value is set to 0.1. The default will change in version 0.21. It will remain 0.1 only if train_size is unspecified, otherwise it will complement the specified train_size.

'default'
train_size float, int, or None, default is None

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

None
random_state int, RandomState instance or None

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Unlike StratifiedShuffleSplit that only uses random_state when shuffle == True, this multilabel implementation always uses the random_state since the iterative stratification algorithm breaks ties randomly.

None
Source code in aucmedi/sampling/iterative.py
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
def __init__(self, n_splits=10, test_size="default", train_size=None,
             random_state=None):
    """
    Args:
        n_splits (int):                         Number of re-shuffling & splitting iterations.
        test_size (float, int, None, optional): If float, should be between 0.0 and 1.0 and represent the proportion
                                                of the dataset to include in the test split. If int, represents the
                                                absolute number of test samples. If None, the value is set to the
                                                complement of the train size. By default, the value is set to 0.1.
                                                The default will change in version 0.21. It will remain 0.1 only
                                                if ``train_size`` is unspecified, otherwise it will complement
                                                the specified ``train_size``.
        train_size (float, int, or None, default is None):  If float, should be between 0.0 and 1.0 and represent the
                                                proportion of the dataset to include in the train split. If
                                                int, represents the absolute number of train samples. If None,
                                                the value is automatically set to the complement of the test size.
        random_state (int, RandomState instance or None, optional): If int, random_state is the seed used by the random number generator;
                                                If RandomState instance, random_state is the random number generator;
                                                If None, the random number generator is the RandomState instance used
                                                by `np.random`. Unlike StratifiedShuffleSplit that only uses
                                                random_state when ``shuffle`` == True, this multilabel implementation
                                                always uses the random_state since the iterative stratification
                                                algorithm breaks ties randomly.
    """
    super(MultilabelStratifiedShuffleSplit, self).__init__(
        n_splits=n_splits, test_size=test_size, train_size=train_size, random_state=random_state)

split(X, y, groups=None) ¤

Generate indices to split data into training and test set.

Note

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

Parameters:

Name Type Description Default
X array-like, shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

required
y array-like, shape (n_samples, n_labels)

The target variable for supervised learning problems. Multilabel stratification is done based on the y labels.

required
groups object

Always ignored, exists for compatibility.

None

Returns:

Name Type Description
train numpy.ndarray

The training set indices for that split.

test numpy.ndarray

The testing set indices for that split.

Source code in aucmedi/sampling/iterative.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
def split(self, X, y, groups=None):
    """Generate indices to split data into training and test set.

    ???+ note
        Randomized CV splitters may return different results for each call of
        split. You can make the results identical by setting ``random_state``
        to an integer.

    Args:
        X (array-like, shape (n_samples, n_features) ): Training data, where n_samples is the number of samples
                                                        and n_features is the number of features.
                                                        Note that providing ``y`` is sufficient to generate the splits and
                                                        hence ``np.zeros(n_samples)`` may be used as a placeholder for
                                                        ``X`` instead of actual training data.
        y (array-like, shape (n_samples, n_labels) ):   The target variable for supervised learning problems.
                                                        Multilabel stratification is done based on the y labels.
        groups (object, optional):                      Always ignored, exists for compatibility.


    Returns:
        train (numpy.ndarray):        The training set indices for that split.
        test (numpy.ndarray):         The testing set indices for that split.
    """
    y = check_array(y, ensure_2d=False, dtype=None)
    return super(MultilabelStratifiedShuffleSplit, self).split(X, y, groups)

IterativeStratification(labels, r, random_state) ¤

This function implements the Iterative Stratification algorithm described in the following paper:

Reference - Publication

Sechidis K., Tsoumakas G., Vlahavas I. 2011. On the Stratification of Multi-Label Data. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg. Aristotle University of Thessaloniki.
https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10

Source code in aucmedi/sampling/iterative.py
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
def IterativeStratification(labels, r, random_state):
    """This function implements the Iterative Stratification algorithm described
    in the following paper:

    ??? abstract "Reference - Publication"
        Sechidis K., Tsoumakas G., Vlahavas I. 2011.
        On the Stratification of Multi-Label Data.
        Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011.
        Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.
        Aristotle University of Thessaloniki.
        <br>
        [https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10](https://link.springer.com/chapter/10.1007/978-3-642-23808-6_10)
    """

    n_samples = labels.shape[0]
    test_folds = np.zeros(n_samples, dtype=int)

    # Calculate the desired number of examples at each subset
    c_folds = r * n_samples

    # Calculate the desired number of examples of each label at each subset
    c_folds_labels = np.outer(r, labels.sum(axis=0))

    labels_not_processed_mask = np.ones(n_samples, dtype=bool)

    while np.any(labels_not_processed_mask):
        # Find the label with the fewest (but at least one) remaining examples,
        # breaking ties randomly
        num_labels = labels[labels_not_processed_mask].sum(axis=0)

        # Handle case where only all-zero labels are left by distributing
        # across all folds as evenly as possible (not in original algorithm but
        # mentioned in the text). (By handling this case separately, some
        # code redundancy is introduced; however, this approach allows for
        # decreased execution time when there are a relatively large number
        # of all-zero labels.)
        if num_labels.sum() == 0:
            sample_idxs = np.where(labels_not_processed_mask)[0]

            for sample_idx in sample_idxs:
                fold_idx = np.where(c_folds == c_folds.max())[0]

                if fold_idx.shape[0] > 1:
                    fold_idx = fold_idx[random_state.choice(fold_idx.shape[0])]

                test_folds[sample_idx] = fold_idx
                c_folds[fold_idx] -= 1

            break

        label_idx = np.where(num_labels == num_labels[np.nonzero(num_labels)].min())[0]
        if label_idx.shape[0] > 1:
            label_idx = label_idx[random_state.choice(label_idx.shape[0])]

        sample_idxs = np.where(np.logical_and(labels[:, label_idx].flatten(), labels_not_processed_mask))[0]

        for sample_idx in sample_idxs:
            # Find the subset(s) with the largest number of desired examples
            # for this label, breaking ties by considering the largest number
            # of desired examples, breaking further ties randomly
            label_folds = c_folds_labels[:, label_idx]
            fold_idx = np.where(label_folds == label_folds.max())[0]

            if fold_idx.shape[0] > 1:
                temp_fold_idx = np.where(c_folds[fold_idx] ==
                                         c_folds[fold_idx].max())[0]
                fold_idx = fold_idx[temp_fold_idx]

                if temp_fold_idx.shape[0] > 1:
                    fold_idx = fold_idx[random_state.choice(temp_fold_idx.shape[0])]

            test_folds[sample_idx] = fold_idx
            labels_not_processed_mask[sample_idx] = False

            # Update desired number of examples
            c_folds_labels[fold_idx, labels[sample_idx]] -= 1
            c_folds[fold_idx] -= 1

    return test_folds