Skip to content

Kfold

sampling_kfold(samples, labels, metadata=None, n_splits=3, stratified=True, iterative=False, seed=None) ยค

Simple wrapper function for calling k-fold cross-validation sampling functions.

Allow usage of stratified and iterative sampling algorithm.

Warning

Be aware that multi-label data does not support random stratified sampling.

Example

The sampling is returned as list with length n_splits containing tuples with sampled data.

Example for n_splits=3
cv = sampling_kfold(samples, labels, n_splits=3)

# sampling in which x = samples and y = labels
# cv <-> [(train_x, train_y, test_x, test_y),   # fold 1
#         (train_x, train_y, test_x, test_y),   # fold 2
#         (train_x, train_y, test_x, test_y)]   # fold 3

# Recommended access on the folds
for fold in cv:
    (train_x, train_y, test_x, test_y) = fold
Example with metadata
cv = sampling_kfold(samples, labels, metadata, n_splits=2)

# sampling in which x = samples, y = labels and m = metadata
# cv <-> [(train_x, train_y, train_m, test_x, test_y, test_m),      # fold 1
#         (train_x, train_y, train_m, test_x, test_y, test_m)]      # fold 2

Parameters:

Name Type Description Default
samples list of str

List of sample/index encoded as Strings.

required
labels numpy.ndarray

NumPy matrix containing the ohe encoded classification.

required
metadata numpy.ndarray

NumPy matrix with additional metadata. Have to be shape (n_samples, meta_variables).

None
n_splits int

Number of folds (k). Must be at least 2.

3
stratified bool

Option whether to use stratified sampling based on provided labels.

True
iterative bool

Option whether to use iterative sampling algorithm.

False
seed int

Seed to ensure reproducibility for random functions.

None

Returns:

Name Type Description
sampling list of tuple

List with length n_splits containing tuples with sampled data.

Source code in aucmedi/sampling/kfold.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def sampling_kfold(samples, labels, metadata=None, n_splits=3,
                   stratified=True, iterative=False, seed=None):
    """ Simple wrapper function for calling k-fold cross-validation sampling functions.

    Allow usage of stratified and iterative sampling algorithm.

    ???+ warning
        Be aware that multi-label data does not support random stratified sampling.

    ???+ example
        The sampling is returned as list with length n_splits containing tuples with sampled data.

        ```python title="Example for n_splits=3"
        cv = sampling_kfold(samples, labels, n_splits=3)

        # sampling in which x = samples and y = labels
        # cv <-> [(train_x, train_y, test_x, test_y),   # fold 1
        #         (train_x, train_y, test_x, test_y),   # fold 2
        #         (train_x, train_y, test_x, test_y)]   # fold 3

        # Recommended access on the folds
        for fold in cv:
            (train_x, train_y, test_x, test_y) = fold
        ```

        ```python title="Example with metadata"
        cv = sampling_kfold(samples, labels, metadata, n_splits=2)

        # sampling in which x = samples, y = labels and m = metadata
        # cv <-> [(train_x, train_y, train_m, test_x, test_y, test_m),      # fold 1
        #         (train_x, train_y, train_m, test_x, test_y, test_m)]      # fold 2
        ```

    Args:
        samples (list of str):      List of sample/index encoded as Strings.
        labels (numpy.ndarray):     NumPy matrix containing the ohe encoded classification.
        metadata (numpy.ndarray):   NumPy matrix with additional metadata. Have to be shape (n_samples, meta_variables).
        n_splits (int):             Number of folds (k). Must be at least 2.
        stratified (bool):          Option whether to use stratified sampling based on provided labels.
        iterative (bool):           Option whether to use iterative sampling algorithm.
        seed (int):                 Seed to ensure reproducibility for random functions.

    Returns:
        sampling (list of tuple):   List with length `n_splits` containing tuples with sampled data.
    """
    # Initialize variables
    results = []
    wk_labels = labels

    # Initialize random sampler
    if not stratified and not iterative:
        sampler = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    # Initialize random stratified sampler
    elif stratified and not iterative:
        sampler = StratifiedKFold(n_splits=n_splits, shuffle=True,
                                  random_state=seed)
        wk_labels = np.argmax(wk_labels, axis=-1)
    # Initialize iterative stratified sampler
    else:
        sampler = MultilabelStratifiedKFold(n_splits=n_splits, shuffle=True,
                                            random_state=seed)

    # Preprocess data
    x = np.asarray(samples)
    y = np.asarray(labels)
    if metadata is not None : m = np.asarray(metadata)

    # Apply sampling and generate folds
    for train, test in sampler.split(X=samples, y=wk_labels):
        # Simple sampling
        if metadata is None:
            fold = (x[train], y[train], x[test], y[test])
        # Sampling with metadata
        else:
            fold = (x[train], y[train], m[train], x[test], y[test], m[test])
        results.append(fold)

    # Return result sampling
    return results