Skip to content

Split

sampling_split(samples, labels, metadata=None, sampling=[0.8, 0.2], stratified=True, iterative=False, seed=None) ยค

Simple wrapper function for calling percentage split sampling functions.

Allow usage of stratified and iterative sampling algorithm.

Warning

Be aware that multi-label data does not support random stratified sampling.

Percentage split ratios have to be provided with a sampling list. Each percentage value in the list defines the approximate split size. Sum of percentage split ratios have to equal 1!

Example
Example for common train/val/test split
split_ratio = [0.7, 0.25, 0.05]
ds = sampling_split(samples, labels, sampling=split_ratio)

# Returns a list with the following elements as tuples:
print(ds[0])  # -> (samples_a, labels_a)                with 70% of complete dataset
print(ds[1])  # -> (samples_b, labels_b)                with 25% of complete dataset
print(ds[2])  # -> (samples_c, labels_c)                with  5% of complete dataset
Example with metadata
ds = sampling_split(samples, labels, metadata, sampling=[0.8, 0.2])

# Returns a list with the following elements as tuples:
print(ds[0])  # -> (samples_a, labels_a, metadata_a)    with 80% of complete dataset
print(ds[1])  # -> (samples_b, labels_b, metadata_b)    with 20% of complete dataset

Parameters:

Name Type Description Default
samples list of str

List of sample/index encoded as Strings.

required
labels numpy.ndarray

NumPy matrix containing the ohe encoded classification.

required
metadata numpy.ndarray

NumPy matrix with additional metadata. Have to be shape (n_samples, meta_variables).

None
sampling list of float

List of percentage values with split sizes.

[0.8, 0.2]
stratified bool

Option whether to use stratified sampling based on provided labels.

True
iterative bool

Option whether to use iterative sampling algorithm.

False
seed int

Seed to ensure reproducibility for random functions.

None

Returns:

Name Type Description
results list of tuple

List with len(sampling) containing tuples with sampled data: (samples_a, labels_a) and with metadata (samples_a, labels_a, metadata_a).

Source code in aucmedi/sampling/split.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def sampling_split(samples, labels, metadata=None, sampling=[0.8, 0.2],
                   stratified=True, iterative=False, seed=None):
    """ Simple wrapper function for calling percentage split sampling functions.

    Allow usage of stratified and iterative sampling algorithm.

    ???+ warning
        Be aware that multi-label data does not support random stratified sampling.

    Percentage split ratios have to be provided with a sampling list.
    Each percentage value in the list defines the approximate split size.
    Sum of percentage split ratios have to equal 1!

    ???+ example
        ```python title="Example for common train/val/test split"
        split_ratio = [0.7, 0.25, 0.05]
        ds = sampling_split(samples, labels, sampling=split_ratio)

        # Returns a list with the following elements as tuples:
        print(ds[0])  # -> (samples_a, labels_a)                with 70% of complete dataset
        print(ds[1])  # -> (samples_b, labels_b)                with 25% of complete dataset
        print(ds[2])  # -> (samples_c, labels_c)                with  5% of complete dataset
        ```

        ```python title="Example with metadata"
        ds = sampling_split(samples, labels, metadata, sampling=[0.8, 0.2])

        # Returns a list with the following elements as tuples:
        print(ds[0])  # -> (samples_a, labels_a, metadata_a)    with 80% of complete dataset
        print(ds[1])  # -> (samples_b, labels_b, metadata_b)    with 20% of complete dataset
        ```

    Args:
        samples (list of str):          List of sample/index encoded as Strings.
        labels (numpy.ndarray):         NumPy matrix containing the ohe encoded classification.
        metadata (numpy.ndarray):       NumPy matrix with additional metadata. Have to be shape (n_samples, meta_variables).
        sampling (list of float):       List of percentage values with split sizes.
        stratified (bool):              Option whether to use stratified sampling based on provided labels.
        iterative (bool):               Option whether to use iterative sampling algorithm.
        seed (int):                     Seed to ensure reproducibility for random functions.

    Returns:
        results (list of tuple):        List with `len(sampling)` containing tuples with sampled data:
                                        (samples_a, labels_a) and with metadata (samples_a, labels_a, metadata_a).
    """
    # Verify sampling percentages
    if not np.isclose(sum(sampling), 1.0):
        raise ValueError("Sum of Percentage split ratios as sampling do not" + \
                         " equal 1", sampling, np.sum(sampling))
    # Initialize leftover with the complete dataset
    leftover_samples = np.asarray(samples)
    leftover_labels = np.asarray(labels)
    if metadata is not None : leftover_meta = np.asarray(metadata)
    leftover_p = 0.0
    # Initialize result list
    results = []

    # Perform sampling for each percentage split
    for i in range(0, len(sampling)):
        # For last split, just take leftover data as subset
        if i == len(sampling)-1:
            # Generate split
            if metadata is None : split = (leftover_samples, leftover_labels)
            else : split = (leftover_samples, leftover_labels, leftover_meta)
            # Append splitted data and stop
            results.append(split)
            break

        # Identify split percentage for remaining data
        p = sampling[i] / (1.0 - leftover_p)
        # Initialize random sampler
        if not stratified and not iterative:
            sampler = ShuffleSplit(n_splits=1, random_state=seed,
                                   train_size=(1.0-p), test_size=p)
        # Initialize random stratified sampler
        elif stratified and not iterative:
            sampler = StratifiedShuffleSplit(n_splits=1, random_state=seed,
                                             train_size=(1.0-p), test_size=p)
        # Initialize iterative stratified sampler
        else:
            sampler = MultilabelStratifiedShuffleSplit(n_splits=1,
                            random_state=seed, train_size=(1.0-p), test_size=p)

        # Apply sampling
        subset_generator = sampler.split(X=leftover_samples, y=leftover_labels)
        subsets = next(subset_generator)
        # Generate split
        if metadata is None:
            split = (leftover_samples[subsets[1]], leftover_labels[subsets[1]])
        else : split = (leftover_samples[subsets[1]],
                        leftover_labels[subsets[1]],
                        leftover_meta[subsets[1]])
        # Append splitted data
        results.append(split)
        # Update remaining data
        leftover_p += sampling[i]
        leftover_samples = leftover_samples[subsets[0]]
        leftover_labels = leftover_labels[subsets[0]]
        if metadata is not None : leftover_meta = leftover_meta[subsets[0]]

    # Return result sampling
    return results