Skip to content

Io csv

csv_loader(path_data, path_imagedir, allowed_image_formats, training=True, ohe=True, ohe_range=None, col_sample='SAMPLE', col_class='CLASS') ยค

Data Input Interface for loading a dataset via a CSV and an image directory.

This internal function allows simple parsing of class annotations encoded in a CSV, and can be called via the input_interface() function by passing "csv" as parameter interface.

Input Formats
CSV Format 1:
   - Name Column: "SAMPLE" -> String Value
   - Class Column: "CLASS" -> Sparse Categorical Classes (String/Integers)
   - Optional Meta Columns possible

CSV Format 2:
   - Name Column: "SAMPLE"
   - One-Hot Encoded Class Columns:
       -> If OHE parameter provides list of column names -> use these
       -> Else try to use all other columns as OHE columns
   - Optional Meta Columns only possible if OHE parameter provided

Expected structure:

dataset/
    images_dir/                 # path_imagedir = "dataset/images_dir"
        sample001.png
        sample002.png
        ...
        sample350.png
    annotations.csv             # path_data = "dataset/annotations.csv"

Parameters:

Name Type Description Default
path_data str

Path to the csv file.

required
path_imagedir str

Path to the directory containing the images.

required
allowed_image_formats list of str

List of allowed imaging formats. (provided by IO_Interface)

required
training bool

Boolean option whether annotation data is available.

True
ohe bool

Boolean option whether annotation data is sparse categorical or one-hot encoded.

True
ohe_range list of str

List of column name values if annotation encoded in OHE. Example: ["classA", "classB", "classC"]

None
col_sample str

Index column name for the sample name column. Default: 'SAMPLE'

'SAMPLE'
col_class str

Index column name for the sparse categorical classes column. Default: 'CLASS'

'CLASS'

Returns:

Name Type Description
index_list list of str

List of sample/index encoded as Strings. Required in DataGenerator as samples.

class_ohe numpy.ndarray

Classification list as One-Hot encoding. Required in DataGenerator as labels.

class_n int

Number of classes. Required in NeuralNetwork for Architecture design as n_labels.

class_names list of str

List of names for corresponding classes. Used for later prediction storage or evaluation.

image_format str

Image format to add at the end of the sample index for image loading. Required in DataGenerator.

Source code in aucmedi/data_processing/io_interfaces/io_csv.py
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
def csv_loader(path_data, path_imagedir, allowed_image_formats,
               training=True, ohe=True, ohe_range=None,
               col_sample="SAMPLE", col_class="CLASS"):
    """ Data Input Interface for loading a dataset via a CSV and an image directory.

    This **internal** function allows simple parsing of class annotations encoded in a CSV,
    and can be called via the [input_interface()][aucmedi.data_processing.io_data.input_interface]
    function by passing `"csv"` as parameter `interface`.

    ???+ info "Input Formats"
        ```
        CSV Format 1:
           - Name Column: "SAMPLE" -> String Value
           - Class Column: "CLASS" -> Sparse Categorical Classes (String/Integers)
           - Optional Meta Columns possible

        CSV Format 2:
           - Name Column: "SAMPLE"
           - One-Hot Encoded Class Columns:
               -> If OHE parameter provides list of column names -> use these
               -> Else try to use all other columns as OHE columns
           - Optional Meta Columns only possible if OHE parameter provided
        ```

    **Expected structure:**
    ```
    dataset/
        images_dir/                 # path_imagedir = "dataset/images_dir"
            sample001.png
            sample002.png
            ...
            sample350.png
        annotations.csv             # path_data = "dataset/annotations.csv"
    ```

    Args:
        path_data (str):                        Path to the csv file.
        path_imagedir (str):                    Path to the directory containing the images.
        allowed_image_formats (list of str):    List of allowed imaging formats. (provided by IO_Interface)
        training (bool):                        Boolean option whether annotation data is available.
        ohe (bool):                             Boolean option whether annotation data is sparse categorical or one-hot encoded.
        ohe_range (list of str):                List of column name values if annotation encoded in OHE. Example: ["classA", "classB", "classC"]
        col_sample (str):                       Index column name for the sample name column. Default: 'SAMPLE'
        col_class (str):                        Index column name for the sparse categorical classes column. Default: 'CLASS'

    Returns:
        index_list (list of str):               List of sample/index encoded as Strings. Required in DataGenerator as `samples`.
        class_ohe (numpy.ndarray):              Classification list as One-Hot encoding. Required in DataGenerator as `labels`.
        class_n (int):                          Number of classes. Required in NeuralNetwork for Architecture design as `n_labels`.
        class_names (list of str):              List of names for corresponding classes. Used for later prediction storage or evaluation.
        image_format (str):                     Image format to add at the end of the sample index for image loading. Required in DataGenerator.
    """
    # Load CSV file
    dt = pd.read_csv(path_data, sep=",", header=0)
    # Check if image index column exist and parse it
    if col_sample in dt.columns : index_list = dt[col_sample].tolist()
    else : raise Exception("Sample column (" + str(col_sample) + \
                           ") not available in CSV file!", path_data)
    # Ensure index list to contain strings
    index_list = [str(index) for index in index_list]
    # Identify image format by peaking first image
    image_format = None
    for file in os.listdir(path_imagedir):
        format = file.split(".")[-1]
        if format.lower() in allowed_image_formats or \
           format.upper() in allowed_image_formats:
           image_format = format
           break
    # Raise Exception if image format is unknown
    if image_format is None:
        raise Exception("Unknown image format.", path_imagedir)
    # Check if image ending is already in sample name by peaking first one
    if index_list[0].endswith("." + image_format) : image_format = None
    # Verify if all images are existing
    for sample in index_list:
        # Obtain image file path
        if image_format : img_file = sample + "." + image_format
        else : img_file = sample
        path_img = os.path.join(path_imagedir, img_file)
        # Check existance
        if not os.path.exists(path_img):
            raise Exception("Image does not exist / not accessible!",
                            'Sample: "' + sample + '"', path_img)

    # If CSV is for inference (no annotation data) -> return parsing
    if not training : return index_list, None, None, None, image_format

    # Try parsing with a sparse categorical class format (CSV Format 1)
    if not ohe:
        # Verify if provided classification column in in dataframe
        if col_class not in dt.columns:
            raise Exception("Provided classification column not in dataset!")
        # Obtain class information
        classes_sparse = dt[col_class].tolist()
        class_names = np.unique(classes_sparse).tolist()
        class_n = len(class_names)
        # Parse sparse categorical annotations to One-Hot Encoding
        class_ohe = pd.get_dummies(classes_sparse).to_numpy()
    # Try parsing one-hot encoded format (CSV Format 2)
    else:
        # Identify OHE columns
        if ohe_range is None : ohe_columns = dt.loc[:, dt.columns != col_sample]
        else : ohe_columns = dt.loc[:, ohe_range]
        # Parse information
        class_names = list(ohe_columns.columns)
        class_n = len(class_names)
        class_ohe = ohe_columns.to_numpy()

    # Validate if number of samples and number of annotations match
    if len(index_list) != len(class_ohe):
        raise Exception("Numbers of samples and annotations do not match!",
                        len(index_list), len(class_ohe))
    # Return parsed CSV data
    return index_list, class_ohe, class_n, class_names, image_format