Quickstart

This guide walks you through a minimal, end-to-end AutoMIL workflow: from preparing your data to running a first model evaluation. It is intended to provide a high-level overview of the typical AutoMIL pipeline.

Dataset Infos

For the purposes of this demonstration, we use an example set of whole-slide images from The Cancer Genome Atlas Program (TCGA) consisting of lung tissue samples. This dataset can be downloaded via slideflows project module, which provides the slides in the form of a preconfigured project named LungAdenoSquam. Since the indiviual image files are quite large and the full project contains 941 slides, we restrict this example to a randomly sampled subset of 100 slides. The subset can be replicated using this annotation file. Provide this file to slideflows API to make sure only the annotated slides are downloaded:

    #!/usr/bin/env python3
    import slideflow as sf
    from slideflow.project_utils import LungAdenoSquam

    if __name__ == "__main__":

        project = sf.create_project(
            root='LungAdenoSquam',
            cfg=LungAdenoSquam().__dict__,
            annotations="path/to/lung_labels.csv",
            download=True
        )

(Optional) 1. Activate Your Environment

If you installed AutoMIL in a virtual environment, ensure your venv is activated before running AutoMIL.

source .venv/bin/activate

2. Prepare the Dataset

AutoMIL expects your WSI dataset to consist of slide images in one of many supported formats (.tiff, .svs, .tif etc) and a file containing slide-level label information

A minimal dataset consists of:

A directory containing slide images
A .csv metadata file with slide-level annotations

Example directory structure:

./LungAdenoSquam/
├── slides/
    ├── TCGA-05-4430-01Z-00-DX1.95659bbb-3091-4370-bc1d-6c6c1baa7b3d.svs
    ├── TCGA-55-A48Z-01Z-00-DX1.0867DC6A-2A51-4CF1-AE3F-0526CE2DD740.svs
    ├── TCGA-55-A4DG-01Z-00-DX1.9CE9B7BE-48EF-44F1-9C25-F15700A3E5DE.svs
    └── ...
└── lung_labels.csv

With annotations.csv:

patient,subtype,site,slide
TCGA-05-4430,adenocarcinoma,Site-61,TCGA-05-4430-01Z-00-DX1.95659bbb-3091-4370-bc1d-6c6c1baa7b3d
TCGA-55-A48Z,adenocarcinoma,Site-67,TCGA-55-A48Z-01Z-00-DX1.0867DC6A-2A51-4CF1-AE3F-0526CE2DD740
TCGA-55-A4DG,adenocarcinoma,Site-67,TCGA-55-A4DG-01Z-00-DX1.9CE9B7BE-48EF-44F1-9C25-F15700A3E5DE

3. Run the basic training pipeline

To train a basic Attention_MIL model on the dataset, run the automil train command with default parameters:

automil train ./LungAdenoSquam/slides ./LungAdenoSquam/lung_labels.csv results -v -lc "subtype" -sc "slide"

AutoMIL expects the column containing labels to be named label and the slide names to be identical to the patient identifiers in the patient column. Using the -lc | --label-colum option, the label column name can be overriden and using the -sc | --slide-column option, a specific column containing slide identifiers can be provided

Using the verbose flag -v, automil will display additional information messages, giving an overview of the pipelines progress:

INFO     Executing command:                                          
        /data/jonas/Master/AutoMIL/.venv/bin/automil train          
        ./Datasets/LungAdenoSquam/slides                            
        ./Datasets/LungAdenoSquam/annotations_balanced.csv results  
        -v -lc subtype -sc slide                                    
INFO     Using resolution presets: ['Low']                           
INFO     Using model type: Attention_MIL                             
INFO     Using default image backend cucim                           
INFO     Project directory results already exists                    
INFO     Annotations saved to results/annotations.csv                
INFO     Project scaffold setup complete                             
INFO     Loading existing project at results                         
INFO     Project Summary                                             
INFO     ┌──────────────────────┬─────────────────────────────────────────────────┐                                              
        │Project Directory:    │ results                                         │                                                           
        │Slide Directory:      │ Datasets/LungAdenoSquam/slides                  │                                                           
        │Annotations File:     │ Datasets/LungAdenoSquam/annotations_balanced.csv│           
        │Patient Column:       │ patient                                         │                                                           
        │Label Column:         │ subtype                                         │                                                           
        │Slide Column:         │ slide                                           │                                                           
        │Transform Labels:     │ False                                           │                                                           
        │Modified Annotations: │ results/annotations.csv                         │                                                           
        │Slideflow Project:    │ Loaded                                          │                                                           
        └──────────────────────┴─────────────────────────────────────────────────┘                                              

INFO     Setting up dataset for resolution preset: Low               
INFO     Computed average MPP across slides: 0.260                   
INFO     Dataset Summary                                             
INFO     ┌──────────────────┬─────────┐                              
        │Resolution Preset │ Low     │                              
        │Tile Size (px)    │ 1000px  │                              
        │Magnification     │ 10x     │                              
        │Microns-Per-Pixel │ 0.260   │                              
        │Tile Size (µm)    │ 260.00µm│                              
        │Pretiled Input    │ False   │                              
        │TIFF Conversion   │ False   │                              
        └──────────────────┴─────────┘                              

INFO     Preparing dataset source at resolution Low (1000px,         
        260.00um)

If not already done, tile extraction and feature bag generation will commence, and the resulting data will be stored results/tfrecords and `results/bags respectively

[16:24:01] INFO     Finished tile extraction for TCGA-85-8664 (1605 tiles of    
                    1605 possible)                                              
           INFO     No ROI for                                                  
                    TCGA-98-A53J-01Z-00-DX1.EEC6256E-D331-4731-B00C-08622C725F61
                    , using whole slide.                                        
[16:27:14] INFO     Finished tile extraction for TCGA-98-A53J (1595 tiles of    
                    1595 possible)                                              
           INFO     No ROI for                                                  
                    TCGA-44-8119-01Z-00-DX1.1EBEBFA7-22DB-4365-9DF8-C4E679C11312
                    , using whole slide.                                        
[16:27:41] INFO     Finished tile extraction for TCGA-44-8119 (211 tiles of 211 
                    possible)                                                   
[16:27:42] INFO     No ROI for                                                  
                    TCGA-21-1071-01Z-00-DX1.a9bba825-1c92-4101-9086-c4d1c91117af
                    , using whole slide.                                        
[16:28:55] INFO     Finished tile extraction for TCGA-21-1071 (674 tiles of 674 
                    possible)                                                   
           INFO     No ROI for                                                  
                    TCGA-21-1075-01Z-00-DX1.937872ae-4d6f-4d7a-b54f-b7e797cb84b0
                    , using whole slide.                                        
[16:29:53] INFO     Finished tile extraction for TCGA-21-1075 (457 tiles of 457 
                    possible)                                                   
           INFO     No ROI for                                                  
                    TCGA-O2-A52V-01Z-00-DX1.561ADDAE-EC55-461A-84B5-535C93E39C56
                    , using whole slide.                                        
[16:36:20] INFO     Finished tile extraction for TCGA-O2-A52V (3460 tiles of    
                    3460 possible)

Tile Extraction Takes Time

Depending on the slide size and the amount of tissue that can be tiled, the tile extraction may take some time (see the timestamps in the log above). If possible, consider using a pretiled dataset

Finally, the trained model will be saved in the results/ directory under results/models/.

4. Evaluate the trained model

To evaluate the trained model on the same dataset, run the automil evaluate command:

automil evaluate ./LungAdenoSquam/slides ./LungAdenoSquam/lung_labels.csv ./results/bags ./results/models -v -lc "subtype" -sc "slide" -o "./results/evaluation"

Once again, the verbose flag provides detailed information on the pipelines progress and current state:

INFO     Evaluation complete.                                                                                               
INFO     Model Comparison:                                           
INFO                         model  Accuracy  AUC   F1               
        00002-attention_mil-label      0.85 0.94 0.86               
        00001-attention_mil-label      0.90 0.92 0.90               
        00000-attention_mil-label      0.85 0.90 0.84               
INFO     Saved plot 'box_plots' to                                   
        results/evaluation/figures/box_plots.png                    
INFO     Saved plot 'model_comparison' to                            
        results/evaluation/figures/model_comparison.png             
INFO     Saved plot 'per_class_accuracy' to                          
        results/evaluation/figures/per_class_accuracy.png           
INFO     Saved plot 'roc_curves' to                                  
        results/evaluation/figures/roc_curves.png

This will create an evaluation report inside the ./results/evaluation directory, containing metrics and plots of the model performance:

Model Comparison — Metric Comparison across all models

Per Class Accuracy scores across all model

ROC Curves — The ROC curves for all models