Beginner Tutorial - Set up & Running One Algorithm

This tutorial provides a hands-on introduction to SPRAS. It is designed to show participants how to install the software, run example workflows, and use tools to interpret the results.

You will learn how to:

Set up the SPRAS software environment
Explore the folder structure and understand how inputs, configurations, and outputs are organized
Configure and run a pathway reconstruction algorithm on a provided dataset
Enable post-analysis steps to generate post analysis information (summary statistics and Cytoscape visualizations)

Step 0: Clone the SPRAS repository, set up the environment, and run Docker

0.1 Start Docker

Launch Docker Desktop and wait until it says “Docker is running”.

0.2 Clone the SPRAS repository

Visit the SPRAS GitHub repository and clone it locally

0.3 Set up the SPRAS environment

From the root directory of the SPRAS repository, create and activate the Conda environment and install the SPRAS python package:

conda env create -f environment.yml
conda activate spras
python -m pip install .

0.4 Test the installation

Run the following command to confirm that SPRAS has been set up successfully from the command line:

python -c "import spras; print('SPRAS import successful')"

Step 1: Explanation of configuration file

A configuration file specifies how a SPRAS workflow should run; think of it as the control center for the workflow. It defines which algorithms to run, the parameters to use, the datasets and gold standards to include, the analyses to perform after reconstruction, and the container settings for execution.

SPRAS uses Snakemake (a workflow manager) and containerized software (like Docker and Apptainer), to read the configuration file and execute a SPRAS workflow.

Snakemake considers a task from the configuration file complete once the expected output files are present in the output directory. As a result, rerunning the same configuration file may do nothing if those files already exist. To continue or rerun SPRAS with the same configuration file, delete the output directory (or its contents) or modify the configuration file so Snakemake regenerates new results.

For this part of the tutorial, we’ll use a pre-defined configuration file. Download it here: Beginner Config File

Save the file into the config/ folder of your SPRAS installation. After adding this file, SPRAS will use the configuration to set up and reference your directory structure, which will look like this:

spras/
├── config/
│   └── beginner.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt # pre-defined in SPRAS already
│   └── tps-egfr-prizes.txt # pre-defined in SPRAS already

Here’s an overview of the major sections when looking at a configuration file:

Algorithms

algorithms:
- name: omicsintegrator1
  params:
     include: true
     run1:
        b: 0.1
        d: 10
        g: 1e-3
     run2:
        b: [0.55, 2, 10]
        d: [10, 20]
        g: 1e-3

When defining an algorithm in the configuration file, its name must match one of the supported SPRAS algorithms (introduced in the intermediate tutorial / more information on the algorithms can be found under the Supported Algorithms section). Each algorithm includes an include flag, which you set to true to have Snakemake run it, or false to disable it.

Algorithm parameters can be organized into one or more run blocks (e.g., run1, run2, …), with each block containing key-value pairs. When defining a parameter, it can be passed as a single value or passed by listing parameters within a list. If multiple parameters are defined as lists within a run block, SPRAS generates all possible combinations (Cartesian product) of those list values together with any fixed single-value parameters in the same run block. Each unique combination runs once per algorithm. Invalid or missing parameter keys will cause SPRAS to fail.

Datasets

datasets:
-
    label: egfr
    node_files: ["prizes.txt", "sources-targets.txt"]
    edge_files: ["interactome.txt"]
    other_files: []
    data_dir: "input"

In the configuration file, datasets are defined under the datasets section. Each dataset you define will be run against all of the algorithms enabled in the configuration file.

The dataset must include the following types of keys and files:

label: a name that uniquely identifies a dataset throughout the SPRAS workflow and outputs.
node_files: Input files listing the “prizes” or important starting nodes (“sources” or “targets”) for the algorithm
edge_files: Input interactome or network file that defines the relationships between nodes
other_files: This placefolder is not used
data_dir: The file path of the directory where the input dataset files are located

Reconstruction Settings

reconstruction_settings:
locations:
    reconstruction_dir: "output"

The reconstruction_settings section controls where outputs are stored. Set reconstruction_dir to the directory path where you want results saved. SPRAS will automatically create this folder if it doesn’t exist. If you are running multiple configuration files, you can set unique paths to keep outputs organized and separate.

Analysis

analysis:
summary:
    include: true
cytoscape:
    include: true
ml:
    include: true

SPRAS includes multiple downstream analyses that can be toggled on or off directly in the configuration file. When enabled, these analyses are performed per dataset and produce summaries or visualizations of the results from all enabled algorithms for that dataset.

Step 2: Running SPRAS on a provided example dataset

2.1 Running SPRAS with the Beginner Configuration

In the beginner.yaml configuration file, it is set up have SPRAS run a single algorithm with one parameter setting on one dataset.

From the root directory spras/, run the command below from the command line:

snakemake --cores 1 --configfile config/beginner.yaml

What Happens When You Run This Command

SPRAS will executes quickly from your perspective; however, several automated steps (handled by Snakemake and Docker) occur behind the scenes.

Snakemake starts the workflow

Snakemake reads the options set in the beginner.yaml configuration file and determines which datasets, algorithms, and parameter combinations need to run and if any post-analysis steps were requested.

Preparing the dataset

SPRAS takes the interactome and node prize files specified in the configuration and bundles them into a Dataset object to be used for processing algorithm specific inputs. This object is stored as a .pickle file (e.g. dataset-egfr-merged.pickle) so it can be reused for other algorithms without re-processing it.

Creating algorithm specific inputs

For each algorithm marked as include: true in the configuration, SPRAS generates input files tailored to that algorithm using the input standardized egfr dataset. In this case, only PathLinker is enabled. SPRAS creates the network.txt and nodetypes.txt files required by PathLinker in the prepared/egfr-pathlinker-inputs/.

Organizing results with parameter hashes

Each dataset-algorithm-parameter combination is placed in its own folder named like egfr-pathlinker-params-D4TUKMX/. D4TUKMX is a hash that uniquely identifies the specific parameter combination (k = 10 here). A matching log file in logs/parameters-pathlinker-params-D4TUKMX.yaml records the exact parameter values.

Running the algorithm

SPRAS launches the PathLinker Docker image that it downloads from DockerHub, sending it the prepared files and parameter settings. PathLinker runs and produces a raw pathway output file (raw-pathway.txt) that holds the subnetwork it found in its own native format.

Standardizing the results

SPRAS parses the raw PathLinker output into a standardized SPRAS format (pathway.txt). This ensures all algorithms output are put into a standardized output, because their native formats differ.

Logging the Snakemake run

Snakemake creates a dated log in .snakemake/log/. This log shows what rules ran and any errors that occurred during the SPRAS run.

What Your Directory Structure Should Like After This Run:

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── beginner.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt
│   └── tps-egfr-prizes.txt
├── outputs/
│   └── basic/
│       └── egfr-pathlinker-params-D4TUKMX/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── logs/
│                └── dataset-egfr.yaml
│                └── parameters-pathlinker-params-D4TUKMX.yaml
│       └── prepared/
│            └── egfr-pathlinker-inputs
│                └── network.txt
│                └── nodetypes.txt
│       └── dataset-egfr-merged.pickle

Step 2.2: Overview of the SPRAS Folder Structure

After running the SPRAS command, you’ll see that the folder structure includes four main directories that organize everything needed to run workflows and store their results.

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── ... other configs ...
├── inputs/
│   └── ... input files ...
├── outputs/
│   └── ... output files ...

.snakemake/log/

The .snakemake/log/ directory contains records of all Snakemake jobs that were executed for the SPRAS run, including any errors encountered during those runs.

config/

Holds configuration files (YAML) that define which algorithms to run, what datasets to use, and which analyses to perform.

input/

Contains the input data files, such as interactome edge files and input nodes. This is where you can place your own datasets when running custom experiments.

output/

Stores all results generated by SPRAS. Subfolders are created automatically for each run, and their structure can be controlled through the configuration file.

By default, the directories are named to be config/, input/, and output/. The config/, input/, and output/ folders can be placed anywhere and named anything within the SPRAS repository. Their input/ and output/ locations can be updated in the configuration file, and the configuration file itself can be set by providing its path when running the SPRAS command. SPRAS has additional files and directories to use during runs. However, for most users, and for the purposes of this tutorial, it isn’t necessary to fully understand them.

2.4 Running SPRAS with More Parameter Combinations

In the beginner.yaml configuration file, uncomment the run2 section under pathlinker so it looks like:

run2:
    k: [10, 100]

With this update, the beginner.yaml configuration file is set up have SPRAS run a single algorithm with multiple parameter settings on one dataset.

After saving the changes, rerun with:

snakemake --cores 1 --configfile config/beginner.yaml

What Happens When You Run This Command

Snakemake loads the configuration file

Snakemake reads beginner.yaml to determine which datasets, algorithms, parameters, and post-analyses to run. It reuses cached results to skip completed steps, rerunning only those that are new or outdated. Here, the dataset pickle, PathLinker inputs, and D4TUKMX parameter set are reused instead of rerun.

Organizing outputs per parameter combination

Each new dataset-algorithm-parameter combination gets its own folder (e.g egfr-pathlinker-params-7S4SLU6/ and egfr-pathlinker-params-VQL7BDZ/) The hashes 7S4SLU6 and VQL7BDZ uniquely identifies the specific set of parameters used.

Reusing prepared inputs with additional parameter combinations

Since PathLinker has already been run once, SPRAS uses the cached prepared inputs (network.txt, nodetypes.txt) rather than regenerating them. For each new parameter combination, SPRAS executes the PathLinker by launching its corresponding Docker image multiple times (once for each parameter configuration). PathLinker then runs and produces a raw-pathway.txt file specific to each parameter hash.

Parsing into standardized results

SPRAS parses each new raw-pathway.txt file into a standardized SPRAS format (pathway.txt).

What Your Directory Structure Should Like After This Run:

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── beginner.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt
│   └── tps-egfr-prizes.txt
├── outputs/
│   └── basic/
│       └── egfr-pathlinker-params-7S4SLU6/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-D4TUKMX/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-VQL7BDZ/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── logs/
│                └── dataset-egfr.yaml
│                └── parameters-pathlinker-params-7S4SLU6.yaml
│                └── parameters-pathlinker-params-D4TUKMX.yaml
│                └── parameters-pathlinker-params-VQL7BDZ.yaml
│       └── prepared/
│            └── egfr-pathlinker-inputs
│                └── network.txt
│                └── nodetypes.txt
│       └── dataset-egfr-merged.pickle

2.5 Reviewing the pathway.txt Files

Each algorithm and parameter combination produces a corresponding pathway.txt file. These files contain the reconstructed subnetworks and can be used at face value, or for further post analysis.

Locate the files

Navigate to the output directory spras/output/beginner/. Inside, you will find subfolders corresponding to each dataset-algorithm-parameter combination.

Open a pathway.txt file

Each file lists the network edges that were reconstructed for that specific run. The format includes columns for the two interacting nodes, the rank, and the edge direction

For example, the file egfr-pathlinker-params-7S4SLU6/pathway.txt contains the following reconstructed subnetwork:

Node1       Node2   Rank    Direction
EGF_HUMAN   EGFR_HUMAN      1       D
EGF_HUMAN   S10A4_HUMAN     2       D
S10A4_HUMAN MYH9_HUMAN      2       D
K7PPA8_HUMAN        MDM2_HUMAN      3       D
MDM2_HUMAN  P53_HUMAN       3       D
S10A4_HUMAN K7PPA8_HUMAN    3       D
K7PPA8_HUMAN        SIR1_HUMAN      4       D
MDM2_HUMAN  MDM4_HUMAN      5       D
MDM4_HUMAN  P53_HUMAN       5       D
CD2A2_HUMAN CDK4_HUMAN      6       D
CDK4_HUMAN  RB_HUMAN        6       D
MDM2_HUMAN  CD2A2_HUMAN     6       D
EP300_HUMAN P53_HUMAN       7       D
K7PPA8_HUMAN        EP300_HUMAN     7       D
K7PPA8_HUMAN        UBP7_HUMAN      8       D
UBP7_HUMAN  P53_HUMAN       8       D
K7PPA8_HUMAN        MDM4_HUMAN      9       D
MDM4_HUMAN  MDM2_HUMAN      9       D

The pathway.txt files serve as the foundation for further analysis, allowing you to explore and interpret the reconstructed networks in greater detail. In this case you can visulize them in cytoscape or compare their statistics to better understand these outputs.

Step 3: Running Post-Analyses within SPRAS

To enable downstream analyses, update the analysis section in your configuration file by setting both summary and cytoscape to true. Your analysis section in the configuration file should look like this:

analysis:
    summary:
        include: true
    cytoscape:
        include: true

summary generates graph topological summary statistics for each algorithm’s parameter combination output, generating a summary file for all reconstructed subnetworks for each dataset. This post analysis will report these statistics for each pathway:

Number of nodes
Number of edges
Number of connected components
Network density
Maximum degree
Median degree
Maximum diameter
Average path length

cytoscape creates a Cytoscape session file (.cys) containing all reconstructed subnetworks for each dataset, making it easy to upload and visualize them directly in Cytoscape.

With this update, the beginner.yaml configuration file is set up for SPRAS to run two post-analyses on the outputs generated by a single algorithm that was executed with multiple parameter settings on one dataset.

After saving the changes, rerun with:

snakemake --cores 1 --configfile config/beginner.yaml

Reusing cached results

Snakemake reads the options set in beginner.yaml and checks for any requested post-analysis steps. It reuses cached results; in this case, the pathway.txt files generated from the previously executed PathLinker parameter combinations for the egfr dataset.

Running the summary analysis

SPRAS aggregates the pathway.txt files from all selected parameter combinations into a single summary table. The results are saved in egfr-pathway-summary.txt.

Running the Cytoscape analysis

All pathway.txt files from the chosen parameter combinations are collected and passed into the Cytoscape Docker image. A Cytoscape session file is then generated, containing visualizations for each pathway and saved as egfr-cytoscape.cys.

spras/
├── .snakemake/
│   └── log/
│       └── ... snakemake log files ...
├── config/
│   └── basic.yaml
├── inputs/
│   ├── phosphosite-irefindex13.0-uniprot.txt
│   └── tps-egfr-prizes.txt
├── outputs/
│   └── basic/
│       └── egfr-pathlinker-params-7S4SLU6/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-D4TUKMX/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── egfr-pathlinker-params-VQL7BDZ/
│            └── pathway.txt
│            └── raw-pathway.txt
│       └── logs/
│                └── dataset-egfr.yaml
│                └── parameters-pathlinker-params-7S4SLU6.yaml
│                └── parameters-pathlinker-params-D4TUKMX.yaml
│                └── parameters-pathlinker-params-VQL7BDZ.yaml
│       └── prepared/
│            └── egfr-pathlinker-inputs
│                └── network.txt
│                └── nodetypes.txt
│       └── dataset-egfr-merged.pickle
│       └── egfr-cytoscape.cys
│       └── egfr-pathway-summary.txt

Step 3.1: Reviewing the Outputs

After completing the workflow, you will have several post analysis outputs that help you explore and interpret the results:

egfr-cytoscape.cys: a Cytoscape session file containing visualizations of the reconstructed subnetworks.
egfr-pathway-summary.txt: a summary file with statistics describing each network.

Reviewing Summary Files

Open the summary statistics file

In your file explorer, go to spras/output/basic/egfr-pathway-summary.txt and open it locally.

This file summarizes the graph topological statistics for each output pathway.txt file for a given dataset, along with the parameter combinations that produced them, allowing you to interpret and compare algorithm outputs side by side in a compact format.

Reviewing Outputs in Cytoscape

Open Cytoscape

Launch the Cytoscape application on your computer.

Load the Cytoscape session file

Navigate to spras/output/basic/egfr-cytoscape.cys and open it in Cytoscape.

Once loaded, the session will display all reconstructed subnetworks for a given dataset, organized by algorithm and parameter combination.

You can view and interact with each reconstructed subnetwork. Compare how the different parameter settings influence the pathways generated.

The small parameter value (k=1) produced a compact subnetwork:

The moderate parameter value (k=10) expanded the subnetwork, introducing additional nodes and edges that may uncover new connections:

The large parameter value (k=100) generates a much denser subnetwork, capturing a broader range of edges but also could introduce connections that may be less meaningful:

The parameters used here help determine which edges and nodes are included; each setting produces a different subnetwork. By examining the statistics (egfr-pathway-summary.txt) alongside the visualizations (Cytoscape), you can assess how parameter choices influence both the structure and interpretability of the outputs.