Git / Snakemake

Peter Alping

February 19, 2020

Basics

  1. Benefits of these tools and how they work (presentation)
  2. How it looks when we put it all together (overview demo)
  3. Demonstration of a project from start to finish (step-by-step demo)

Start using these tools in your own projects today!

Advanced (if there is time)

  • Remote repositories (push/pull)
  • Hosted repositories (GitHub/GitLab)
  • Git tagging and branches (merge/rebase)
  • Workflow parallelization (Snakemake)

Version Control (Git)

Version Control (Git)

Any system that keeps track of changes in files and folders,
making it possible to review these changes and roll back to previous versions

  • Fully manual system (example)
    • Cumbersome
  • Fully automatic system
    • No inherent meaning to the versions
  • Semi-automatic
    • Automate most things, but make sure every version has a meaning

The problems it helps solve

  • Documents entire work process as changes in code and text
  • Facilitates open and reproducible research
  • Makes collaboration possible
  • Easily create backups on a server while maintaining a local copy

How it works

  • All history is stored in a repository (database)
    • Local repository (your project folder)
    • Remote repository (other folder/server/GitHub/GitLab)
  • Changes are tracked
    • Add changes -> Commit changes (message) -> Push changes (server)
  • Language agnostic (works with Python, R, SAS, STATA, text, etc)
  • (Git vs GitHub/GitLab)

Find out more about Git at: https://git-scm.com

https://xkcd.com/1597

Workflow Manager (Snakemake)

Workflow Manager (Snakemake)

Any system that let’s you specify and run a workflow,
e.g. running a set of instructions in a specific order

  • A note describing in which order to run things
  • A script calling other scripts in the correct order
  • A dependency graph that only runs the necessary steps

The problems it helps solve

  • Reproducible and scalable data analyses
  • Self-documenting workflow (helps in understanding and communication)
  • Allows for combining different tools (SAS, Python, R, STATA)
  • Saves time through easy parallelization and running only the necessary code
  • Helps when coming back to six-months-old code, responding to reviewers

How it works

  • Specify rules
    • Input files
    • Output files
    • Code to run
  • A DAG (directed acyclic graph) is constructed from the rules
  • The DAG is used to determine what parts of the workflow to run
  • You only have to specify rules and the workflow manager handles the rest

Find out more about Snakemake at: https://snakemake.readthedocs.io

(SAS Enterprise Guide has something similar)

Putting it together

Scenario

A colleague has asked you to review their work
and gives you access to their Git repository

  1. Clone repository (copying the files)
  2. Run Snakemake (run all the steps in the workflow)
  3. Inspect output and workflow DAG (and the code)
# Clone repository
git clone "path/to/repository.git"
# Enter project directory
cd repository
# Run Snakemake
snakemake

Demo

A project using Git and Snakemake

  1. Initialize Git repository
  2. Read data with SAS
  3. Make a table using Python
  4. Make figures using R
  5. Create a report using the table and figures

Commit changes and create Snakemake rules as we go

Reference

Basic Git Commands

# Initialize a new repository
git init
# Check if any files have been modified
git status
# Check the difference to previous versions
git diff
# Stage changes before commit
git add
# Commit with a short message
git commit -m "Commit message"
# If you have a remote repo, start by getting any changes
git pull
# Push to remote repository
git push

Pro Git book (free)

Basic Snakemake File

# A basic rule inside Snakefile
rule my_rule:
    input:
        "path/to/input_1.txt",
        "path/to/input_2.txt",
    output:
        "path/to/output.txt"
    script:
        "path/to/script.py"

Run rule from the command line:

# Check rule with a dry run
snakemake my_rule -n
# Run rule
snakemake my_rule
# Force the rule to run
snakemake my_rule -f
# Run in parallel (-j), keep going (-k), give reason (-r)
snakemake my_rule -rkj

Snakemake Documentation

Summary

Git and Snakemake

Version control with Git can help keep track of changes in code
and text without having to manually keep an archive of old files

Workflow management with Snakemake can help organize the steps
from data to final report, and share these with collaborators or reviewers

You can easily start using both today in your existing projects