A DevOps approach to research

Sebastian Pölsterl

Institute of Cancer Research

C4RR, 28 June 2017

Challenges in Research #1

  • A piece of code tends to work only for a single point in time and space

cycle-lane-1

Challenges in Research #2

  • Errors are often discovered months after they have been introduced

cycle-lane-2

Challenges in Research #3

  • We want to repeat the same analysis with different parameters and/or data, sometimes 6 months later

broken-vase

The DevOp Approach

  • DevOps are responsible for developing, testing, and operating a piece of software
  • Only possible by relying on a large set of tools that help automating certain processes of their work (version control, unit testing, deployment, monitoring)
  • Git and Docker are two of the most fundamental tools

Jenkins Bamboo Chef Travis CI Selenium
Docker Git Puppet Snort logstash

Docker moby

  • Allows bundeling software and all of its dependencies into an image
  • Built images can be shared easily
  • Docker provides a more lightweight solution, compared to virtual machines
  • Easy to get started: Docker Hub has a rich collection of pre-built images

GitLab gitlab

  • GitLab is an open-source platform similar to Github for managing projects
  • Features:
    • Source code browser (Git repository)
    • Repository and activity monitoring
    • Issue management
    • Code review
    • Wiki
    • Continuous integration/deployment
    • Docker registry

Continuous integration (CI)

  • Frequently merge all developers working copies and identifying/resolve problems as early as possible

ci-infograph

GitLab CI

  • GitLab CI allows defining pipelines to automate certain tasks
  • Jobs can be triggered manually or automatic on specific events (e.g. a new commit is pushed)
  • GitLab CI supports Docker!

ci-infograph

Example Project

Learn the distribution of hand written digits using a Restricted Boltzmann Machine (RBM) and sample from it.

Ingredients:
  1. Code based on Theano's RBM tutorial
  2. Docker image containing the code and its dependencies
  3. GitLab CI configuration

GitLab CI configuration

  • GitLab CI uses a YAML file (.gitlab-ci.yml)
  • Defines a set of jobs with constraints stating when they should be run
  • Jobs are picked up by Runners (shell, Docker, SSH, VirtualBox, ...)
  • Each job is run independently from each other

Docker Registry

image: gitlab/dind

before_script:
  - docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN registry.gitlab.com

after_script:
  - docker logout registry.gitlab.com

Build stage

build:
  script:
    - docker build --pull -t $CI_REGISTRY_IMAGE .
    - docker push $CI_REGISTRY_IMAGE
  stage: build
  tags:
    - docker

Docker Image in Registry

registry

Test stage

test:pytest:
  script:
      - docker pull $CI_REGISTRY_IMAGE
      - docker run --name pytest --entrypoint /bin/bash $CI_REGISTRY_IMAGE
        -c "pytest -c /pytest.ini --cov-config /.coveragerc
        --cov-report=html:cov-report rbm_model"
      - docker cp pytest:/data/cov-report - | tar xf -
  artifacts:
    paths:
      - cov-report
  stage: test
  tags:
    - docker

Coverage report

pytest

Coverage report

pytest

Deploy stage

train:
  script:
    - docker pull $CI_REGISTRY_IMAGE
    - docker run --name rbm_t -v /data $CI_REGISTRY_IMAGE train
      -t /data -o /data/rbm-model.pkl --epochs 15
    - docker run --rm --volumes-from rbm_trained  $CI_REGISTRY_IMAGE sample
      -t /data -m /data/rbm-model.pkl -o /data/samples.png
    - docker cp rbm_trained:/data - | tar xf -
  artifacts:
    paths:
      - data/
  stage: deploy
  tags:
    - docker

Artifacts

artifacts

Artifacts - samples.png

samples

Challenges #1

  • A piece of code tends to work only for a single point in time and space

  • Solution
    • Always use Git, and commit often!
    • Manage your dependencies (e.g. by using Docker)
    • Use continuous integration

Challenge #2

  • Errors are often discovered months after they have been introduced

  • Solution
    • Use continuous integration
    • Unit tests would be ideal, but static code analysis or regression tests are a good start

Challenge #3

  • We want to repeat the same analysis with different parameters and/or data, sometimes 6 months later
  • Solution
    • Create a new Docker image for every single analysis you are performing
    • Automatically commit it to a Docker registry (just a Dockerfile is not enough)
    • GitLab CI can trigger your analysis, collect and store the results

Thanks for your attention!

giraffe
https://gitlab.com/sebp/devops-approach-to-research/