Data Version Control

What is DVC?

DVC (data version control) is an open source tool for tracking datasets and outputs of Data Science and Machine Learning projects.

DVC is designed to be integrated into your Git workflows. On top of that, the interface of DVC is very similar to Git's, so if you are familiar with Git you will find starting to use DVC very easy.

So why should you start using DVC if you already have Git to track code and an a Azure instance to store your datasets and models?

  1. DVC integrates seamlessly with both Git and all major storage systems (AWS, Azure, Google Cloud, HTTP, SSH server, HDFS, ...) and does not require installing and maintaining any databases.
  2. DVC makes your projects and all their related data reproducible and shareable, allowing to answer questions on how a model or a result was obtained.
  3. DVC lets you treat complex Data Science pipelines as a Makefile, so that when something changes (e.g. a new training dataset, a new choice of model hyperparameters, ...) each stage of the pipeline is automatically re-run whenever needed.

Getting Started

DVC is available on PyPI and it is very easy to install:

pip install dvc
Depending on remote storage you will use, you may want to install specific dependencies.
$ pip install dvc[s3]   # support Amazon S3
$ pip install dvc[ssh]  # support ssh
$ pip install dvc[all]  # all supports

To use DVC you should work inside of a Git repo. If it does not exist yet, we create and initialize one.

$ mkdir ml_project & cd ml_project
$ git init
Then, initializing a DVC project creates and automatically git add a few configuration files. Note how the syntax is similar to Git!
$ dvc init
$ git status -s
A  .dvc/.gitignore
A  .dvc/config
A  .dvc/plots/confusion.json
A  .dvc/plots/default.json
A  .dvc/plots/scatter.json
A  .dvc/plots/smooth.json
A  .dvcignore

$ git commit -m "Initialize dvc project"

Data Versioning

Let us assume that in our ML project we have 200,000 images that we want to use to train and test a classification model able to distinguish dogs from cats.

├── train
│   ├── dogs       # 95,000 pictures
│   └── cats       # 95,000 pictures
└── validation
   ├── dogs        # 5,000 pictures
   └── cats        # 5,000 pictures
Tracking all of these images with Git would be a bad idea, as we are dealing with a lot of binary files. Instead, we can easily track them with DVC.

$ dvc add data/
100% Add|██████████|1/1    [00:30, 30.51s/file]
To track the changes with git, run:
    git add .gitignore data.dvc

$ git add .gitignore  data.dvc
$ git commit -m "Add first version of data/"
$ git tag  -a "v1.0"  -m "data v1.0, 200,000 images"

As you can see, if you are familiar with Git, the syntax of DVC looks straightforward. But what exactly has happened when we executed dvc add data/? Actually, quite a lot of things!

  1. The hash of the content of data/ was computed and added to a new data.dvc file DVC updates .gitignore to tell Git not to track the content of data/.
  2. The physical content of data/ —i.e. the 200,000 images— has been moved to a cache (by default the cache is located under .dvc/cache/).
  3. The files were linked back to the workspace so that it looks like nothing happened (the user can configure the link type to use: hard link, soft link, reflink, copy).

Now, what happens when you modify some DVC-tracked data? Let us assume for instance, that you replaced some images in data/train/ with higher definition ones, or that you added new training images to data/train/. We can easily check what has changed with dvc diff:

$ dvc diff
To track these changes we follow the same procedure as before with dvc add , so that DVC will re-compute the hash of data/ to know which version of your repo corresponds to which contents.
$ dvc add data/
$ git diff data.dvc
-- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
+- md5: 21060888834f7220846d1c6f6c04e649.dir
   path: data
$ git commit -am "Add images + Improve quality in data/train/"
$ git tag  -a "v2.0"  -m "data v2.0, 300,000 images in HD"
Now that we have several versions of our dataset data/, we can easily switch from one version to another using Git and DVC.

  1. First, we run git checkout to switch to a specific revision of the project. This will guarantee that data.dvc contains the hash of the correct version the dataset we are interested in, but this command does not modify the files in the workspace, i.e. the content of data/.
    $ git checkout v1.0
    $ dvc diff
  2. Second, we fix the mismatch between the hash written in data.dvc and the hash of the content of data/ by running dvc checkout. This command fetches from the DVC cache the correct version of the dataset based on the hash found in the data.dvc file.
    $ dvc checkout
    M       data/
    $ dvc status
    Data and pipelines are up to date.

Working with Storages

A remote storage is for dvc, what a GitHub is for Git.

  1. It is used to push and pull files from your workspace to the remote.
  2. It allows easy sharing between developers.
  3. It provides a safe backup in case of catastrofic deletions (like rm -rf *).
All main remote storages are supported (Azure, AWS, Google Cloud, etc.). However, as for Git, nothing prevents us to use a remote storage located in the same filesystem.
$ mkdir -p  ~/tmp/dvc_storage
$ dvc remote add  --default loc_remote  ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.

$ git add .dvc/config  # DVC wrote here remote storage config
$ git commit -m "Configure remote storage loc_remote"
Now, if we run dvc push, DVC will upload the content of the cache to the remote storage. This is pretty much like a git push.
$ dvc push
300000 files pushed
So, even if by mistake we deleted all data from our workspace and cache, we can retrieve all of it it with dvc pull. Again, this is pretty much like git pull.
$ rm -rf   .dvc/cache   data
$ dvc pull       # update .dvc/cache with contents from remote
300000 files fetched

$ dvc checkout   # update workspace, linking data from .dvc/cache
A       data/

Machine Learning Pipelines

So far we have seen how dvc add can be used to track large files or datasets. When we work on Machine Learning projects, trained models are typical examples of large files we would like to track.

However, we also want to track how such models (as well as other outputs of our ML pipeline) were generated, for reproducibility and better tracking. Let us consider the following example of a Machine Learning pipeline for a NLP (Natural Language Processing) project of sentiment analysis.

ML pipeline graph

To track each stage of this pipeline with DVC, we can run the pipeline stage and track its output together with all the dependencies with dvc run:

$ dvc run -n prepare \  # name of the stage
    -p prepare.seed \   # dependency (parameter)
    -p prepare.split \  # dependency (parameter)
    -d src/prepare.py \ # dependency (file)
    -d data/data.xml \  # dependency (file)
    -o data/prepared \  # output
    python src/prepare.py data/data.xml  # command to run
The parameter dependencies declared above are read from a params.yaml file that could look like this:
 split: 0.20
 seed: 20170428

 max_feats: 500
 ngrams: 1


There are several advantages of using this dvc run approach instead of running the pipeline stages, and then tracking the output artifacts with dvc add.

  1. Outputs are automatically tracked (i.e. saved in .dvc/cache).
  2. Pipeline stages with parameters names are saved in a dvc.yaml file.
  3. Dependency files and parameters as well as outputs are all hashed and tracked in a dvc.lock file.
  4. The file dvc.yaml works as a Makefile, in the sense that we can reproduce each stage of the pipeline with dvc repro prepare, which automatically decides to re-run each stage only if one of the dependencies has changed.
Let's have a closer look at dvc.yaml, which as said describes the graph structure of the data pipeline, similar to how a Makefile works for building software.
   cmd: python src/prepare.py data/data.xml
   - data/data.xml
   - src/prepare.py
   - prepare.seed
   - prepare.split
   - data/prepared
As for the dvc.lock generated by dvc run , it is a bit like the * .dvc files generated by dvc add. In particular, code.lock matches dvc.lock and describes latest pipeline state in order to:
  1. track intermediate and final artifacts (like a *.dvc file)
  2. allow DVC to detect when stage defs or dependencies changed, to trigger re-run when dvc repro is called.
 cmd: python src/prepare.py data/data.xml
 - path: data/data.xml
   md5: a304afb96060aad90176268345e10355
 - path: src/prepare.py
   md5: 285af85d794bb57e5d09ace7209f3519
     prepare.seed: 20170428
     prepare.split: 0.2
 - params: data/prepared
   md5: 20b786b6e6f80e2b3fcf17827ad18597.dir

If dvc.yaml is like a Makefile, running dvc repro is similar to running the make command. Once we have run all stages of the pipeline, if we change one of the dependencies and run dvc repro train, only the affected stages will be re-run.

$ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize' with command:
	python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train' with command:
	python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
To track the changes with git, run:
	git add dvc.lock

Finally, a useful feature of DVC pipelines is that it easily allow to compare parameters and scores of models corresponding to different versions.

$ dvc params diff
Path		   Param                   Old    New
params.yaml  featurize.max_features  500    1500

$ dvc metrics diff
Path         Metric    Value    Change
scores.json  auc       0.61314  0.07139


  1. DVC is a simple yet powerful version control system for large ML data and artifacts.
  2. DVC integrates with Git through *.dvc and dvc.lock files, to version files and pipelines, respectively.
  3. To track raw ML data files, use dvc add — e.g. for input dataset.
  4. To track intermediate or final results of a ML pipeline, use dvc run — e.g. for model weights, dataset.
  5. To easily compare parameters and results of two versions of a pipeline, use dvc params diff and dvc metrics diff, respectively.