DVC (data version control) is an open source tool for tracking datasets and outputs of Data Science and Machine Learning projects.
DVC is designed to be integrated into your Git workflows. On top of that, the interface of DVC is very similar to Git's, so if you are familiar with Git you will find starting to use DVC very easy.
So why should you start using DVC if you already have Git to track code and an a Azure instance to store your datasets and models?
DVC is available on PyPI and it is very easy to install:
pip install dvc
Depending on remote storage you will use, you may want to install specific dependencies.
$ pip install dvc[s3] # support Amazon S3
$ pip install dvc[ssh] # support ssh
$ pip install dvc[all] # all supports
To use DVC you should work inside of a Git repo. If it does not exist yet, we create and initialize one.
$ mkdir ml_project & cd ml_project
$ git init
Then, initializing a DVC project creates and automatically git add a few configuration files. Note how the
syntax is similar to Git!
$ dvc init
$ git status -s
A .dvc/.gitignore
A .dvc/config
A .dvc/plots/confusion.json
A .dvc/plots/default.json
A .dvc/plots/scatter.json
A .dvc/plots/smooth.json
A .dvcignore
$ git commit -m "Initialize dvc project"
Let us assume that in our ML project we have 200,000 images that we want to use to train and test a
classification model able to distinguish dogs from cats.
Tracking all of these images with Git would be a bad idea, as we are dealing with a lot of binary
files. Instead, we can easily track them with DVC.
data
├── train
│ ├── dogs # 95,000 pictures
│ └── cats # 95,000 pictures
└── validation
├── dogs # 5,000 pictures
└── cats # 5,000 pictures
$ dvc add data/
100% Add|██████████|1/1 [00:30, 30.51s/file]
To track the changes with git, run:
git add .gitignore data.dvc
$ git add .gitignore data.dvc
$ git commit -m "Add first version of data/"
$ git tag -a "v1.0" -m "data v1.0, 200,000 images"
As you can see, if you are familiar with Git, the syntax of DVC looks straightforward.
But what exactly has happened when we executed
dvc add data/
? Actually, quite a lot of things!
data/
was computed and added to a new data.dvc file
DVC updates .gitignore
to tell Git not to track the content of data/
.
data/
—i.e. the 200,000 images— has been moved to a cache
(by default the cache is located under .dvc/cache/
).
Now, what happens when you modify some DVC-tracked data? Let us assume for instance, that you
replaced some images in data/train/
with higher definition ones, or that you added new
training images to data/train/
. We can easily check what has changed with dvc
diff
:
$ dvc diff
Modified:
data/
To track these changes we follow the same procedure as before with dvc add
, so that DVC will re-compute the hash of data/
to know which version of your repo
corresponds to which contents.
$ dvc add data/
$ git diff data.dvc
outs:
-- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
+- md5: 21060888834f7220846d1c6f6c04e649.dir
path: data
$ git commit -am "Add images + Improve quality in data/train/"
$ git tag -a "v2.0" -m "data v2.0, 300,000 images in HD"
Now that we have several versions of our dataset data/
, we can easily switch from one
version to another using Git and DVC.
git checkout
to switch to a specific revision of the project.
This will guarantee that data.dvc
contains the hash of the correct version the dataset
we are interested in, but this command does not modify the files in the workspace, i.e. the content
of
data/
.
$ git checkout v1.0
$ dvc diff
Modified:
data/
data.dvc
and the hash of
the content of data/
by running dvc checkout
.
This command fetches from the DVC cache the correct version of the dataset based on the hash found
in the data.dvc
file.
$ dvc checkout
M data/
$ dvc status
Data and pipelines are up to date.
A remote storage is for dvc, what a GitHub is for Git.
rm -rf *
).$ mkdir -p ~/tmp/dvc_storage
$ dvc remote add --default loc_remote ~/tmp/dvc_storage
Setting 'loc_remote' as a default remote.
$ git add .dvc/config # DVC wrote here remote storage config
$ git commit -m "Configure remote storage loc_remote"
Now, if we run dvc push
, DVC will upload the content of the cache to the remote storage.
This is pretty much like a git push
.
$ dvc push
300000 files pushed
So, even if by mistake we deleted all data from our workspace and cache, we can retrieve all of it it with
dvc pull
. Again, this is pretty much like git pull
.
$ rm -rf .dvc/cache data
$ dvc pull # update .dvc/cache with contents from remote
300000 files fetched
$ dvc checkout # update workspace, linking data from .dvc/cache
A data/
So far we have seen how dvc add
can be used to track large files or datasets. When we
work on Machine Learning projects, trained models are typical examples of large files we would like
to track.
However, we also want to track how such models (as well as other outputs of our ML pipeline) were generated, for reproducibility and better tracking. Let us consider the following example of a Machine Learning pipeline for a NLP (Natural Language Processing) project of sentiment analysis.
To track each stage of this pipeline with DVC, we can run the pipeline stage and track its output
together with all the dependencies with dvc run
:
$ dvc run -n prepare \ # name of the stage
-p prepare.seed \ # dependency (parameter)
-p prepare.split \ # dependency (parameter)
-d src/prepare.py \ # dependency (file)
-d data/data.xml \ # dependency (file)
-o data/prepared \ # output
python src/prepare.py data/data.xml # command to run
The parameter dependencies declared above are read from a params.yaml
file that could look
like this:
prepare:
split: 0.20
seed: 20170428
featurize:
max_feats: 500
ngrams: 1
...
There are several advantages of using this dvc run
approach instead of running the pipeline
stages, and then tracking the output artifacts with dvc add
.
.dvc/cache
).dvc.yaml
file.dvc.lock
file.
dvc.yaml
works as a Makefile
, in the
sense that we can reproduce each stage of the pipeline with dvc repro prepare
, which
automatically decides to re-run each stage only if one of the dependencies has changed.
dvc.yaml
, which as said describes the graph structure of the data
pipeline,
similar to how
a Makefile
works for building software.
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
- src/prepare.py
params:
- prepare.seed
- prepare.split
outs:
- data/prepared
As for the dvc.lock
generated by dvc run
, it is a bit like the *
.dvc
files generated by dvc add
.
In particular, code.lock
matches dvc.lock
and describes latest pipeline state
in order to:
*.dvc
file)dvc repro
is called.
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- path: data/data.xml
md5: a304afb96060aad90176268345e10355
- path: src/prepare.py
md5: 285af85d794bb57e5d09ace7209f3519
params:
params.yaml:
prepare.seed: 20170428
prepare.split: 0.2
outs:
- params: data/prepared
md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
If dvc.yaml
is like a Makefile
, running dvc repro
is similar to
running the make
command. Once we have run all stages of the pipeline, if we change one
of the
dependencies and run dvc repro train
, only the affected stages will be re-run.
$ sed -i -e "s@max_features: 500@max_features: 1500@g" params.yaml
$ dvc repro train
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Running stage 'featurize' with command:
python src/featurization.py data/prepared data/features
Updating lock file 'dvc.lock'
Running stage 'train' with command:
python src/train.py data/features model.pkl
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock
Finally, a useful feature of DVC pipelines is that it easily allow to compare parameters and scores of models corresponding to different versions.
$ dvc params diff
Path Param Old New
params.yaml featurize.max_features 500 1500
$ dvc metrics diff
Path Metric Value Change
scores.json auc 0.61314 0.07139
*.dvc
and dvc.lock
files, to version
files and
pipelines, respectively.
dvc add
— e.g. for input dataset.dvc run
— e.g. for model
weights,
dataset.
dvc params
diff
and
dvc metrics diff
, respectively.