Content
Introduction
Despite all the benefits of DVC , there are very few developers who know about this tool. Therefore, I think it will not be superfluous to introduce you first. DVC is an open source data version control system that is great for machine learning. And the main difference between DVC and Git is that: firstly, it has a wider and more convenient toolkit for ML- projects; secondly, it is designed for version control of data, not code. And for the most part, this is where their major differences end. And then I will try to describe why DVC is so good, and why Git is not enough for ML.
Reproducibility crisis
«Reproducibility crisis» ( . – « »), , , , , .
? , 98.5%, ?
, . . , . – , , , , , .
, – . , / . , .
Git . , / - , , , GitHub. . , , . – , - joblib. , . – Git-LFS
Git-LFS [] Git , Git. – / , . . . . , :
- Git-LFS – 1 GitHub ( ), Gitlab Atlassian . , LFS .
- , .
- Git-LFS . LFS .
- Git-LFS .
Data Version Control
DVC Git. , (, Git). DVC + Git :
Github’ - . ( ) , . .
DVC . , - , - «- 0 1». DVC «1» . – : «0 0 1», «1 1 2» «2 2 ». 6 . , DVC . , Make, DVC .
DVC:
- ;
- ;
- Creation of pipelines for processing datasets and their visualization in the console;
- Saving and tracking all metrics;
- Switching between file versions;
- Reproduction of models on the created pipelines.