DVC vs GIT. Why GIT isn't enough for machine learning projects

Content





Introduction



Despite all the benefits of DVC , there are very few developers who know about this tool. Therefore, I think it will not be superfluous to introduce you first. DVC is an open source data version control system that is great for machine learning. And the main difference between DVC and Git is that: firstly, it has a wider and more convenient toolkit for ML- projects; secondly, it is designed for version control of data, not code. And for the most part, this is where their major differences end. And then I will try to describe why DVC is so good, and why Git is not enough for ML.









Reproducibility crisis



«Reproducibility crisis» ( . – « »), , , , , .







? , 98.5%, ?







, . . , . – , , , , , .







, – . , / . , .











Git . , / - , , , GitHub. . , , . – , - joblib. , . – Git-LFS







Git-LFS [] Git , Git. – / , . . . . , :







  • Git-LFS – 1 GitHub ( ), Gitlab Atlassian . , LFS .
  • , .
  • Git-LFS . LFS .
  • Git-LFS .




Data Version Control



DVC Git. , (, Git). DVC + Git :











Github’ - . ( ) , . .







DVC . , - , - «- 0 1». DVC «1» . – : «0 0 1», «1 1 2» «2 2 ». 6 . , DVC . , Make, DVC .







DVC:







  • ;
  • ;
  • Creation of pipelines for processing datasets and their visualization in the console;
  • Saving and tracking all metrics;
  • Switching between file versions;
  • Reproduction of models on the created pipelines.









All Articles