Crowdbotics Logo

Customers arrow

Don’t take our word for it, see what our customers have to say.

About Us arrow

We are on a mission to radically transform the software development lifecycle.

Home Blog ...

Data Analysis

Data Version Control Explained

If you intend to study and reproduce the results of machine learning research, you'll need to implement a version control system for your data. Here's an overview of how data version control works.

6 October 2021

by Nimra Ejaz

In today’s data-driven world, machine learning experts and data scientists deal with a large volume of datasets, files, and metrics to carry out day-to-day operations. The varying versions of these artifacts need to be tracked and managed as experiments are performed on them in multiple iterations. Data Version Control is a great practice for managing numerous datasets, machine learning models, and files in addition to keeping a record of multiple iterations – i.e. when, why, and what was altered.

Introduction to Data Version Control

DVC is an open-source system that ensures reproducibility within machine learning experiments, since its users do not have to manually remember which data model uses which dataset and what actions were conducted to get the desired result.

Furthermore, DVC users do not have to rebuild previous models or data modeling techniques to achieve the same past state of results. Even when provided with tons of models and data metrics, DVC eliminates the effort required to know which model was trained with what type or version of data. All distinguished reports are maintained in rotations.

DVC basically consists of bundle of tools and processes that track changing versions of data and collections of previous data (in other words, no more digging around for files with names like “old2-v2.html”). Repositories in DVC usually refer to files or directories which are under the influence of the version control system. A categorized state is maintained for each change committed to a file (e.g. add, delete, move, or modify).

A Brief History of DVC

DVC was released in 2017 as a simple command line tool and most recently released version 1.11.2. It has been adopted by thousands of users and boasts 150+ contributors. Development on DVC is guided by discussions between community members, most of whom are ML engineers, software developers, and data scientists.

It took almost three years of planning to release DVC (which bills itself as “Git for Machine Learning Projects”) with stabilized commands and file formats. An upgraded form of DVC is under active development which will further improve the data management layers to make it a less complicated historical ML tool.

Accessing and Preserving Large Datasets before DVC

Controlling large-scale data without DVC or a similar tool is almost inconceivable today. Before these tools evolved, ML statistics were handled manually by good old CTRL-C, CTRL-V, and conventional file trees. Here were a few tactics used to manage unwieldy large data files:

Effect of DVC on Workflows

To ensure accuracy in projects, data scientists frequently spend weeks and months on time-consuming experimentation. They perform the tedious task of configuring which model to train with what dataset. DVC impacts this process in the following ways:

DVC Tools and Frameworks

In a productive ML environment, scientists face many challenges like versioning in a collaborative environment and maintaining enough storage space. To simplify data management and tackle these issues head-on, you can utilize the following tools:

Tool

Pros

Cons

Open Source

Convenient to Use

Supports Cloud

DVC

Light Weight Pipelines,
support cloud storage

Tightly coupled, redundancy

Yes

Yes

Yes

DOLT

SQL interface, Light weight

Does not support images and freedom text, Still evolving

Yes

No

No

Pachyderm

Portable, robust, and offers
scalability options

Integrating with existing
structure is complicated

Yes

No

Yes

Delta Lake

Effective for data processing, Allows ACID transactions

Less flexible, Built for Spark and bigdata

Yes

No

Yes

Git LFS

Smooth integration, Same
permissions as for Git repository

Non scalable servers

Yes

Yes

No

DVC on Git

DVC takes advantage of Git and runs top of it. It uses a remote repository like Google Cloud, Azure, or S3 for storing large files. It could be said that Git supports version control for code, but DVC provides it for data. In simpler terms, you can say that DVC = “Machine Learning Git.”

Git is used as a foundation by DVC to track the lifecycle of how a model was produced and what commands were used to produce metrics. .dvc files also come with downloading a Git repository. Small data files are meant for Git, while large data files goes into the remote storage of DVC. It is not required that Git be paired with DVC, as DVC can work effectively even without it.

try our app estimate calculator CTA image

Advantages of DVC

A critical challenge in deep learning experiments is to manage, store, and reuse models and algorithms. To minimize the complexity of these challenges, some advantages of DVC for data scientists are listed as follows:

1. Share Models via Cloud Storage

By centralizing data storage, teams find it easier to perform experiments using a shared single machine, which in turn promotes better resource utilization. DVC allows teams to manage a development server for shared data usage.

Servers in this case can be any type of cloud server (Microsoft Azure, Amazon S3, Google SSH, etc.). As we do git checkout for our code, we can do the same for our data models in DVC because it initiates fast switching and workspace restoration speed for all users to share models through the cloud.

2. Track & Visualize ML Models

Data science features in DVC are versioned in data repositories. Versioning is achieved through regular Git workflows such as pull requests. To store all ML artifacts, DVC uses a built-in cache, which is further synchronized with remote cloud storage. This way, DVC allows for the tracking of data and models for further versioning. A basic step to build artifacts by tracking ML models is to write a dvc.yaml file.

3. Reproducibility

When using ML models in cross-project experiments, DVC data registries can be helpful. These are like a package management system for boosting reproducibility and reusability. DVC repositories store the history for all artifacts, including what was changed and when, and can use no-code pulls to update requests with a single commit. A simple command line interface enables users to reproduce and organize feature stores with dvc get and dvc import commits.

4. Organized ML data

Data is the main asset for ML engineers, so proper organization of data is necessary to train models effectively. DVC uses the concept of a data pipeline to version data using Git. These pipelines are lightweight and allow you to organize and reproduce your workflows. Dataset versioning promotes automation, reproducibility, and CI/CD for machine learning.

5. Increase the pace of data science

A stack of modernized features enables fast-paced machine learning innovation. The features include versioning metafiles, fast tracking of metrics in simple text form, switching, sharing data through a centralized development server, lightweight pipelines and data-driven navigation through the directory.

Imagine switching from a 100GB file with a simple git checkout command and using git clone to visualize large metafiles and models within seconds, or using sets of similar commands to train systems in shorter times and generate faster results.

Disadvantages

DVC is not a one-pack solution for all ML problems. It comes with its own set of pitfalls, which are mentioned below:

1. Redundancy

Using a separate pipeline tool can cause redundancy because DVC is firmly coupled with pipeline management.

2. Incorrect Configuration Risk

A risk of incorrect configuration of your pipeline (if your team forget to add the output file) is present in DVC. It is false to assume that a DVC-produced version of project from a year ago will work the same in current circumstances. To check for missing dependencies in DVC is tough because data mugging does not become readily visible through an error.

3. Poor Performance in Sloppy Architecture

DVC works along with Git, so without the proper definition of metrics and datasets for a given architecture, teams will not be able to get the full benefit of this version control system. Teams may have to manually develop extra features in DVC to meet certain demands of ML.

Note: DVC does not help you with full system design and does not control non-deterministic behavior of your model.

Case Study

Christopher Samiullah, a freelance software consultant, shared his experience using the DVC platform and posted changes that occurred after incorporating DVC in his project. His model was a convolutional neural network for image classification taking data from a plant seedlings dataset.

Working Without DVC

Working With DVC

The modifications he made to his workflow are listed below:

Conclusion

DVC is useful in all cases where reduced storage space is required to manage datasets and track changes made by multiple team members at same time. If you run a large ML team working with complex datasets, it’s recommended to implement Data Version Control. If your model outputs require debugging, then you must adopt an available DVC tool to enhance reproducibility.

If you’re an ML team, you should know that Crowdbotics provides managed app development services by vetted developers, including ML expertise and business intelligence implementations. Our developers can add ML features and analytics to an existing product or build ML and data tools (including a DVC pipeline) from the ground up. Get in touch with our experts today to learn more.