</code></pre></div></div> <p>This solution has some points to consider:
</p> <ul> <li>I strongly recommend grouping the files in a way that makes sense for your project. In our case, we use some metadata (eg. labeling session) that naturally partition the dataset. This way, we could easily navigate through the dataset and find the images we needed without having to download/update the whole dataset.
</li> <li>Adding new data to the dataset should not modify the existing archives. This way, DVC will not store the same files twice. The new data should be added to a new archive. This is easy to achieve based on the metadata used to partition the dataset. By using the labeling sessions, you can add new zips without changing the existing ones.
</li> </ul> <p>Another solution was proposed at
<a href="
https://fizzylogic.nl/2023/01/13/did-you-know-dvc-doesn-t-handle-large-datasets-neither-did-we-and-here-s-how-we-fixed-it"
rel="
external nofollow noopener"
target="
_blank"
>this article
</a>, which uses
<code class="
language-plaintext highlighter-rouge"
>Parquet
</code> for partitioning the data instead of zipping. This clever solution is more efficient for some cases, but it requires more effort to implement and may not apply to CV datasets.
</p> <h2 id="
summary"
>Summary
</h2> <p>DVC is a great tool for managing data in machine learning projects. However, it struggles with large datasets containing a large number of files. To overcome this limitation, we zipped the images in groups and uploaded them to the remote storage. This change significantly improved the upload and download times as it reduced the number of files being tracked. I hope this post helps you avoid similar pitfalls when working with large datasets in DVC. If you have any questions or suggestions, feel free to leave a comment below. I’d love to hear from you!
</p> </div> </article> <br> <hr> <br> <ul class="
list-disc pl-8"
></ul> <h2 class="
text-3xl font-semibold mb-4 mt-12"
>Enjoy Reading This Article?
</h2> <p class="
mb-2"
>Here are some more articles you might like to read next:
</p> <li class="
my-2"
> <a class="
text-pink-700 underline font-semibold hover:text-pink-800"
href="
/blog/2025/s3-vectors/"
>Amazon S3 Vectors: What It Is, Where It Fits, and the Gotchas Nobody Tells You
</a> </li> <li class="
my-2"
> <a class="
text-pink-700 underline font-semibold hover:text-pink-800"
href="
/blog/2025/sklearn-faiss/"
>From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search
</a> </li> <li class="
my-2"
> <a class="
text-pink-700 underline font-semibold hover:text-pink-800"
href="
/blog/2024/start-ml-project/"
>How to Start a Machine Learning Project Before Starting a Machine Learning Project
</a> </li> </div> </div> <footer class="
fixed-bottom"
role="
contentinfo"
> <div class="
container mt-0"
> © Copyright 2025 Bruno A. Bruno Baruffaldi. Powered by
<a href="
https://jekyllrb.com/"
target="
_blank"
rel="
external nofollow noopener"
>Jekyll
</a> with
<a href="
https://github.com/alshedivat/al-folio"
rel="
external nofollow noopener"
target="
_blank"
>al-folio
</a> theme. Hosted by
<a href="
https://pages.github.com/"
target="
_blank"
rel="
external nofollow noopener"
>GitHub Pages
</a>.
</div> </footer> <script src="
https://cdn.jsdelivr.net/npm/[email protected]/dist/jquery.min.js"
integrity="
sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4="
crossorigin="
anonymous"
></script> <script src="
/assets/js/bootstrap.bundle.min.js"
></script> <script src="
https://cdn.jsdelivr.net/npm/[email protected]/js/mdb.min.js"
integrity="
sha256-NdbiivsvWt7VYCt6hYNT3h/th9vSTL4EDWeGs5SN3DA="
crossorigin="
anonymous"
></script> <script defer src="
https://cdn.jsdelivr.net/npm/[email protected]/dist/masonry.pkgd.min.js"
integrity="
sha256-Nn1q/fx0H7SNLZMQ5Hw5JLaTRZp0yILA/FRexe19VdI="
crossorigin="
anonymous"
></script> <script defer src="
https://cdn.jsdelivr.net/npm/[email protected]/imagesloaded.pkgd.min.js"
integrity="
sha256-htrLFfZJ6v5udOG+3kNLINIKh2gvoKqwEhHYfTTMICc="
crossorigin="
anonymous"
></script> <script defer src="
/assets/js/masonry.js?a0db7e5d5c70cc3252b3138b0c91dcaf"
type="
text/javascript"
></script> <script defer src="
https://cdn.jsdelivr.net/npm/[email protected]/dist/medium-zoom.min.js"
integrity="
sha256-ZgMyDAIYDYGxbcpJcfUnYwNevG/xi9OHKaR/8GK+jWc="
crossorigin="
anonymous"
></script> <script defer src="
/assets/js/zoom.js?85ddb88934d28b74e78031fd54cf8308"
></script> <script src="
/assets/js/no_defer.js?2781658a0a2b13ed609542042a859126"
></script> <script defer src="
/assets/js/common.js?e0514a05c5c95ac1a93a8dfd5249b92e"
></script> <script defer src="
/assets/js/copy_code.js?c8a01c11a92744d44b093fc3bda915df"
type="
text/javascript"
></script> <script defer src="
/assets/js/jupyter_new_tab.js?d9f17b6adc2311cbabd747f4538bb15f"
></script> <script async src="
https://d1bxh8uas1mnw7.cloudfront.net/assets/embed.js"
></script> <script async src="
https://badge.dimensions.ai/badge.js"
></script> <script defer type="
text/javascript"
id="
MathJax-script"
src="
https://cdn.jsdelivr.net/npm/[email protected]/es5/tex-mml-chtml.js"
integrity="
sha256-MASABpB4tYktI2Oitl4t+78w/lyA+D7b/s9GEP0JOGI="
crossorigin="
anonymous"
></script> <script src="
/assets/js/mathjax-setup.js?a5bb4e6a542c546dd929b24b8b236dfd"
></script> <script defer src="
https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6"
crossorigin="
anonymous"
></script> <script defer src="
/assets/js/progress-bar.js?2f30e0e6801ea8f5036fa66e1ab0a71a"
type="
text/javascript"
></script> <script src="
/assets/js/vanilla-back-to-top.min.js?eaf77346e117baa09987a278a117b9a7"
></script> <script>
0 commit comments