Skip to content

Commit a240d88

Browse files
committed
Deploying to gh-pages from @ 6c70558 🚀
1 parent b9d7bd7 commit a240d88

File tree

12 files changed

+29
-13
lines changed

12 files changed

+29
-13
lines changed

assets/css/main.css

Lines changed: 3 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

assets/js/search-data.js

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,18 @@ ninja.data = [{
3737
handler: () => {
3838
window.location.href = "/cv/";
3939
},
40-
},{id: "post-from-scikit-learn-to-faiss-migrating-pca-for-scalable-vector-search",
40+
},{id: "post-amazon-s3-vectors-what-it-is-where-it-fits-and-the-gotchas-nobody-tells-you",
41+
42+
title: "Amazon S3 Vectors: What It Is, Where It Fits, and the Gotchas Nobody...",
43+
44+
description: "",
45+
section: "Posts",
46+
handler: () => {
47+
48+
window.location.href = "/blog/2025/s3-vectors/";
49+
50+
},
51+
},{id: "post-from-scikit-learn-to-faiss-migrating-pca-for-scalable-vector-search",
4152

4253
title: "From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search",
4354

blog/2024/dvc-fix/index.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,6 @@
5050
esac
5151

5252
exit 0
53-
</code></pre></div></div> <p>This solution has some points to consider:</p> <ul> <li>I strongly recommend grouping the files in a way that makes sense for your project. In our case, we use some metadata (eg. labeling session) that naturally partition the dataset. This way, we could easily navigate through the dataset and find the images we needed without having to download/update the whole dataset.</li> <li>Adding new data to the dataset should not modify the existing archives. This way, DVC will not store the same files twice. The new data should be added to a new archive. This is easy to achieve based on the metadata used to partition the dataset. By using the labeling sessions, you can add new zips without changing the existing ones.</li> </ul> <p>Another solution was proposed at <a href="https://fizzylogic.nl/2023/01/13/did-you-know-dvc-doesn-t-handle-large-datasets-neither-did-we-and-here-s-how-we-fixed-it" rel="external nofollow noopener" target="_blank">this article</a>, which uses <code class="language-plaintext highlighter-rouge">Parquet</code> for partitioning the data instead of zipping. This clever solution is more efficient for some cases, but it requires more effort to implement and may not apply to CV datasets.</p> <h2 id="summary">Summary</h2> <p>DVC is a great tool for managing data in machine learning projects. However, it struggles with large datasets containing a large number of files. To overcome this limitation, we zipped the images in groups and uploaded them to the remote storage. This change significantly improved the upload and download times as it reduced the number of files being tracked. I hope this post helps you avoid similar pitfalls when working with large datasets in DVC. If you have any questions or suggestions, feel free to leave a comment below. I’d love to hear from you!</p> </div> </article> <br> <hr> <br> <ul class="list-disc pl-8"></ul> <h2 class="text-3xl font-semibold mb-4 mt-12">Enjoy Reading This Article?</h2> <p class="mb-2">Here are some more articles you might like to read next:</p> <li class="my-2"> <a class="text-pink-700 underline font-semibold hover:text-pink-800" href="/blog/2025/sklearn-faiss/">From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search</a> </li> <li class="my-2"> <a class="text-pink-700 underline font-semibold hover:text-pink-800" href="/blog/2024/start-ml-project/">How to Start a Machine Learning Project Before Starting a Machine Learning Project</a> </li> </div> </div> <footer class="fixed-bottom" role="contentinfo"> <div class="container mt-0"> © Copyright 2025 Bruno A. Bruno Baruffaldi. Powered by <a href="https://jekyllrb.com/" target="_blank" rel="external nofollow noopener">Jekyll</a> with <a href="https://github.com/alshedivat/al-folio" rel="external nofollow noopener" target="_blank">al-folio</a> theme. Hosted by <a href="https://pages.github.com/" target="_blank" rel="external nofollow noopener">GitHub Pages</a>. </div> </footer> <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/jquery.min.js" integrity="sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" crossorigin="anonymous"></script> <script src="/assets/js/bootstrap.bundle.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/[email protected]/js/mdb.min.js" integrity="sha256-NdbiivsvWt7VYCt6hYNT3h/th9vSTL4EDWeGs5SN3DA=" crossorigin="anonymous"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/masonry.pkgd.min.js" integrity="sha256-Nn1q/fx0H7SNLZMQ5Hw5JLaTRZp0yILA/FRexe19VdI=" crossorigin="anonymous"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/imagesloaded.pkgd.min.js" integrity="sha256-htrLFfZJ6v5udOG+3kNLINIKh2gvoKqwEhHYfTTMICc=" crossorigin="anonymous"></script> <script defer src="/assets/js/masonry.js?a0db7e5d5c70cc3252b3138b0c91dcaf" type="text/javascript"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/medium-zoom.min.js" integrity="sha256-ZgMyDAIYDYGxbcpJcfUnYwNevG/xi9OHKaR/8GK+jWc=" crossorigin="anonymous"></script> <script defer src="/assets/js/zoom.js?85ddb88934d28b74e78031fd54cf8308"></script> <script src="/assets/js/no_defer.js?2781658a0a2b13ed609542042a859126"></script> <script defer src="/assets/js/common.js?e0514a05c5c95ac1a93a8dfd5249b92e"></script> <script defer src="/assets/js/copy_code.js?c8a01c11a92744d44b093fc3bda915df" type="text/javascript"></script> <script defer src="/assets/js/jupyter_new_tab.js?d9f17b6adc2311cbabd747f4538bb15f"></script> <script async src="https://d1bxh8uas1mnw7.cloudfront.net/assets/embed.js"></script> <script async src="https://badge.dimensions.ai/badge.js"></script> <script defer type="text/javascript" id="MathJax-script" src="https://cdn.jsdelivr.net/npm/[email protected]/es5/tex-mml-chtml.js" integrity="sha256-MASABpB4tYktI2Oitl4t+78w/lyA+D7b/s9GEP0JOGI=" crossorigin="anonymous"></script> <script src="/assets/js/mathjax-setup.js?a5bb4e6a542c546dd929b24b8b236dfd"></script> <script defer src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6" crossorigin="anonymous"></script> <script defer src="/assets/js/progress-bar.js?2f30e0e6801ea8f5036fa66e1ab0a71a" type="text/javascript"></script> <script src="/assets/js/vanilla-back-to-top.min.js?eaf77346e117baa09987a278a117b9a7"></script> <script>
53+
</code></pre></div></div> <p>This solution has some points to consider:</p> <ul> <li>I strongly recommend grouping the files in a way that makes sense for your project. In our case, we use some metadata (eg. labeling session) that naturally partition the dataset. This way, we could easily navigate through the dataset and find the images we needed without having to download/update the whole dataset.</li> <li>Adding new data to the dataset should not modify the existing archives. This way, DVC will not store the same files twice. The new data should be added to a new archive. This is easy to achieve based on the metadata used to partition the dataset. By using the labeling sessions, you can add new zips without changing the existing ones.</li> </ul> <p>Another solution was proposed at <a href="https://fizzylogic.nl/2023/01/13/did-you-know-dvc-doesn-t-handle-large-datasets-neither-did-we-and-here-s-how-we-fixed-it" rel="external nofollow noopener" target="_blank">this article</a>, which uses <code class="language-plaintext highlighter-rouge">Parquet</code> for partitioning the data instead of zipping. This clever solution is more efficient for some cases, but it requires more effort to implement and may not apply to CV datasets.</p> <h2 id="summary">Summary</h2> <p>DVC is a great tool for managing data in machine learning projects. However, it struggles with large datasets containing a large number of files. To overcome this limitation, we zipped the images in groups and uploaded them to the remote storage. This change significantly improved the upload and download times as it reduced the number of files being tracked. I hope this post helps you avoid similar pitfalls when working with large datasets in DVC. If you have any questions or suggestions, feel free to leave a comment below. I’d love to hear from you!</p> </div> </article> <br> <hr> <br> <ul class="list-disc pl-8"></ul> <h2 class="text-3xl font-semibold mb-4 mt-12">Enjoy Reading This Article?</h2> <p class="mb-2">Here are some more articles you might like to read next:</p> <li class="my-2"> <a class="text-pink-700 underline font-semibold hover:text-pink-800" href="/blog/2025/s3-vectors/">Amazon S3 Vectors: What It Is, Where It Fits, and the Gotchas Nobody Tells You</a> </li> <li class="my-2"> <a class="text-pink-700 underline font-semibold hover:text-pink-800" href="/blog/2025/sklearn-faiss/">From scikit-learn to Faiss: Migrating PCA for Scalable Vector Search</a> </li> <li class="my-2"> <a class="text-pink-700 underline font-semibold hover:text-pink-800" href="/blog/2024/start-ml-project/">How to Start a Machine Learning Project Before Starting a Machine Learning Project</a> </li> </div> </div> <footer class="fixed-bottom" role="contentinfo"> <div class="container mt-0"> © Copyright 2025 Bruno A. Bruno Baruffaldi. Powered by <a href="https://jekyllrb.com/" target="_blank" rel="external nofollow noopener">Jekyll</a> with <a href="https://github.com/alshedivat/al-folio" rel="external nofollow noopener" target="_blank">al-folio</a> theme. Hosted by <a href="https://pages.github.com/" target="_blank" rel="external nofollow noopener">GitHub Pages</a>. </div> </footer> <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/jquery.min.js" integrity="sha256-/xUj+3OJU5yExlq6GSYGSHk7tPXikynS7ogEvDej/m4=" crossorigin="anonymous"></script> <script src="/assets/js/bootstrap.bundle.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/[email protected]/js/mdb.min.js" integrity="sha256-NdbiivsvWt7VYCt6hYNT3h/th9vSTL4EDWeGs5SN3DA=" crossorigin="anonymous"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/masonry.pkgd.min.js" integrity="sha256-Nn1q/fx0H7SNLZMQ5Hw5JLaTRZp0yILA/FRexe19VdI=" crossorigin="anonymous"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/imagesloaded.pkgd.min.js" integrity="sha256-htrLFfZJ6v5udOG+3kNLINIKh2gvoKqwEhHYfTTMICc=" crossorigin="anonymous"></script> <script defer src="/assets/js/masonry.js?a0db7e5d5c70cc3252b3138b0c91dcaf" type="text/javascript"></script> <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/medium-zoom.min.js" integrity="sha256-ZgMyDAIYDYGxbcpJcfUnYwNevG/xi9OHKaR/8GK+jWc=" crossorigin="anonymous"></script> <script defer src="/assets/js/zoom.js?85ddb88934d28b74e78031fd54cf8308"></script> <script src="/assets/js/no_defer.js?2781658a0a2b13ed609542042a859126"></script> <script defer src="/assets/js/common.js?e0514a05c5c95ac1a93a8dfd5249b92e"></script> <script defer src="/assets/js/copy_code.js?c8a01c11a92744d44b093fc3bda915df" type="text/javascript"></script> <script defer src="/assets/js/jupyter_new_tab.js?d9f17b6adc2311cbabd747f4538bb15f"></script> <script async src="https://d1bxh8uas1mnw7.cloudfront.net/assets/embed.js"></script> <script async src="https://badge.dimensions.ai/badge.js"></script> <script defer type="text/javascript" id="MathJax-script" src="https://cdn.jsdelivr.net/npm/[email protected]/es5/tex-mml-chtml.js" integrity="sha256-MASABpB4tYktI2Oitl4t+78w/lyA+D7b/s9GEP0JOGI=" crossorigin="anonymous"></script> <script src="/assets/js/mathjax-setup.js?a5bb4e6a542c546dd929b24b8b236dfd"></script> <script defer src="https://cdnjs.cloudflare.com/polyfill/v3/polyfill.min.js?features=es6" crossorigin="anonymous"></script> <script defer src="/assets/js/progress-bar.js?2f30e0e6801ea8f5036fa66e1ab0a71a" type="text/javascript"></script> <script src="/assets/js/vanilla-back-to-top.min.js?eaf77346e117baa09987a278a117b9a7"></script> <script>
5454
addBackToTop();
5555
</script> <script type="module" src="/assets/js/search/ninja-keys.min.js?f8abf2f636f242d077f24149a0a56c96"></script> <ninja-keys hidebreadcrumbs noautoloadmdicons placeholder="Type to start searching"></ninja-keys> <script src="/assets/js/search-setup.js?6c304f7b1992d4b60f7a07956e52f04a"></script> <script src="/assets/js/search-data.js"></script> <script src="/assets/js/shortcut-key.js?6f508d74becd347268a7f822bca7309d"></script> </body> </html>

0 commit comments

Comments
 (0)