aplbrain · NEStock · Apr 29, 2025 · Apr 17, 2025 · Apr 18, 2025 · Apr 18, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,58 @@
+# v0.8.1 (Mon Apr 21 2025)
+
+#### 🏠 Internal
+
+- Integrate `garbage_collection` service into `collect_garbage.py` [#2343](https://github.com/dandi/dandi-archive/pull/2343) ([@mvandenburgh](https://github.com/mvandenburgh))
+
+#### 🧪 Tests
+
+- Remove unneeded data from playwright test fixture [#2341](https://github.com/dandi/dandi-archive/pull/2341) ([@mvandenburgh](https://github.com/mvandenburgh))
+
+#### Authors: 1
+
+- Mike VanDenburgh ([@mvandenburgh](https://github.com/mvandenburgh))
+
+---
+
+# v0.8.0 (Fri Apr 18 2025)
+
+#### 🚀 Enhancement
+
+- Move banner with info blurb to top of all pages [#2329](https://github.com/dandi/dandi-archive/pull/2329) ([@naglepuff](https://github.com/naglepuff))
+
+#### 🐛 Bug Fix
+
+- Don't override oauth2_provider settings dict [#2337](https://github.com/dandi/dandi-archive/pull/2337) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Fix oauth2 setting [#2335](https://github.com/dandi/dandi-archive/pull/2335) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Require minimum version of 2.0 for django-oauth-toolkit [#2326](https://github.com/dandi/dandi-archive/pull/2326) ([@jjnesbitt](https://github.com/jjnesbitt))
+
+#### 🏠 Internal
+
+- Revert OAuth model change [#2338](https://github.com/dandi/dandi-archive/pull/2338) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Switch from `runtime.txt` to `.python-version` [#2332](https://github.com/dandi/dandi-archive/pull/2332) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Switch staging back to builtin oauth `Application` [#2331](https://github.com/dandi/dandi-archive/pull/2331) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Update swagger/redocs urls to align with Resonant [#2327](https://github.com/dandi/dandi-archive/pull/2327) ([@mvandenburgh](https://github.com/mvandenburgh))
+
+#### 📝 Documentation
+
+- DOC: fixup description of the interaction with auto for releases based on labels [#2285](https://github.com/dandi/dandi-archive/pull/2285) ([@yarikoptic](https://github.com/yarikoptic) [@waxlamp](https://github.com/waxlamp))
+
+#### 🔩 Dependency Updates
+
+- Clean up `setup.py` [#2324](https://github.com/dandi/dandi-archive/pull/2324) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Update Heroku Python runtime [#2323](https://github.com/dandi/dandi-archive/pull/2323) ([@mvandenburgh](https://github.com/mvandenburgh))
+- Unpin `django-oauth-toolkit`, generate migrations for downstream `StagingApplication` [#2320](https://github.com/dandi/dandi-archive/pull/2320) ([@mvandenburgh](https://github.com/mvandenburgh))
+
+#### Authors: 5
+
+- Jacob Nesbitt ([@jjnesbitt](https://github.com/jjnesbitt))
+- Michael Nagler ([@naglepuff](https://github.com/naglepuff))
+- Mike VanDenburgh ([@mvandenburgh](https://github.com/mvandenburgh))
+- Roni Choudhury ([@waxlamp](https://github.com/waxlamp))
+- Yaroslav Halchenko ([@yarikoptic](https://github.com/yarikoptic))
+
+---
+
 # v0.7.0 (Wed Apr 16 2025)
 
 #### 🚀 Enhancement

diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md
@@ -146,40 +146,6 @@ This creates a dummy dandiset with valid metadata and a single dummy asset.
 The dandiset should be valid and publishable out of the box.
 This script is a simple way to get test data into your DB without having to use dandi-cli.
 
-### import_dandisets
-```
-python manage.py import_dandisets [API_URL] --all
-```
-
-This imports all dandisets (versions + metadata only, no assets) from the dandi-api deployment
-living at `API_URL`. For example, to import all dandisets from the production server into your
-local dev environment, run `python manage.py import_dandisets https://api.dandiarchive.org` from
-your local terminal. Note that if a dandiset with the same identifier as the one being imported
-already exists, that dandiset will not be imported.
-
-```
-python manage.py import_dandisets [API_URL] --all --replace
-```
-
-Same as the previous example, except if a dandiset with the same identifier as the one being imported
-already exists, the existing dandiset will be replaced with the one being imported.
-
-```
-python manage.py import_dandisets [API_URL] --all --offset 100000
-```
-
-This imports all dandisets (versions + metadata only, no assets) from the dandi-api deployment
-living at `API_URL` and offsets their identifiers by 100000. This is helpful if you want to import
-a dandiset that has the same identifier as one already in your database.
-
-```
-python manage.py import_dandisets [API_URL] --identifier 000005
-```
-
-This imports dandiset 000005 from `API_URL` into your local dev environment. Note that if there is already
-a dandiset with an identifier of 000005, nothing will happen. Use the --replace flag to have the script
-overwrite it instead if desired.
-
 ## Abbreviations
 
 - DLP: Dataset Landing Page (e.g. https://dandiarchive.org/dandiset/000027)
diff --git a/dandiapi/api/fixtures/README.md b/dandiapi/api/fixtures/README.md
@@ -0,0 +1,12 @@
+# Playwright Test Data Fixture
+
+This directory contains a [Django fixture](https://docs.djangoproject.com/en/5.2/topics/db/fixtures/) that contains test data
+for the Playwright-based e2e tests.
+
+## How was this data generated?
+
+To generate this data, a local DB was populated with test data and then dumped to a Django fixture using `manage.py dumpdata`. Note, the `--exclude` flags are important here because they prevent unneeded and/or deployment specific DB tables from being included in the dump.
+
+```bash
+./manage.py dumpdata --output dandiapi/api/fixtures/playwright.json.xz --exclude auth.permission --exclude authtoken --exclude contenttypes --exclude oauth2_provider --exclude sites
+```
diff --git a/dandiapi/api/fixtures/playwright.json.xz b/dandiapi/api/fixtures/playwright.json.xz
diff --git a/dandiapi/api/management/commands/collect_garbage.py b/dandiapi/api/management/commands/collect_garbage.py
@@ -1,14 +1,29 @@
 from __future__ import annotations
 
+from django.db.models import Sum
 import djclick as click
 
 from dandiapi.api.garbage import stale_assets
+from dandiapi.api.services import garbage_collection
 
 
 def echo_report():
-    click.echo(f'Assets: {stale_assets().count()}')
-    click.echo('AssetBlobs: Coming soon')
-    click.echo('Uploads: Coming soon')
+    garbage_collectable_assets = stale_assets()
+    assets_count = garbage_collectable_assets.count()
+
+    garbage_collectable_asset_blobs = garbage_collection.asset_blob.get_queryset()
+    asset_blobs_count = garbage_collectable_asset_blobs.count()
+    asset_blobs_size_in_bytes = garbage_collectable_asset_blobs.aggregate(Sum('size'))['size__sum']
+
+    garbage_collectable_uploads = garbage_collection.upload.get_queryset()
+    uploads_count = garbage_collectable_uploads.count()
+
+    click.echo(f'Assets: {assets_count}')
+    click.echo(
+        f'AssetBlobs: {asset_blobs_count} ({asset_blobs_size_in_bytes} bytes / '
+        f'{asset_blobs_size_in_bytes / (1024 ** 3):.2f} GB)'
+    )
+    click.echo(f'Uploads: {uploads_count}')
     click.echo('S3 Blobs: Coming soon')
 
 
@@ -24,13 +39,13 @@ def collect_garbage(*, assets: bool, assetblobs: bool, uploads: bool, s3blobs: b
     if doing_deletes:
         echo_report()
 
-    if assetblobs:
-        raise click.NoSuchOption('Deleting AssetBlobs is not yet implemented')
-    if uploads:
-        raise click.NoSuchOption('Deleting Uploads is not yet implemented')
-    if s3blobs:
+    if assetblobs and click.confirm('This will delete all AssetBlobs. Are you sure?'):
+        garbage_collection.asset_blob.garbage_collect()
+    if uploads and click.confirm('This will delete all Uploads. Are you sure?'):
+        garbage_collection.upload.garbage_collect()
+    if s3blobs and click.confirm('This will delete all S3 Blobs. Are you sure?'):
         raise click.NoSuchOption('Deleting S3 Blobs is not yet implemented')
-    if assets:
+    if assets and click.confirm('This will delete all Assets. Are you sure?'):
         assets_to_delete = stale_assets()
         if click.confirm(f'This will delete {assets_to_delete.count()} assets. Are you sure?'):
             assets_to_delete.delete()

diff --git a/dandiapi/api/services/garbage_collection/__init__.py b/dandiapi/api/services/garbage_collection/__init__.py
@@ -1,21 +1,13 @@
 from __future__ import annotations
 
-from concurrent.futures import Future, ThreadPoolExecutor, wait
 from datetime import timedelta
-import json
 
 from celery.utils.log import get_task_logger
-from django.core import serializers
 from django.db import transaction
 from django.utils import timezone
-from more_itertools import chunked
 
-from dandiapi.api.models import (
-    AssetBlob,
-    GarbageCollectionEvent,
-    GarbageCollectionEventRecord,
-    Upload,
-)
+from dandiapi.api.models import GarbageCollectionEvent
+from dandiapi.api.services.garbage_collection import asset_blob, upload
 from dandiapi.api.storage import DandiMultipartMixin
 
 logger = get_task_logger(__name__)
@@ -33,85 +25,10 @@
 )  # TODO: pick this up from env var set by Terraform to ensure consistency?
 
 
-def _garbage_collect_uploads() -> int:
-    qs = Upload.objects.filter(
-        created__lt=timezone.now() - UPLOAD_EXPIRATION_TIME,
-    )
-    if not qs.exists():
-        return 0
-
-    deleted_records = 0
-    futures: list[Future] = []
-
-    with transaction.atomic(), ThreadPoolExecutor() as executor:
-        event = GarbageCollectionEvent.objects.create(type=Upload.__name__)
-        for uploads_chunk in chunked(qs.iterator(), GARBAGE_COLLECTION_EVENT_CHUNK_SIZE):
-            GarbageCollectionEventRecord.objects.bulk_create(
-                GarbageCollectionEventRecord(
-                    event=event, record=json.loads(serializers.serialize('json', [u]))[0]
-                )
-                for u in uploads_chunk
-            )
-
-            # Delete the blobs from S3
-            futures.append(
-                executor.submit(
-                    lambda chunk: [u.blob.delete(save=False) for u in chunk],
-                    uploads_chunk,
-                )
-            )
-
-            deleted_records += Upload.objects.filter(
-                pk__in=[u.pk for u in uploads_chunk],
-            ).delete()[0]
-
-        wait(futures)
-
-    return deleted_records
-
-
-def _garbage_collect_asset_blobs() -> int:
-    qs = AssetBlob.objects.filter(
-        assets__isnull=True,
-        created__lt=timezone.now() - ASSET_BLOB_EXPIRATION_TIME,
-    )
-    if not qs.exists():
-        return 0
-
-    deleted_records = 0
-    futures: list[Future] = []
-
-    with transaction.atomic(), ThreadPoolExecutor() as executor:
-        event = GarbageCollectionEvent.objects.create(type=AssetBlob.__name__)
-        for asset_blobs_chunk in chunked(qs.iterator(), GARBAGE_COLLECTION_EVENT_CHUNK_SIZE):
-            GarbageCollectionEventRecord.objects.bulk_create(
-                GarbageCollectionEventRecord(
-                    event=event, record=json.loads(serializers.serialize('json', [a]))[0]
-                )
-                for a in asset_blobs_chunk
-            )
-
-            # Delete the blobs from S3
-            futures.append(
-                executor.submit(
-                    lambda chunk: [a.blob.delete(save=False) for a in chunk],
-                    asset_blobs_chunk,
-                )
-            )
-
-            deleted_records += AssetBlob.objects.filter(
-                pk__in=[a.pk for a in asset_blobs_chunk],
-            ).delete()[0]
-
-        wait(futures)
-
-    return deleted_records
-
-
 def garbage_collect():
     with transaction.atomic():
-        garbage_collected_uploads = _garbage_collect_uploads()
-        garbage_collected_asset_blobs = _garbage_collect_asset_blobs()
+        garbage_collected_uploads = upload.garbage_collect()
+        garbage_collected_asset_blobs = asset_blob.garbage_collect()
 
         GarbageCollectionEvent.objects.filter(
             timestamp__lt=timezone.now() - RESTORATION_WINDOW

diff --git a/dandiapi/api/services/garbage_collection/asset_blob.py b/dandiapi/api/services/garbage_collection/asset_blob.py
@@ -0,0 +1,71 @@
+from __future__ import annotations
+
+from concurrent.futures import Future, ThreadPoolExecutor, wait
+from datetime import timedelta
+import json
+from typing import TYPE_CHECKING
+
+from celery.utils.log import get_task_logger
+from django.core import serializers
+from django.db import transaction
+from django.utils import timezone
+from more_itertools import chunked
+
+from dandiapi.api.models import (
+    AssetBlob,
+    GarbageCollectionEvent,
+    GarbageCollectionEventRecord,
+)
+
+if TYPE_CHECKING:
+    from django.db.models import QuerySet
+
+logger = get_task_logger(__name__)
+
+ASSET_BLOB_EXPIRATION_TIME = timedelta(days=7)
+
+
+def get_queryset() -> QuerySet[AssetBlob]:
+    """Get the queryset of AssetBlobs that are eligible for garbage collection."""
+    return AssetBlob.objects.filter(
+        assets__isnull=True,
+        created__lt=timezone.now() - ASSET_BLOB_EXPIRATION_TIME,
+    )
+
+
+def garbage_collect() -> int:
+    from . import GARBAGE_COLLECTION_EVENT_CHUNK_SIZE
+
+    qs = get_queryset()
+
+    if not qs.exists():
+        return 0
+
+    deleted_records = 0
+    futures: list[Future] = []
+
+    with transaction.atomic(), ThreadPoolExecutor() as executor:
+        event = GarbageCollectionEvent.objects.create(type=AssetBlob.__name__)
+        for asset_blobs_chunk in chunked(qs.iterator(), GARBAGE_COLLECTION_EVENT_CHUNK_SIZE):
+            GarbageCollectionEventRecord.objects.bulk_create(
+                GarbageCollectionEventRecord(
+                    event=event, record=json.loads(serializers.serialize('json', [a]))[0]
+                )
+                for a in asset_blobs_chunk
+            )
+
+            # Delete the blobs from S3
+            futures.append(
+                executor.submit(
+                    lambda chunk: [a.blob.delete(save=False) for a in chunk],
+                    asset_blobs_chunk,
+                )
+            )
+
+            deleted_records += AssetBlob.objects.filter(
+                pk__in=[a.pk for a in asset_blobs_chunk],
+            ).delete()[0]
+
+        wait(futures)
+
+    return deleted_records