Skip to content

Conversation

@sravfeyn
Copy link
Member

Technical Summary

  • creates a csv of ConnectUser, uploads it to superset via its REST API
  • uses instance itself to store and delete the csv file. So, added a 100k user limit (current suer count is 6k). Will add a separate task for a proper file backend, either mounted on host or S3 backend like connect. Though probably not necessary in short term.

I have created a role restricted user on superset for this purpose. I will update the env once this gets deployed.

Logging and monitoring

Added logs, and sends exception to sentry in case.

Safety Assurance

Safety story

Tested the full flow using a test csv and test table.

  • I am confident that this change will not break current and/or previous versions of CommCare apps

Automated test coverage

QA Plan

NA

Labels & Review

  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@coderabbitai
Copy link

coderabbitai bot commented Dec 25, 2025

Walkthrough

Adds configuration and a scheduled job to export ConnectUser records to Superset. New environment variables and a SUPERSET_UPLOAD_CONFIG setting were introduced. A Celery beat schedule entry runs a new task users.tasks.upload_connect_users_to_superset daily at midnight. The new users/tasks.py implements ConnectUserSupersetExporter which generates a CSV of user records, authenticates with Superset (login and CSRF handling), uploads the CSV to a Superset database endpoint, and handles errors and cleanup. Unit tests for CSV generation, upload request composition, and authentication header handling were added.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: a daily midnight task to upload a ConnectUser table to Superset, which is the central objective of all added code.
Description check ✅ Passed The description is directly related to the changeset, providing technical context about CSV generation, Superset upload, file handling, logging, and safety testing.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch sr/userupload

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
users/tests/test_tasks.py (1)

84-97: Consider adding tests for error scenarios.

The authentication test is well-structured. For more comprehensive coverage, consider adding tests for:

  • SuperusetUserUploadException when login fails (missing access_token)
  • SuperusetUserUploadException when CSRF token is missing
  • The MAX_USER_UPLOAD_LIMIT check in generate_csv()

Would you like me to generate additional test cases for these error scenarios?

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ace11e6 and ddc77fd.

📒 Files selected for processing (5)
  • .env_template
  • connectid/celery_app.py
  • connectid/settings.py
  • users/tasks.py
  • users/tests/test_tasks.py
🧰 Additional context used
🧬 Code graph analysis (2)
users/tasks.py (1)
users/models.py (1)
  • ConnectUser (30-118)
users/tests/test_tasks.py (1)
users/tasks.py (5)
  • ConnectUserSupersetExporter (38-135)
  • generate_csv (67-77)
  • _serialize_value (128-135)
  • upload (106-119)
  • authenticate (79-83)
🪛 GitHub Actions: CI
connectid/celery_app.py

[error] 31-31: TypeError: 'hour' is an invalid keyword argument for timedelta. In connectid/celery_app.py line 31, replace timedelta(hour=0, minute=0) with timedelta(hours=0, minutes=0) or use the appropriate keyword. This caused the pytest/Django setup to fail during import.

🪛 Ruff (0.14.10)
users/tasks.py

64-64: Do not catch blind exception: Exception

(BLE001)


73-73: Avoid specifying long messages outside the exception class

(TRY003)


93-93: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (5)
.env_template (1)

40-46: LGTM!

The new environment variable placeholders for Superset configuration follow the existing patterns in the template and are appropriately left empty.

connectid/settings.py (1)

350-359: LGTM!

The SUPERSET_UPLOAD_CONFIG dictionary is well-structured. The empty defaults for username and password ensure the upload won't accidentally run without proper configuration, and the task implementation guards against this.

users/tests/test_tasks.py (2)

24-60: LGTM!

Good test coverage for CSV generation. The test validates field serialization, row content, and column ordering correctly.


63-81: LGTM!

The upload test effectively validates the HTTP request structure, payload, and file handling by mocking _request.

users/tasks.py (1)

138-146: LGTM!

The Celery task properly captures exceptions to Sentry, logs them, and re-raises to ensure task failure is recorded. This is good practice for observability and debugging.

Comment on lines 29 to 33
"upload_connect_users_to_superset": {
"task": "users.tasks.upload_connect_users_to_superset",
"schedule": timedelta(hour=0, minute=0),
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix invalid timedelta arguments and incorrect schedule interval.

Two issues here:

  1. Syntax error (causes pipeline failure): hour and minute are invalid keyword arguments for timedelta. The correct keywords are hours and minutes (plural).

  2. Logic error: Even with corrected keywords, timedelta(hours=0, minutes=0) equals zero duration, which would cause the task to run continuously. Based on the PR title "midnight daily", this should run once every 24 hours.

🔎 Proposed fix

For a daily interval:

     "upload_connect_users_to_superset": {
         "task": "users.tasks.upload_connect_users_to_superset",
-        "schedule": timedelta(hour=0, minute=0),
+        "schedule": timedelta(hours=24),
     },

Alternatively, for running at exactly midnight UTC, use crontab:

from celery.schedules import crontab

# In beat_schedule:
"upload_connect_users_to_superset": {
    "task": "users.tasks.upload_connect_users_to_superset",
    "schedule": crontab(hour=0, minute=0),  # Runs at midnight UTC
},
🧰 Tools
🪛 GitHub Actions: CI

[error] 31-31: TypeError: 'hour' is an invalid keyword argument for timedelta. In connectid/celery_app.py line 31, replace timedelta(hour=0, minute=0) with timedelta(hours=0, minutes=0) or use the appropriate keyword. This caused the pytest/Django setup to fail during import.

🤖 Prompt for AI Agents
In connectid/celery_app.py around lines 29 to 32, the schedule uses invalid
timedelta keywords (hour, minute) and even if corrected would be a zero-duration
causing continuous execution; change the schedule to a 24-hour interval (e.g.,
timedelta(days=1)) or, if you need the task to run exactly at midnight UTC,
replace the schedule with celery.schedules.crontab(hour=0, minute=0) and add the
necessary import for crontab at the top of the file.

Comment on lines +34 to +35
class SuperusetUserUploadException(Exception):
pass
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in exception class name.

SuperusetUserUploadException should be SupersetUserUploadException (missing 's' in "Superset").

🔎 Proposed fix
-class SuperusetUserUploadException(Exception):
+class SupersetUserUploadException(Exception):
     pass

Also update the references on lines 73, 93, and 103.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class SuperusetUserUploadException(Exception):
pass
class SupersetUserUploadException(Exception):
pass
🤖 Prompt for AI Agents
In users/tasks.py around lines 34–35, the exception class name has a typo:
rename SuperusetUserUploadException to SupersetUserUploadException (add the
missing "s"), and update all references to that class on lines 73, 93, and 103
to use the new correct name; ensure any imports or string mentions are updated
consistently and run tests/lint to confirm no remaining references.

Comment on lines +67 to +77
def generate_csv(self) -> Path:
tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
writer.writeheader()
MAX_USER_UPLOAD_LIMIT = 100000
if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
raise SuperusetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
tmp.close()
return Path(tmp.name)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Temp file not cleaned up if an exception occurs during CSV generation.

If generate_csv() raises an exception (e.g., when MAX_USER_UPLOAD_LIMIT is exceeded at line 73, or during iteration), the temporary file won't be cleaned up because run() only cleans up after generate_csv() returns successfully.

🔎 Proposed fix using context manager pattern
     def generate_csv(self) -> Path:
-        tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
-        writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
-        writer.writeheader()
         MAX_USER_UPLOAD_LIMIT = 100000
         if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
-            raise SuperusetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
-        for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
-            writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
-        tmp.close()
-        return Path(tmp.name)
+            raise SupersetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
+        
+        tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
+        try:
+            writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
+            writer.writeheader()
+            for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
+                writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
+        except Exception:
+            tmp.close()
+            Path(tmp.name).unlink(missing_ok=True)
+            raise
+        tmp.close()
+        return Path(tmp.name)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def generate_csv(self) -> Path:
tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
writer.writeheader()
MAX_USER_UPLOAD_LIMIT = 100000
if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
raise SuperusetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
tmp.close()
return Path(tmp.name)
def generate_csv(self) -> Path:
MAX_USER_UPLOAD_LIMIT = 100000
if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
raise SupersetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
try:
writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
writer.writeheader()
for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
except Exception:
tmp.close()
Path(tmp.name).unlink(missing_ok=True)
raise
tmp.close()
return Path(tmp.name)
🧰 Tools
🪛 Ruff (0.14.10)

73-73: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In users/tasks.py around lines 67 to 77, the temporary file created by
NamedTemporaryFile is not removed if generate_csv() raises an exception (e.g.,
limit check or during iteration); change the implementation to ensure the temp
file is always cleaned up on error by either creating the temp filepath and
using a try/finally to remove the file on exception or by using a context
manager that guarantees deletion (or TemporaryDirectory) while writing;
specifically, open/write the CSV inside a try block and in the finally block
close and delete the tempfile if an exception occurred, leaving successful
return behavior unchanged.

Comment on lines +121 to +125
def _request(self, method: str, path: str, **kwargs):
url = f"{self.base_url}{path}"
response = self.session.request(method, url, timeout=60, **kwargs)
response.raise_for_status()
return response.json()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "tasks.py" -path "*/users/*" | head -20

Repository: dimagi/connect-id

Length of output: 77


🏁 Script executed:

cat -n ./users/tasks.py | sed -n '115,135p'

Repository: dimagi/connect-id

Length of output: 936


🏁 Script executed:

cat -n ./users/tasks.py | sed -n '100,125p'

Repository: dimagi/connect-id

Length of output: 1146


🌐 Web query:

Superset API csv_upload endpoint response format success

💡 Result:

Short answer: use Superset's OpenAPI (Swagger) — the CSV upload endpoints follow the standard REST patterns and return JSON on success (HTTP 201/200) with a success message plus the created resource info (ID/metadata). See Superset API docs and the shipped openapi.json for exact response schema for your Superset version (inspect /swagger/v1 on your instance or the repo openapi.json). [1][2]

Sources

  • Superset API docs (use /swagger/v1 on your Superset instance). [1]
  • Superset openapi.json (contains the upload/file-related schemas and response definitions). [2]

References
[1] https://superset.apache.org/docs/6.0.0/api/
[2] Superset openapi.json (docs/static/resources/openapi.json in the Superset repo).


🌐 Web query:

Superset /api/v1/database/csv_upload endpoint response body empty

💡 Result:

Short troubleshooting checklist — common causes and fixes when /api/v1/database/csv_upload (or related CSV-upload endpoints) return an empty body:

  1. Wrong endpoint or method — recent Superset versions expose CSV upload via the Database REST API (check /swagger/v1 on your instance or docs). Confirm you’re calling the correct POST URL and path for your Superset version. [1]

  2. Permissions / DB not enabled for uploads — the target database must have “Allow file uploads to database” enabled and schemas_allowed_for_csv_upload configured correctly; if not, the UI/backend can fail silently. Check the database’s extra JSON (schemas_allowed_for_csv_upload) and toggle file uploads in the DB settings. [2][3]

  3. Request format / headers — upload must be sent as multipart/form-data with the file field and other form fields expected by the API (Authorization header, database id, schema, table name). Incorrect Content-Type or missing fields can produce empty/no-content responses. [1][4]

  4. Proxy / size limits — Nginx/Proxy or Flask limits can drop the body: check nginx client_max_body_size, any reverse-proxy timeouts, and Flask MAX_CONTENT_LENGTH. If the file is large the server/proxy may return an empty response. [5]

  5. Temporary file handling / server filesystem — Superset writes uploaded files to a tmp/uploads path; FileNotFoundError or permission issues on that path will break upload. Inspect Superset logs for FileNotFoundError or permission errors. [6]

  6. Inspect Superset logs and request/response — enable debug logging and check superset-worker and webserver logs for stack traces. Capture the full HTTP request/response (curl -v or tcpdump) to see status code and headers; empty body often accompanies 4xx/5xx responses with details in logs. [6][4]

  7. API changes or client bug — if you’re scripting against /api/v1/dataset/ vs database upload endpoints, try the official upload flow (UI) and replicate the exact request with curl to find differences. Search recent GitHub issues for version-specific bugs. [2][7]

If you want, tell me:

  • Superset version,
  • Exact curl/postman request you’re sending (headers + body),
  • Status code and any webserver logs — and I’ll give a targeted fix.

Sources: Superset API/docs and upload docs, GitHub issues and community posts, and StackOverflow. [1][2][3][4][6][7]

References
[1] Superset API docs and swagger (API versions vary).
[2] GitHub issue and discussion about enabling CSV upload / schemas_allowed_for_csv_upload.
[3] Superset docs: enabling file uploads in Database Connections.
[4] Postman/StackOverflow examples of API CSV upload problems.
[5] Common nginx/Flask upload limits (proxy config).
[6] StackOverflow issue: FileNotFoundError during CSV upload.
[7] Recent GitHub discussions about CSV upload behavior.


🌐 Web query:

Superset csv_upload 201 success response schema JSON

💡 Result:

Do you want the OpenAPI/Swagger response schema (exact JSON) for the 201 success from your Superset instance's csv-upload endpoint, or an example 201 response body? If the former I can fetch the OpenAPI spec from your Superset server (e.g. GET /swagger/v1) — please provide the Superset base URL and version (or allow me to try localhost). If you want an example response, tell me which Superset version.


🌐 Web query:

site:github.com/apache/superset csv_upload endpoint response

💡 Result:

I found GitHub issues, discussions and docs showing common responses/errors and behavior for Superset's CSV upload endpoint (Upload a CSV):

  • "Failed to fetch schemas allowed for csv upload in this database! Please contact your Superset Admin!" — reported in issue #15708. [1]
  • "No database is allowed for your csv upload" / guidance on enabling CSV upload via the Database > Advanced > Security > Allow file uploads to database setting (docs/discussion). [2][6]
  • FileNotFoundError when the uploads directory is missing in container (traceback shown in StackOverflow). [3]
  • Several threads about programmatic CSV import and API limitations / migration permissions referencing related endpoints and permissions. [7][5]

Key docs showing how CSV upload is enabled and used in the UI: Superset docs "Exploring Data" (Upload CSV instructions). [6]

Sources:
[1] github.com/apache/superset/issues/15708
[2] github.com/apache/superset/discussions/14143
[3] stackoverflow.com/questions/63244063
[5] github.com/apache/superset/discussions/23399
[6] superset.apache.org/docs/6.0.0/using-superset/exploring-data
[7] github.com/apache/superset/issues/8796

Would you like me to fetch the exact server-side handler/source file for the csv upload endpoint in the Apache Superset repo?


🏁 Script executed:

rg -i "csv_upload|upload.*test" --type py -A 5 -B 5

Repository: dimagi/connect-id

Length of output: 2234


🏁 Script executed:

cat -n ./users/tests/test_tasks.py

Repository: dimagi/connect-id

Length of output: 4208


🏁 Script executed:

rg "_request\(" --type py -B 2 -A 2

Repository: dimagi/connect-id

Length of output: 5623


🏁 Script executed:

cat -n ./users/tasks.py | sed -n '80,110p'

Repository: dimagi/connect-id

Length of output: 1384


🌐 Web query:

site:github.com/apache/superset "csv_upload" POST response json

💡 Result:

I couldn't find a single repository file named "csv_upload" that shows a direct POST response JSON. The CSV upload-related API/permission is described in Superset's OpenAPI spec (schemas_allowed_for_file_upload / csv upload settings) and in PR/discussion notes about upload endpoints. See the OpenAPI spec and the PR discussion. (fossies.org)

Citations:


🌐 Web query:

site:github.com/apache/superset database_upload endpoint handler response

💡 Result:

The database file-upload endpoints the frontend calls are:

  • POST /api/v1/database/upload_metadata/ — returns parsed file metadata (columns, sheets, etc.). [1]
  • POST /api/v1/database/upload/ — uploads the file to be ingested. [1]

On the backend these are implemented/handled in the databases API code (see superset/databases/api.py) — that file contains the schema-access / upload-related handlers used by the UI. There are also recent changes noted in the 5.0 release notes that affect legacy per-file endpoints (see UPDATING). [2][3]

Sources:

  1. Frontend calls to the endpoints (UploadDataModel component). [Superset frontend — UploadDataModel]. [4]
  2. Code-review / API discussion referencing superset/databases/api.py upload/schema-access logic. [superset/databases/api.py review]. [3]
  3. UPDATING.md (notes about removal/changes to data-upload endpoints in 5.0). [2]

🌐 Web query:

Superset "csv_upload" endpoint HTTP 200 empty response OR "204" OR "202"

💡 Result:

Short answer

  • An empty HTTP 200/204/202 from Superset's csv upload usually means the request reached Superset but no synchronous JSON payload was returned (either the server accepted the upload and responded with no body, or the request failed earlier — permission/config or a CORS/preflight problem). See Superset issues and StackOverflow examples. [1][2][3]

What to check (quick checklist)

  1. Confirm CSV upload is enabled on the target database: "Allow CSV Upload" and proper "Extra" settings. If not allowed Superset will not process uploads. [3][4]
  2. Inspect response headers (Content-Length, Content-Type, Location) and server logs — a 202/204 often indicates asynchronous acceptance (no body) while errors appear in the server log. Check Superset logs for traceback. [1][5]
  3. Permissions / filesystem: permission denied errors when Superset writes temp files cause silent failures; check process user and upload directory permissions. [2]
  4. CORS / browser preflight: browser clients can show “empty” responses when preflight or CORS blocks the real response — verify network tab (request vs preflight) and server CORS settings. [1]
  5. Use the API endpoints documented for CSV/dataset import (api/v1/dataset or dataset/import) with correct multipart/form-data or JSON format; examples and recent issues show people hitting empty responses when using the wrong endpoint or parameters. [5][6]
  6. If using overwrite/import flows, note some API calls return 202/204 on accepted async jobs — check task status endpoints or the database table to confirm. See related GitHub issues. [6]

Helpful links

  • GitHub issues and discussions about csv upload behavior and errors. [1][3][6]
  • StackOverflow thread about permission-related CSV upload failures. [2]
  • Superset handbook / docs on uploading CSVs. [4]

If you want, paste the exact curl/Postman request (headers + response headers + status) and a short excerpt of the Superset log and I’ll identify the likely cause and exact fix.

References
[1] apache/superset issue and discussion search results (CSV upload).
[2] "Apache Superset permissions issue upload csv" — StackOverflow.
[3] apache/superset GitHub issues about CSV upload failures.
[4] "Upload a CSV — Superset Handbook" documentation.
[5] StackOverflow: posting to /api/v1/dataset for CSV upload (example and pitfalls).
[6] GitHub issues re: dataset import async behavior.


Handle empty or non-JSON responses from Superset's csv_upload endpoint

Superset's csv_upload endpoint can return empty responses (HTTP 204) or responses without JSON body in certain scenarios (e.g., asynchronous job acceptance). Calling response.json() unconditionally will raise a JSONDecodeError in these cases, causing the upload to fail.

Since upload() discards the return value from _request(), apply this fix:

    def _request(self, method: str, path: str, **kwargs):
        url = f"{self.base_url}{path}"
        response = self.session.request(method, url, timeout=60, **kwargs)
        response.raise_for_status()
+       if not response.content:
+           return {}
        return response.json()
🤖 Prompt for AI Agents
In users/tasks.py around lines 121 to 125, the helper always calls
response.json() which raises on empty (204) or non-JSON responses from
Superset's csv_upload endpoint; update the method to handle these cases by:
after response.raise_for_status(), check for response.status_code == 204 or
response.content is empty and return None, or inspect Content-Type (not
containing "application/json") and return None; otherwise call response.json()
inside a try/except catching json.JSONDecodeError and return None on decode
failure so upload() can proceed without expecting a JSON body.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
users/tasks.py (3)

34-35: Typo in exception class name.

SuperusetUserUploadException should be SupersetUserUploadException (missing 's' in "Superset").

🔎 Proposed fix
-class SuperusetUserUploadException(Exception):
+class SupersetUserUploadException(Exception):
     pass

Also update the references on lines 73, 93, and 103.


67-77: Temp file not cleaned up if an exception occurs during CSV generation.

If generate_csv() raises an exception (e.g., when MAX_USER_UPLOAD_LIMIT is exceeded at line 73, or during iteration), the temporary file won't be cleaned up because run() only cleans up after generate_csv() returns successfully.

🔎 Proposed fix using try/finally pattern
     def generate_csv(self) -> Path:
-        tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
-        writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
-        writer.writeheader()
         MAX_USER_UPLOAD_LIMIT = 100000
         if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
-            raise SuperusetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
-        for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
-            writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
-        tmp.close()
-        return Path(tmp.name)
+            raise SupersetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
+        
+        tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
+        try:
+            writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
+            writer.writeheader()
+            for user in ConnectUser.objects.all().values(*CONNECT_USER_DUMP_FIELDS).iterator(chunk_size=500):
+                writer.writerow({k: self._serialize_value(user[k]) for k in CONNECT_USER_DUMP_FIELDS})
+        except Exception:
+            tmp.close()
+            Path(tmp.name).unlink(missing_ok=True)
+            raise
+        tmp.close()
+        return Path(tmp.name)

121-125: Handle empty or non-JSON responses from Superset's csv_upload endpoint.

Superset's csv_upload endpoint can return empty responses (HTTP 204) or responses without a JSON body in certain scenarios (e.g., asynchronous job acceptance). Calling response.json() unconditionally will raise a JSONDecodeError in these cases, causing the upload to fail.

🔎 Proposed fix
    def _request(self, method: str, path: str, **kwargs):
        url = f"{self.base_url}{path}"
        response = self.session.request(method, url, timeout=60, **kwargs)
        response.raise_for_status()
+       if not response.content:
+           return {}
        return response.json()
🧹 Nitpick comments (1)
users/tasks.py (1)

71-71: Consider extracting magic number to a class or module constant.

The MAX_USER_UPLOAD_LIMIT is defined inline within the method. For better maintainability and reusability, consider defining it as a class or module-level constant.

🔎 Proposed refactor

At the module level (after imports):

MAX_USER_UPLOAD_LIMIT = 100000

Then in the method:

     def generate_csv(self) -> Path:
         tmp = tempfile.NamedTemporaryFile(mode="w", newline="", suffix=".csv", delete=False)
         writer = csv.DictWriter(tmp, fieldnames=CONNECT_USER_DUMP_FIELDS)
         writer.writeheader()
-        MAX_USER_UPLOAD_LIMIT = 100000
         if ConnectUser.objects.count() > MAX_USER_UPLOAD_LIMIT:
             raise SuperusetUserUploadException(f"ConnectUser count is greater than {MAX_USER_UPLOAD_LIMIT}")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ddc77fd and 4504547.

📒 Files selected for processing (5)
  • .env_template
  • connectid/celery_app.py
  • connectid/settings.py
  • users/tasks.py
  • users/tests/test_tasks.py
🧰 Additional context used
🧬 Code graph analysis (2)
users/tasks.py (1)
users/models.py (1)
  • ConnectUser (30-118)
users/tests/test_tasks.py (1)
users/tasks.py (5)
  • ConnectUserSupersetExporter (38-135)
  • generate_csv (67-77)
  • _serialize_value (128-135)
  • upload (106-119)
  • authenticate (79-83)
🪛 Ruff (0.14.10)
users/tasks.py

64-64: Do not catch blind exception: Exception

(BLE001)


73-73: Avoid specifying long messages outside the exception class

(TRY003)


93-93: Avoid specifying long messages outside the exception class

(TRY003)


103-103: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (6)
.env_template (1)

41-46: LGTM!

The new environment variable placeholders are well-named and align with the Superset upload configuration defined in connectid/settings.py.

connectid/celery_app.py (2)

5-5: LGTM!

The crontab import is correctly added to support the midnight schedule.


30-33: Previous issue resolved.

The schedule now correctly uses crontab(hour=0, minute=0) to run at midnight UTC daily. This fixes the previous syntax error with invalid timedelta keyword arguments.

connectid/settings.py (1)

351-359: LGTM!

The Superset upload configuration is well-structured with appropriate defaults. The use of env.int() for database_id ensures proper type conversion.

users/tests/test_tasks.py (1)

1-97: Excellent test coverage!

The tests comprehensively validate:

  • CSV generation with correct field serialization
  • Upload request payload construction
  • Authentication header setup

The use of mocking to isolate concerns and fixtures for configuration is appropriate.

users/tasks.py (1)

138-146: LGTM!

The Celery task properly wraps the exporter, captures exceptions to Sentry, and logs errors appropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants