Skip to content

Commit c068e3f

Browse files
committed
Merge branch 'dev' into mc_484_eliminate_contact_info_and_agency_meta_recrd_type
# Conflicts: # pyproject.toml # src/external/pdap/client.py # src/security/manager.py # tests/manual/external/pdap/test_match_agency.py # uv.lock
2 parents cebf085 + ae77cb4 commit c068e3f

File tree

1,001 files changed

+19079
-4467
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,001 files changed

+19079
-4467
lines changed

ENV.md

Lines changed: 25 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -57,19 +57,31 @@ Note that some tasks/subtasks are themselves enabled by other tasks.
5757

5858
### Scheduled Task Flags
5959

60-
| Flag | Description |
61-
|-------------------------------------|-------------------------------------------------------------------------------|
62-
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. Disabling disables all other scheduled tasks. |
63-
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
64-
| `POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG` | Populates the backlog snapshot. |
65-
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
66-
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
67-
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
68-
| `IA_SAVE_TASK_FLAG` | Saves URLs to Internet Archives. |
69-
| `MARK_TASK_NEVER_COMPLETED_TASK_FLAG` | Marks tasks that were started but never completed (usually due to a restart). |
70-
| `DELETE_STALE_SCREENSHOTS_TASK_FLAG` | Deletes stale screenshots for URLs already validated. |
71-
| `TASK_CLEANUP_TASK_FLAG` | Cleans up tasks that are no longer needed. |
72-
| `REFRESH_MATERIALIZED_VIEWS_TASK_FLAG` | Refreshes materialized views. |
60+
| Flag | Description |
61+
|--------------------------------------------|-------------------------------------------------------------------------------|
62+
| `SCHEDULED_TASKS_FLAG` | All scheduled tasks. Disabling disables all other scheduled tasks. |
63+
| `PUSH_TO_HUGGING_FACE_TASK_FLAG` | Pushes data to HuggingFace. |
64+
| `POPULATE_BACKLOG_SNAPSHOT_TASK_FLAG` | Populates the backlog snapshot. |
65+
| `DELETE_OLD_LOGS_TASK_FLAG` | Deletes old logs. |
66+
| `RUN_URL_TASKS_TASK_FLAG` | Runs URL tasks. |
67+
| `IA_PROBE_TASK_FLAG` | Extracts and links Internet Archives metadata to URLs. |
68+
| `IA_SAVE_TASK_FLAG` | Saves URLs to Internet Archives. |
69+
| `MARK_TASK_NEVER_COMPLETED_TASK_FLAG` | Marks tasks that were started but never completed (usually due to a restart). |
70+
| `DELETE_STALE_SCREENSHOTS_TASK_FLAG` | Deletes stale screenshots for URLs already validated. |
71+
| `TASK_CLEANUP_TASK_FLAG` | Cleans up tasks that are no longer needed. |
72+
| `REFRESH_MATERIALIZED_VIEWS_TASK_FLAG` | Refreshes materialized views. |
73+
| `UPDATE_URL_STATUS_TASK_FLAG` | Updates the status of URLs. |
74+
| `DS_APP_SYNC_AGENCY_ADD_TASK_FLAG` | Adds new agencies to the Data Sources App|
75+
| `DS_APP_SYNC_AGENCY_UPDATE_TASK_FLAG` | Updates existing agencies in the Data Sources App|
76+
| `DS_APP_SYNC_AGENCY_DELETE_TASK_FLAG` | Deletes agencies in the Data Sources App|
77+
| `DS_APP_SYNC_DATA_SOURCE_ADD_TASK_FLAG` | Adds new data sources to the Data Sources App|
78+
| `DS_APP_SYNC_DATA_SOURCE_UPDATE_TASK_FLAG` | Updates existing data sources in the Data Sources App|
79+
| `DS_APP_SYNC_DATA_SOURCE_DELETE_TASK_FLAG` | Deletes data sources in the Data Sources App|
80+
| `DS_APP_SYNC_META_URL_ADD_TASK_FLAG` | Adds new meta URLs to the Data Sources App|
81+
| `DS_APP_SYNC_META_URL_UPDATE_TASK_FLAG` | Updates existing meta URLs in the Data Sources App|
82+
| `DS_APP_SYNC_META_URL_DELETE_TASK_FLAG` | Deletes meta URLs in the Data Sources App|
83+
| `DS_APP_SYNC_USER_FOLLOWS_GET_TASK_FLAG` | Gets user follows from the Data Sources App|
84+
| `INTEGRITY_MONITOR_TASK_FLAG` | Runs integrity checks. |
7385

7486
### URL Task Flags
7587

@@ -81,7 +93,6 @@ URL Task Flags are collectively controlled by the `RUN_URL_TASKS_TASK_FLAG` flag
8193
| `URL_HTML_TASK_FLAG` | URL HTML scraping task. |
8294
| `URL_RECORD_TYPE_TASK_FLAG` | Automatically assigns Record Types to URLs. |
8395
| `URL_AGENCY_IDENTIFICATION_TASK_FLAG` | Automatically assigns and suggests Agencies for URLs. |
84-
| `URL_SUBMIT_APPROVED_TASK_FLAG` | Submits approved URLs to the Data Sources App. |
8596
| `URL_MISC_METADATA_TASK_FLAG` | Adds misc metadata to URLs. |
8697
| `URL_AUTO_RELEVANCE_TASK_FLAG` | Automatically assigns Relevances to URLs. |
8798
| `URL_PROBE_TASK_FLAG` | Probes URLs for web metadata. |
@@ -90,7 +101,6 @@ URL Task Flags are collectively controlled by the `RUN_URL_TASKS_TASK_FLAG` flag
90101
| `URL_AUTO_VALIDATE_TASK_FLAG` | Automatically validates URLs. |
91102
| `URL_AUTO_NAME_TASK_FLAG` | Automatically names URLs. |
92103
| `URL_SUSPEND_TASK_FLAG` | Suspends URLs meeting suspension criteria. |
93-
| `URL_SUBMIT_META_URLS_TASK_FLAG` | Submits meta URLs to the Data Sources App. |
94104

95105
### Agency ID Subtasks
96106

README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,3 +156,71 @@ if it detects any missing docstrings or type hints in files that you have modifi
156156
These will *not* block any Pull request, but exist primarily as advisory comments to encourage good coding standards.
157157

158158
Note that `python_checks.yml` will only function on pull requests made from within the repo, not from a forked repo.
159+
160+
# Syncing to Data Sources App
161+
162+
The Source Manager (SM) is part of a two app system, with the other app being the Data Sources (DS) App.
163+
164+
165+
## Add, Update, and Delete
166+
167+
These are the core synchronization actions.
168+
169+
In order to propagate changes to DS, we synchronize additions, updates, and deletions of the following entities:
170+
- Agencies
171+
- Data Sources
172+
- Meta URLs
173+
174+
Each action for each entity occurs through a separate task. At the moment, there are nine tasks total.
175+
176+
Each task gathers requisite information from the SM database and sends a request to one of nine corresponding endpoints in the DS API.
177+
178+
Each DS endpoint follows the following format:
179+
180+
```text
181+
/v3/sync/{entity}/{action}
182+
```
183+
184+
Synchronizations are designed to occur on an hourly basis.
185+
186+
Here is a high-level description of how each action works:
187+
188+
### Add
189+
190+
Adds the given entities to DS.
191+
192+
These are denoted with the `/{entity}/add` path in the DS API.
193+
194+
When an entity is added, it returns a unique DS ID that is mapped to the internal SM database ID via the DS app link tables.
195+
196+
For an entity to be added, it must meet preconditions which are distinct for each entity:
197+
- Agencies: Must have an agency entry in the database and be linked to a location.
198+
- Data Sources: Must be a URL that has been internally validated as a data source and linked to an agency.
199+
- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency.
200+
201+
### Update
202+
203+
Updates the given entities in DS.
204+
205+
These are denoted with the `/{entity}/update` path in the DS API.
206+
207+
These consist of submitting the updated entities (in full) to the requisite endpoint, and updating the local app link to indicate that the update occurred. All updates are designed to be full overwrites of the entity.
208+
209+
For an entity to be updated, it must meet preconditions which are distinct for each entity:
210+
- Agencies: Must have either an agency row updated or an agency/location link updated or deleted.
211+
- Data Sources: One of the following must be updated:
212+
- The URL table
213+
- The record type table
214+
- The optional data sources metadata table
215+
- The agency link table (either an addition or deletion)
216+
- Meta URLs: Must be a URL that has been internally validated as a meta URL and linked to an agency. Either the URL table or the agency link table (addition or deletion) must be updated.
217+
218+
### Delete
219+
220+
Deletes the given entities from DS.
221+
222+
These are denoted with the `/{entity}/delete` path in the DS API.
223+
224+
This consists of submitting a set of DS IDs to the requisite endpoint, and removing the associated DS app link entry in the SM database.
225+
226+
When an entity with a corresponding DS App Link is deleted from the Source Manager, the core data is removed but a deletion flag is appended to the DS App Link entry, indicating that the entry is not yet removed from the DS App. The deletion task uses this flag to identify entities to be deleted, submits the deletion request to the DS API, and removes both the flag and the DS App Link.

alembic/Jenkinsfile

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
pipeline {
2+
agent {
3+
dockerfile {
4+
filename 'Dockerfile'
5+
args '-e POSTGRES_USER=POSTGRES_USER -e POSTGRES_PASSWORD=POSTGRES_PASSWORD -e POSTGRES_DB=POSTGRES_DB -e POSTGRES_HOST=POSTGRES_HOST -e POSTGRES_PORT=POSTGRES_PORT'
6+
}
7+
}
8+
9+
stages {
10+
stage('Migrate using Alembic') {
11+
steps {
12+
echo 'Building..'
13+
sh 'python apply_migrations.py'
14+
}
15+
}
16+
}
17+
post {
18+
failure {
19+
script {
20+
def payload = """{
21+
"content": "🚨 Build Failed: ${env.JOB_NAME} #${env.BUILD_NUMBER}"
22+
}"""
23+
24+
sh """
25+
curl -X POST -H "Content-Type: application/json" -d '${payload}' ${env.WEBHOOK_URL}
26+
"""
27+
}
28+
}
29+
}
30+
}
Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
"""Add anonymous annotation tables
2+
3+
Revision ID: 7aace6587d1a
4+
Revises: 43077d7e08c5
5+
Create Date: 2025-10-13 20:07:18.388899
6+
7+
"""
8+
from typing import Sequence, Union
9+
10+
from alembic import op
11+
import sqlalchemy as sa
12+
13+
from src.util.alembic_helpers import url_id_column, agency_id_column, created_at_column, location_id_column, enum_column
14+
15+
# revision identifiers, used by Alembic.
16+
revision: str = '7aace6587d1a'
17+
down_revision: Union[str, None] = '43077d7e08c5'
18+
branch_labels: Union[str, Sequence[str], None] = None
19+
depends_on: Union[str, Sequence[str], None] = None
20+
21+
22+
def upgrade() -> None:
23+
op.create_table(
24+
"anonymous_annotation_agency",
25+
url_id_column(),
26+
agency_id_column(),
27+
created_at_column(),
28+
sa.PrimaryKeyConstraint('url_id', 'agency_id')
29+
)
30+
op.create_table(
31+
"anonymous_annotation_location",
32+
url_id_column(),
33+
location_id_column(),
34+
created_at_column(),
35+
sa.PrimaryKeyConstraint('url_id', 'location_id')
36+
)
37+
op.create_table(
38+
"anonymous_annotation_record_type",
39+
url_id_column(),
40+
enum_column(
41+
column_name="record_type",
42+
enum_name="record_type"
43+
),
44+
created_at_column(),
45+
sa.PrimaryKeyConstraint('url_id', 'record_type')
46+
)
47+
op.create_table(
48+
"anonymous_annotation_url_type",
49+
url_id_column(),
50+
enum_column(
51+
column_name="url_type",
52+
enum_name="url_type"
53+
),
54+
created_at_column(),
55+
sa.PrimaryKeyConstraint('url_id', 'url_type')
56+
)
57+
58+
59+
def downgrade() -> None:
60+
pass

0 commit comments

Comments
 (0)