Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .claude/settings.local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"permissions": {
"allow": [
"Bash(tee:*)",
"Bash(git checkout:*)",
"Bash(pip install:*)",
"Bash(gh issue view:*)",
"Bash(pytest:*)",
"Bash(pip search:*)",
"Bash(psql:*)"
],
"deny": [],
"ask": []
}
}
61 changes: 55 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,27 +4,76 @@

### Added

- ...
- **RSS/Atom feed harvesting support** (`publications/tasks.py`)
- `parse_rss_feed_and_save_publications()` function for parsing RSS/Atom feeds
- `harvest_rss_endpoint()` function for complete RSS harvesting workflow
- Support for RDF-based RSS feeds (Scientific Data journal)
- DOI extraction from multiple feed fields (prism:doi, dc:identifier)
- Duplicate detection by DOI and URL
- Abstract/description extraction from feed content
- feedparser library integration (v6.0.12)
- Added to requirements.txt for RSS/Atom feed parsing
- Supports RSS 1.0/2.0, Atom, and RDF feeds
- Django management command `harvest_journals` enhanced for RSS/Atom feeds
- Added Scientific Data journal with RSS feed support
- Support for both OAI-PMH and RSS/Atom feed types
- Automatic feed type detection based on journal configuration
- Now supports 4 journals: ESSD, AGILE-GISS, GEO-LEO (OAI-PMH), Scientific Data (RSS)
- Comprehensive RSS harvesting tests (`RSSFeedHarvestingTests`)
- 7 test cases covering RSS parsing, duplicate detection, error handling
- Test fixture with sample RDF/RSS feed (`tests/harvesting/rss_feed_sample.xml`)
- Tests for max_records limit, invalid feeds, and HTTP errors
- Django management command `harvest_journals` for harvesting real journal sources
- Command-line options for journal selection, record limits, and source creation
- Detailed progress reporting with colored output
- Statistics for spatial/temporal metadata extraction
- Integration tests for real journal harvesting (`tests/test_real_harvesting.py`)
- 6 tests covering ESSD, AGILE-GISS, GEO-LEO, and EssOAr
- Tests skipped by default (use `SKIP_REAL_HARVESTING=0` to enable)
- Max records parameter to limit harvesting for testing
- Comprehensive error handling tests for OAI-PMH harvesting (`HarvestingErrorTests`)
- 10 test cases covering malformed XML, missing metadata, HTTP errors, network timeouts
- Test fixtures for various error conditions in `tests/harvesting/error_cases/`
- Verification of graceful error handling and logging
- pytest configuration with custom markers (`pytest.ini`)
- `real_harvesting` marker for integration tests
- Configuration for Django test discovery

### Changed

- ...
- Fixed OAI-PMH harvesting test failures by updating response format parameters
- Changed from invalid 'structured'/'raw' to valid 'geojson'/'wkt'/'wkb' formats
- Updated test assertions to expect GeoJSON FeatureCollection
- Fixed syntax errors in `publications/tasks.py`
- Fixed import statement typo
- Fixed indentation in `extract_timeperiod_from_html` function
- Fixed misplaced return statement in `regenerate_geopackage_cache` function
- Fixed test setup method in `tests/test_harvesting.py`
- Removed incorrect `@classmethod` decorator from `setUp` method
- Fixed `test_regular_harvesting.py` to include `max_records` parameter in mock function
- Updated README.md with comprehensive documentation for:
- Integration test execution
- `harvest_journals` management command usage
- Journal harvesting workflows

### Fixed

- ...
- Docker build for geoextent installation (added git dependency to Dockerfile)
- 18 geoextent API test failures due to invalid response format values
- 8 test setup errors in OAI-PMH harvesting tests
- Test harvesting function signature mismatch

### Deprecated

- ...
- None.

### Removed

- ...
- None.

### Security

- ...
- None.

## [0.2.0] - 2025-10-09

Expand Down
104 changes: 104 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,11 @@ python manage.py qcluster
# If you want to use the predefined feeds for continents and oceans we need to load the geometries for global regions
python manage.py load_global_regions

# Harvest publications from real OAI-PMH journal sources
python manage.py harvest_journals --list # List available journals
python manage.py harvest_journals --all --max-records 20 # Harvest all journals (limited to 20 records each)
python manage.py harvest_journals --journal essd --journal geo-leo # Harvest specific journals

# Start the Django development server
python manage.py runserver

Expand Down Expand Up @@ -233,6 +238,66 @@ OPTIMAP_EMAIL_PORT=5587

Visit the URL - http://127.0.0.1:8000/articles/links/

### Harvest Publications from Real Journals

The `harvest_journals` management command allows you to harvest publications from real OAI-PMH journal sources directly into your database. This is useful for:

- Populating your database with real data for testing and development
- Testing harvesting functionality against live endpoints
- Initial data loading for production deployment

**List available journals**:

```bash
python manage.py harvest_journals --list
```

**Harvest all configured journals** (with record limit):

```bash
python manage.py harvest_journals --all --max-records 50
```

**Harvest specific journals**:

```bash
# Single journal
python manage.py harvest_journals --journal essd --max-records 100

# Multiple journals
python manage.py harvest_journals --journal essd --journal geo-leo --journal agile-giss
```

**Create source entries automatically**:

```bash
python manage.py harvest_journals --journal essd --create-sources
```

**Associate with specific user**:

```bash
python manage.py harvest_journals --all --user-email [email protected]
```

**Currently configured journals**:

- `essd` - Earth System Science Data (OAI-PMH) ([Issue #59](https://github.com/GeoinformationSystems/optimap/issues/59))
- `agile-giss` - AGILE-GISS conference series (OAI-PMH) ([Issue #60](https://github.com/GeoinformationSystems/optimap/issues/60))
- `geo-leo` - GEO-LEO e-docs repository (OAI-PMH) ([Issue #13](https://github.com/GeoinformationSystems/optimap/issues/13))
- `scientific-data` - Scientific Data (RSS/Atom) ([Issue #58](https://github.com/GeoinformationSystems/optimap/issues/58))

The command supports both OAI-PMH and RSS/Atom feeds, automatically detecting the feed type for each journal.

The command provides detailed progress reporting including:

- Number of publications harvested
- Harvesting duration
- Spatial and temporal metadata statistics
- Success/failure status for each journal

When the command runs mutiple times, it will only add new publications that are not already in the database as part of the regular harvesting process.

### Create Superusers/Admin

Superusers or administrators can be created using the `createsuperuser` command. This user will have access to the Django admin interface.
Expand Down Expand Up @@ -265,6 +330,10 @@ UI tests are based on [Helium](https://github.com/mherrmann/selenium-python-heli
pip install -r requirements-dev.txt
```

#### Unit Tests

Run all unit tests:

```bash
python manage.py test tests

Expand All @@ -275,6 +344,41 @@ python -Wa manage.py test
OPTIMAP_LOGGING_LEVEL=WARNING python manage.py test tests
```

#### Integration Tests (Real Harvesting)

Integration tests that harvest from live OAI-PMH endpoints are disabled by default to avoid network dependencies and slow test execution. These tests verify harvesting from real journal sources.

Run all integration tests:

```bash
# Enable real harvesting tests
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting
```

Run a specific journal test:

```bash
# Test ESSD harvesting
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting.RealHarvestingTest.test_harvest_essd

# Test GEO-LEO harvesting
SKIP_REAL_HARVESTING=0 python manage.py test tests.test_real_harvesting.RealHarvestingTest.test_harvest_geo_leo
```

Show skipped tests (these are skipped by default):

```bash
# Run with verbose output to see skip reasons
python manage.py test tests.test_real_harvesting -v 2
```

**Supported journals**:

- Earth System Science Data (ESSD) - [Issue #59](https://github.com/GeoinformationSystems/optimap/issues/59)
- AGILE-GISS - [Issue #60](https://github.com/GeoinformationSystems/optimap/issues/60)
- GEO-LEO e-docs - [Issue #13](https://github.com/GeoinformationSystems/optimap/issues/13)
- ESS Open Archive (EssOAr) - [Issue #99](https://github.com/GeoinformationSystems/optimap/issues/99) _(endpoint needs confirmation)_

### Run UI tests

Running UI tests needs either compose configuration or a manage.py runserver in a seperate shell.
Expand Down
2 changes: 1 addition & 1 deletion optimap/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__version__ = "0.2.0"
__version__ = "0.3.0"
VERSION = __version__
Loading