Skip to content

Feat/obtain appstream data and use keywords for search#202

Open
m1rm wants to merge 45 commits intomainfrom
feat/obtain-appstream-data
Open

Feat/obtain appstream data and use keywords for search#202
m1rm wants to merge 45 commits intomainfrom
feat/obtain-appstream-data

Conversation

@m1rm
Copy link
Collaborator

@m1rm m1rm commented Dec 26, 2025

This PR:

  • adds a command and data fetcher to obtain app data for packages
  • adds a migration to add columns to the package table to hold the new information
    • new column "keywords" contains description (en and de), categories and keywords from app-data to be used for improving search functionality
  • keywords are used for OpenSearch (searching with German terms now works)

@m1rm m1rm self-assigned this Dec 26, 2025
@m1rm m1rm marked this pull request as draft December 26, 2025 08:58
@m1rm m1rm marked this pull request as ready for review January 1, 2026 15:56
@m1rm m1rm changed the title Feat/obtain appstream data Feat/obtain appstream data and use keywords for search Jan 10, 2026
@m1rm m1rm requested a review from pierres January 10, 2026 20:28
@pierres
Copy link
Member

pierres commented Feb 26, 2026

Note: This review was performed by Claude (AI assistant).

Overview

This PR adds appstream metadata (descriptions, categories, keywords) to packages and uses them as a new keywords field in OpenSearch to improve search (notably enabling German-language search). CI is green. The overall idea is solid and well-scoped.

Issues

Bugs

  1. KeywordProcessorTest is broken (api/tests/Service/KeywordProcessorTest.php:29,52): sort() returns bool, not the sorted array. The test is comparing true === true and would pass regardless of the actual output. Should use $this->assertEqualsCanonicalizing($expected, $actual) or sort in-place first.

  2. ini_set('memory_limit', '8G') in the command (UpdateAppStreamDataCommand.php:40): This is a code smell left from debugging. If the command truly needs 8G, there's an architectural problem (the data is streamed via generators, so it shouldn't). If it's no longer needed, it should be removed. If it is needed, it should at least be configurable or documented why.

  3. error_log() instead of logger (AppStreamDataFetcher.php:54): The command injects a PSR LoggerInterface, but the fetcher uses raw error_log(). Should use the injected logger for consistency and testability.

Design Concerns

  1. AppStreamDataDenormalizer materializes everything into an array (AppStreamDataDenormalizer.php:20-31): The generator pattern in AppStreamDataFetcher is negated because the denormalizer spreads the generator into an array with [...(function () { yield ... })()]. This loads all components of a repo into memory at once. This is likely the reason for the 8G memory limit. Consider returning a generator/iterator instead.

  2. AppStreamDataDenormalizer::getKeywords() is public (AppStreamDataDenormalizer.php:128): Should be private like the other helper methods.

  3. $this->lock() without checking the return value (UpdateAppStreamDataCommand.php:39): LockableTrait::lock() returns false if the lock cannot be acquired. The return value should be checked to avoid running duplicate instances.

  4. Description in configure() has a leading newline and excessive indentation (UpdateAppStreamDataCommand.php:33-34): Minor but noticeable in --help output.

  5. Extra blank line (UpdateAppStreamDataCommand.php:59): Double blank line before return.

Architecture / Suggestions

  1. Hardcoded 'extra' and 'x86_64' in AppStreamDataHelper::obtainAppStreamDataVersion() (AppStreamDataHelper.php:23-26): The repo and architecture are hardcoded to look up the archlinux-appstream-data package version. This works today but is fragile. Consider making these configurable or at least documented.

  2. findOneByName returns an arbitrary package when multiple repos have the same name (PackageRepository.php:243-254): It filters by testing = false and takes setMaxResults(1) without ordering. If a package exists in both core and extra, the result is non-deterministic. This may be fine for keyword assignment but worth noting.

  3. No batch flushing (UpdateAppStreamDataCommand.php:57): All persisted packages are flushed in a single call at the end. For large datasets, periodic flushing (e.g., every 100-500 entities) with $entityManager->clear() would reduce memory pressure.

  4. Exposing OpenSearch port 9200 (docker/app.yml): This was added for debugging and should probably not be merged to main (or at least be documented as dev-only).

Minor / Nits

  1. The docs/opensearch.md file is fine as a dev note, but the HTTP URL in the curl example assumes the default docker setup.

  2. The phpstan.dist.neon change excludes config/reference.php — this seems unrelated to this PR.

Summary

The feature is a good addition. The main things to address before merging:

  • Fix the broken test (sort() return value)
  • Replace error_log() with logger
  • Check $this->lock() return value
  • Consider the memory issue (denormalizer materialization vs. the 8G ini_set)
  • Remove the exposed OpenSearch port or make it intentional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants