Skip to content

Filter / separate out CI downloads where possible #1372

@brynpickering

Description

@brynpickering

PyPI download stats are heavily skewed by CI action downloads. Their bigquery table does distinguish between these downloaded sources. E.g., a query that filters out CI downloads:

SELECT
  COUNT(*) AS num_downloads,
  DATE_TRUNC(DATE(timestamp), MONTH) AS `month`,
  file.project AS `project`
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
  -- Only user downloads, not downloads as part of CI pipelines
  details.ci is NULL
  -- Only query the last 6 months of history
  AND DATE(timestamp)
    BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 YEAR), MONTH)
    AND CURRENT_DATE()
GROUP BY `month`, `project`
ORDER BY `month` DESC

It gets expensive to query the table quite quickly, sadly.

The same filtering is possible with the Julia package stats table, using the "client_type" columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions