Skip to content
This repository was archived by the owner on Jan 21, 2026. It is now read-only.
/ rssperado Public archive

The RSS parser is a command-line utility that simplifies the process of parsing RSS feeds, enriching the extracted data, exporting it to JSON files, enabling content classification, and providing English translation.

License

Notifications You must be signed in to change notification settings

mitjafelicijan/rssperado

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rssperado

This project is under active development and still considered work in progress. It is not yet ready for production use. Use at your own risk.

Despedado

The RSS parser is a command-line utility that simplifies the process of parsing RSS feeds, enriching the extracted data, exporting it to JSON files, enabling content classification, and providing English translation.

Feeds are parsed in linear order and each story is parsed in parallel, ensuring the best possible performance. Be sure your machine has enough RAM to handle the load (2 gigs should be enough, but beware of it).

Translations are done with argostranslate that utilizes https://opennmt.net/ for translation. The translation model is downloaded on the first run and cached for future use. The translation model is not included in the repository due to its size.

NER classification is done with https://huggingface.co/transformers/ that utilizes dbmdz/bert-large-cased-finetuned-conll03-english model. The model is downloaded on the first run and cached for future use. The model is not included in the repository due to its size. This feature requires translation to be enabled, and it always takes the English translation as input.

Extract content goes to the actual page and tries to extract the content from it. It also uses NLP to summarize the full content to a few sentences.

These models can take up a lot of disk space, so be sure to have enough space available (10 gigs on a VM should be enough).

Each feed gets exported as JSON files where the md5(feed-url) is the name of the file, so it is predictable and easy to integrate into your own code. An example of such a filename would be 26a13287efe864d840c98b840f16465d.json.

See it in action (click to expand)

Features

  • Fetches RSS feeds in parallel
  • Fetches images from feed
  • Fetches images from OG meta tags as fallback
  • Resizes images to specific dimensions
  • Translate each story to English
  • Enables NER classification
  • Detects language of the story
  • Allows extraction of content from each story
  • Distinguishes between articles and podcasts
  • Podcasts have accompanying URL to audio file
  • Images can be resized to fit or contain

Installation

Requirements:

  • Python 3.8+
  • pip
  • virtualenv
$ [sudo] apt install build-essential python3 python3-venv python3-pip

And now the actual installation:

$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt

If you have issues installing PyTorch, please refer to the official installation guide. For CPU only installation, use the following command:

$ source .venv/bin/activate
$ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Usage

$ source .venv/bin/activate
$ python3 rssperado.py --input urls.txt --resize --image-width 200 --image-height 200 --ner --translate

Help command output

$ source .venv/bin/activate
$ python3 rssperado.py --help

usage: rssperado.py [-h] --input-feeds INPUT_FEEDS [--output-dir OUTPUT_DIR] [--max-entries MAX_ENTRIES]
                    [--fetch-images | --no-fetch-images] [--og-images | --no-og-images] [--resize | --no-resize]
                    [--image-width IMAGE_WIDTH] [--image-height IMAGE_HEIGHT] [--image-quality IMAGE_QUALITY]
                    [--image-fit | --no-image-fit] [--image-contain | --no-image-contain]
                    [--extract-content | --no-extract-content] [--translate | --no-translate] [--ner | --no-ner]
                    [--verbose | --no-verbose] [--version]

Available options

Option Description
-h, --help Show this help message and exit
--input-feeds INPUT_FEEDS Specify input file with the list of RSS feeds
--output-dir OUTPUT_DIR Specify output directory (default: ./output)
--max-entries MAX_ENTRIES Specify max feed entries to parse (default: 50)
--fetch-images, --no-fetch-images Fetches images from feed
--og-images, --no-og-images Fetches images from OG meta tags as fallback
--resize, --no-resize Resizes images to specific dimensions
--image-width IMAGE_WIDTH Specify resized image width (default: 800)
--image-height IMAGE_HEIGHT Specify resized image height (default: 600)
--image-quality IMAGE_QUALITY Specify resized image quality (default: 90)
--image-fit, --no-image-fit Fit image to canvas (default)
--image-contain, --no-image-contain Contain image in canvas
--extract-content, --no-extract-content Extracts content from each story
--translate, --no-translate Translate each story to English
--ner, --no-ner Enables NER classification
--verbose, --no-verbose Make the operation more talkative
--version Show program's version number and exit

Example input file

Each line in the input file represents a single RSS feed URL.

https://newatlas.com/index.rss
https://mitjafelicijan.com/feed.rss
https://www.spiegel.de/index.rss

Example output file

[{
  "guid": "efb152d38b66ec229e8d654e0235181b",
  "link": "https://www.spiegel.de/ausland/flugzeugabsturz-bei-washington-d-c-pilot-sass-zusammengesackt-im-cockpit-a-b5f70396-b3f4-48b1-b6ca-f2063fb02ed0",
  "title": {
    "origin": "Flugzeugabsturz bei Washington D.C.: Pilot saß zusammengesackt im Cockpit",
    "en": "Aircraft crash at Washington D.C.: Pilot sat down in cockpit"
  },
  "summary": {
    "origin": "Neue Details im rätselhaften Fall des abgestürzten Kleinflugzeugs nahe der US-Hauptstadt: So war der Pilot der Cessna vor dem Crash offenbar bewusstlos – möglicherweise wegen eines Sauerstoffproblems.",
    "en": "New details in the mysterious case of the crashed small aircraft near the US capital: Thus, the pilot of the Cessna before the crash was apparently unconscious – possibly due to an oxygen problem."
  },
  "published": {
    "ago": "7 hours ago",
    "dt": "Tue, 6 Jun 2023 02:36:08 +0200"
  },
  "ner": [
    {
      "entity_group": "LOC",
      "score": "0.51140416",
      "word": "Hauptstadt",
      "start": 152,
      "end": 162
    }, {
      "entity_group": "ORG",
      "score": "0.7654832",
      "word": "Cessna",
      "start": 185,
      "end": 191
    }
  ],
  "image_filename": "efb152d38b66ec229e8d654e0235181b.jpg",
  "image_url": "https://cdn.prod.www.spiegel.de/images/61ebb31e-d8b6-4c8a-8be0-6150d6a98963_w520_r2.08_fpx49.5_fpy49.98.jpg",
  "podcast_url": null,
  "source": "www.spiegel.de",
  "type": "story",
  "origin_language": "de"
}]

Available translations

  • Arabic → English
  • Azerbaijani → English
  • Catalan → English
  • Chinese → English
  • Czech → English
  • Danish → English
  • Dutch → English
  • Esperanto → English
  • Finnish → English
  • French → English
  • German → English
  • Greek → English
  • Hebrew → English
  • Hindi → English
  • Hungarian → English
  • Indonesian → English
  • Irish → English
  • Italian → English
  • Japanese → English
  • Korean → English
  • Persian → English
  • Polish → English
  • Portuguese → English
  • Russian → English
  • Slovak → English
  • Spanish → English
  • Swedish → English
  • Thai → English
  • Turkish → English

About

The RSS parser is a command-line utility that simplifies the process of parsing RSS feeds, enriching the extracted data, exporting it to JSON files, enabling content classification, and providing English translation.

Topics

Resources

License

Stars

Watchers

Forks

Languages