A ruby gem to manipulate GTFS feeds using DataFrames using Polars (ruby-polars)
This project was created to bring the power of partridge to ruby.
⚠️ Warning: This gem is not ready for production use. It is currently in active development and the API may change without notice.
Install the gem and add to the application's Gemfile by executing:
bundle add gtfs_dfIf bundler is not being used to manage dependencies, install the gem by executing:
gem install gtfs_dfrequire 'gtfs_df'
# Load from a zip file
feed = GtfsDf::Reader.load_from_zip('path/to/gtfs.zip')
# Or, load from a directory
feed = GtfsDf::Reader.load_from_dir('path/to/gtfs_dir')
# Parse times as seconds since midnight instead of string
feed = GtfsDf::Reader.load_from_dir('path/to/gtfs_dir', parse_times: true)
# Access dataframes for each GTFS file
puts feed.agency.head
puts feed.routes.head
puts feed.trips.head
puts feed.stop_times.head
puts feed.stops.headThe library supports filtering feeds by any field in any table. The filter automatically cascades through the dependency graph to ensure referential integrity.
# Filter by agency
filtered_feed = feed.filter({ 'agency' => { 'agency_id' => 'MTA' } })
# Filter by route
filtered_feed = feed.filter({ 'routes' => { 'route_id' => ['1', '2', '3'] } })
# Filter by a service
filtered_feed = feed.filter({ 'calendar' => { 'service_id' => 'WEEKDAY' } })
# Multiple filters
filtered_feed = feed.filter({
'agency' => { 'agency_id' => 'MTA' },
'routes' => { 'route_type' => 1 } # Filter to subway routes
})When you filter by a field, the library automatically:
- Filters the specified table
- Cascades related tables following foreign key relationships
- Keeps only the data that maintains referential integrity
For example, filtering by agency_id will automatically filter routes, trips, stop_times, and stops to only include data for that agency.
By default gtfs_df treats trips as the atomic unit of GTFS. Therefore, if we filter to one stop referenced by TripA, we will preserve all stops referenced by TripA.
To avoid this behavior, you can pass the filter_only_children param. In this case, only the children of the specified filter will be pruned and trip integrity will not be maintained. In the below example, stop 1 and related stop_times will be pruned.
filtered_feed = feed.filter({ 'stop' => { 'stop_id' => ['1'] } }, filter_only_children: true)# Write to a new zip file
GtfsDf::Writer.write_to_zip(filtered_feed, 'output/filtered_gtfs.zip')
# Write to a directory
GtfsDf::Writer.write_to_dir(filtered_feed, 'output/filtered_gtfs')See examples/split-by-agency for a complete example that splits a multi-agency GTFS feed into separate files per agency.
This project manages its development environment with nix, specifically devenv.
After checking out the repo:
-
Install devenv: https://devenv.sh/getting-started/
-
To enable the environment you can either:
- Use direnv to enable the environment as soon as you enter the project's path.
- Enable it only when you needed by running:
devenv shell
-
Run
bin/setupto install the gem dependencies.
Run rake spec to run the tests.
You can also run bin/console for an interactive prompt that will allow you to experiment.
bin/bump-version
- Bumps the version in
lib/gtfs_df/version.rb - Updates the
CHANGELOG.mdusing the git log since the last version - Creates and push a new release branch with those changes
- Creates a PR for that release
bin/create-tag
Creates and pushes the git tag for the release. That will trigger the GitHub action: .github/workflows/publish.yml to publish to RubyGems.
Bug reports and pull requests are welcome on GitHub at https://github.com/davidmh/ruby-gtfs_df.
The gem is available as open source under the terms of the MIT License.