Line Server Problem

Motivation

This project is one solution to the Line Server Problem. A text file with a large number of lines is preprocessed; the goal is to create a server that can efficiently retrieve specific lines from the file without loading the entire file into memory.

Raspberry Pi & Cloudflare

If you would like to test the GET /lines/<line_index> endpoint remotely, I have a Raspberry Pi 5 with 16GB of RAM running this project through a Cloudflare tunnel (if my Pi is powered on!)

You can find the GET /lines/<line_index> endpoint here:

lsp.run.foo/lines/<line_index>
Please note that you may need to refresh the browser if Cloudflare/Pi are not behaving as expected.

Or alternatively you can run this project locally (see below).

Installation

This project uses Ruby: 3.2.2 and Ruby on Rails: 8.0.2. To set up the project, follow these steps:

Build

Run the following command to check if you have the required dependencies installed. The build.sh script will check for the Ruby version and install the required gems. It will also run the tests with RSpec to ensure everything is working correctly.

  ./build.sh

Run

You can then run the application using the following command.

  ./run.sh /path/to/file

This will start the Rails server where a background worker PreprocessFileJob will preprocess the file during initialization. Once the file is preprocessed, the server will be ready to serve requests.

The server is available at http://localhost:3000. You can access the endpoints in Postman or any other API testing tool. Please note that the indexing of the lines begins at 0.

  GET "/lines/<line_index>"

This will return the line at the specified index with the status code 200 if the line exists.:

  {
    "line": "1: This is a sample ASCII line for testing purposes.",
    "status": 200
  }

If the line index is out of range, it will return the status code 413 with the following message:

  {
    "error": "Requested line index is outside file bounds"
  }

File Preprocessing

You can use these example files from Google Drive to test the application:

During preprocessing, the application will create a directory called /files_in_bytesize in the root of the project. Depending on the size of the file, this directory will take up approximately 1/10 of the size of the original file.

The 10MB file takes approximately 1MB of space in the /files_in_bytesize directory. Processing time is around 250ms.
The 1GB file takes approximately 100MB of space in the /files_in_bytesize directory. Processing time is around 27500ms.
The 10GB file takes approximately 1GB of space in the /files_in_bytesize directory. Processing time is around 250000ms.

The PreprocessFile service efficiently preprocesses a large .txt file by creating an index of byte offsets for each line without loading the entire file into memory. It stores this index and relevant metadata in a series of binary files within a /files_in_bytesize directory, split into batches of 100_000 offsets per file.

This enables faster and more memory-efficient access to specific lines with the exposed endpoint. The service ensures any previous preprocessing output is removed before starting, handles errors gracefully, and logs the entire process for visibility.

Line Retrieval

The LineRetriever service retrieves a specific line from the preprocessed file using the byte offset index. It reads the relevant binary file containing the index of byte offsets and uses the offset to seek directly to the line in the original file. This approach minimizes memory usage and improves performance by avoiding the need to load the entire file into memory.

File modification validation

The LineRetriever service also validates that the original text file has not been modified since the file was preproccesed. If it was modified then an exception is raised.

Marshal

I read this AppSignal blog post on using Marshal.

The PreprocessFile service uses Marshal.dump to serialize the index of byte offsets for each line. I have not worked with Marshal before, so this was a nice learning experience. Perhaps it is not really needed, but I thought it would be a good idea to use it to serialize the index of byte offsets for each line.

Marshal is a Ruby built-in module that provides a way to serialize and deserialize Ruby objects. It is used to convert Ruby objects into a byte stream that can be stored in a file or sent over a network.

The LineRetriever service uses Marshal.load to deserialize the index of byte offsets for each line.

Cache memory store

The application uses Rails.cache to store the relevant bytesize file. This means that if the endpoint is called with an index range (within 100,000) that has already been processed, the application will not need to read the file again. It will check the cache first and if the file is not in the cache, it will read the file and store it in the cache.

File comments

There are extensive comments in the code to explain what each part of the code does.

Falcon

The application uses Falcon as the HTTP server. Falcon is a fast, concurrent HTTP server for Ruby using fibres instead of threads. I first read about using falcon in Vladimir Dementyev's book Layered Design for Ruby on Rails Applications and I thought it would be a good idea to try it out. It is a great alternative to puma but I would need to implement this further to see how it performs in the real world. For example we would probably need to use the gem async-http to make it work with falcon as the HTTP client.

Logging

The application uses Rails.logger to log the requests and responses. The logs are stored in the log/development.log file. You can check the logs in the terminal and see the requests and responses being processed by the server.

Testing

The project uses RSpec for testing. You can run the tests using the following command:

  rspec

The tests will also be run automatically when you run the build.sh script. The tests are located in the spec directory and cover the main functionality of the application.

Endpoints

The application provides the following endpoints:

GET /lines/<line_index>: Returns the line at the specified index. The index param is required and should be a valid integer. The server will return the line if it exists, otherwise it will return a 413 status code with an error message.
This endpoint is fast and responds in around 15-20ms.

Grape API is used to define the endpoints and handle the requests. The API is defined in the app/api/v1/lines.rb file.

Libraries

rails: Rails 8 web framework used for building the application.
redis: The Redis client for Ruby, used for queue management.
rspec: The testing framework used for writing and running tests.
falcon: The fibre HTTP server used for serving the application (this replaces puma threads).
dotenv: The gem used for loading environment variables from a .env file.
grape: The API framework used for building the RESTful API.

A note on Bash and GPT 4o

I used bash to write the build.sh and run.sh scripts. I have not worked with bash before, so this was a nice learning experience. Disclaimer - I used GPT 4o to help me write the scripts.

Critique and notes on this exercise

Firstly, this was a really fun project to work on. I learned a lot more about Ruby, Marshal, and File I/O. I also learned a lot about how to use Rails.cache and Falcon.

I am not sure if the PreprocessFile service is the best way to preprocess the file or if there is a better way to do it. One alternative approach could be to use a database to store the lines and their offsets, this would be much more efficient but a lot less interesting.

I think the current approach is a good balance between performance and simplicity. And it certainly gives us something to discuss...

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
.idea		.idea
.kamal		.kamal
app		app
bin		bin
config		config
db		db
lib/tasks		lib/tasks
log		log
public		public
script		script
spec		spec
storage		storage
tmp		tmp
vendor		vendor
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md
Rakefile		Rakefile
build.sh		build.sh
config.ru		config.ru
run.sh		run.sh
test_ascii_10MB.txt		test_ascii_10MB.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Line Server Problem

Motivation

Raspberry Pi & Cloudflare

Installation

Build

Run

File Preprocessing

Line Retrieval

File modification validation

Marshal

Cache memory store

File comments

Falcon

Logging

Testing

Endpoints

Libraries

A note on Bash and GPT 4o

Critique and notes on this exercise

About

Uh oh!

Releases

Packages

Languages

rossme/line-server-problem

Folders and files

Latest commit

History

Repository files navigation

Line Server Problem

Motivation

Raspberry Pi & Cloudflare

Installation

Build

Run

File Preprocessing

Line Retrieval

File modification validation

Marshal

Cache memory store

File comments

Falcon

Logging

Testing

Endpoints

Libraries

A note on Bash and GPT 4o

Critique and notes on this exercise

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages