Skip to content

Minimal FastAPI project to test storing and searching image vector data using the Lance format on Cloudflare R2.

License

Notifications You must be signed in to change notification settings

gordonmurray/fastapi-lance-r2-mvp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MVP Test Plan for Lance-Based Vector Store

This project explores using the Lance file format to store and search image vector data in a minimal FastAPI-based image ingestion pipeline. All data is stored on Cloudflare R2.

Goals

  • Validate Lance for storing image metadata and vector embeddings
  • Understand file growth and structure over time
  • Assess search capability, latency, and performance
  • Identify limitations in deletion or update workflows

Checklist

✅ Core Functionality

  • Store a single image with its vector embedding in a Lance dataset
  • Verify vector is correctly stored and retrievable
  • Observe which files/folders are created in the Lance directory (e.g., data/, index/, manifest/)

📈 Appending Data

  • Add 5–10 more image+vector entries to the same Lance dataset
  • Confirm that appends succeed without data corruption
  • Observe which files are modified or appended
  • Note any increase in latency for reads or writes

❌ Deleting Data

  • Attempt to logically delete one or more entries from the dataset
  • Verify whether deleted entries still appear in searches
  • Inspect the resulting files for deletion vector metadata

🔍 Vector Search

  • Convert a text prompt (e.g. "blue car") to a vector using the same CLIP/BLIP model
  • Run a similarity search against the dataset
  • Confirm that expected results appear and are ranked appropriately
  • Measure average search latency without any index
  • Create an index (ivf_pq or similar) and measure search latency again
  • Compare search accuracy before and after indexing

🚀 Storage Performance

  • Store the Lance dataset in Cloudflare R2
  • Measure time to:
    • Append a record
    • Run a vector search
  • Compare to:
    • Local disk
    • AWS S3
    • S3 Express One Zone

🔁 Concurrency & Durability

  • Simulate concurrent uploads (2–3 at once)
  • Confirm that no data loss or corruption occurs
  • Try uploading a record while another is being queried

📦 Partial Access Behavior

  • Attempt to retrieve only the vector column or only the metadata
  • Measure how many bytes are downloaded (R2 / S3)
  • Confirm expected columnar behavior (partial reads instead of full dataset)

Future Experiments

  • Explore time travel/versioning capabilities of Lance
  • Build a simple UI to test image search UX (using SvelteKit)
  • Wrap Lance search into a FastAPI endpoint with caching

Testing

The files and data will end up in a direct structure as follows:

r2-bucket-name/
├── images/
│   └── {sha256}.{ext}               # Original uploaded image (e.g. jpg, png)
├── images.lance/
│   ├── _transactions/
│   │   └── *.txn                    # Transaction logs (append-only)
│   ├── _versions/
│   │   └── *.manifest               # Manifest files for each version
│   └── data/
│       └── *.lance                  # Binary fragments containing vector rows
  • Each uploaded image is saved under images/{sha256}.{ext}, ensuring no duplicates.
  • The images.lance/ directory is a Lance dataset storing all vector metadata and is append-only.
  • Lance maintains internal versioning and transactional integrity via _transactions/ and _versions/.

Single image test

curl -X POST \
  -F "file=@your_image.jpg" \
  https://fastapi-lance-r2-mvp.fly.dev/vectorize_and_store

Multiple image test

#!/bin/bash

ENDPOINT="https://fastapi-lance-r2-mvp.fly.dev/vectorize_and_store"

for file in *.jpg; do
  if [[ -f "$file" ]]; then
    echo "Uploading: $file"

    time_start=$(date +%s%3N)

    response=$(curl -s -w "\nHTTP_STATUS:%{http_code}" -F "file=@$file" "$ENDPOINT")

    time_end=$(date +%s%3N)
    duration=$((time_end - time_start))

    http_status=$(echo "$response" | grep HTTP_STATUS | cut -d':' -f2)
    body=$(echo "$response" | sed '/HTTP_STATUS/d')

    echo "Status: $http_status"
    echo "Response: $body"
    echo "Time taken: ${duration} ms"
    echo "-----------------------------"
  fi
done

Search results

Perform a query string search as follows:

https://fastapi-lance-r2-mvp.fly.dev/search?text=golf%20ball

{
  "query": "golf ball",
  "results": [
    {
      "id": "images/b82883249aa34e702b733a4bbeb5e4b22c32422ad4b605c631b606d6298d2691.jpg",
      "path": "s3://fastapi-lance-r2-mvp/images/b82883249aa34e702b733a4bbeb5e4b22c32422ad4b605c631b606d6298d2691.jpg",
      "_distance": 1.4301637411117554
    }
  ]
}

Timing Search results

Performing searches using Curl

curl -o /dev/null -s -w "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TTFB: %{time_starttransfer}s | Total: %{time_total}s\n" "https://fastapi-lance-r2-mvp.fly.dev/search?text=drink"

Geting back response times such as:

DNS: 0.001157s | Connect: 0.021120s | TTFB: 1.142971s | Total: 1.143055s

About

Minimal FastAPI project to test storing and searching image vector data using the Lance format on Cloudflare R2.

Topics

Resources

License

Stars

Watchers

Forks