MiniDB is a small SQL database implemented from scratch in C.
The goal of this project is to learn how databases work internally: storage engine, B+Tree indexing, query optimization, transaction management, and crash recovery.
SQL Operations
- DDL:
CREATE TABLE,CREATE INDEX - DML:
SELECT,INSERT,UPDATE,DELETE - Joins:
INNER JOINwith multi-table support - Query Modifiers:
WHERE,ORDER BY,LIMIT - Aggregations:
COUNT(),SUM(),AVG(),MAX(),MIN() - Query Analysis:
EXPLAINfor execution plans
Storage Engine
- B+Tree index on primary key with automatic node splitting
- Secondary indexes on any column for fast lookups
- Page-based storage (4 KB pages) with efficient caching
- Multi-table support with schema persistence
- Custom on-disk file format
Transaction Management
- Write-Ahead Log (WAL) for crash recovery
- Automatic checkpoint and log compaction
- Durability guarantees with fsync()
- Transaction statistics tracking
Query Optimization
- Cost-based optimizer (index search vs full scan)
- Secondary index utilization
- Query plan visualization with
EXPLAIN - Performance statistics collection
This is a single binary, file-backed database built for learning database internals.
Requirements: gcc (C99), make, Linux/macOS/WSL.
git clone https://github.com/<your-username>/minidb.git
cd minidb
make
./minidb database.dbThe database file will be created if it doesn't exist.
minidb> create table users (id int primary key, username varchar(32), email varchar(64))
Table 'users' created successfully.
minidb> insert 1 alice alice@example.com
Executed.
minidb> insert 2 bob bob@example.com
Executed.
minidb> insert 3 charlie charlie@example.com
Executed.
-- Point lookup using B+Tree
minidb> select * where id = 2
(2, bob, bob@example.com)
Executed.
-- Update records
minidb> update set email = bob_new@example.com where id = 2
Executed.
-- Delete records
minidb> delete where id = 3
Executed.
minidb> select count(*)
COUNT: 2
Executed.-- Create index for fast username lookups
minidb> create index on users (username)
Created index on users.username
Building index on users.username...
Index built: 2 entries indexed
Executed.
-- Use secondary index (O(log n) instead of O(n))
minidb> select * where username = alice
Using secondary index on username
(1, alice, alice@example.com)
Executed.minidb> select count(*)
COUNT: 2
minidb> select sum(id)
SUM: 3
minidb> select avg(id)
AVG: 1.50
minidb> select max(id)
MAX: 2
minidb> select min(id)
MIN: 1-- Order by descending
minidb> select * order by id desc
(2, bob, bob_new@example.com)
(1, alice, alice@example.com)
Executed.
-- Limit results
minidb> select * order by id desc limit 1
(2, bob, bob_new@example.com)
Executed.minidb> create table orders (id int primary key, user_id int, product varchar(32))
minidb> insert 100 1 laptop
minidb> insert 101 2 mouse
minidb> insert 102 1 keyboard
-- Inner join between users and orders
minidb> select * from users inner join orders on users.id = orders.user_id
(1, alice, alice@example.com, 100, 1, laptop)
(2, bob, bob_new@example.com, 101, 2, mouse)
(1, alice, alice@example.com, 102, 1, keyboard)
Executed.minidb> explain select * where id = 1
=== Query Plan ===
Scan Type: INDEX SEARCH (B+Tree)
Index Used: id (Primary Key)
Estimated Rows: 1
Estimated Cost: 5 (O(log n) - Binary Search)
==================
minidb> explain select * where email = '[email protected]'
=== Query Plan ===
Scan Type: FULL TABLE SCAN
Estimated Rows: ALL
Estimated Cost: 250 (O(n) - Sequential Scan)
Recommendation: Consider creating an index on 'email'
==================.schema -- Show all table schemas
.btree -- Display B+Tree structure
.stats -- Show query execution statistics
.indexes -- List all secondary indexes
.checkpoint -- Force WAL checkpoint
.constants -- Display internal constants
.exit -- Exit databaseSQL Query
↓
┌─────────────────┐
│ Lexer + Parser │ → Tokenize and parse SQL
└─────────────────┘
↓
┌─────────────────┐
│ Query Optimizer │ → Choose index vs full scan
└─────────────────┘
↓
┌─────────────────┐
│ Executor │ → Execute SELECT/INSERT/UPDATE/DELETE/JOIN
└─────────────────┘
↓
┌─────────────────────────────────────┐
│ Storage Layer │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ B+Tree │ │ Pager │ │ WAL │ │
│ └────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────┘
↓
Disk I/O
Components:
src/main.c– REPL, meta commands, and CLI interfacesrc/parser/– Lexer and recursive descent parser for SQLsrc/storage/– Table, pager, schema, multi-table managersrc/index/– B+Tree (primary) and secondary index structuressrc/transaction/– Write-ahead logging and recoverysrc/optimizer/– Cost estimation and query plan generation
- Structure: Self-balancing tree with data in leaf nodes
- Page Size: 4 KB (4096 bytes)
- Leaf Capacity: 13 cells (key + serialized row)
- Operations: All O(log n) - insert, search, delete, update
- Node Splitting: Automatic when capacity exceeded
Example Tree:
[Root: 10, 20]
/ |
[1,5,8] [10,15] [20,25,30] ← Leaf nodes with data
- Hash-based index mapping column value → primary key
- Binary search within index for O(log m) lookup
- Two-step process: index lookup → B+Tree primary key lookup
- Total cost: O(log m + log n)
Frame Format:
[Header: 24 bytes] [Page Data: 4096 bytes]
- page_number (4 bytes)
- db_size (4 bytes)
- salt1, salt2 (8 bytes)
- checksum1, checksum2 (8 bytes)
Recovery Process:
- On startup, scan WAL file
- Verify checksums for each frame
- Replay valid frames to restore state
- Checkpoint periodically to compact log
Statistics Tracked:
- Full scans vs index searches
- Average rows scanned per query
- Query efficiency ratio
minidb/
├── src/
│ ├── main.c # CLI and REPL
│ ├── parser/
│ │ ├── lexer.c # SQL tokenization
│ │ └── parser.c # AST generation
│ ├── storage/
│ │ ├── table.c # Table operations
│ │ ├── pager.c # Page cache management
│ │ ├── schema.c # Schema persistence
│ │ └── table_manager.c # Multi-table support
│ ├── index/
│ │ ├── btree.c # B+Tree implementation
│ │ └── secondary_index.c # Secondary indexes
│ ├── transaction/
│ │ └── wal.c # Write-ahead logging
│ └── optimizer/
│ └── optimizer.c # Query optimization
├── Makefile
└── README.md
Create test files in tests/ directory:
tests/basic_ops.sql:
create table test (id int primary key, name varchar(32), value varchar(64))
insert 1 test1 value1
insert 2 test2 value2
insert 3 test3 value3
select count(*)
select * order by id desc
update set value = updated where id = 2
delete where id = 3
select count(*)
.exitRun tests:
./minidb test.db < tests/basic_ops.sqlThis project demonstrates understanding of:
- Storage Systems: Custom on-disk layouts, page management, caching
- Data Structures: B+Tree implementation with automatic balancing
- Query Processing: Parsing, optimization, execution
- Transaction Management: WAL, crash recovery, durability
- Systems Programming: Low-level C, file I/O, memory management
It showcases systems programming skills relevant to database, storage, and distributed systems roles.
Possible extensions (not yet implemented):
- LEFT/RIGHT/OUTER JOIN support
- Subqueries and nested SELECT
- GROUP BY and HAVING clauses
- Transactions with BEGIN/COMMIT/ROLLBACK
- Multi-threading for concurrent queries
- Client-server architecture
- B-tree node compression
- Query result caching