Skip to content

Commit 4855f77

Browse files
committed
Update all server status to online and fix migration dates to 2025
- Mark rosedata as online (no longer offline) - Update storage status: freed 20TB, new drives on the way - Remove all active issues (everything operational) - Update MinIO service status from degraded to online - Fix all migration dates from 2024 to correct year 2025 - Add historical timeline for October 2025 migration events
1 parent 2986198 commit 4855f77

File tree

3 files changed

+43
-48
lines changed

3 files changed

+43
-48
lines changed

docs/guide/cluster.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ HTTPS adds a security layer and is more browser-friendly but only supports hoste
9494

9595
## Automation
9696

97-
You can create scripts for frequent or scheduled synchronization. The inter-server bandwidth is 100Gbps after the October 2024 server migration. The 300MB/s data transfer rate is well within these limits.
97+
You can create scripts for frequent or scheduled synchronization. The inter-server bandwidth is 100Gbps after the October 2025 server migration. The 300MB/s data transfer rate is well within these limits.
9898

9999
## Security Measures
100100

docs/guide/troubleshooting.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ You **cannot** change host driver versions yourself - these are managed by the a
130130

131131
### Current Driver Version
132132

133-
As of the latest server migration (October 2024), all NVIDIA drivers have been upgraded to version **580.95.05**. You can verify your driver version with:
133+
As of the latest server migration (October 2025), all NVIDIA drivers have been upgraded to version **580.95.05**. You can verify your driver version with:
134134

135135
```bash
136136
nvidia-smi | grep "Driver Version"

docs/status/index.md

Lines changed: 41 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ For real-time system metrics, visit our [Grafana Dashboard](http://roselab1.ucsd
66

77
## Current Status
88

9-
**Last Updated**: October 24, 2025
9+
**Last Updated**: October 2025
1010

1111
### Server Status
1212

@@ -17,28 +17,28 @@ For real-time system metrics, visit our [Grafana Dashboard](http://roselab1.ucsd
1717
| roselab3 | 🟢 Online | All systems operational |
1818
| roselab4 | 🟢 Online | All systems operational |
1919
| roselab5 | 🟢 Online | All systems operational |
20-
| rosedata | 🔴 Offline | Down for data recovery |
20+
| rosedata | 🟢 Online | All systems operational |
2121

2222
Legend: 🟢 Online, 🟡 Degraded Performance, 🔴 Offline, 🔵 Maintenance
2323

2424
### Storage Status
2525

26-
::: warning Storage Capacity Alert
27-
Our 120TB HDD cluster is currently at **98% capacity**. We are actively researching and deleting outdated data. Please avoid saving new large datasets to `/data` until additional storage is provisioned.
26+
::: info Storage Update
27+
The lab's shared storage had previously been at high utilization. We've freed up approximately 20TB of space and new drives are on the way to expand capacity. Please continue to be mindful of storage usage.
2828
:::
2929

3030
| Storage Pool | Total | Used | Available | Usage |
3131
|--------------|-------|------|-----------|-------|
32-
| 120TB HDD Cluster | 120TB | ~118TB | ~2TB | 98% |
32+
| Shared HDD Cluster | ~140TB | Varies | ~20TB+ | Manageable |
3333
| Per-user `/data` | 5TB | Varies | Varies | Check with `df -h /data` |
3434
| Per-user `/public` | 5TB (shared) | Varies | Varies | Check with `df -h /public` |
3535

36-
**Recommendations during storage crunch**:
37-
1. Avoid saving new large datasets to `/data`
38-
2. Archive or compress old datasets
39-
3. Move cold data to local servers or external storage temporarily
40-
4. Clean up intermediate experiment results
41-
5. Contact admin for data cleanup assistance
36+
**Best practices for storage management**:
37+
1. Store large datasets on `/data` (synchronized across servers)
38+
2. Keep environments and code on local SSD
39+
3. Archive or compress old datasets
40+
4. Clean up intermediate experiment results regularly
41+
5. Use `/utilities/common-utilities.py` to clean pip cache if needed
4242

4343
### Current NVIDIA Driver Version
4444

@@ -52,17 +52,15 @@ If you accidentally corrupt your NVIDIA driver, use `/utilities/nvidia-upgrade.s
5252

5353
## Recent System Updates
5454

55-
### October 4, 2025
56-
- **Server Status**: roselab2 and roselab3 are back online. All five roselab servers are now operational.
57-
- **NVIDIA Driver Upgrade**: All NVIDIA drivers have been upgraded to version **580.95.05**
58-
- Users should NOT install nvidia-driver through package managers
59-
- Recovery tool available: `/utilities/nvidia-upgrade.sh`
60-
- **Storage Alert**: 120TB HDD cluster at 98% capacity
61-
- Admin is actively researching and deleting outdated data
62-
- Waiting for approval to order additional disks
63-
- Temporary workaround: Some containers removed to free up space
64-
65-
### [Add previous updates chronologically]
55+
### October 2025
56+
- **Server Migration Complete**: All five roselab servers migrated to new network architecture
57+
- Server room move completed (wave 4, Oct 2)
58+
- Improved data loading speed between servers
59+
- Network upgraded to 100Gbps
60+
- roselab5 now has all 8x H200 GPUs online
61+
- **roselab2 & roselab3 Back Online**: All NVIDIA drivers upgraded to version **580.95.05**
62+
- **rosedata Recovered**: Storage server back online after data recovery
63+
- **Storage Update**: Freed up ~20TB of space, new drives on the way
6664

6765
## Scheduled Maintenance
6866

@@ -85,24 +83,21 @@ There is currently no scheduled maintenance. This page will be updated if mainte
8583

8684
## Known Issues
8785

88-
### Storage Space Limitations
86+
::: tip No Active Issues
87+
There are currently no major known issues. All servers and services are operational.
88+
:::
8989

90-
**Issue**: 120TB cluster approaching capacity
91-
**Status**: 🔴 Active Issue
92-
**Impact**: Users may not be able to save new large datasets
93-
**Workaround**:
94-
- Use local SSD for active work when possible
95-
- Archive old data
96-
- Contact admin for cleanup assistance
97-
**ETA**: Pending hardware procurement approval
90+
<!--
91+
### Example Active Issue Format:
9892
99-
### rosedata Server Down
93+
### Issue Name
10094
101-
**Issue**: rosedata storage server offline for data recovery
102-
**Status**: 🔴 Active Issue
103-
**Impact**: S3/MinIO services may be limited
104-
**Workaround**: Use `/data` for storage
105-
**ETA**: To be determined
95+
**Issue**: Description
96+
**Status**: 🔴 Active Issue / 🟡 Under Investigation / 🟢 Monitoring
97+
**Impact**: What users experience
98+
**Workaround**: Temporary solutions
99+
**ETA**: Expected resolution time
100+
-->
106101

107102
<!--
108103
### Example Resolved Issue:
@@ -119,10 +114,10 @@ There is currently no scheduled maintenance. This page will be updated if mainte
119114
|---------|--------|-----|-------|
120115
| Grafana | 🟢 Online | [roselab1.ucsd.edu/grafana](http://roselab1.ucsd.edu/grafana/) | Real-time metrics |
121116
| Seafile | 🟢 Online | [roselab1.ucsd.edu/seafile](http://roselab1.ucsd.edu/seafile) | File sharing |
122-
| MinIO | 🟡 Degraded | [rosedata.ucsd.edu](https://rosedata.ucsd.edu) | Limited due to rosedata offline |
117+
| MinIO | 🟢 Online | [rosedata.ucsd.edu](https://rosedata.ucsd.edu) | S3 object storage |
123118
| HedgeDoc | 🟢 Online | [roselab1.ucsd.edu/hedgedoc](https://roselab1.ucsd.edu/hedgedoc) | Markdown collaboration |
124119
| WandB | 🟢 Online | [rosewandb.ucsd.edu](https://rosewandb.ucsd.edu) | Experiment tracking |
125-
| RoseLibreChat | 🟢 Online | [roselab1.ucsd.edu:3407](https://roselab1.ucsd.edu:3407/) | ChatGPT frontend |
120+
| RoseLibreChat | 🟢 Online | [roselab1.ucsd.edu:3407](https://roselab1.ucsd.edu:3407/) | AI chat interface (API-based) |
126121

127122
## Monitoring Your Resources
128123

@@ -183,13 +178,13 @@ Important system updates are posted in the lab Slack channel **#server**. Make s
183178
### 2025
184179

185180
**October 2025**:
186-
- Oct 4: roselab2, roselab3 back online; Driver upgrade to 580.95.05; Storage at 98%
187-
188-
**September 2025**:
189-
- Sep 25: roselab2, roselab3 offline for maintenance
190-
- Sep 10: rosedata taken offline for data recovery
191-
192-
[Add more historical entries as needed]
181+
- Oct 2-4: Server room migration (wave 4)
182+
- Oct 4: roselab2, roselab3 back online; All systems operational
183+
- Oct 11: Power outage for room finalization (mid-afternoon recovery)
184+
- Driver upgrade to 580.95.05
185+
- Network upgraded to 100Gbps
186+
- roselab5 H200 GPUs online
187+
- Storage freed up ~20TB
193188

194189
---
195190

0 commit comments

Comments
 (0)