Skip to content

Commit 42aea1e

Browse files
authored
SenseNet Index Rebuilder Console Application (#2221)
* Add SnIndexRebuilder console application - Implements standalone SenseNet index rebuilder console application - Uses IsOuterSearchEngineEnabled special working mode to prevent processing old indexing activities - Clears legacy indexing activities from database before rebuild - Successfully rebuilds index from scratch using ClearAndPopulateAllAsync - Provides progress monitoring and error handling - Tested successfully: indexed 62,685 nodes in ~28 minutes - No core SenseNet modifications required - uses existing infrastructure only Features: - Service registration using AddSenseNet pattern from integration tests - Proper repository startup with indexing disabled during initialization - Automatic cleanup of old IndexingActivities table entries - Clean index rebuild from current database state - Comprehensive logging and progress tracking This approach solves the issue of old indexing activities being processed during normal repository startup, enabling truly clean index rebuilds. * feat: enhance index rebuilder with dual ETA estimation and comprehensive progress tracking Features: - Add dual ETA display showing both average and worst-case time estimates - Implement IndexingProgressTracker class with advanced progress monitoring - Add comprehensive CLI argument parsing (--clear-activities, --help) - Add Serilog integration for dual console+file logging - Implement two rebuild approaches: 1. Clean rebuild without clearing activities (default) 2. Complete clean rebuild with activities table clearing (--clear-activities) - Add phantom activities issue resolution with TRUNCATE + DBCC CHECKIDENT - Add index directory clearing to remove cached LastActivityId - Enhanced error handling and user feedback - Professional progress display with total node counts and completion times Technical improvements: - Real-time progress updates every 100 nodes or 5 seconds - Worst-case scenario tracking using maximum time per node - Convergent ETA estimation as process stabilizes - Structured logging for troubleshooting and monitoring - Complete SQL identity seed management - Comprehensive help documentation * Fix: Add missing using directives for Queue and LINQ - Add System.Collections.Generic for Queue<double> - Add System.Linq for Average() extension method - Complete conservative ETA estimation implementation * Refactor: Eliminate code duplication between clear/non-clear index rebuild paths - Extract common functionality into helper methods: - ClearIndexingActivitiesAsync() for database cleanup - ClearIndexDirectoryAsync() for file system cleanup - PerformIndexRebuildAsync() for shared rebuild logic - Consolidate duplicate progress tracking, error handling, and populator setup - Reduce code duplication from ~150 lines to ~20 lines of shared logic - Maintain identical functionality for both --clear-activities and default modes - Fix Task ambiguity by using fully qualified System.Threading.Tasks.Task * Fix --clear-activities mode and eliminate code duplication - Refactored Program.cs to eliminate 143 lines of duplicate code - Extracted helper methods: ClearIndexingActivitiesAsync, ClearIndexDirectoryAsync, PerformIndexRebuildAsync - Fixed --clear-activities mode by letting ClearAndPopulateAllAsync handle indexing engine startup internally - Resolved Lucene29 compatibility issue with explicit indexing engine startup - Updated comprehensive documentation with latest behavioral findings - Confirmed identity reset behavior: --clear-activities resets to ID 1, default mode preserves counter - Both modes now work correctly with streamlined codebase
1 parent baf2bff commit 42aea1e

File tree

7 files changed

+810
-3
lines changed

7 files changed

+810
-3
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -263,3 +263,5 @@ paket-files/
263263
**/App_Data/IndexBackup
264264
**/App_Data/Logs
265265
/src/WebApps/**/install-services-core
266+
.github/*.md
267+
.vscode/*

docs/index-rebuilder-console.md

Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
# SenseNet Index Rebuilder Console Application
2+
3+
The **SenseNet Index Rebuilder** is a standalone console application designed for rebuilding search indexes in sensenet environments. This tool provides advanced progress tracking, dual time estimation, and comprehensive logging capabilities for better operational visibility during index rebuild operations.
4+
5+
## Overview
6+
7+
The Index Rebuilder console application offers a professional alternative to traditional index rebuilding methods with enhanced features:
8+
9+
- **Real-time progress tracking** with percentage completion and node counts
10+
- **Dual ETA estimation** showing both average and worst-case completion times
11+
- **Professional logging** with Serilog integration for monitoring and troubleshooting
12+
- **Phantom activities resolution** to eliminate ghost indexing activities
13+
- **Two rebuild approaches** for different operational needs
14+
15+
## Getting Started
16+
17+
### Prerequisites
18+
19+
- .NET runtime compatible with sensenet
20+
- Valid sensenet database connection
21+
- Write access to the index directory
22+
- Proper connection string configuration
23+
24+
### Configuration
25+
26+
Before running the tool, ensure you have a valid connection string configured in one of these ways:
27+
28+
1. **appsettings.json** file in the application directory
29+
2. **User secrets** for development environments
30+
3. **Environment variables** for containerized deployments
31+
32+
Example appsettings.json:
33+
```json
34+
{
35+
"ConnectionStrings": {
36+
"SnCrMsSql": "Data Source=localhost;Initial Catalog=sensenet;User ID=sa;Password=yourpassword;TrustServerCertificate=True"
37+
}
38+
}
39+
```
40+
41+
### Running the Application
42+
43+
Navigate to the application directory and execute:
44+
45+
```bash
46+
cd src/Tools/SnIndexRebuilder
47+
dotnet run [options]
48+
```
49+
50+
## Command Line Arguments
51+
52+
### Available Options
53+
54+
| Option | Description |
55+
|--------|-------------|
56+
| `--clear-activities` | Performs complete clean rebuild with manual IndexingActivities table pre-clearing |
57+
| `--help`, `-h` | Shows comprehensive help documentation |
58+
| *(no arguments)* | **Default**: Clean rebuild with automatic IndexingActivities clearing after rebuild |
59+
60+
### Examples
61+
62+
**Standard Clean Rebuild (Recommended)**
63+
```bash
64+
dotnet run
65+
```
66+
This is the default and recommended approach for most scenarios. It performs a clean index rebuild and clears the IndexingActivities table after the rebuild completes (via ClearAndPopulateAllAsync).
67+
68+
**Complete Clean Rebuild**
69+
```bash
70+
dotnet run --clear-activities
71+
```
72+
Use this approach when experiencing phantom activities issues or when a completely fresh start is required. This will:
73+
- Clear the IndexingActivities table using TRUNCATE + DBCC CHECKIDENT
74+
- Remove cached index directories
75+
- Perform a full rebuild from scratch
76+
77+
**Show Help Documentation**
78+
```bash
79+
dotnet run --help
80+
```
81+
82+
## Rebuild Approaches
83+
84+
### Approach 1: Clean Rebuild (Default)
85+
86+
This is the **recommended approach** for most scenarios:
87+
88+
- Uses `ClearAndPopulateAllAsync` with controlled indexing engine startup
89+
- Prevents phantom activities by disabling indexing during repository startup
90+
- **Automatically clears IndexingActivities table** after the rebuild completes
91+
- Safer option that doesn't modify database structure during the rebuild process
92+
- Suitable for regular maintenance and updates
93+
94+
**IndexingActivities Behavior:**
95+
- The table is **automatically cleared** after the rebuild via `ClearAndPopulateAllAsync`
96+
- This happens because all content has been freshly indexed, making pending activities obsolete
97+
- Results in an empty IndexingActivities table at completion
98+
99+
**When to use:**
100+
- Regular index maintenance
101+
- After content updates or schema changes
102+
- When you want a clean rebuild without pre-clearing activities
103+
104+
### Approach 2: Complete Clean Rebuild (`--clear-activities`)
105+
106+
This approach provides a **complete reset** of the indexing system:
107+
108+
- **Pre-clears** the IndexingActivities table using TRUNCATE + DBCC CHECKIDENT
109+
- Clears index directory to remove cached data
110+
- Eliminates phantom activities (ghost indexing entries) before rebuild starts
111+
- **Also clears IndexingActivities again** after rebuild via `ClearAndPopulateAllAsync`
112+
- Provides the cleanest possible rebuild with double-cleanup
113+
114+
**IndexingActivities Behavior:**
115+
- Table is cleared **twice**: once at the start (manual), once at the end (automatic)
116+
- Manual clearing uses TRUNCATE + identity reset for complete cleanup
117+
- Automatic clearing happens via `ClearAndPopulateAllAsync` like in default mode
118+
- Results in an empty IndexingActivities table at completion (same as default mode)
119+
120+
**When to use:**
121+
- Experiencing phantom activities issues (1M+ ghost activities)
122+
- After major system migrations or upgrades
123+
- When troubleshooting unexplained indexing behavior
124+
- For establishing a known clean baseline with pre-cleanup
125+
126+
## Progress Tracking Features
127+
128+
### Real-time Progress Display
129+
130+
The application provides comprehensive progress information:
131+
132+
```
133+
Indexed 25,500 / 62,681 nodes (40.7%) - Elapsed: 00:05:36 - ETA: 00:08:10 (avg) / 00:11:26 (worst)
134+
```
135+
136+
**Progress Information Includes:**
137+
- **Current/Total Nodes**: Number of nodes processed and total count
138+
- **Percentage**: Completion percentage with decimal precision
139+
- **Elapsed Time**: Time spent so far in HH:MM:SS format
140+
- **Dual ETA**: Both average and worst-case time estimates
141+
142+
### Dual ETA Estimation
143+
144+
The tool calculates two time estimates:
145+
146+
- **Average ETA**: Based on current average processing speed
147+
- **Worst-Case ETA**: Based on the maximum time per node encountered
148+
149+
As the process progresses, both estimates typically converge, providing increasingly accurate completion time predictions.
150+
151+
### ETA Fluctuation Behavior
152+
153+
**Why ETAs fluctuate:**
154+
- The system uses a **sliding window of the last 1,000 nodes** for recent performance calculation
155+
- A **safety multiplier of 1.8** (80% margin) is applied for conservative estimates
156+
- Content complexity varies significantly (simple folders vs. documents with large binaries)
157+
- Text extraction and binary processing create processing time spikes
158+
159+
**Normal fluctuation patterns:**
160+
- ETAs may swing from ~10 minutes to ~50+ minutes and back
161+
- Large jumps often occur when processing complex documents
162+
- Estimates stabilize as more content is processed
163+
- This is **mathematically correct behavior**, not a bug
164+
165+
**Understanding the algorithm:**
166+
- Uses dual calculation: recent average vs. overall average
167+
- Recent performance is prioritized when sufficient data is available
168+
- Conservative estimates prevent disappointing users with overly optimistic times
169+
- Fluctuations reflect real content complexity variations in the repository
170+
171+
### Update Frequency
172+
173+
Progress updates occur:
174+
- Every **100 nodes** processed
175+
- Every **5 seconds** (whichever comes first)
176+
177+
This provides regular feedback without overwhelming the console output.
178+
179+
## Logging and Monitoring
180+
181+
### Serilog Integration
182+
183+
The application uses Serilog for professional logging with:
184+
185+
- **Console output**: Real-time progress and status messages
186+
- **File logging**: Persistent logs for audit and troubleshooting
187+
- **Structured logging**: JSON-formatted logs for monitoring systems
188+
189+
### Error Handling
190+
191+
Comprehensive error handling includes:
192+
193+
- **Database connection issues**: Graceful handling with clear error messages
194+
- **Index directory problems**: Automatic fallback and warnings
195+
- **Processing errors**: Individual node errors don't stop the entire process
196+
- **Configuration issues**: Clear guidance for resolution
197+
198+
### Log Locations
199+
200+
- **Console**: Real-time output during execution
201+
- **Log Files**: Check the application directory for generated log files
202+
- **Event Logs**: Integration with Windows Event Log (if configured)
203+
204+
## Performance Considerations
205+
206+
### Resource Usage
207+
208+
- **Memory**: Stable memory usage throughout the process
209+
- **CPU**: Moderate CPU usage with efficient processing
210+
- **Disk I/O**: Sequential read/write patterns for optimal performance
211+
- **Database**: Optimized queries with minimal connection overhead
212+
213+
### Large Repositories
214+
215+
For repositories with large node counts:
216+
217+
- Monitor the **worst-case ETA** for realistic time planning
218+
- Consider running during **maintenance windows**
219+
- Ensure adequate **disk space** for index files
220+
- Plan for **extended execution times** (hours for very large repositories)
221+
222+
## Troubleshooting
223+
224+
### Common Issues
225+
226+
**Connection String Problems**
227+
- Verify connection string format and credentials
228+
- Check database server accessibility
229+
- Ensure proper permissions for the database user
230+
231+
**ETA Fluctuation Questions**
232+
- **ETA jumping wildly is normal behavior** - not a bug
233+
- Reflects real content complexity variations (folders vs. large documents)
234+
- Sliding window algorithm with conservative safety multiplier causes fluctuations
235+
- Estimates stabilize as more content is processed
236+
237+
**Phantom Activities**
238+
- Use `--clear-activities` option for pre-clearing ghost activities
239+
- Check for previous incomplete rebuild operations
240+
- Note: Both modes clear IndexingActivities, difference is timing
241+
242+
**Performance Issues**
243+
- Monitor system resources during execution
244+
- Check disk space availability
245+
- Consider database performance tuning
246+
247+
### Getting Help
248+
249+
Use the built-in help system:
250+
```bash
251+
dotnet run --help
252+
```
253+
254+
This provides comprehensive documentation about all available options and usage examples.
255+
256+
## Best Practices
257+
258+
1. **Always backup** your repository before running index rebuilds
259+
2. **Stop the web application** before running the rebuilder to avoid conflicts
260+
3. **Use the default mode** unless experiencing specific phantom activity issues
261+
4. **Monitor progress** using the dual ETA estimates for time planning
262+
5. **Don't be alarmed by ETA fluctuations** - this is normal and mathematically correct
263+
6. **Expect empty IndexingActivities table** after completion in both modes
264+
7. **Check logs** after completion for any reported issues
265+
8. **Verify search functionality** after the rebuild completes
266+
9. **Plan for extended time** on large repositories (hours for very large content)
267+
268+
## Technical Details
269+
270+
### Code Architecture (Recent Improvements)
271+
272+
The application has been refactored to eliminate code duplication and improve maintainability:
273+
274+
**Refactored Helper Methods:**
275+
- `ClearIndexingActivitiesAsync()`: Handles manual IndexingActivities table cleanup
276+
- `ClearIndexDirectoryAsync()`: Manages index directory cleanup
277+
- `PerformIndexRebuildAsync()`: Common rebuild logic shared between both modes
278+
279+
**Benefits:**
280+
- **Eliminated 143 lines of duplicate code** between execution modes
281+
- **Reduced total codebase** from 442 to 416 lines while maintaining functionality
282+
- **Improved maintainability** through DRY (Don't Repeat Yourself) principle
283+
- **Consistent behavior** between modes for shared operations
284+
285+
**Index Activities Cleanup Clarification:**
286+
- **Both modes result in empty IndexingActivities tables**
287+
- Default mode: Cleanup happens once via `ClearAndPopulateAllAsync`
288+
- Clear-activities mode: Cleanup happens twice (manual + automatic)
289+
- The difference is WHEN cleanup occurs, not WHETHER it occurs
290+
291+
### Dependencies
292+
293+
- **SenseNet.ContentRepository**: Core repository functionality
294+
- **SenseNet.Search.Lucene29**: Lucene search engine integration
295+
- **Serilog**: Professional logging framework
296+
- **Microsoft.Extensions**: Configuration and dependency injection
297+
298+
### Architecture
299+
300+
The application follows clean architecture principles:
301+
302+
- **IndexingProgressTracker**: Handles progress monitoring and ETA calculation with sliding window algorithm
303+
- **Repository Startup**: Controlled initialization with indexing disabled initially
304+
- **Index Population**: Uses sensenet's built-in index population mechanisms
305+
- **Error Handling**: Comprehensive exception management throughout
306+
- **Task Disambiguation**: Fully qualified System.Threading.Tasks.Task usage to avoid conflicts
307+
308+
This tool provides a robust, professional solution for sensenet index rebuilding operations with enhanced visibility and control over the process.

0 commit comments

Comments
 (0)