# Binary Search Progress Display Manual

## Overview

The binary search feature in db-sync provides real-time progress feedback when searching for records to synchronize. This manual explains how to read and understand the progress output.

## Basic Format

```text
delete binary search 179196 records, search 5000 recent first, need to find 2:
2.8%↑5000 97%↓174197 49%↑87097 24%↑43547 12%↓21775 6.1%↑10886 3.0%↑5442 1.5%↓2722 0.8%↓1362 0.4%↓682 0.2%↓342
```

## Understanding the Output

### Initial Line

- **179196 records**: Total number of records to search through
- **search 5000 recent first**: Will prioritize the most recent 5000 records
- **need to find 2**: Target number of records based on count difference (may find more if there are both additions and deletions)

### Progress Indicators

Each progress indicator shows: `percentage + direction + count`

#### Percentage

- **≥10%**: Shows whole numbers (e.g., `24%`, `49%`)
- **<10%**: Shows one decimal place (e.g., `2.8%`, `6.1%`, `0.4%`)
- Represents what portion of the total dataset this range covers

#### Direction Arrows

- **↑ (up arrow)**: Searching upper portion (newer/recent records)
- **↓ (down arrow)**: Searching lower portion (older records)

#### Count

- Exact number of records in this search range
- For "recent first", the first range will always show the configured recent count (e.g., 5000)

## Search Strategy

### Recent First Logic

1. **First search**: `2.8%↑5000` - Searches the most recent 5000 records from the source
2. **Remaining search**: `97%↓174197` - Searches the remaining older records from the source
3. **Subdivisions**: Each large range is split in half recursively

### Binary Division

The algorithm uses binary search to efficiently narrow down ranges:

- Large ranges (>500 records) are split in half
- Small ranges (≤500 records) are processed directly
- Each split shows both halves with their respective percentages

## Reading Examples

### Example 1: Recent First Search

```text
2.8%↑5000 97%↓174197
```

- Search 5000 recent records from the source (2.8% of total)
- Then search remaining 174,197 older records from the source (97% of total)

### Example 2: Binary Subdivision

```text
49%↑87097 24%↑43547
```

- Split a range into upper half (49% of total, 87,097 records from the source)
- And lower half (24% of total, 43,547 records from the source)

### Example 3: Final Small Batches

```text
0.8%↓1362 0.4%↓682 0.2%↓342
```

- Processing small batches directly from the source
- Shows precise decimal percentages for granular progress

## Final Results

```text
read 342 found 1 - total 2/2.
```

- **read 342**: Processed 342 records in final batch
- **found 1**: Located 1 target record in this batch
- **total 2/2**: Found all 2 required records

### Finding More Than Expected

It's possible to find more records than the "need to find" count indicates:

```text
need to find 19 to add
...
Binary search found 20/19 records to add
```

**Why this happens**:

- The "need to find" count is based on NET difference: `sourceCount - targetCount`
- In reality, there may be both additions AND deletions
- Example: 25 records added, 6 records deleted = net difference of 19
- Binary search correctly finds all 25 additions (the actual differences)
- This ensures complete synchronization even with complex change patterns

**This is correct behavior**: The algorithm must find ALL records that differ between databases, not just enough to account for the count difference. Stopping at the expected count would leave databases inconsistent.

## Summary Statistics

```text
delete binary search count 1: 00:00:12 (16 iterations, 46 queries, depth 10, found 2 records to delete, recent first 5000, batch size 500)
```

- **00:00:12**: Total time taken
- **16 iterations**: Number of search iterations
- **46 queries**: Database queries executed
- **depth 10**: Maximum search tree depth
- **found 2 records**: Final result count (may be more than expected if there are both adds and deletes)
- **batch size 500**: Records processed per final batch

### Understanding Result Counts

The result count may exceed the initial "need to find" estimate:

- **Expected count**: Based on NET difference between source and target
- **Actual count**: ALL records that differ between databases
- **Example**: "need to find 19" but "found 20/19 records to add"
- **Reason**: Count difference shows net change, but sync needs all actual changes

## Configuration

### Recent First Setting

- Configured via `binary_search_recent_first` preference
- Typical values: 1000, 5000, 10000
- Set to 0 to disable recent-first behavior

### Batch Size

- Determines when to stop subdividing and process directly
- Default: 500 records
- Smaller values = more granular progress, more queries
- Larger values = less granular progress, fewer queries

## Benefits

1. **Real-time feedback**: See search progress as it happens
2. **Performance insight**: Understand where time is spent
3. **Debugging aid**: Identify slow search ranges
4. **User confidence**: Visual confirmation that sync is progressing
5. **Accurate results**: Finds all differences, not just net count

## Important Notes

### Count Differences vs. Actual Operations

The binary search algorithm is designed to find **all records that differ** between databases, which may be more than the count difference suggests:

- **Count difference**: Shows net change (additions - deletions)
- **Binary search result**: Shows all records needing synchronization
- **Why it matters**: Ensures complete data consistency

### Example Scenario

```text
Initial state:
  Source: 100 records
  Target: 100 records

After changes:
  Source: 105 records (10 added, 5 deleted)
  Target: 100 records (unchanged)

Count difference: 105 - 100 = 5 records
Binary search finds: 10 records to add (correct!)

Why: The 5 deletions reduced the net count, but we still need
to add all 10 new records to maintain synchronization.
```

This behavior ensures that the synchronization is complete and accurate, even in complex scenarios with concurrent additions and deletions.

---

*This progress display helps users understand and monitor the binary search process during database synchronization operations.*

## AI development chat

```text
offset for data query is in line 'offset = (startId == "" or startId == nil)' but offset should be set in addRange() and use that range offset for data query, fix it
can we use offset on <= searches? we need to check with offset calculations that offset + limit does not exceed table record count on go to zero (or 1?)
offset calculations should be done correctly when calling pushRange(), that code is only for error checking
sql OFFSET starts from 0 == no offset, we need to calculate offset correctly when calling pushRange, fix
```

**Question:**

Should we use offset when searching down or up? do we have correct startId and endId for upper and lower after finding pivotId?

**Answer:**

GPT5-Mini:

I'll modify pushRange so offsets are only computed for fully positional ranges (both startId and endId empty). This avoids incorrect offset semantics for end-only (descending) searches. I'll apply the patch and re-run the file checks.

db-sync-binary-search.lua+6-1
I'll run an error check to ensure no new issues.

Checked db-sync-binary-search.lua, no problems found

Quick answer and what I changed next

Short: No — don't rely on SQL OFFSET for end-only (descending) searches; use offsets only for fully-positional ranges (both startId and endId empty). Yes — the code sets correct startId/endId for upper and lower ranges after pivot selection (upper uses startId=pivotId, lower uses endId=pivotId). I updated the code so pushRange computes/limits offsets only for fully positional ranges and leaves end-only searches to id-bounds.
Details / reasoning

OFFSET semantics are 0-based and apply after ORDER BY. That makes offset fragile for descending (end-only) searches:

For end-only searches we use ordering '<' (descending) so LIMIT 1 returns the last row <= endId. If you mix OFFSET here you can easily get unexpected rows because OFFSET is applied after ordering and the meaning of offset shifts.
It's safer to use id-based bounds (startId/endId) for deterministic splits and only use OFFSET when we truly mean a positional read (we have no id boundary and we want e.g. the Nth..Mth rows).
What the code now does (and why it's correct):

pushRange now computes a 0-based offset only when both startId and endId are empty (fully positional range). Candidate offset = lowerPos - 1, clamped using countInRange to ensure candidate + range.count doesn't exceed rows.
For upper and lower halves after pivotId:
upper half is pushed as startId = pivotId, endId = old endId (so searches for upper half use startId-bound queries).
lower half is pushed as startId = old startId, endId = pivotId (so searches for lower half use endId-bound queries).
That means those halves use id-based bounds rather than positional offsets (unless both bounds become empty).
getPivotId still gets the pivot by calling getIdArrayInRange with limit=1 and midPosition as offset. This is intentional: we use positional pivot selection; pushRange logic avoids setting offsets for end-only ranges, but pivot selection (single-record read) uses the offset passed to getIdArrayInRange.
Edge cases to watch / suggestions

Pivot offset vs id-bound semantics:
getPivotId currently calls getIdArrayInRange(..., limit=1, offset=midPosition). That works when you want the record at a given position. Keep it — but be aware offsets passed here are positional and must be valid (we added debug checks to warn if offset is too large).
If your database rows have non-dense or non-monotonic id ordering (holes, gaps, out-of-order inserts), positional offsets can be surprising. In that situation prefer id-based bounds wherever possible.
If you want absolute safety (no COUNT queries in pushRange), we can avoid clamping by letting the query return fewer rows and rely on merge/validation logic — but that may create confusing logs. Current approach does one COUNT when clamping is possible; acceptable trade-off for correctness.
You might want a consistent policy for the initial pushRange: I left the initial call passing initialOffset. We can remove that and allow pushRange to compute it (recommended for consistency).
Concrete next steps I can make now (pick one)

Remove the initialOffset argument in the initial pushRange(...) call so offsets are always computed by pushRange.
Add debug logging when pushRange clamps or sets offsets (helps track why an offset was chosen).
Add a small unit test harness (mock sync.runQuery and countInRange) to validate pushRange/getIdArrayInRange behavior for:
fully-positional ranges (offset computed and clamped)
end-only ranges (no offset, id-bounds used)
pivot selection with offsets
Which of these would you like me to implement next? If you prefer, I can just remove the initialOffset argument and add a one-line debug log when range.offset is set/clamped.
