Skip to content

Add Vector benchmarks to suite#7397

Closed
connortsui20 wants to merge 18 commits intodevelopfrom
ct/vector-bench
Closed

Add Vector benchmarks to suite#7397
connortsui20 wants to merge 18 commits intodevelopfrom
ct/vector-bench

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

@connortsui20 connortsui20 commented Apr 11, 2026

Summary

Tracking issue: #7297

Adds a vector-search-bench crate similar to VectorDBBench.

The benchmark brute-forces cosine similarity search over public VectorDBBench datasets (Cohere, OpenAI, etc).

Since Vortex is not a database, we do not measure things like vector inserts and deletes. instead we just measure storage size and throughput for four targets:

  • Hand-rolled Rust baseline over &[f32]
  • Uncompressed (canonical) Vortex
  • Default compressed (which ends up being ALPrd)
  • TurboQuant (plus Recall@10 for the lossy TurboQuant path).

Every variant goes through a correctness check against the uncompressed scan before timing.

Not sure if this makes sense to have on a PR benchmark yet since this can only happen on a very specific array tree that will have a very specific optimized implementation.

Testing

N/A

claude added 18 commits April 11, 2026 12:33
Promote `compress_turboquant`, `build_constant_query_vector`, and
`build_similarity_search_tree` from the bench-only
`similarity_search_common` module into a new public
`vortex_tensor::vector_search` module so downstream benchmark crates
(and library users doing brute-force vector search) can reuse them
without duplicating the canonical recipe.

The existing similarity_search bench now delegates to these helpers
instead of inlining them, and adds three unit tests for the new
public surface: constant-query vector construction, end-to-end tree
execution to a BoolArray, and a TurboQuant-roundtrip ranking check.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Adds `Format::VortexTurboQuant` as a new Format enum variant, parallel to
`VortexCompact`, so downstream dashboards can distinguish TurboQuant-encoded
Vortex results from the generic BtrBlocks-compressed OnDiskVortex layout
via the existing `Target { engine, format }` grouping key.

Also adds `vortex_bench::conversions::list_to_vector_ext`: a utility that
rewraps a `List<float>` column (the shape an `emb` column takes when
ingested from parquet via `parquet_to_vortex_chunks`) as
`Extension<Vector>(FixedSizeList<T, D>)`. This is the only piece of
conversion glue needed to bring a VectorDBBench-style parquet embedding
column into a form the vector-search scalar functions expect. It handles
both single-list and chunked inputs, validates uniform stride, and
rejects nullable, non-float, or mismatched-length inputs.

`vortex-bench` gains a direct dependency on `vortex-tensor` for the
`Vector` extension type.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Introduce `VectorDataset`, a new `Dataset`-trait implementor that wires
public VectorDBBench-hosted embedding corpora into the vortex-bench
data-management machinery. The first variant is `CohereSmall`
(Cohere wiki-22-12, 100K rows × 768 dims, cosine metric, ~150 MB
zstd-parquet), targeting a CI-friendly size. Follow-up variants
(CohereMedium, OpenAI*, SIFT*, etc.) can be added alongside without
structural change.

Each variant carries its upstream `assets.zilliz.com` parquet URL,
dimension, row count, and curated distance metric. The download path
reuses the existing `download_data` / `idempotent_async` infrastructure
from `DownloadableDataset`, caching under `{name}/{name}.parquet` in
the local benchmark data directory.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Introduce a new standalone benchmark crate at
`benchmarks/vector-search-bench/` that measures brute-force cosine
similarity search over VectorDBBench-style public embedding corpora.
This commit wires up three Vortex variants:

- `vortex-uncompressed`: raw `Vector<dim, f32>` with no encoding-level
  compression applied. Baseline for Vortex variants.
- `vortex-default`: `BtrBlocksCompressor::default()` applied to the FSL
  storage — generic lossless compression for float vectors.
- `vortex-turboquant`: the full `L2Denorm(SorfTransform(FSL(Dict)))`
  pipeline via `compress_turboquant`. Lossy; in-memory `.nbytes()` is
  used as the reported size until `L2Denorm` serialization lands.

Four measurements are emitted per `(dataset, variant)` pair:
1. Compressed size (bytes) — on-disk `.vortex` file size for the
   lossless variants, in-memory footprint for TurboQuant.
2. Full-scan decode time — executes the column to a materialized
   `FixedSizeListArray`, forcing the entire decompression/dequantization
   path for each variant.
3. Cosine-similarity execute time — runs `CosineSimilarity(data, query)`
   into a materialized f32 array.
4. Filter execute time — runs `Binary(Gt, [cosine, threshold])` into a
   `BoolArray`. This is the end-to-end "query pushdown" path.

Output goes through the existing `vortex_bench::measurements` types so
results flow through the standard `gh-json` pipeline and appear on the
CI dashboard alongside compress-bench / random-access-bench results.
`Variant::VortexUncompressed` and `Variant::VortexDefault` both report
as `Format::OnDiskVortex`; `Variant::VortexTurboQuant` reports as
`Format::VortexTurboQuant`, letting downstream consumers distinguish
them via the existing `Target { engine, format }` grouping key.

The crate ships with a unit test that builds a 64×128 synthetic
`Vector` array, runs all three variants through `prepare_variant` +
`run_timings`, and verifies every variant reports a non-zero size and
non-zero timings. Parquet-Arrow baseline and Recall@10 quality
measurements are deferred to a follow-up commit; this commit
intentionally keeps the Vortex-only core focused.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
…ment

This completes the v1 scope of the vector-search benchmark by adding
two orthogonal pieces of functionality:

1. **Parquet-Arrow hand-rolled cosine baseline** (`parquet_baseline.rs`)
   reads the canonical parquet file via `parquet::arrow`, decodes the
   `emb` column to a flat `Vec<f32>`, and runs a straightforward Rust
   cosine loop. This is the "what you'd do without Vortex" external
   floor — necessary for the benchmark to tell a credible story about
   Vortex's value vs. raw parquet. Reports size, decode time, cosine
   time, and filter time in the same format as the Vortex variants,
   under `Format::Parquet`. Handles both `List<f32>` and
   `FixedSizeList<f32>` parquet schemas.

2. **Recall@K quality measurement** (`recall.rs`) computes the fraction
   of the exact top-K nearest neighbours that the lossy TurboQuant
   variant recovers, using the uncompressed Vortex scan as local ground
   truth (not VectorDBBench's shipped `neighbors.parquet`, which would
   require an index and is out of scope). Samples are deterministic —
   rows picked at uniform intervals across the dataset — so results are
   stable across runs. Only TurboQuant is checked; lossless variants
   are trivially 1.0. Emitted as a `CustomUnitMeasurement` with unit
   "recall" so it flows through the existing gh-json pipeline alongside
   the size/timing measurements.

The binary picks this all up behind two new CLI flags: `--parquet-baseline`
(default true) and `--recall-queries N` (default 100), plus `--recall-k K`
(default 10). Four unit tests cover the new functionality:
- `parquet_baseline_reads_fsl_column` — end-to-end parquet write +
  read + cosine loop on a 3×3 fixture.
- `uncompressed_has_perfect_self_recall` — sanity check that recall@10
  of a variant against itself is exactly 1.0.
- `turboquant_recall_is_reasonable_for_synthetic_data` — loose lower
  bound (>= 0.3) on TurboQuant recall for a 64×128 random dataset to
  catch total regressions without being flaky.

A small `test_utils` submodule was extracted from the lib's existing
tests module so the recall tests can reuse the synthetic vector
generator.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Adds `vector-search-bench` as a new entry in the post-merge `bench`
matrix alongside `random-access-bench` and `compress-bench`. Uses the
same build-bin / bench-taskset / gh-json / S3-upload pipeline so
results flow into the existing benchmarks dashboard without any
dashboard-side changes. The four formats passed on the CLI
(`parquet,vortex-uncompressed,vortex-default,vortex-turboquant`) surface
in the emitted JSON as three different `target.format` values —
`Format::Parquet` for the hand-rolled baseline, `Format::OnDiskVortex`
for the two lossless variants, and `Format::VortexTurboQuant` for the
lossy one — so dashboards can render them as distinct bars per metric.

Also promotes the `--variants` clap flag on the binary to `--formats`
for naming consistency with the other standalone benchmarks (the CI
script pipes `--formats ${{ matrix.benchmark.formats }}` uniformly).
The underlying `Variant` enum is unchanged; only the CLI surface moved.

Finally, adds `benchmarks/vector-search-bench/README.md` documenting
what the benchmark measures, the four formats, local-run instructions,
and — importantly — the dataset-mirror caveat. The current dataset URL
points at `assets.zilliz.com` (Zilliz's public anonymous-readable
bucket). Running this matrix entry on every develop merge will create
recurring egress traffic on a third-party bucket; the README explains
how to swap the URL in `VectorDataset::parquet_url` to an internal
mirror before turning this on for a fork with heavy merge frequency.
Skipping `bench-pr.yml` (PR runs) deliberately — v1 is develop-only
until the mirror / egress question is resolved.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Extends `VectorDataset` with five additional cosine-metric corpora from
the VectorDBBench collection:

- CohereMedium: 1M × 768d (wiki-22-12)
- OpenAiSmall: 50K × 1536d (OpenAI-on-C4)
- OpenAiMedium: 500K × 1536d (OpenAI-on-C4)
- BioasqMedium: 1M × 1024d (biomedical)
- GloveMedium: 1M × 200d (word embeddings)

All six built-in datasets are exported via a new `ALL_VECTOR_DATASETS`
slice and exposed through the `vector-search-bench` `--datasets` flag
so local runs can pick any of them.

Adds an `all_datasets_have_consistent_metadata` test that verifies every
built-in variant has a unique kebab-cased name, a `train.parquet` URL
under `assets.zilliz.com/benchmark/`, a dimension above the TurboQuant
minimum (128), and cosine metric (v1 only wires cosine). L2 / IP
datasets (SIFT, GIST, LAION) remain out of scope until vortex-tensor
gains an L2-distance scalar function.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
…generator

Fixes two issues discovered while running the benchmark end-to-end
against a real parquet file in a no-network environment:

1. `parquet_to_vortex_chunks` produces `ListView` arrays for list
   columns by default, so `list_to_vector_ext` was rejecting them with
   "expects a List array, got list(f32)". Added a fast path that
   applies `recursive_list_from_list_view` before the offset-stride
   validation kicks in.

2. `list_to_vector_ext` returns a `ChunkedArray<Extension<Vector>>`
   when given a chunked input (the normal shape after parquet ingest),
   but downstream `extract_query_row` expects a single non-chunked
   `Extension<Vector>`. `prepare_dataset` now calls
   `wrapped.execute::<ExtensionArray>(&mut ctx)` to materialize the
   chunked wrapper into a single unchunked extension array.

Also adds a new `gen_synthetic_dataset` helper binary that writes a
VectorDBBench-shape parquet file (`id: int64`, `emb: list<float32>`,
zstd-compressed) with deterministic xorshift-generated vectors. This
is useful for:

- Local dev runs of `vector-search-bench` without needing outbound
  network access to `assets.zilliz.com`.
- Populating the `vortex-bench/data/<name>/<name>.parquet` cache path
  so the benchmark's idempotent download step skips the HTTP fetch.
- Sanity-checking the benchmark pipeline in CI environments that
  block outbound HTTPS to third-party buckets.

Verified end-to-end against 5000×768 synthetic Cohere-small and
5000×1536 synthetic OpenAI-small inputs — every measurement pipeline
(size, decode, cosine, filter, recall@10) produces valid gh-json and
TurboQuant recall@10 lands at 0.91 on both fixtures, consistent with
the loose recall floor the unit test enforces.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
…press measurements

A thorough audit of the benchmark's outputs found three issues that together
made the results hard to interpret:

1. **"Size" was lying.** The old `measure_on_disk_size` helper wrote every
   variant through `SESSION.write_options()`, which uses
   `WriteStrategyBuilder::default()` — and that default applies BtrBlocks
   compression regardless of the in-memory tree. So both
   `vortex-uncompressed` and `vortex-default` ended up writing identical
   ALP-RD-compressed bytes to disk, reporting the same "size" number even
   though the two in-memory trees were different. Switched to reporting
   `ArrayRef::nbytes()` of the in-memory variant tree instead — the honest
   footprint of what the compute measurements actually operate on.

   Running the diagnostic `print_variant_trees` test now shows what each
   variant really is:

   ```
   VortexUncompressed   nbytes=15,360,000  Extension → FSL → Primitive<f32>
   VortexDefault        nbytes=13,072,050  Extension → FSL → alprd(f32)
                                             [bitpacked u16/u32 + patches]
   VortexTurboQuant     nbytes= 5,141,024  L2Denorm → SorfTransform →
                                             FSL → Dict(u8 codes + f32[256])
   ```

   So BtrBlocks really does find a ~15% lossless saving on random f32 via
   ALP-RD, and the previously-observed 3.5× cosine slowdown is the real
   cost of decoding ALP-RD on the fly inside `CosineSimilarity`.

2. **"Decompress time" for the compressed variants was a no-op.** The old
   `decode_full_scan` only executed the Extension shell and the FSL
   storage; it never forced the element buffer to materialize. Since FSL
   canonical form can keep its elements in their encoded representation,
   this reported ~60 ns for `vortex-default` even though reading any value
   would actually have to run the ALP-RD unpacking. Renamed to
   `decompress_full_scan` and added a final
   `fsl.elements().execute::<PrimitiveArray>()` call that forces the lazy
   decode to completion. `vortex-default`'s decompress time now shows
   ~6 ms on 5000×768 rows — the actual ALP-RD cost.

3. **No correctness verification.** The benchmark reported throughput
   numbers with no guarantee that the variants were producing the same
   answers. A new `verify.rs` module now computes cosine-similarity
   scores for one query against every variant *before* timing, compares
   them to the uncompressed baseline, and bails the run if any lossless
   variant drifts beyond 1e-4 max-abs-diff or any lossy variant drifts
   beyond 0.2 max-abs-diff. The `execute_cosine` implementation is
   promoted to `pub` and shared between the timing loop and verification
   so both paths exercise the exact same expression tree.

Also adds a new `compress time` measurement alongside `decompress time`,
so the dashboard can show the full round-trip cost per variant
(previously we only timed one side). The TurboQuant verification is
emitted as a warning instead of a hard error — the lossy tolerance is a
quality-budget alert, not a correctness invariant.

New unit tests (13 total in vector-search-bench, up from 4):
- `verify::*` — 8 tests covering the `compare_scores`, `verify_scores`,
  and `verify_variant` paths including NaN handling and end-to-end
  lossless/lossy checks on synthetic 64×128 data.
- `tests::uncompressed_decompress_is_fast` — regression guard ensuring
  the uncompressed variant's decompress pass is >5× faster than
  TurboQuant's so future refactors can't accidentally make the
  "uncompressed" path take the slow path.
- `tests::print_variant_trees` — `#[ignore]` diagnostic test that dumps
  `display_tree_encodings_only()` for each variant. This is what was
  used to root-cause the size-lying issue.

Verified end-to-end on synthetic 5000×768 Cohere-small:
  parquet              nbytes=14,511,306  cosine= 2.9 ms  correctness=0.00e0
  vortex-uncompressed  nbytes=15,360,000  cosine= 7.9 ms  correctness=0.00e0
  vortex-default       nbytes=13,072,050  cosine=20.5 ms  correctness=0.00e0
  vortex-turboquant    nbytes= 5,141,024  cosine=59.1 ms  correctness=5.18e-3
  recall@10 (tq)       0.9150

All lossless variants now produce bit-identical cosine scores (max diff
exactly 0.0) against the uncompressed baseline, confirming the ALP-RD
and hand-rolled parquet paths are correct. TurboQuant's 5.18e-3 drift
is well within the 0.2 lossy tolerance.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
The compute side of the baseline isn't really a parquet implementation —
it's a **hand-rolled Rust scalar cosine loop over a flat `Vec<f32>`**
that happens to have been decoded from a parquet file via
`parquet-rs` / `arrow-rs`. Labeling it as "parquet" in CLI flags,
metric names, and docs suggested the baseline was measuring a real
parquet-based analytical engine, when in reality it's the minimum cost
a Rust programmer could get away with given no query engine.

User-facing rename: everywhere the baseline was called "parquet" it now
says "handrolled" — the CLI format, the metric labels, the CI matrix
entry, the README, the type and function names in the baseline module,
and the module filename itself (`parquet_baseline.rs` →
`handrolled_baseline.rs`).

Unchanged:
- `read_parquet_embedding_column` keeps its name because it really does
  read parquet. Only the baseline-level wrappers take the `handrolled`
  label.
- `target.format` stays `Format::Parquet` — the storage side really is
  parquet on disk, and dashboards grouping by format should still put
  the baseline's size/decompress measurements under "parquet". The
  `CompressionTimingMeasurement::to_json` prefix therefore still reads
  `"parquet_rs-zstd "`, which honestly describes the parquet reader
  stack. Only the compute label in the metric `name` string reflects
  the hand-rolled compute path.

The module-level doc comment on `handrolled_baseline.rs` now explicitly
frames the baseline as a compute-cost *floor*, not a fair DBMS
comparison, and points at DuckDB `list_cosine_similarity` and
DataFusion vector UDFs as better "what real users pay for
parquet+cosine" baselines for future work.

Verified end-to-end against the 5000×768 synthetic cohere-small
fixture. All 13 existing tests still pass (the baseline test rename
from `parquet_baseline_reads_fsl_column` to
`handrolled_baseline_reads_fsl_column` is the only functional change to
tests). Correctness verification still reports:

  handrolled          max_abs_diff = 0.000e0 (bit-identical, lossless)
  vortex-uncompressed max_abs_diff = 0.000e0
  vortex-default      max_abs_diff = 0.000e0
  vortex-turboquant   max_abs_diff = 5.18e-3 (within 0.2 lossy tolerance)

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Review caught four real bugs plus a stylistic stray import. All are
small, targeted fixes.

1. `build_similarity_search_tree` (vortex-tensor) and `execute_cosine`
   (vector-search-bench) wrapped `CosineSimilarity::try_new_array` in
   `vortex_expect`, which panics on the `Err` branch. Both functions'
   doc comments claimed they *returned* errors on dimension mismatch,
   so the behavior contradicted the contract: a caller passing a
   wrong-dimension query vector would crash instead of seeing a clean
   error. Fixed by replacing `.vortex_expect(...)` with `?`.

2. The progress bar in `main.rs` was initialized with
   `datasets.len() * args.formats.len()` (which counts the handrolled
   baseline) but only `inc(1)`'d inside the Vortex-variants loop, so
   the bar finished at `N-1/N` whenever handrolled was selected. Fixed
   by incrementing inside the handrolled block and recomputing
   `total_work` from the actual increment sites.

3. `extract_query_row` blindly called `PrimitiveArray::as_slice::<f32>()`
   without checking the underlying ptype. The benchmark only ever uses
   f32 today, but the `Vector` extension type supports f16/f32/f64, so
   a future caller that accidentally passes an f64 `Vector` would get
   a mis-typed slice. Fixed by asserting `elements.ptype() == PType::F32`
   before the cast with a clear error message. While there, simplified
   the convoluted bounds check (was `row * dim + dim > len * dim`,
   reduces to `row >= len` without the multiplication).

4. `gen_synthetic_dataset.rs` and the README had shell examples using
   `--bin gen-synthetic-dataset` with hyphens, but cargo's default bin
   name is the filename minus extension — i.e., underscores. The
   example would fail to resolve. Fixed to `gen_synthetic_dataset`.

5. `use std::io::Write;` was stranded at the bottom of `main.rs` as a
   single-line import after the `main` function. Moved to the top
   import block for idiomatic style.

All 13 vector-search-bench tests still pass. 222 vortex-tensor tests
still pass. clippy and rustfmt clean. Public API unchanged — no
public-api.lock regen needed.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Cleanup of seven dead-or-suppressed items the review flagged:

1. `QueryLen` type alias (`lib.rs`) — defined but only "used" via
   `let _ = QueryLen::default;`, a warning-suppression hack. The doc
   comment claimed it made "broadcast semantics obvious at call sites"
   but no call site actually referenced it. Both the alias and the
   suppression `let` are gone.

2. `async-trait` and `itertools` Cargo dependencies — listed in
   `vector-search-bench/Cargo.toml` but not referenced anywhere in the
   crate's source tree (probably copied from `compress-bench` when the
   crate was bootstrapped). Removed.

3. `VERIFICATION_QUERY_ROW` constant (`verify.rs`) — defined as `0`
   and documented as "the row we verify against" but never referenced
   from anywhere; `main.rs` passes `prepared.query` directly, and
   `prepared.query` is extracted at `DEFAULT_QUERY_ROW` time inside
   `prepare_dataset`. Removed.

4. Dead-code-suppression hack in `vortex_bench::conversions::tests`:

       // ...suppress the dead-code warning by referencing them in a
       // no-op expression.
       let _ = (Nullability::NonNullable, PType::I32);

   Both imports were never used in the test body. Removed the imports
   and the dead let.

5. `let _ = emb_idx;` in `handrolled_baseline::read_parquet_embedding_column`.
   Replaced with destructuring via `let (_, emb_field) = ...`.

6. `HandrolledBaselineData::file_size` — populated by `read_parquet_
   embedding_column` via a second `File::metadata()` call, but no
   caller reads it: the benchmark's "handrolled size" measurement
   reads `PreparedDataset::parquet_bytes` (populated once in
   `prepare_dataset`). Keeping the duplicated state was a recipe for
   drift. Removed the field and its populating code, and added a doc
   comment on the struct pointing at `parquet_bytes` as the canonical
   file-size source.

All 13 vector-search-bench tests + 16 vortex-bench tests still pass.
clippy / rustfmt clean.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Four related cleanups the review flagged for modularity & clarity:

1. Deleted the duplicate `extract_query_row` helper from
   `recall.rs` (30 lines near-identical to the one in `lib.rs`). Both
   call sites now go through a single `pub(crate)` helper in `lib.rs`.
   The lib.rs helper also gets a fresh doc comment explicitly calling
   out its f32-only requirement (landed in the previous commit) so the
   deduped version is not just shorter but more correct.

2. Added `verify::verify_and_report_scores`: a score-vector-first
   variant of `verify_variant` that takes pre-computed scores as
   `&[f32]` instead of an `ArrayRef`. `verify_variant` now delegates
   to it. The point is to make both main-loop paths share a single
   error-handling / logging / bail-on-lossless codepath:
   - Vortex variants compute scores via `compute_cosine_scores(array)`
     and call `verify_variant`, which then calls
     `verify_and_report_scores` under the hood.
   - The handrolled baseline computes scores via `cosine_loop(&[f32])`
     and calls `verify_and_report_scores` directly.
   Previously `main.rs` hand-rolled the handrolled path's check/log/
   bail triplet, duplicating logic in two places that could drift.

3. Dropped the `compress_turboquant` and `build_similarity_search_tree`
   wrapper shims from `vortex-tensor/benches/similarity_search_common/
   mod.rs`. They existed only to re-expose the public helpers under
   local names for bench call-site compatibility. The bench now
   `pub use`s them directly, which is one fewer indirection layer.
   `compress_default` stays in the bench module because its
   `BtrBlocksCompressor` dependency lives in vortex-tensor's
   dev-dependencies, not main deps, so lifting it to the public
   `vector_search` module would pull `vortex-btrblocks` into
   vortex-tensor's main dep graph.

4. Replaced the stale "in future commits" doc comment on
   `PreparedDataset::uncompressed` — recall is already wired.

All 13 vector-search-bench + 222 vortex-tensor tests pass. clippy /
rustfmt clean.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Two micro-refactors for correctness quality:

1. `list_to_vector_ext` offset validation walks `ListArrayExt::offset_at`
   N+1 times (once per row boundary) instead of 2N times (once per start
   and once per end of every row). Each iteration now carries the
   previous `end` into the next iteration's `prev_end`, so we only
   fetch each offset boundary once. For a canonical list produced by
   `parquet_to_vortex_chunks` the offsets are a `Primitive` array and
   `offset_at` takes the fast slice-index path, so this halves the
   offset lookups from ~200K to ~100K on a 100K-row column. The loop
   body is still O(1) per row either way.

2. `run_timings` now interleaves its decompress / cosine / filter
   stages inside a single outer loop over iterations, instead of
   running the three stages as back-to-back sequential loops. The
   back-to-back shape gives each stage an asymmetric cache profile —
   the second+ iterations of the cosine stage run on cache lines left
   behind by the previous *cosine* iteration, while the second+
   iterations of the filter stage run on cache lines left behind by
   the *cosine* stage. Interleaving makes each iteration of each stage
   see roughly the same cache state, removing the asymmetry.

Each stage still gets a fresh `ExecutionCtx` so no cached scalar-fn
state leaks between stages within a single iteration. The doc comment
on `run_timings` now spells out why interleaving matters.

All 13 vector-search-bench + 16 vortex-bench tests still pass.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
…bail

Two review items related to testing coverage and trait-contract
honesty:

1. `list_to_vector_ext` has five unit tests but none of them exercised
   the `ListView` → `List` fast path, even though that path is the one
   the benchmark actually hits after `parquet_to_vortex_chunks` (which
   canonicalizes list columns into `ListView`, not `List`). Added a
   `list_view_input_is_rewrapped_as_vector_extension` test with a
   `list_view_f32(dim, rows)` helper that constructs a
   `ListViewArray<f32>` with uniform-stride offsets and sizes, then
   asserts `list_to_vector_ext` returns an `Extension<Vector>` whose
   storage is a `FixedSizeList<f32, dim>`. A regression to the
   ListView handling would now be caught by unit tests rather than by
   the end-to-end benchmark smoke test.

   `vortex-bench` tests are now 17 (was 16).

2. `VectorDataset::to_vortex_array` previously implemented the
   `Dataset` trait by downloading the parquet file, writing it through
   the default Vortex write path, and reading it back. The returned
   array was a `StructArray` with `{ id: int64, emb: list<float> }` —
   **not** the `Extension<Vector>(FixedSizeList<...>)` shape the
   vector-search benchmark actually operates on. The benchmark worked
   around this by bypassing `to_vortex_array` entirely and rebuilding
   the Vector extension via `parquet_to_vortex_chunks` +
   `list_to_vector_ext`. But that left the trait method as a trap for
   future callers who expect it to return the same shape the
   benchmark measures.

   Replaced the body with `bail!` and a message that points callers at
   the parquet path + list_to_vector_ext sequence. This makes the
   contract unambiguous: either use the parquet path, or get a clear
   error instead of a semantically-wrong array.

   The unused imports (`tokio::fs::File`, `ArrayStreamExt`,
   `OpenOptionsSessionExt`, `WriteOptionsSessionExt`, `SESSION`,
   `parquet_to_vortex_chunks`, `idempotent_async`, `IntoArray`) are
   dropped too.

13 vector-search-bench + 17 vortex-bench tests pass.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Five small style-only cleanups identified during the review pass.
Pure documentation + formatting — no behavior change.

1. `list_to_vector_ext` had one error message missing the
   `"list_to_vector_ext: ..."` prefix every other error used
   (`"list_to_vector_ext expects a List array ..."` → `"list_to_vector_ext:
   expected a List array ..."`). The test that asserts the substring is
   updated too.

2. `handrolled_baseline::filter_loop` now has an explicit comment
   flagging the `>` comparison as the *strict* greater-than that must
   stay in sync with the Vortex-side `Operator::Gt` in
   `build_similarity_search_tree`. A divergence between the two would
   show up as a max-abs-diff correctness failure for the lossless
   variants, but the comment makes the invariant explicit so the next
   person to touch either side thinks about it.

3. `gen_synthetic_dataset.rs` had three magic-number constants in the
   per-element value computation (`0.00013`, `0.00007`, `0.25`, plus
   `1.0/32768.0` for the random scale). Promoted them to named
   constants (`POS_FREQ_ROW`, `POS_FREQ_COL`, `POS_AMPLITUDE`,
   `RAND_SCALE`) with a comment explaining the intent: the sinusoid
   mixes the `(row, col)` position so vectors stay distinct even at
   low bit widths after quantization, and the frequency constants are
   chosen small and coprime to avoid short-period aliasing over the
   100K-row × 1536-col domain.

4. `gen_synthetic_dataset.rs`'s module doc said "bit-identical across
   runs". That overpromises: the generator uses `f32::sin`, whose last
   few ULPs are libm/CPU-dependent. Changed to "deterministic... on
   the same machine" with an explicit note about cross-machine
   variance.

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
…ests

A second pass over the branch caught several items I missed in the
first review cleanup. All code-style and correctness-level, no
behavior changes beyond the missed f32 assertion.

## Missed bug

`verify::compute_cosine_scores` had the same blind
`PrimitiveArray::as_slice::<f32>()` call I already fixed in
`extract_query_row` last round — I only patched the lib.rs copy, not
the verify.rs one. Added the `scores.ptype() != PType::F32` check
with a clear error message before the slice cast. Matches the doc
contract that the function returns `Vec<f32>` and only supports f32
Vector columns today.

## CLAUDE.md style violations

CLAUDE.md's import rule: "All imports must be at the top of the
module, never inside functions. The only exception is `#[cfg(test)]`
blocks, where imports should be at the top of the test module."

I had function-scoped `use` statements sprinkled across five
functions in `lib.rs` and inside an `extract_row_zero` test helper in
`verify.rs`, all against the guidance. Moved every function-scoped
`use` into the top-level import block:

- `prepare_dataset`: 1 import moved
- `extract_emb_column`: 5 imports moved
- `extract_query_row`: 4 imports moved
- `decompress_full_scan`: 3 imports moved
- `execute_cosine`: 2 imports moved
- `PreparedDataset::dim`: 2 fully-qualified `vortex::dtype::DType::...`
  paths replaced with plain `DType::...` from top-level import

## Test duplication

`verify::tests::extract_row_zero` was a 10-line reimplementation of
`extract_query_row` inside the test module. Deleted it and made
`verify::tests::make_prepared` call `extract_query_row` directly,
which also sidestep the need for function-scoped imports in the test.
Same rework applied to `lib::tests::prepare_variant_produces_...`
(which had its own inline f32 extraction) and `recall::tests::...`
(which was building a `PreparedDataset` with an empty query).
A new `test_prepared` helper in `lib.rs` tests replaces the three
separate inline constructions with a single shared builder.

## New test coverage

Added 8 new tests across the three modules:

- `vortex-bench::conversions::all_invalid_list_validity_is_rejected`:
  builds a `List<f32>` with `Validity::AllInvalid` and confirms the
  rejection path fires. Previously unexercised.

- `vector-search-bench::tests::extract_query_row_returns_the_right_slice`:
  verifies row 0 extraction is idempotent and different rows differ.

- `vector-search-bench::tests::extract_query_row_rejects_out_of_bounds_row`:
  verifies the `row >= len` bounds check.

- `vector-search-bench::verify::tests::verify_scores_fails_on_nan_in_baseline`:
  symmetric NaN case — the existing `verify_scores_fails_on_nan` test
  only covered the variant-side NaN.

- `verify::tests::verify_and_report_scores_is_ok_for_identical_inputs`:
  direct test for the helper that main.rs's handrolled path uses.

- `verify::tests::verify_and_report_scores_bails_for_lossless_mismatch`:
  lossless failures must hard-error with a message that includes the
  variant name.

- `verify::tests::verify_and_report_scores_warns_for_lossy_mismatch_without_bailing`:
  lossy failures must NOT bail — they log a warning and return the
  failing report so the caller can still emit it as a measurement.

- `recall::tests::top_k_indices_handles_nan_without_panicking`: NaN
  scores used to sort via `partial_cmp(...).unwrap_or(Equal)`, which
  produced arbitrary orderings. Switched to `f32::total_cmp` for a
  proper total order; test confirms non-NaN scores still rank in
  descending order and the sort doesn't panic.

## Minor

- Stale comment on `main.rs::total_work` calculation: referenced
  `args.formats.len()` but the expression uses a different formula
  after commit 74aaf5a. Fixed the comment.

- `top_k_indices` switched from
  `scores[b].partial_cmp(&scores[a]).unwrap_or(Ordering::Equal)` to
  `scores[b].total_cmp(&scores[a])`. Added a doc comment on the
  function explaining the NaN-safety rationale.

## Test counts after this commit

- vortex-bench: 18 tests (was 17, +1 for `all_invalid_list_validity`)
- vector-search-bench: 20 tests passing + 1 ignored (was 13 + 1,
  +7 for the new coverage above)
- vortex-tensor: 222 tests (unchanged)

Clippy / rustfmt clean. End-to-end smoke run on synthetic
cohere-small still reports:
- all four correctness-max-diff values coherent (0.000e0 for the three
  lossless variants, 5.18e-3 for TurboQuant)
- Recall@10 = 0.91 for TurboQuant

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
… reads

Three targeted fixes from the third review pass.

## 1. Drop vestigial `session: &VortexSession` parameters

The benchmark always passed `&vortex_bench::SESSION` to every function
that took a session parameter, so the parameter was plumbing with zero
value. More importantly it was *inconsistent* — some internal helpers
(like `extract_query_row`) already hardcoded the global, so the API
suggested a custom session was respected end-to-end when it wasn't.

Removed the `session` parameter from:
- `lib::prepare_variant(prepared, variant)`
- `lib::run_timings(variant_array, query, iterations)`
- `verify::compute_cosine_scores(data, query)`
- `verify::verify_variant(name, array, query, baseline, kind)`
- `recall::measure_recall_at_k(uncompressed, compressed, num_queries, top_k)`

Each function now calls `vortex_bench::SESSION.create_execution_ctx()`
directly where needed. Doc comments note the "uses the global SESSION"
contract explicitly. `execute_cosine`, `decompress_full_scan`, and
`execute_filter` still take `&mut ExecutionCtx` because they're the
low-level primitives that `run_timings` constructs fresh contexts for
on each iteration.

All call sites in `main.rs` and the test modules updated.

## 2. Fix the handrolled double-parquet-read

`main.rs` previously did:

1. Read parquet → run cosine → verify against baseline → bail on fail
2. Call `run_handrolled_baseline_timings`, which reads parquet N more
   times inside its own loop

Total: N+1 parquet reads per dataset per run. The extra read was
wasted work — `cosine_loop` is deterministic, so the scores from
iteration N of the timing loop are bit-identical to a one-shot
pre-timing computation.

Restructured `run_handrolled_baseline_timings` to return a new
`HandrolledBaselineResult { timings, last_scores }` where `last_scores`
is the `Vec<f32>` produced by the final iteration. `main.rs` now:

1. Call `run_handrolled_baseline_timings` → gets timings + last_scores
2. Call `verify_and_report_scores` with those last_scores vs baseline
   → bails on lossless mismatch

Total: N parquet reads per dataset per run. One less read, and the
verification now uses the exact same scores the timing loop computed
rather than a separate pass.

Order reversal note: verification now runs *after* timing instead of
before. The handrolled loop is cheap enough (~5 ms cosine + ~40 ms
parquet read per iter) that running it once even for a broken variant
is acceptable, and the correctness bail still prevents the run from
reporting wrong-answer numbers.

`run_handrolled_baseline_timings` now asserts `iterations >= 1` up
front (previously relied on `last_scores` being populated after the
loop).

## 3. Nullable element type test

`list_to_vector_ext` has a rejection path for `List<f32?>` — element
dtype is a nullable f32 (even when every value is present). Added
`conversions::tests::nullable_element_dtype_is_rejected` to exercise
this path. Passing `Validity::AllValid` to `PrimitiveArray::new::<f32>`
produces the nullable dtype the rejection path is looking for.

## Also: housekeeping

- Dropped `use vortex_bench::SESSION` from the `lib.rs` inner test
  module (was left over from the session-parameter removal).
- Replaced two fully-qualified `vortex::error::vortex_panic!` calls in
  `PreparedDataset::dim()` with a plain `vortex_panic!` sourced from a
  top-level import.
- Refreshed the doc comment on `PreparedDataset` — it previously
  described the struct as carrying an "execution session / context",
  which is no longer accurate.
- Added two new tests for `run_handrolled_baseline_timings`:
  - `run_handrolled_baseline_timings_returns_last_iteration_scores`:
    verifies the `last_scores` contract on a 3-row fixture.
  - `run_handrolled_baseline_timings_panics_on_zero_iterations`:
    regression guard for the iterations>=1 assert.

## Test counts

- vortex-bench: 19 tests (was 18, +1 for nullable_element_dtype)
- vector-search-bench: 22 passing + 1 ignored (was 20+1, +2 for the
  handrolled timing tests)
- vortex-tensor: 222 tests (unchanged)
- Total: 263 passing, 3 ignored

clippy / rustfmt clean. End-to-end smoke run still coherent:
- correctness-max-diff = 0.000e0 for all three lossless variants
- correctness-max-diff = 5.18e-3 for TurboQuant (within 0.2 tolerance)
- Recall@10 for TurboQuant = 0.9267

Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20
Copy link
Copy Markdown
Contributor Author

claude code web fell over so this doesnt even compile

@connortsui20
Copy link
Copy Markdown
Contributor Author

Superseded by #7399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants