Rust text frequency inverse document frequency pipe friendly capabilities
  • Rust 88%
  • Shell 12%
Find a file
Andrew Briscoe 2ba8fe839e
feat(tfidf+viz): add Rust TF-IDF engine and literate 3D similarity viz
**Capabilities**

- Adds a Rust-based `tfidf` CLI (via `src/main.rs` rust-script) that:
  - Builds a binary TF-IDF index from files/dirs (`save`).
  - Lists indexed docs, top terms, and similar docs (`query list|terms|similar`).
  - Uses XDG-aware stopword config keyed by `$TFIDF_APPNAME` and a fast custom hasher.
- Adds a Rust k‑NN graph generator (`generate-graph.rs` in `LIT-SPEC.EXAMPLE.md`) that:
  - Reads the binary TF-IDF index directly.
  - Emits an undirected similarity graph as JSON (`nodes` / `links`).
- Adds HTML + JS 3D visualisation wiring (via litblocks):
  - 3D force-directed graph using `3d-force-graph` + three.js.
  - Bloom controls (strength/radius/threshold) and window-resize handling.
  - Node labels as sprites positioned above nodes.
- Introduces literate specs and shell helpers (`design-docs/litspec-webviz`) for:
  - Building indices, deriving graphs, and serving HTML with `webdoc-pipe`.
  - Defining path→index naming and “index algebra” (union, intersection, diff, cross-corpus edges).

**Operations / Workflows**

- Developer flow:
  `fd → tfidf save <index> → tfidf query list/terms/similar`.
- Viz flow (Rust example):
  `fd → tfidf save docs.index → generate-graph.rs docs.index → graph.json → jq → index.html → webdoc-pipe open`.
- Viz flow (shell):
  `./generate-graph.sh <index> [top-n] [output.json]` → JSON graph with same schema.
- Webdoc loop:
  `litblock viz.html.{1,3} → cat viz.html.1 <graph> viz.html.3 | webdoc-pipe html` then `webdoc-pipe open`.
- Cross-corpus flow:
  Use `get_index_mapping`, `realize_index`, and TSV algebra to manage multiple indices and detect “bridge” documents across corpora.

**Systems & Integration**

- New/updated components:
  - `tfidf` rust-script CLI using Rayon, bincode, and a custom hasher.
  - Rust `generate-graph.rs` and shell `generate-graph.sh`.
  - Web viz stack: `3d-force-graph`, `three`, `UnrealBloomPass`, `webdoc-pipe`.
- Index resolution precedence: `--index` → `$TFIDF_INDEX` → XDG data path → `./tfidf.index`.

**Data & Information**

- Binary index: `Index { docs: Vec<Document>, term_df }` with TF-IDF vectors derived per doc.
- Graph JSON: `nodes[{id,name,group}], links[{source,target,value}]`, with stable ordering.
- TSV structures for `(path, index)` mappings and edges, enabling algebraic set operations.

**Project Impact**

- Establishes a reproducible, documented pipeline from text corpora to interactive 3D similarity graphs.
- Consolidates architecture/docs under `design-docs`, including a reference literate spec.
- Prepares the ground for:
  - Packaging `tfidf` as a conventional binary.
  - Rich multi-index analysis using the “index algebra” patterns.

**Issues / Risks**

- `src/main.rs` is rust-script style but documented as `src/bin/tfidf.rs`; needs layout alignment.
- `generate-graph.sh` hardcodes `/usr/local/bin/tfidf` and is serial; Rust version is preferred for performance/portability.
- Some referenced TSVs (e.g. `labeled_mappings.tsv`) are conceptual and need concrete generators.

---
2026-01-04 13:54:14 +08:00
design-docs feat(tfidf+viz): add Rust TF-IDF engine and literate 3D similarity viz 2026-01-04 13:54:14 +08:00
docs feat(tfidf+viz): add Rust TF-IDF engine and literate 3D similarity viz 2026-01-04 13:54:14 +08:00
src feat(tfidf+viz): add Rust TF-IDF engine and literate 3D similarity viz 2026-01-04 13:54:14 +08:00
tools feat(tooling): bootstrap tfidf-rs crate with local install and context scripts 2026-01-04 13:45:21 +08:00
.gitignore feat(cli,docs,tooling): add composable tfidf CLI, mdBook docs, and AI commit tooling 2026-01-01 14:42:16 +08:00
Cargo.toml feat(tooling): bootstrap tfidf-rs crate with local install and context scripts 2026-01-04 13:45:21 +08:00
install.sh feat(tooling): bootstrap tfidf-rs crate with local install and context scripts 2026-01-04 13:45:21 +08:00
tfidf.rs feat(cli,docs,tooling): add composable tfidf CLI, mdBook docs, and AI commit tooling 2026-01-01 14:42:16 +08:00