Rust text frequency inverse document frequency pipe friendly capabilities
- Rust 88%
- Shell 12%
**Capabilities**
- Adds a Rust-based `tfidf` CLI (via `src/main.rs` rust-script) that:
- Builds a binary TF-IDF index from files/dirs (`save`).
- Lists indexed docs, top terms, and similar docs (`query list|terms|similar`).
- Uses XDG-aware stopword config keyed by `$TFIDF_APPNAME` and a fast custom hasher.
- Adds a Rust k‑NN graph generator (`generate-graph.rs` in `LIT-SPEC.EXAMPLE.md`) that:
- Reads the binary TF-IDF index directly.
- Emits an undirected similarity graph as JSON (`nodes` / `links`).
- Adds HTML + JS 3D visualisation wiring (via litblocks):
- 3D force-directed graph using `3d-force-graph` + three.js.
- Bloom controls (strength/radius/threshold) and window-resize handling.
- Node labels as sprites positioned above nodes.
- Introduces literate specs and shell helpers (`design-docs/litspec-webviz`) for:
- Building indices, deriving graphs, and serving HTML with `webdoc-pipe`.
- Defining path→index naming and “index algebra” (union, intersection, diff, cross-corpus edges).
**Operations / Workflows**
- Developer flow:
`fd → tfidf save <index> → tfidf query list/terms/similar`.
- Viz flow (Rust example):
`fd → tfidf save docs.index → generate-graph.rs docs.index → graph.json → jq → index.html → webdoc-pipe open`.
- Viz flow (shell):
`./generate-graph.sh <index> [top-n] [output.json]` → JSON graph with same schema.
- Webdoc loop:
`litblock viz.html.{1,3} → cat viz.html.1 <graph> viz.html.3 | webdoc-pipe html` then `webdoc-pipe open`.
- Cross-corpus flow:
Use `get_index_mapping`, `realize_index`, and TSV algebra to manage multiple indices and detect “bridge” documents across corpora.
**Systems & Integration**
- New/updated components:
- `tfidf` rust-script CLI using Rayon, bincode, and a custom hasher.
- Rust `generate-graph.rs` and shell `generate-graph.sh`.
- Web viz stack: `3d-force-graph`, `three`, `UnrealBloomPass`, `webdoc-pipe`.
- Index resolution precedence: `--index` → `$TFIDF_INDEX` → XDG data path → `./tfidf.index`.
**Data & Information**
- Binary index: `Index { docs: Vec<Document>, term_df }` with TF-IDF vectors derived per doc.
- Graph JSON: `nodes[{id,name,group}], links[{source,target,value}]`, with stable ordering.
- TSV structures for `(path, index)` mappings and edges, enabling algebraic set operations.
**Project Impact**
- Establishes a reproducible, documented pipeline from text corpora to interactive 3D similarity graphs.
- Consolidates architecture/docs under `design-docs`, including a reference literate spec.
- Prepares the ground for:
- Packaging `tfidf` as a conventional binary.
- Rich multi-index analysis using the “index algebra” patterns.
**Issues / Risks**
- `src/main.rs` is rust-script style but documented as `src/bin/tfidf.rs`; needs layout alignment.
- `generate-graph.sh` hardcodes `/usr/local/bin/tfidf` and is serial; Rust version is preferred for performance/portability.
- Some referenced TSVs (e.g. `labeled_mappings.tsv`) are conceptual and need concrete generators.
---
|
||
|---|---|---|
| design-docs | ||
| docs | ||
| src | ||
| tools | ||
| .gitignore | ||
| Cargo.toml | ||
| install.sh | ||
| tfidf.rs | ||