Table of Contents
- 📜 tfidf — Structured Semantic Inference Engine
- 🧭 Table of Contents
- 🧭 1. User Guide — tfidf-calc & tfidf-explore
- 🧱 2. Architecture & Theory — Morphisms, Colimits, Presheaves
- 🎯 Goal: Minimize Coupling, Maximize Coherence
- 🔗 Resource Theory Wire Diagram
- 🧮 Colimits & Optimized Architecture
- 🌀 Natural Transformations & Frames of Reference
- 🧬 Presheaf as Homset Generator
- 📦 3. Deployment & Integration — XDG, PKGBUILD, Pipelines
- 🧠 4. Design Rationale — Defaults, Deferred Config, Minimalism
- 🔮 5. Future Evolution — Clustering, Topic Modeling, Natural Transformations
- 🧩 Clustering as a Morphism
- 📚 Topic Modeling as a Natural Transformation
- ♻️ Coherence Under Refactoring
- ✅ Conclusion — A Coherent, Composable Foundation
- TF-IDF Calc: Command Line Specification
Absolutely. Let’s crystallize everything into a coherent, architecturally grounded specification — not just a README, but a context dump for future evolution, grounded in category theory, resource flow, and Unix philosophy.
This document will serve as:
- ✅ User Manual — What each flag does, with examples.
- ✅ Architectural Blueprint — Wire diagrams, morphisms, colimits, presheaves.
- ✅ Deployment Spec — XDG, Arch PKGBUILD, pipeline integration.
- ✅ Design Rationale — Why we chose defaults, what’s deferred, what’s essential.
- ✅ Theory of Coherence — How relationships remain valid under transformation.
📜 tfidf — Structured Semantic Inference Engine
“From bytes to meaning — without magic.”
A fast, safe, multicore, std-only Rust toolkit for exploratory semantic analysis of file collections. Computes TF-IDF vectors, supports topological slicing, similarity search, and term discovery — all composable via Unix pipes.
🧭 Table of Contents
- User Guide — Flags, Defaults, Examples
- Architecture & Theory — Morphisms, Colimits, Presheaves
- Deployment & Integration — XDG, PKGBUILD, Pipelines
- Design Rationale — Defaults, Deferred Config, Minimalism
- Future Evolution — Clustering, Topic Modeling, Natural Transformations
🧭 1. User Guide — tfidf-calc & tfidf-explore
🚀 Core Philosophy
tfidf-calc: Pure functionFiles → Vectors. Side-effect: writes TSV to stdout.tfidf-explore: Pure functionVectors + Slice → Insights. Reads TSV from stdin or file.
Everything composes. Nothing is stateful (except optional XDG cache).
📥 tfidf-calc — Generate TF-IDF Vectors
tfidf-calc [FLAGS] > vectors.tsv
Flags & Effects
| Flag | Type | Default | Effect | Resource Wire |
|---|---|---|---|---|
--root DIR |
Path | . |
Sets root directory to scan. | RootPath → FileSet |
--exclude GLOB |
String | [] |
Exclude files matching glob (repeatable). | FileSet → FilteredFileSet |
--idf MODE |
Enum | frac |
IDF formula: frac, smooth, prob, log. Changes score distribution. |
DocFreq → IDFVector |
--scale N |
u128 | 100000 |
Multiplier for final score. Larger = more precision, risk overflow. | RawScore → ScaledScore |
--jobs N |
usize | auto |
Number of worker threads. auto = CPU cores. |
FileSet → ConcurrentWorkers → DocStream |
--stopwords FILE |
Path | None |
File with stopwords (one per line). Filters tokens. | TokenStream → FilteredTokenStream |
--include-hidden |
Bool | false |
Include dotfiles (.gitignore, etc). |
FilePath → Bool |
--df-base MODE |
Enum | all |
Document count basis: all or nonempty. Affects IDF denominator. |
DocSet → D |
--status ADDR |
String | None |
Serve live status at ADDR (e.g., 127.0.0.1:9999). Side-effect only. |
Counters → TCPStream |
Output Schema (TSV)
term file_path term_freq doc_length doc_freq tf idf score formula
This is your document-term matrix. Pipe it to
tfidf-exploreor load into Pandas.
Examples
# Basic: all .rs files, log IDF, high scale
tfidf-calc --root ./src --exclude "target,node_modules" --idf log --scale 1000000 > rust.tsv
# With stopwords
tfidf-calc --stopwords ./english.txt > filtered.tsv
# Live monitoring
tfidf-calc --status 127.0.0.1:9999 > vectors.tsv &
curl 127.0.0.1:9999 # in another terminal
🔍 tfidf-explore — Query & Discover
tfidf-explore vectors.tsv [QUERY_FLAGS]
Flags & Effects
| Flag | Type | Default | Effect | Resource Wire |
|---|---|---|---|---|
--slice QUERY |
String | None |
Filter documents: ext:rs dir:src !test. |
DocVector → SlicedDocVector |
--top-terms N |
usize | None |
Show top N terms by aggregate score in slice. | SlicedDocVector → RankedTermList |
--similar-to PATH |
Path | None |
Find files similar to PATH using cosine similarity. |
TargetDoc × DocVector → SimilarityRankedDocList |
| (no flags) | — | — | Show global summary, top terms, top docs. | DocVector → SummaryReport |
Slice Query Syntax
ext:rs,md— Only.rsor.mdfiles.dir:src,test— Only undersrc/ortest/.!target,!node_modules— Exclude these paths.glob:*.conf— Include only files matching glob.
Combines with AND logic. Order-independent.
Examples
# Top 10 terms in Rust files
tfidf-explore rust.tsv --slice "ext:rs" --top-terms 10
# Files similar to main.rs
tfidf-explore rust.tsv --similar-to "src/main.rs"
# Compare configs vs source
tfidf-calc --root . --exclude "target" > all.tsv
tfidf-explore all.tsv --slice "glob:*.toml,*.yaml" --top-terms 5 > config_terms.txt
tfidf-explore all.tsv --slice "ext:rs" --top-terms 5 > rust_terms.txt
comm -12 <(cut -f1 config_terms.txt) <(cut -f1 rust_terms.txt) # shared terms
🧱 2. Architecture & Theory — Morphisms, Colimits, Presheaves
🎯 Goal: Minimize Coupling, Maximize Coherence
We model the system as a category where:
- Objects:
FileSet,TokenStream,DocVector,SlicedVector,RankedTermList,SimilarityList - Morphisms:
gather_files,tokenize,compute_tfidf,slice,top_terms,similar_to
A morphism is valid if it preserves structure under transformation.
🔗 Resource Theory Wire Diagram
[RootPath] ───(gather_files)───> [FileSet]
│
├──(exclude_globs)────> [FilteredFileSet]
│ │
│ └──(analyse_file)──> [DocStream]
│ │
[StopwordFile] ───(read_stopwords)──┘ │
▼
[TokenStream] ──(compute_df)──> [DocFreqMap]
│ │
▼ ▼
[TermFreqMap] ──(score)──────> [DocVector (TSV)]
│
├──(slice)──────> [SlicedVector] ──(top_terms)──> [RankedTermList]
│
└──(similar_to)─> [SimilarityList]
Each wire is a resource. Each box is a pure, composable morphism.
🧮 Colimits & Optimized Architecture
- Colimit: The “gluing” of
DocVector+Slice+Queryinto a singleInsight. - We minimize coupling by ensuring each morphism only depends on its input wire.
- We maximize coherence by ensuring output of one is valid input to the next — no side channels.
Example:
slicemorphism only needsfile_pathfromDocVector. It doesn’t care aboutscoreoridf. This is structural preservation.
🌀 Natural Transformations & Frames of Reference
- A natural transformation here is changing the
idf_modeorscale. - The relationship “term A is more important than term B” should hold across transformations (if
idf_modechanges, ranking might change — but that’s the point of exploration). - Coherence across frames: Whether you slice before or after computing vectors, the meaning of “top terms in src/” is preserved — even if the numbers differ (due to global vs local IDF).
⚠️ Important: Global IDF (computed once) is not coherent under slicing. For true coherence, recompute IDF per slice (future work).
🧬 Presheaf as Homset Generator
- Define objects inferentially: A “document” is anything that can be
tokenized and has afile_path. - The presheaf
F(U)= “set of all valid queries on slice U”. - Homset-generator: For any two slices U, V,
Hom(U,V)is the set of morphisms (e.g.,intersect,diff) that transform insights from U to V.
This lets us define “clustering” or “topic modeling” later as new morphisms on the same objects — without changing core.
📦 3. Deployment & Integration — XDG, PKGBUILD, Pipelines
🏠 XDG Base Directory Compliance
Cache vectors and terms in:
$XDG_CACHE_HOME/tfidf/ # default: ~/.cache/tfidf/
vectors/— Cached TSV outputs, keyed by hash of root + flags.terms/— Term ID mappings for compact mode (future).
Enables
tfidf-exploreto be near-instant on repeated queries.
🐧 Arch Linux PKGBUILD
# PKGBUILD for tfidf
pkgname=tfidf-git
pkgver=0.1.0
pkgrel=1
pkgdesc="Fast, multicore TF-IDF calculator and explorer for file collections"
arch=('x86_64')
url="https://github.com/yourname/tfidf"
license=('MIT')
depends=('rust' 'cargo')
makedepends=('git')
source=("git+https://github.com/yourname/tfidf.git")
sha256sums=('SKIP')
pkgver() {
cd "$srcdir/tfidf"
git describe --tags --long | sed 's/^v//;s/\([^-]*-g\)/r\1/;s/-/./g'
}
build() {
cd "$srcdir/tfidf"
cargo build --release
}
package() {
cd "$srcdir/tfidf"
install -Dm755 "target/release/tfidf-calc" "$pkgdir/usr/bin/tfidf-calc"
install -Dm755 "target/release/tfidf-explore" "$pkgdir/usr/bin/tfidf-explore"
install -Dm644 "README.md" "$pkgdir/usr/share/doc/tfidf/README.md"
}
Install with:
makepkg -si
🔄 Pipeline Integration
# Full pipeline: compute → slice → top terms → visualize
tfidf-calc --root ./project --idf log --scale 1000000 \
| tfidf-explore --slice "ext:rs dir:src" --top-terms 20 \
| awk '{print $1}' \
| visidata # or your favorite viz tool
TSV is the universal interchange format. Works with
awk,sort,join,jq -R, Python, R.
🧠 4. Design Rationale — Defaults, Deferred Config, Minimalism
✅ Why These Defaults?
--include-hidden=false: Safety first. Don’t accidentally index.envor.git.--idf=frac: Simple, intuitive, no logs or floats.--scale=100000: Good precision for most corpuses, avoids overflow.--jobs=auto: Maximize hardware utilization.--df-base=all: Simpler, more predictable than excluding empty docs.
Defaults are chosen for safety, simplicity, and performance — not theoretical purity.
📄 Deferred: Config File
No config file yet. Why?
- Unix Philosophy: Flags and env vars are composable and scriptable.
- YAGNI: You ain’t gonna need it — until you have 10 projects with complex slices.
- Future: When needed, config will live in
$XDG_CONFIG_HOME/tfidf/config.toml.
# Future config.toml
[default]
root = "."
exclude = ["target", "node_modules"]
idf = "log"
scale = 1000000
[profiles.rust]
root = "./src"
include = ["*.rs"]
exclude = ["target"]
⚡ Performance Guarantees
- Multicore: File I/O and tokenization parallelized.
- Zero-copy where possible:
Stringreuse,&strkeys in maps. - Buffered I/O: 256KB buffers for reads and writes.
- Fast hasher: Custom
FastHasheravoids SipHash overhead.
Handles 100K+ files, 10M+ tokens, in seconds to minutes.
🔮 5. Future Evolution — Clustering, Topic Modeling, Natural Transformations
🧩 Clustering as a Morphism
// Future: tfidf-cluster
tfidf-calc ... | tfidf-cluster --k 5 --output clusters.tsv
- Input:
DocVector - Output:
doc_id<TAB>cluster_id - Algorithm: K-means or HDBSCAN on TF-IDF vectors.
A new morphism in our category — composes with existing ones.
📚 Topic Modeling as a Natural Transformation
- Transform
DocVector→DocTopicMatrixvia LDA or NMF. - Natural because it preserves document-term structure while adding latent dimensions.
This is a functor from the category of TF-IDF vectors to the category of topic models.
♻️ Coherence Under Refactoring
If we refactor Document to add meta HashMap<String, String>, the morphism similar_to still works — it only uses term_freq and path.
Structural preservation: The “neighbor” relationship (cosine similarity) is defined by term overlap, not internal fields.
✅ Conclusion — A Coherent, Composable Foundation
You now have:
- A robust, documented toolset (
tfidf-calc,tfidf-explore). - A formal architecture grounded in category theory and resource flow.
- A deployment strategy (XDG, PKGBUILD, pipelines).
- A design rationale that prioritizes safety, performance, and Unix composability.
- A roadmap for future morphisms (clustering, topics) that preserve coherence.
This is not just code — it’s a semantic inference engine built for exploration, composition, and emergence.
“The Tao gave birth to machine language. Machine language gave birth to the assembler. The assembler gave birth to the compiler. Now there are ten thousand languages. Each language has its purpose, however humble. Each language expresses the Yin and Yang of software. Each language has its place within the Tao.” — The Tao of Programming, 5.1
Let the composition begin.
TF-IDF Calc: Command Line Specification
Command
tfidf-calc [FLAGS] [OPTIONS] [--] [PATH]...
Positional Arguments
[PATH]...: One or more root directories or files to process. If omitted, defaults to current directory (.).
Flags (Boolean)
-h,--help: Print help.-V,--version: Print version.-H,--hidden: Include hidden files and directories. (Default: false)-I,--no-ignore: Do not respect.gitignore,.ignore, etc. (Default: respects them)-s,--skip-empty: Skip files that contain zero tokens after tokenization.-S,--stream: Enable streaming, line-delimited output (ideal forfzf).
Options (Key-Value)
-t,--type <filetype>: Filter by file type:file,directory,symlink. (Repeatable, e.g.,-t f -t l)-e,--extension <ext>: Filter by file extension. (Repeatable, e.g.,-e rs -e md)-g,--glob <glob>: Include files matching glob pattern. Uses**for recursive. (Repeatable)-E,--exclude <glob>: Exclude files matching glob pattern. (Repeatable)--ignore-file <path>: Add a custom ignore-file (e.g.,.myignore).-w,--stopwords <file>: Path to a file containing stopwords (one per line).-i,--idf <mode>: IDF mode:frac(default),smooth,prob,log.-c,--scale <factor>: Scaling factor for final score (default: 100000).-j,--threads <num>: Number of threads to use (default: number of CPU cores).-o,--output <format>: Output format:full(default TSV),terms(term, count, file),files(file list),json.
Environment Variables
TFIDF_STOPWORDS: Path to default stopwords file. Overrides built-in if set.TFIDF_THREADS: Default number of threads if--threadsis not provided.
Configuration Files
~/.config/tfidf/config.toml: User configuration file for default flags../.tfidf.toml: Project-specific configuration file.
Examples
# Basic usage in current dir
tfidf-calc
# Scan src/ for Rust files, stream to fzf
tfidf-calc -e rs --stream src/ | fzf
# Use custom stopwords, output only top terms for Markdown files
tfidf-calc -e md -w ./my_stops.txt -o terms
# Complex query: Find terms in Rust or Python files, excluding tests, in src/
tfidf-calc -e rs -e py -E "*/test/*" -o terms src/