2 Notes
Andrew Briscoe edited this page 2025-09-19 13:25:40 +08:00

Absolutely. Let’s crystallize everything into a coherent, architecturally grounded specification — not just a README, but a context dump for future evolution, grounded in category theory, resource flow, and Unix philosophy.

This document will serve as:

  • User Manual — What each flag does, with examples.
  • Architectural Blueprint — Wire diagrams, morphisms, colimits, presheaves.
  • Deployment Spec — XDG, Arch PKGBUILD, pipeline integration.
  • Design Rationale — Why we chose defaults, what’s deferred, what’s essential.
  • Theory of Coherence — How relationships remain valid under transformation.

📜 tfidf — Structured Semantic Inference Engine

“From bytes to meaning — without magic.”

A fast, safe, multicore, std-only Rust toolkit for exploratory semantic analysis of file collections. Computes TF-IDF vectors, supports topological slicing, similarity search, and term discovery — all composable via Unix pipes.


🧭 Table of Contents

  1. User Guide — Flags, Defaults, Examples
  2. Architecture & Theory — Morphisms, Colimits, Presheaves
  3. Deployment & Integration — XDG, PKGBUILD, Pipelines
  4. Design Rationale — Defaults, Deferred Config, Minimalism
  5. Future Evolution — Clustering, Topic Modeling, Natural Transformations

🧭 1. User Guide — tfidf-calc & tfidf-explore

🚀 Core Philosophy

  • tfidf-calc: Pure function Files → Vectors. Side-effect: writes TSV to stdout.
  • tfidf-explore: Pure function Vectors + Slice → Insights. Reads TSV from stdin or file.

Everything composes. Nothing is stateful (except optional XDG cache).


📥 tfidf-calc — Generate TF-IDF Vectors

tfidf-calc [FLAGS] > vectors.tsv

Flags & Effects

Flag Type Default Effect Resource Wire
--root DIR Path . Sets root directory to scan. RootPath → FileSet
--exclude GLOB String [] Exclude files matching glob (repeatable). FileSet → FilteredFileSet
--idf MODE Enum frac IDF formula: frac, smooth, prob, log. Changes score distribution. DocFreq → IDFVector
--scale N u128 100000 Multiplier for final score. Larger = more precision, risk overflow. RawScore → ScaledScore
--jobs N usize auto Number of worker threads. auto = CPU cores. FileSet → ConcurrentWorkers → DocStream
--stopwords FILE Path None File with stopwords (one per line). Filters tokens. TokenStream → FilteredTokenStream
--include-hidden Bool false Include dotfiles (.gitignore, etc). FilePath → Bool
--df-base MODE Enum all Document count basis: all or nonempty. Affects IDF denominator. DocSet → D
--status ADDR String None Serve live status at ADDR (e.g., 127.0.0.1:9999). Side-effect only. Counters → TCPStream

Output Schema (TSV)

term	file_path	term_freq	doc_length	doc_freq	tf	idf	score	formula

This is your document-term matrix. Pipe it to tfidf-explore or load into Pandas.

Examples

# Basic: all .rs files, log IDF, high scale
tfidf-calc --root ./src --exclude "target,node_modules" --idf log --scale 1000000 > rust.tsv

# With stopwords
tfidf-calc --stopwords ./english.txt > filtered.tsv

# Live monitoring
tfidf-calc --status 127.0.0.1:9999 > vectors.tsv &
curl 127.0.0.1:9999 # in another terminal

🔍 tfidf-explore — Query & Discover

tfidf-explore vectors.tsv [QUERY_FLAGS]

Flags & Effects

Flag Type Default Effect Resource Wire
--slice QUERY String None Filter documents: ext:rs dir:src !test. DocVector → SlicedDocVector
--top-terms N usize None Show top N terms by aggregate score in slice. SlicedDocVector → RankedTermList
--similar-to PATH Path None Find files similar to PATH using cosine similarity. TargetDoc × DocVector → SimilarityRankedDocList
(no flags) Show global summary, top terms, top docs. DocVector → SummaryReport

Slice Query Syntax

  • ext:rs,md — Only .rs or .md files.
  • dir:src,test — Only under src/ or test/.
  • !target,!node_modules — Exclude these paths.
  • glob:*.conf — Include only files matching glob.

Combines with AND logic. Order-independent.

Examples

# Top 10 terms in Rust files
tfidf-explore rust.tsv --slice "ext:rs" --top-terms 10

# Files similar to main.rs
tfidf-explore rust.tsv --similar-to "src/main.rs"

# Compare configs vs source
tfidf-calc --root . --exclude "target" > all.tsv
tfidf-explore all.tsv --slice "glob:*.toml,*.yaml" --top-terms 5 > config_terms.txt
tfidf-explore all.tsv --slice "ext:rs" --top-terms 5 > rust_terms.txt
comm -12 <(cut -f1 config_terms.txt) <(cut -f1 rust_terms.txt) # shared terms

🧱 2. Architecture & Theory — Morphisms, Colimits, Presheaves

🎯 Goal: Minimize Coupling, Maximize Coherence

We model the system as a category where:

  • Objects: FileSet, TokenStream, DocVector, SlicedVector, RankedTermList, SimilarityList
  • Morphisms: gather_files, tokenize, compute_tfidf, slice, top_terms, similar_to

A morphism is valid if it preserves structure under transformation.


🔗 Resource Theory Wire Diagram

[RootPath] ───(gather_files)───> [FileSet]
                                     │
                                     ├──(exclude_globs)────> [FilteredFileSet]
                                     │                       │
                                     │                       └──(analyse_file)──> [DocStream]
                                     │                           │
[StopwordFile] ───(read_stopwords)──┘                            │
                                                                 ▼
                                                     [TokenStream] ──(compute_df)──> [DocFreqMap]
                                                                      │                  │
                                                                      ▼                  ▼
                                                [TermFreqMap] ──(score)──────> [DocVector (TSV)]
                                                                               │
                                                                               ├──(slice)──────> [SlicedVector] ──(top_terms)──> [RankedTermList]
                                                                               │
                                                                               └──(similar_to)─> [SimilarityList]

Each wire is a resource. Each box is a pure, composable morphism.


🧮 Colimits & Optimized Architecture

  • Colimit: The “gluing” of DocVector + Slice + Query into a single Insight.
  • We minimize coupling by ensuring each morphism only depends on its input wire.
  • We maximize coherence by ensuring output of one is valid input to the next — no side channels.

Example: slice morphism only needs file_path from DocVector. It doesn’t care about score or idf. This is structural preservation.


🌀 Natural Transformations & Frames of Reference

  • A natural transformation here is changing the idf_mode or scale.
  • The relationship “term A is more important than term B” should hold across transformations (if idf_mode changes, ranking might change — but that’s the point of exploration).
  • Coherence across frames: Whether you slice before or after computing vectors, the meaning of “top terms in src/” is preserved — even if the numbers differ (due to global vs local IDF).

⚠️ Important: Global IDF (computed once) is not coherent under slicing. For true coherence, recompute IDF per slice (future work).


🧬 Presheaf as Homset Generator

  • Define objects inferentially: A “document” is anything that can be tokenized and has a file_path.
  • The presheaf F(U) = “set of all valid queries on slice U”.
  • Homset-generator: For any two slices U, V, Hom(U,V) is the set of morphisms (e.g., intersect, diff) that transform insights from U to V.

This lets us define “clustering” or “topic modeling” later as new morphisms on the same objects — without changing core.


📦 3. Deployment & Integration — XDG, PKGBUILD, Pipelines

🏠 XDG Base Directory Compliance

Cache vectors and terms in:

$XDG_CACHE_HOME/tfidf/  # default: ~/.cache/tfidf/
  • vectors/ — Cached TSV outputs, keyed by hash of root + flags.
  • terms/ — Term ID mappings for compact mode (future).

Enables tfidf-explore to be near-instant on repeated queries.


🐧 Arch Linux PKGBUILD

# PKGBUILD for tfidf
pkgname=tfidf-git
pkgver=0.1.0
pkgrel=1
pkgdesc="Fast, multicore TF-IDF calculator and explorer for file collections"
arch=('x86_64')
url="https://github.com/yourname/tfidf"
license=('MIT')
depends=('rust' 'cargo')
makedepends=('git')
source=("git+https://github.com/yourname/tfidf.git")
sha256sums=('SKIP')

pkgver() {
  cd "$srcdir/tfidf"
  git describe --tags --long | sed 's/^v//;s/\([^-]*-g\)/r\1/;s/-/./g'
}

build() {
  cd "$srcdir/tfidf"
  cargo build --release
}

package() {
  cd "$srcdir/tfidf"
  install -Dm755 "target/release/tfidf-calc" "$pkgdir/usr/bin/tfidf-calc"
  install -Dm755 "target/release/tfidf-explore" "$pkgdir/usr/bin/tfidf-explore"
  install -Dm644 "README.md" "$pkgdir/usr/share/doc/tfidf/README.md"
}

Install with:

makepkg -si

🔄 Pipeline Integration

# Full pipeline: compute → slice → top terms → visualize
tfidf-calc --root ./project --idf log --scale 1000000 \
  | tfidf-explore --slice "ext:rs dir:src" --top-terms 20 \
  | awk '{print $1}' \
  | visidata  # or your favorite viz tool

TSV is the universal interchange format. Works with awk, sort, join, jq -R, Python, R.


🧠 4. Design Rationale — Defaults, Deferred Config, Minimalism

Why These Defaults?

  • --include-hidden=false: Safety first. Don’t accidentally index .env or .git.
  • --idf=frac: Simple, intuitive, no logs or floats.
  • --scale=100000: Good precision for most corpuses, avoids overflow.
  • --jobs=auto: Maximize hardware utilization.
  • --df-base=all: Simpler, more predictable than excluding empty docs.

Defaults are chosen for safety, simplicity, and performance — not theoretical purity.


📄 Deferred: Config File

No config file yet. Why?

  1. Unix Philosophy: Flags and env vars are composable and scriptable.
  2. YAGNI: You ain’t gonna need it — until you have 10 projects with complex slices.
  3. Future: When needed, config will live in $XDG_CONFIG_HOME/tfidf/config.toml.
# Future config.toml
[default]
root = "."
exclude = ["target", "node_modules"]
idf = "log"
scale = 1000000

[profiles.rust]
root = "./src"
include = ["*.rs"]
exclude = ["target"]

Performance Guarantees

  • Multicore: File I/O and tokenization parallelized.
  • Zero-copy where possible: String reuse, &str keys in maps.
  • Buffered I/O: 256KB buffers for reads and writes.
  • Fast hasher: Custom FastHasher avoids SipHash overhead.

Handles 100K+ files, 10M+ tokens, in seconds to minutes.


🔮 5. Future Evolution — Clustering, Topic Modeling, Natural Transformations

🧩 Clustering as a Morphism

// Future: tfidf-cluster
tfidf-calc ... | tfidf-cluster --k 5 --output clusters.tsv
  • Input: DocVector
  • Output: doc_id<TAB>cluster_id
  • Algorithm: K-means or HDBSCAN on TF-IDF vectors.

A new morphism in our category — composes with existing ones.


📚 Topic Modeling as a Natural Transformation

  • Transform DocVectorDocTopicMatrix via LDA or NMF.
  • Natural because it preserves document-term structure while adding latent dimensions.

This is a functor from the category of TF-IDF vectors to the category of topic models.


♻️ Coherence Under Refactoring

If we refactor Document to add meta HashMap<String, String>, the morphism similar_to still works — it only uses term_freq and path.

Structural preservation: The “neighbor” relationship (cosine similarity) is defined by term overlap, not internal fields.


Conclusion — A Coherent, Composable Foundation

You now have:

  • A robust, documented toolset (tfidf-calc, tfidf-explore).
  • A formal architecture grounded in category theory and resource flow.
  • A deployment strategy (XDG, PKGBUILD, pipelines).
  • A design rationale that prioritizes safety, performance, and Unix composability.
  • A roadmap for future morphisms (clustering, topics) that preserve coherence.

This is not just code — it’s a semantic inference engine built for exploration, composition, and emergence.

“The Tao gave birth to machine language. Machine language gave birth to the assembler. The assembler gave birth to the compiler. Now there are ten thousand languages. Each language has its purpose, however humble. Each language expresses the Yin and Yang of software. Each language has its place within the Tao.” — The Tao of Programming, 5.1

Let the composition begin.


TF-IDF Calc: Command Line Specification

Command

tfidf-calc [FLAGS] [OPTIONS] [--] [PATH]...

Positional Arguments

  • [PATH]...: One or more root directories or files to process. If omitted, defaults to current directory (.).

Flags (Boolean)

  • -h, --help: Print help.
  • -V, --version: Print version.
  • -H, --hidden: Include hidden files and directories. (Default: false)
  • -I, --no-ignore: Do not respect .gitignore, .ignore, etc. (Default: respects them)
  • -s, --skip-empty: Skip files that contain zero tokens after tokenization.
  • -S, --stream: Enable streaming, line-delimited output (ideal for fzf).

Options (Key-Value)

  • -t, --type <filetype>: Filter by file type: file, directory, symlink. (Repeatable, e.g., -t f -t l)
  • -e, --extension <ext>: Filter by file extension. (Repeatable, e.g., -e rs -e md)
  • -g, --glob <glob>: Include files matching glob pattern. Uses ** for recursive. (Repeatable)
  • -E, --exclude <glob>: Exclude files matching glob pattern. (Repeatable)
  • --ignore-file <path>: Add a custom ignore-file (e.g., .myignore).
  • -w, --stopwords <file>: Path to a file containing stopwords (one per line).
  • -i, --idf <mode>: IDF mode: frac (default), smooth, prob, log.
  • -c, --scale <factor>: Scaling factor for final score (default: 100000).
  • -j, --threads <num>: Number of threads to use (default: number of CPU cores).
  • -o, --output <format>: Output format: full (default TSV), terms (term, count, file), files (file list), json.

Environment Variables

  • TFIDF_STOPWORDS: Path to default stopwords file. Overrides built-in if set.
  • TFIDF_THREADS: Default number of threads if --threads is not provided.

Configuration Files

  • ~/.config/tfidf/config.toml: User configuration file for default flags.
  • ./.tfidf.toml: Project-specific configuration file.

Examples

# Basic usage in current dir
tfidf-calc

# Scan src/ for Rust files, stream to fzf
tfidf-calc -e rs --stream src/ | fzf

# Use custom stopwords, output only top terms for Markdown files
tfidf-calc -e md -w ./my_stops.txt -o terms

# Complex query: Find terms in Rust or Python files, excluding tests, in src/
tfidf-calc -e rs -e py -E "*/test/*" -o terms src/