Help · Training corpus

Build a training corpus from your projects

Phase 4 of the roadmap is a small language model fine-tuned on extracted KGs. Self-hosters can run that pipeline today — opt projects in, watch the counts grow on /settings, then run the CLI when you have enough data. This page walks through the workflow; the dashboard answers "what do I have?", this page answers "what do I do with it?".

1. Opt projects in

Training-corpus contribution is per-project opt-in (default off). Open the kebab menu next to a project title → "Training data…" → toggle on. The dashboard at /settings only counts opted-in projects; non-opted-in ones are visible in the total but contribute zero to the histograms.

2. Watch the counts on /settings

The Training corpus card shows total nodes, total edges, kind/relation histograms, and an estimated training-sample count (one sample per node + one per edge). The reference range — typical LoRA fine-tunes on Phi-4-mini land between 3 000 and 10 000 samples — is a soft band, not a threshold. Diversity matters too: a corpus that's 95% one kind won't fine-tune a useful general-purpose extractor.

3. Run the CLI when ready

`scripts/finetune_slm.py --dry-run --corpus <(curl -H 'Authorization: Bearer ok_…' /api/public/v1/corpus)` validates the corpus without touching the GPU. Drop `--dry-run` and add `--base-model microsoft/Phi-4-mini-instruct` to actually run on an H100 or 2× A10. Merge the LoRA, convert to GGUF, register with Ollama via `scripts/templates/Modelfile.openmind.template`.

4. Eval before promotion

`scripts/eval_finetuned.py` runs the eval-tale harness against both the baseline (whatever LLM_DEFAULT_PROVIDER/MODEL is set to) and the candidate model, printing a side-by-side promotion verdict. Promote the candidate only if pass-rate meets or beats the baseline. The harness is the same one the eval suite uses, so the comparison is apples-to-apples.

Full operator depth — hyperparameters, GPU sizing, eval interpretation — lives at docs/training.md in the repo. This page is the on-ramp; the doc is the manual.