We split corpora into 10 folds, balanced by the number of words (tokens), using dynamic programming. Folds should start on a header or the beginning of a paragraph.
Create a train/dev/test split (60/20/20), based on the folds
Output as toml
scripts/split.py
first, last
are the sids (sentence identifiers)
length
is the number of tokens in the fold or split
Speckled Band
[meta] doc = "spec" docid = 440 length = 11741 first = 10000 last = 10598 split = "60/20/20 -- split on paragraphs so lengths are slightly uneven" "dcterms:created" = "2024-08-31" [train] first = 10000 last = 10336 length = 6905 [dev] first = 10337 last = 10490 length = 2425 [test] first = 10491 last = 10598 length = 2411 [fold0] first = 10000 last = 10052 length = 1213 [fold1] first = 10053 last = 10089 length = 1126 [fold2] first = 10090 last = 10154 length = 1181 [fold3] first = 10155 last = 10205 length = 992 [fold4] first = 10206 last = 10277 length = 1182 [fold5] first = 10278 last = 10336 length = 1211 [fold6] first = 10337 last = 10414 length = 1216 [fold7] first = 10415 last = 10490 length = 1209 [fold8] first = 10491 last = 10552 length = 1215 [fold9] first = 10553 last = 10598 length = 1196