Modulo:spec-splitter-en
--[[
Specification of the lemma splitter syntax ========================================== About ----- The lemma is read from the pagename. The splitter can divide it into useful fragments and link to those, and optionally the page can be added to compound (morpheme) categories for them. A split control parameter and an extra parameter are available. Split strategies ---------------- The 7 split and processing strategies selectable with the split control parameter sent to the template are: - automatic multiword split (default) - assisted split (assisted automatic multiword split) - manual split - simple root split (mortyp:s N + U) - simple bare root (mortyp M or N) - large letter split (mortyp M) - no split Incoming information -------------------- - boolean information whether compound categories are desired - lemma (may NOT be empty) - split control parameter ("fra=", may be empty) - extra parameter ("ext=", may be and usually is empty) - language stuff (code and some variants of language name) - word class (reduced to 2 questions) The extra parameter creates extra category includes, but does NOT affect the actual split. The language stuff is built into category names, but it does NOT affect the actual split. The splitter is language-independent. The word class is reduced to 2 boolean YES-vs-NO questions: - "Is it KA ie sentence?" (affects category name in some cases) - "Is it NR ie nonstandalone root?" (affects mortyp for "simple bare root" strategy) Thus the word class is needed but does NOT affect the actual split. Output ------ - wikitext intended to be sent to screen, usually with wikilinks ie [[...|...]] - list of category names without prefix ("Category:"), without rectangular brackets, and without sorting hint text - list of bool values parallel to the list of category names revealing what category include is to be formed as main page of the category by means of sorting hint "|-" Overall limits -------------- - length of the incoming lemma : 1...120 octet:s - total number of compound categories created : 18 (this is less than 16 + 4) Limits for the split control parameter -------------------------------------- - length of the split control parameter : 1...120 octet:s (example: "-") Limits for the extra parameter ------------------------------ - length of the extra parameter : 2 or 5...120 octet:s (example: "&X" or "[U:u]") - number of extra fragments : min 1 max 4 (unless "&"-syntax is used) Limits for assisted split ------------------------- - length of the split control parameter : 2...120 octet:s (example: "%0") - length of explicitly provided link target : 1...40 octet:s - number of blocked input boundaries : max 8 - number of accessible output fragments : max 16 (numbered "0"..."F") Limits for manual split ----------------------- - length of the split control parameter : 4...120 octet:s (example: "b[o]") - number of output fragments : min 1 or 2 and max 16 (numbered "0"..."F") Split control parameter ----------------------- The split control parameter is evaluated only if splitting is globally enabled in the module, otherwise it is useless and ignored, and no error occurs, and the splitter is not called at all. Base syntax of the split control parameter: - special value "-" : no split - sequence of tuning commands, beginning with "%" or "#" : assisted split - special value "$S" : simple root split - special value "$B" : simple bare root - special value "$H" : large letter split - any other content : considered as request for manual split No split -------- No split can be achieved by the special value "-". The raw lemma is showed, there are no links, and no compound (morpheme) category includes are created automatically, but the extra parameter is still available. Automatic multiword split and assisted split -------------------------------------------- If the lemma is multiword then it is automatically split at detected split boundaries. Such a boundary consists of one or multiple qualifying char:s, namely space and punctuation (5 char:s: ! , . ; ?). Note that particularly dash "-" and apo "'" do NOT count as punctuation, thus for example words "berjalan-jalan" or "o'clock" will remain together (but "there's" will too, see below). The fragments are linked by default but there are various options to tune the linking. Assisted split -------------- During the split work 2 separate ZERO-based counters are maintained and commands in the split control parameter refer to those. - input boundary counter : Counts boundaries between incoming words forming the lemma. Multiple consecutive qualifying char:s count as one boundary, this applies even to leading and trailing position. For example the text "Apples, ? bananas and beer." contains 4 boundaries numbered from 0 to 3, the string ", ? " (4 char:s) receives index 0. The string "?va?" contains 2 boundaries. - output fragment counter : Counts generated fragments. For example "pembangkit listrik tenaga surya" contains 3 boundaries (see above) and will by default generate 4 fragments. If you disable breaking at boundaries 0 and 2, then the result will be only 2 fragments "pembangkit listrik" and "tenaga surya" instead, numbered 0 and 1. The counters are referenced by one-digit numbers, "0" to "9", and "A" to "F" (must be uppercase) for rarely needed indexes "10"..."15", thus actually hex numbers. For assisted split the split control parameter contains a sequence of tuning commands separated by spaces, or even only one command: - "%" followed by 1...8 ascending hex digits : do not split at listed input word boundaries - "#" followed by a hex digit followed by "N" or "I" or "A" : tune at pointed output fragment index - "N" do not link the fragment (this blocks the categorization too) - "I" convert beginning letter to lowercase ("I" minusklo) for link target - "A" convert beginning letter to uppercase ("A" majusklo) for link target - "#" followed by a hex digit followed by colon ":" followed by string : link to that target instead The "#"-items (ZERO or ONE or more permitted) must be ascending but need not to be consecutive, and they must follow the single "%"-item if it is present. For example "%3A #2N #5A #7N #8:test" will: - avoid breaking at input boundaries 3 and 10 - avoid linking of fragments 2 and 7 - link fragment 5 to target with uppercase letter - link fragment 8 to "test" The most common use will be "#0I" fixing the case of the word at beginning of a sentence lemma, for example "Yes we can." will link to "yes", not to "Yes", besides "we" and "can". Too high positions of boundaries and fragments are ignored, but other errors are not and they do result in an error, most notably: - messing up the order of "%" and "#", ie putting "#" before "%", for example "#2N %3A #5A #7N #8:test" - numbers after "%" are not ascending, for example "%A3 #2N #5A #7N #8:test" - "#"-items are not ascending, for example "%3A #2N #5A #7N #6:test" - invalid char:s or missing spaces, for example "%3A #2N#5A #7N #8:test" A too high number of boundaries or fragments occurring in the lemma does not cause an error, but it is not possible to tune those with index >= 16 anymore. Manual split ------------ The manual split shares some ideas with the "very raw manual split" carried out with wiki syntax but there are some crucial differences. Fragments are enclosed by single rectangular brackets (as opposed to double ones in "very raw manual split"), slash "/" is the primary field separator instead of wall "|" and there is a secondary (and early) separator denoted by colon ":". There is a "sum check" feature that ensures that the visible text (sum or concatenation of all fragments) still equals to the lemma, otherwise an error occurs. Moving (ie renaming) a lemma page where this manual split is applied will inevitably generate an error and force the contributor to adjust the split control parameter. Besides the usual syntax with one field separator (here slash instead of wall) there is a syntax with one colon, and a syntax with both one colon and one slash allowing to specify the morpheme type of the fragment. Spaces are permitted in the lemma and to some degree in the predefined split fragments. It is possible to apply manual split to multiword lemmas, but in practice this is rarely useful (see above and below, automatic multiword split or assisted split is much better). If the morpheme type is specified but link target is not, then the splitter will construct the link target (see below). It is permitted to use the plus "+" sign as fragment separator (adjacent to a rectangular bracket, except at the very beginning or end of the split control parameter) in the manual split that will be visible, but is excluded from the "sum check". Types of fragments: - F000 : no brackets, no colon, no slash (visible text no link) - F200 : 2 brackets, no colon, no slash (combo field target visible text) - F201 : 2 brackets, no colon, 1 slash (target / visible text) - F210 : 2 brackets, 1 colon, no slash (mortyp : combo field target visible text) - F211 : 2 brackets, 1 colon, 1 slash (mortyp : target / visible text) Deleted substrings: Arc brackets are permitted to some degree in the visible text and combo field for types F200 F201 F210 F211 allowing to specify deleted substrings (usually single letters, even single space), but prohibited in the target field (left of the slash "/"). Such deleted substrings are excluded from the "sum check", and with pseudo mortyp "L" (see below) from categorization, but never from linking. Spaces are permitted with some restrictions: - a field may not begin nor end with a space ("[U:-are /ar(e)]" is bad) - a deleted substring may not begin nor end with a space ("[M:loep( a)]" is bad) - deleted spaces are prohibited after "L:" but otherwise permitted ("[L:fingr( )]" is bad but "[M:kereta( )api]" is good) Special magic features of the type F210: - automatic dash adding for mortyp:s I P U for linking and categorization - pseudo mortyp "L" (-> "N") Note that omitting deleted letters for the "sum check" is NOT restricted to the fragment type F210, and it is performed much earlier during prevalidation of the split control parameter. Restrictions: - at least 2 fragments, or ONE fragment if it contains a slash (ONE fragment with colon but without slash is NOT sufficient) - max 16 fragments - two fragments of type F000 may not follow each other - leading and trailing spaces are prohibited inside rectangular brackets (but spaces inside lemma parts are possible), ie leading and trailing spaces are prohibited except for type F000 - leading and trailing spaces are prohibited inside arc brackets if "L" is used with type F210 - empty content of a field (left and right of slash "/") is prohibited List of 6+1+1+1 selectable morpheme type codes: C circumfix cirkumfikso I infix infikso (-eo-: -o- -et- -ist- | -en-: -fucking-) M standalone root memstara radiko (-eo-: tri dek post | -en-: hole) N nonstandalone root nememstara radiko (-eo-: fer voj | -en-: lingon) P prefix prefikso U suffix sufikso (postfikso, finajxo) (-eo-: -a -j -n | -en-: -ist) ------- W word vorto ------- L same as "N" but changes categorization behavior (only in F210, see below) ------- X only after "&" in the extra parameter (converted to M plus W and pagename) These mortyp:s can be used in the split control parameter before colon ":" with manual split, and in the extra parameter (see below), but then "L" is prohibited (thus C I M N P U W are left plus maybe X), either after "&", or in fragments before ":" or "!". The default mortyp with manual split is none (link but do not put the page into any compound (morpheme) category) as long as nothing else is specified. These types are partially ignored if compound categories are entirely not desired for some reason, but affix types I P U still affect the linking. The letter "L" belongs to same morpheme type as "N" and is categorized as "N" but changes the categorization behavior of the splitter. It is to be used in fragment type F210 only together with deleted letters (not spaces) and causes the long form to be linked but the short form categorized (this cannot be achieved by any other means with the split control parameter only, code "N" links and categorizes the long form only, but can with the extra parameter), and the short form is fed into the "sum check" (long time earlier). This is useful with Esperanto where most of the vocabulary consists of nonstandalone roots. Instead of for example "[N:pomo/pom(o)][N:arbo/arb(o)][U:o]" (complicated and wrongly categorizing "pomo" instead of intended "pom") we can write "[L:pom(o)][L:arb(o)][U:o]" resulting in lemma links to "pomo", "arbo" and "-o" and categories type "N" of "pom", type "N" of "arb", and type "U" of "o". Colon evaluation: - colon is only regarded as a control char and can cause an error if: - it is preceded by an uppercase letter ("A"..."Z") and - those 2 char:s are located in the beginning of a fragment and inside [...] - otherwise it is considered to be an ordinary letter For example: - "+[M:crap]" is regarded and valid (although maybe useless) - "+[A:crap]" is regarded and an error - "+[m:A:crap]" and "A:crap" is maybe nonsense but ignored and not an error against this specification Examples of legal syntax of fragments: - "blah" -- F000 show "blah" no link - " blah" -- F000 - " " -- F000 - "[blah]" -- F200 show "blah" link to "blah" - "[blah/Blah]" -- F201 show "Blah" link to "blah" - "[bl ah/Bl ah]" -- F201 show "Bl ah" link to "bl ah" (inner spaces are legal) - "[I:il]" -- F210 show "il" link to "-il-" categorize "-il-" as mortyp "I" and feed only "il" to the "sum check" - "[M:preter]" -- F210 show "preter" link to "preter" categorize "preter" as mortyp "M" and feed "preter" to the "sum check" - "[M:(k)irim]" -- F210 show "(k)irim" link to "kirim" categorize "kirim" as mortyp "M" and feed only "irim" to the "sum check" - "[M:kereta( )api]" -- F210 show "kereta( )api" link to "kereta api" categorize "kereta api" as mortyp "M" and feed only "keretaapi" to the "sum check" - "[L:kat(o)]" -- F210 show "kat(o)" link to "kato" categorize only "kat" as mortyp "N" and feed only "kat" to the "sum check" - "[P:blah-/Blah]" -- F211 show "Blah" link to "blah-" categorize "blah-" as mortyp "P" and feed "Blah" to the "sum check" Examples of dubious syntax (not against this specification but not useful): - "[blah/blah]" -- F201 unnecessary slash ("target" = "visible text", use "[blah]" instead) - "[blah-/blah]" -- F201 show "blah" link to "blah-" ("[P:blah]" is usually better) - "[N:kat(o)]" -- F210 show "kat(o)" link to "kato" categorize "kato" as mortyp "N" and feed "kat" to the "sum check" (dubious effect, "[L:kat(o)]" is probably what we intended) - "[P:blah-/blah]" -- F211 show "blah" link to "blah-" and select mortyp "P" (unnecessary, "[P:blah]" is sufficient) - "[N:kato/kat(o)]" -- F211 show "kat(o)" link to "kato" categorize "kato" as mortyp "N" and feed only "kat" to the "sum check", this does have same effect as "[N:kat(o)]" but not as "[L:kat(o)]" (dubious effect, "[L:kat(o)]" is probably what we intended) - "[N:kat/kat(o)]" -- F211 show "kat(o)" link to only "kat" categorize only "kat" as mortyp "N" and feed only "kat" to the "sum check", this does not have same effect as "[L:kat(o)]" (dubious effect, "[L:kat(o)]" is probably what we intended) - "[M:kirim/(k)irim]" -- F211 show "(k)irim" link to "kirim" categorize "kirim" as mortyp "M" and feed only "irim" to the "sum check", this does have same effect as "[M:(k)irim]" (unnecessary, "[M:(k)irim]" is sufficient) - "[I:il/il]" -- F211 show "il" link to "il" categorize "il" as mortyp "I" and feed "il" to the "sum check", no auto-added slashes here (dubious effect, "[I:il]" is probably what we intended) Examples of illegal syntax of fragments: - "[[blah/Blah]]" -- double or multiple brackets - "[blah/Bl[a]h]" -- nested or unbalanced brackets - "[blah|Blah]" -- illegal genuine wall - "[blah/Blah ]" -- illegal space - "[ blah/Blah]" -- illegal space - "[blah/]" -- illegal empty content (invisible link makes no sense) - "[/blah]" -- illegal empty content (use "blah" instead) - "[M:/blah]" -- illegal empty content (use "M:blah" instead) - "[N:kat(o)/kat]" -- arc brackets are illegal in the target field - "[L:katrol]" -- "L" used but no arc brackets - "[L:kat/kat(o)]" -- "L" used with type F211 ie 2 fields - "[L:kat(r )ol]" -- "L" used with leading or trailing space inside arc brackets - "[A:blah-/blah]" -- illegal mortyp (only some selected uppercase letters are permitted) Examples of complete syntax of the split control parameter: lemma "pertidaksamaan" -> [C:per-...-an/per][M:tidak][M:sama][C:per-...-an/an] lemma "perkeretaapian" -> [C:per-...-an/per][M:kereta][M:api][C:per-...-an/an] -> [C:per-...-an/per][M:kereta( )api][C:per-...-an/an] -> [C:per-...-an/per]+[M:kereta( )api]+[C:per-...-an/an] lemma "mengirim" -> [P:meN-/meng][M:(k)irim] (dash "-" is required) lemma "icke-binaer" -> [P:icke-][M:binaer] lemma "kingdom" -> [M:king][U:dom] lemma "kingdom" -> [M:king]+[U:dom] lemma "hallon" -> hall[U:on] lemma "hallon" -> hall+[U:on] lemma "God" -> [god/God] (no category) lemma "tridek" -> [M:tri][M:dek] lemma "tridek" -> [M:tri]+[M:dek] lemma "tridek" -> [tri][dek] (no category) lemma "fervojo" -> [L:fer(o)][L:voj(o)][U:o] lemma "kungadoeme -> [M:kung]+a+[M:doeme] Examples of dubious syntax (not against this specification but not useful): lemma "perkeretaapian" -> [C:per-...-an/per][M:keretaapi][C:per-...-an/an] (links to wrong morpheme "keretaapi) -> [C:per-...-an/per][M:kereta api/kereta( )api][C:per-...-an/an] (unnecessarily complicated) lemma "mengirim" -> [P:meng][N:irim] (links to wrong morphemes) -> [P:meng][irim] (links to wrong morphemes, "irim" not categorized) -> [P:meN-/men][M:kirim/girim] (shows one dubious and one wrong morpheme) -> [meN-/meng][M:(k)irim] ("meN-" is not categorized) lemma "hallon" -> hall +[U:-on/on] (junk space precedes the plus) -> hall[U:-on/on] (unnecessary, "hall[U:on]" is sufficient) -> hall+[U:-on/on] (unnecessary, "hall+[U:on]" is sufficient) lemma "God" -> [W:god/God] (better use "$B") lemma "God" -> [M:god/God] (better use "$B") lemma "dek tri" -> [dek] [tri] (better use automatic) lemma "dek tri" -> [W:dek] [W:tri] (better use automatic) Examples of illegal syntax of the split control parameter: lemma "perkeretaapian" -> [C:per-...-an/per][M:kereta api][C:per-...-an/an] (fails the "sum check") lemma "icke-binaer" -> [P:icke][M:binaer] (fails the "sum check") lemma "mengirim" -> [P:meN][M:kirim/girim] (fails the "sum check") lemma "hallon" -> hall++[U:-on/on] (double plus) lemma "hallon" -> hall+ [U:-on/on] (junk space beside the plus) lemma "hallon" -> +hall[U:-on/on] (leading plus) lemma "hallon" -> hall[U:-on/on]+ (trailing plus) lemma "hallon" -> ha+ll[U:-on/on] (plus is not adjacent to a bracket) Field processing: The left field (left of the slash but after a possible colon) is for the link and the category. The right one (right of the slash) is showed on the screen and fed into the "sum check". If only one field is given (combo field), then it is assumed to be primarily the right one and the left one is auto-generated by copying from the right one with removing brackets but not content between them (for example "fer(o)" -> "fero"). If the morpheme type is specified but link target (left field) is not, thus fragment type is F210, then the splitter will in some cases (I P U) additionally enhance the left field by adding "-" if it is not yet present (for example for I: "il" -> "-il-" and for P "re" -> "re-" and "icke-" -> "icke-"). The strings used for link and category are always exactly same except when pseudo-type "L" is used (note that the simple root split also breaks against this principle). Other rules apply to the right field where the string is displayed literally but bracketed part is removed before feeding it into the "sum check" allowing to display deleted letters and even deleted spaces. Simple root split ----------------- The syntax "$S" selects the simple root split strategy. It is fully automatic and cannot be tuned in any way. The pagename must consist of at least 2 letters and the last one must be ASCII lowercase "a"..."z", otherwise an error is triggered. The last letter of the lemma is separated. The remaining root with beginning letter changed to lowercase if needed is used to brew the category include with type "N" (nonstandalone) and as the main page of the category. If the beginning letter was uppercase, then a link to the full lemma with beginning letter changed to lowercase is created, otherwise there is no link (because it would be a self-link). The last letter becomes a suffix with a dash and is linked and the category include receives the type "U". This is intended for Esperanto words built from and representing nonstandalone roots including proper nouns having the lowercase variant (for example "Suno" and "suno") (see below under "Examples and selecting the optimal strategy"). This assumes that all Esperanto roots are denoted with lowercase letters, for example "sun" for "suno" and "Suno" and there is no root "Sun". For Esperanto proper nouns without any lowercase variant no split or manual split is the choice, and if desired the extra parameter to categorize the main root (see below example word "GXakarto" under "Examples and selecting the optimal strategy"). Simple bare root ---------------- The syntax "$B" selects the simple bare root strategy. It is fully automatic and cannot be tuned in any way. But it depends on the word class, or more precisely whether it is "NR" ie nonstandalone root. The root with beginning letter changed to lowercase (unless "NR") if needed is used to brew the category include with type "M" (standalone) or "N" (nonstandalone, if "NR") and as the main page of the category. If the beginning letter was uppercase, then a link to the lemma with beginning letter changed to lowercase is created, otherwise (and with "NR") there is no link (because it would be a self-link). This is intended for standalone roots including proper nouns having the lowercase variant (for example "Sun" and "sun") (see below under "Examples and selecting the optimal strategy"). Note that "fra=$B" has very same effect as "ext=&M" if the word class is not "NR" and the lemma does not begin with uppercase, and "fra=$B" has very same effect as "ext=&N" if the word class is "NR" irrespective letter case. Large letter split ------------------ The syntax "$H" selects the large letter split strategy. It is fully automatic and cannot be tuned in any way. The lemma is split into single letters. This is most useful for but not restricted to Chinese ones. The lemma must not contain punctuation, spaces, dash "-", apo "'" and so-called combining accents, and it may be at most 16 characters long. Decorated Latin letters such as Swedish or Esperanto ones are tolerable but probably not useful to feed into this split. The resulting single letters are linked and categorized as mortyp "M". If another mortyp is needed, then the manual split must be used instead. Words with only one fragment ---------------------------- For standalone words that are also roots (for example "sun") use the simple bare root strategy with syntax "$B". This will categorize the page as type "M" and main page of the category (sorting hint "-") but not link. In Esperanto it might be desirable to have the lemma (for example "suno") with its "native" suffix but the category without it, this can be achieved by the "$S"-syntax and simple root split. This will categorize the page as type "N" and main page of the category (sorting hint "-") but not link. This works even for standalone words differing from the root by case (for example "Sun" or "Suno"). This will categorize in very same way but link to "sun" or "suno". Thus both "sun" and "Sun" will be main pages of the category under "-" as opposed to ordinary words as for example "sunshine" under "S" and (theoretically) "insunity" under "I". For Esperanto standalone roots (prepositions, numerals, subordinators, some adverbs, ...) "$B" is the preferred solution. For non-Esperanto proper nouns without any lowercase counterpart no split is the choice, and if desired the extra parameter to categorize the root (see below example word "Inverness" under "Examples and selecting the optimal strategy"). For affix lemmas (for example "meN-", "-kan", "-il-") use no split and the extra parameter with "&"-syntax. This will categorize the page as selected type (for example "I") and main page of the category (sorting hint "-"). Syntax of the extra parameter ----------------------------- The extra parameter is evaluated only if compound (morpheme) categories are desired (globally enabled in the module and further conditions met), otherwise it is useless and ignored, and no error occurs, but the splitter may still be called and generate links. The syntax of the extra parameter is similar to the syntax of the split control parameter requesting manual split, but there are some crucial differences and restrictions, and 3 enhancements ("!" and "&" and "X"). The difference is that nothing is visible, nothing is linked and there is no "sum check". The purpose of the extra parameter is to create extra compound (morpheme) categories. For example with the split control parameter we can link the lemma "perkeretaapian" to either "kereta api" or "kereta" and "api" but never both. To get both we need the extra parameter. The extra parameter consists either of 1 to 4 fragments similar to those for manual split, or 2 char:s of a special value. Restrictions for fragments: - only type F210 (2 brackets, 1 colon/exclam, no slash) is permitted - morpheme type must be specified, L is prohibited, only C I M N P U W left - no arc brackets "(" ")", no plus "+", no slash "/" Enhancement for fragments: - separator after morpheme type can be not only colon ":" but also exclam "!" to request main page in category Special value: - char "&" followed by mortyp code (pseudo type "X" permitted here) to add page to compound (morpheme) category by pagename and as main page The char "&" followed by a valid uppercase letter (one of 8, same as for the fragment syntax plus "X", but obviously not "L") creates one or two extra compound (morpheme) category includes. The type of the morpheme must be specified (see even list above), namely one of "&C" "&I" "&M" "&N" "&P" "&U" "&W" creating one include, and "&X" creating two, namely types "M" and "W". Note that combinations other than "M" + "W" are obviously useless, and that pseudo type "L" is prohibited. The lemma is marked as main page of the category (key "-"). The "&"-syntax is useful for affix lemmas (note that dashes "-" must come from the lemma itself then, the splitter does not add any), whereas for standalone and nonstandalone roots the simple bare root and simple root split strategies are preferable. Difficult cases --------------- The splitter is designed to automate common cases, save typing work and minimize the risk of errors. But there are situations where the restrictions and sanity checks block seemingly easy solutions and create almost unsolvable challenges. There are 4 versions of the morpheme that have to be managed, exemplified on the lemma "perkeretaapian" and its root "keretaapi" and last 2 letters "an": - morpheme fed into the "sum check" cut from the pagename (here "keretaapi" and "an", anything else would fail the "sum check") - linked morpheme (here "kereta api" and "-an", for example "keretaapi" is not a valid word or root) - categorized morpheme (here "kereta api" with type "M" and "per-...-an" with type "C", for example categorizing "-an" only would be definitely inferior) - showed morpheme (here "kereta( )api" and "an") Here are the workarounds needed for such cases. # link and categorize - the default and easy case, use "$B", or type F210 ie morpheme type + colon + morpheme # link but do NOT categorize - use type F200, omit the morpheme type and colon # categorize but do NOT link - put the morpheme type and morpheme into the extra parameter, type F210, and bare text if supposed to be visible into the split control parameter, type F000 # do NOT categorize and do NOT link - put bare text into the split control parameter, type F000 # link to several alternatives - put the main alternative into the the split control parameter and further one or several alternatives into the extra parameter - this is frequently needed for "middle-level" compounds # link and categorize but with different names - use the pseudo-type "L" if useful, otherwise use type F200 in the split control parameter and category in the extra parameter - the former is needed for all nonstandalone roots in Esperanto, the latter for compounds involving proper nouns or assimilations, for example "penektomio" and "ekstertero" Examples and selecting the optimal strategy ------------------------------------------- # "pembangkit listrik" (-id-) (-en-: "power plant") - default automatic split gives perfect result # "pembangkit listrik tenaga surya" (-id-) (-en-: "solar power plant") - default automatic split gives good result - assisted split "fra=%0" gives a maybe better result and costs only 2 char:s # "pertidaksamaan" (-id-) - note that there is most likely no lemma "tidak sama" - default automatic split gives no result - assisted split does not help either (it is not possible to add boundaries, only to block such) - use manual split "fra=[C:per-...-an/per][M:tidak][M:sama][C:per-...-an/an]" "fra=[C:per-...-an/per]+[M:tidak]+[M:sama]+[C:per-...-an/an]" # "perkeretaapian" (-id-) - note that there most likely is a lemma "kereta api" we want to link to - default automatic split gives no result - tempting solution "fra=[C:per-...-an/per][M:kereta api][C:per-...-an/an]" is invalid and fails the "sum check" - tempting solution "fra=[C:per-...-an/per][M:keretaapi][C:per-...-an/an]" is not against this specification but links to invalid morpheme "keretaapi" - tempting solution "fra=[C:per-...-an/per][M:kereta api/keretaapi][C:per-...-an/an]" is not against this specification and links to morpheme "kereta api" but shows invalid morpheme "keretaapi" on the screen - possible manual split "fra=[C:per-...-an/per][M:kereta][M:api][C:per-...-an/an]" links to "kereta" and "api" but not "kereta api" - possible manual split "fra=[C:per-...-an/per][M:kereta( )api][C:per-...-an/an]" links to "kereta api" and shows "kereta( )api" but less nice - probably better way to do "fra=[C:per-...-an/per]+[M:kereta( )api]+[C:per-...-an/an]" links to "kereta api" and shows "kereta( )api" and the extra parameter can be used to add categorization for "kereta" and "api" besides "kereta api" # "mengirim" (-id-) (-en-: "send") - we want to link to "meN-" and "kirim" - default automatic split gives no result - tempting solution "fra=[P:meN][M:kirim/girim]" is invalid and fails the "sum check" - tempting solution "fra=[P:meng][N:irim]" links to wrong morphemes "meng-" and "irim" - tempting solution "fra=[P:meng][irim]" links to two wrong morphemes and "irim" is not categorized - tempting solution "fra=[P:meN-/men][M:kirim/girim]" shows one dubious ("men") and one wrong ("girim") morpheme - tempting solution "fra=[meN-/meng][M:(k)irim]" "meN-" is not categorized - tempting solution "fra=[P:meN/meng][M:(k)irim]" links to "meN", dash is not added since this is type F211 - correct way to do "fra=[P:meN-/meng][M:(k)irim]" - correct way to do and nicer "fra=[P:meN-/meng]+[M:(k)irim]" # "penectomy" (-en-) - the root is "penis" but is reduced to "pen" - correct way to do "fra=[M:pen(is)]+[U:ectomy]" # "When in a hole, stop digging." (en) - default automatic split gives inferior result linking to "When" - use assisted split "fra=#0I" costs 3 char:s # "When in Rome, do as the Romans do." (en) - default automatic split gives inferior result linking to "When" and "Romans" - assisted split "fra=#0I" is insufficient as it fixes "When" only but not "Romans" - use assisted split "fra=#0I #6" costs 6 char:s - manual split is possible but far more expensive "fra=[when/When] [in] [Rome], [do] [as] [the] [Roman/Romans] [do]." costs 61 char:s and note that "very raw manual split" would be even worse # "sun" (-en-) - there is nothing to split and nothing to link to - use simple bare root strategy "$B" # "Sun" (en) - we want to link to "sun" - use simple bare root strategy "$B" # "Inverness" (en) - there is nothing to split and nothing to link to - use no split (default) - if you badly want to categorize the root "Inverness" (with uppercase "I") then use the extra parameter "" and "&M" no link but category of type "M" and main page in it # "polvosucxilo" (-eo-) - we want to link to "polvo" and "sucxi" but categorize "polv" and "sucx" - default automatic split gives no result - tempting solution "[N:polv(o)]+[I:o]+[N:sucx(i)]+[I:il]+[U:o]" categorizes "polvo" and "sucxi", probably not what we intended - tempting solution "[N:polvo/polv(o)]+[I:o]+[N:sucxi/sucx(i)]+[I:il]+[U:o]" categorizes "polvo" and "sucxi", probably not what we intended and a bit horrible to type - tempting solution "[N:polv/polv(o)]+[I:o]+[N:sucx/sucx(i)]+[I:il]+[U:o]" links to "polv" and "sucx", probably not what we intended and a bit horrible to type - correct way to do "[L:polv(o)]+[I:o]+[L:sucx(i)]+[I:il]+[U:o]" links to "polvo" and "sucxi" but categorizes "polv" and "sucx" - note that same string "o" is used twice with different types of morpheme # "penektomio" (-eo-) - the root is "penis(o)" but is reduced to "pen" - here the assimilation and nonstandalone root together create a dilemma that cannot be solved without the extra parameter - correct way to do "fra=[peniso/pen(iso)]+[I:ektomi]+[U:o]|ext=[N:penis]" - alternative way to do "fra=[L:penektomi(o)]+[U:o]|ext=[N:penis][I:ektomi]" # "suno" (-eo-) (-en-: sun) - there is nothing to link to for the main root "sun" - most likely there is a page "sun" but it is not a valid Esperanto word, it's English and we do not want to link it here (maybe in the translation section instead) - tempting solution "fra=[N:sun(o)]+[U:o]" shows "sun(o) + o", tries to link to "suno" (self-link) and categorizes "suno", probably not what we intended - tempting solution "fra=[L:sun(o)]+[U:o]" shows "sun(o) + o", tries to link to "suno" (self-link) - use simple root split "fra=$S" shows "sun + o", "sun" is not linked, "o" links to "-o", "sun" is categorized as type "N" and main, "-o" as type "U" # "Suno" (-eo-) (-en-: Sun) - we want to link to "suno" - use simple root split "fra=$S" shows "Sun + o", "Sun" links to "suno", "o" links to "-o", "Sun" is categorized as type "N" and main, "-o" as type "U" # "terpomo" (-eo-) (-en-: "potato") - very typical compound - use manual split "fra=[L:ter(o)]+[L:pom(o)]+[U:o]" "ekstertero" (-eo-) (-en-: "outer space") - very tough compound - we want to link to "Tero" (-en-: "Earth", "planet Earth"), not to "tero" (-en-: "soil", "ground", "earth") - the root "ter" is common for "terpomo" and "extertero", making it case sensitive would be good for this word, but otherwise cause much more trouble than benefit, thus both "terpomo" and "ekstertero" will appear at both "tero" and "Tero" - tempting solution "[L:ekster(a)]+[L:ter(o)]+[U:o]" is suboptimal as it links to "tero" - tempting solution "[L:ekster(a)]+[L:Ter(o)]+[U:o]" is invalid and fails the "sum check" - correct way to do "[L:ekster(a)]+[Tero/ter(o)]+[U:o]" and "[N:ter]" links to "Tero", this is a unique tought case where the extra parameter is needed to circumvent the restrictions of the splitter "etfingro" (-eo-) - here an infix (originally intended to be located near the end of a word and frequently even called "suffix" although never tolerable at very end of a word) became a prefix - correct way to do "[I:et]+[L:fingr(o)]+[U:o]" "GXakarto" (-eo-) (-en-: "Jakarta", city in Indonesia) - the root could be "gxakart" but we do not want to link to it - tempting solution "gxakart+[U:o]" is invalid and fails the "sum check" - tempting solution "$S" shows "GXakart + o", "GXakart" links to "gxakarto" (invalid target) - use no split (default) - if you badly want to categorize the suffix "-o" then use manual split "GXakart+[U:o]" - if you badly want to categorize both the root "gxakart" (with lowercase "gx") and the suffix "-o" then use manual split and the extra parameter "GXakart+[U:o]" and "[N:gxakart]" "-ist-" (-eo-) - there is nothing to split and nothing to link to - let the automatic split fail, alternatively use "-" for no split - use the extra parameter with "&"-syntax, type is "I" "" and "&I" "-ist" (en, sv) - there is nothing to split and nothing to link to - let the automatic split fail, alternatively use "-" for no split - use the extra parameter with "&"-syntax, type is "U" "" and "&U" "skolbok" (-sv-) - default automatic split gives no result - tempting solution "[M:skola][M:bok]" is invalid and fails the "sum check" - tempting solution "[M:skol][M:bok]" links to invalid morpheme "skol" - tempting solution "[L:skol(a)][M:bok]" links as supposed to "skola" but categorizes "skol" as type "N" - correct way to do "[M:skola/skol(a)][M:bok]" "[M:skola/skol(a)]+[M:bok]" - pseudo-type "L" is of no use here "varumaerke" (-sv-) - default automatic split gives no result - tempting solution "[L:var(a)u][M:maerke]" links to invalid morpheme "varau", and categorizes "varu" as type "N" that was probably not intended either - correct way to do "[M:vara/var(a)u][M:maerke]" "[M:vara/var(a)u]+[M:maerke]" - pseudo-type "L" is of no use here "loeparsko" (-sv-) (-en-: running shoe) - here two letters are stolen in the assimilation, one of them in a suffix - default automatic split gives no result - correct way to do "[M:loep(a)]+[U:-are/ar(e)]+[M:sko]" "#@" (-zh-) (note that chars "#" and "@" represent 2 Chinese letters here) - default automatic split gives no result - use the large letter split strategy "$H" if the morphemes are not standalone then manual split is needed Limitations and (lack of) further automatization (rationale) ------------------------------------------------------------ The splitter is language-independent (and somewhat script-independent) and does not know any grammar. Thus it cannot in the above example "When in Rome, do as the Romans do." change "Romans" to "Roman" although it looks like a trivial task. Adding (partial) support for grammar in some language (EN, EO, SV, ...) would start a never-ending mess and dissatisfaction. Automatic removing of some "common" affixes (-en-: plural or 3rd person "-s", -eo-: "-j" and "-n", -sv-: "-ar", ...) would cause false positives and much trouble. The simple principle is: no grammar and no dictionary. Similarly, the word "berjalan-jalan" will remain together, like all other words containing a dash. If you want to split for example the lemma "user-friendly" you have to use the (optimized) manual split. The word "o'clock" will remain together and so "there's" will, in order to split the latter to "there"+"'s" the manual approach is required. Also the splitter does not try to guess the earliest (linkable) morpheme to be a prefix or the last one to be a suffix.
--]]