Modulo:utf8debug
MODULO | ||
Memtesto disponeblas sur la paĝo Ŝablono:debu. |
--[===[
MODULE "UTF8DEBUG" (debug UTF8 text)
"eo.wiktionary.org/wiki/Modulo:utf8debug" <!--2024-Oct-18-->
"id.wiktionary.org/wiki/Modul:utf8debug"
"sv.wiktionary.org/wiki/Modul:utf8debug"
"eo.wikipedia.org/wiki/Modulo:Utf8debug"
"id.wikipedia.org/wiki/Modul:Utf8debug"
Purpose: allows to debug an incoming UTF8 string (directly submitted or
generated by a template) by splitting it into isolated chars,
checking validity of the UTF8 stream and displaying chars and codes,
or by performing a "hard nowiki" and displaying complete text
including spaces and line breaks
Utilo: ebligas sencimigi enirantan UTF8-signocxenon (rekte enigitan aux
generitan far sxablono) per dispecigo farigxante apartaj signoj,
kontrolante validecon de la UTF8-vico kaj montrante signojn kaj kodojn,
aux per efektivigo de "hard nowiki" kaj montrado de kompleta teksto
inkluzive spacojn kaj liniorompojn
Manfaat: memungkinkan ...
Syfte: moejliggoer att debugga en inkommande UTF8 straeng (direkt oeverlaemnad
eller ...
Used by templates / Uzata far sxablonoj:
* only "debu" (not to be called from any other place, to be
used only for debugging, see below)
Required submodules / Bezonataj submoduloj:
* none / neniuj
Required images:
* "File:Return arrow.svg", Public Domain
This module can accept parameters whether sent to itself (own frame) or
to the caller (caller's frame). If there is a parameter "caller=true"
on the own frame then that own frame is discarded in favor of the
caller's one.
Incoming: * one anonymous and obligatory parameter
* input string (empty is legal but not very
useful, missing ie "nil" same as empty, 64 KiO max)
* two named and optional parameters
* "outctl=" output type selection control string (4 digits,
boolean or fourstate)
* show octet bloat ("0" or "1")
* show big boxes for single char:s ("0" or "1")
* show hard nowiki ("0" or "1" (no color) or "2" (colored)
or 3 (colored and split UTF8))
* show UTF8 char bloat ("0" or "1")
default is "1101", "0000" is prohibited, "nw" is synonymous
with "0010", empty input switches the type to "1000"
unless "empsil=1"
* "empsil=1" to switch on empty input from default big red
box to empty string too
Returned: * large text with complicated wikicode, empty possible
This module is unbreakable (when called with correct module name
and function name).
Cxi tiu modulo estas nerompebla (kiam vokita kun gxustaj nomo de modulo
kaj nomo de funkcio).
This module is special in that it can seem unused and useless. Do not
delete it just because no pages transclude it. Its purpose is not to be used
in article, lemma, appendix or whatever pages. It is intended to be used
temporarily when debugging UTF8 text, preferably from the sandbox. With the
option "hard nowiki" it can even be used for documentation and selftest of
modules and templates. Then the proxy template "debu" can be classed as
a documentation template. Still the template "pate" is a better choice
for this purpose.
Note that "<nowiki>" does NOT work in wikitext generated by a module. We
must dec-encode instead. This works for the commmon problem char:s ":#*='[]"
(there is no problem with curly "{}"). But dec-encoding does NOT work for UTF8
multi-octet char:s. So we dec-encode only some ANSI/ASCII char:s $00...$7F
and leave the remaining ones pass unchanged (both for big boxes mode and hard
nowiki mode). Note that dec-encoding does NOT work for LF either. We catch LF
separately in any case, and in the big boxes mode we show its name "LF",
whereas in the hard nowiki mode we show an arrow as image.
In text coming from a module some evil stuff (invalid UTF8 sequence, ZERO,
FF/12, ZWSP, LRM, RLM) is replaced with U+$FFFD by MediaWiki, whereas
other dubious content (TAB, CR, NBSP, BOM) survives.
Color coding of the result in the big boxes mode:
1 white ordinary ANSI/ASCII char
2 light grey valid 2-octet UTF8 with some exceptions
3 grey valid 3-octet UTF8 with some exceptions
4 dark grey valid 4-octet UTF8 (with no exceptions yet)
5 red code ZERO or invalid UTF8 sequence (plus empty input
not limited to the big boxes mode)
6 yellow dubious TAB CR NBSP ZWSP LRM RLM BOM
7 light yellow invisile LF SPACE
8 light blue initial octet bloat report (blue except on empty
input) and final UTF8 bloat report
Error <<FATAL in "utf8debug" : internal error or invalid
parameter>> is NOT included in the above list, possible causes:
* internal error
* input string too long
* extraneous anonymous parameter
* parameter "outctl=" or "empsil=" bad
Some interesting UTF8 codepoints:
-------- ---------- ----------------------- ------- ----------------------
codepo codepo UTFG-8 short official name and
int HEX int DEC encoding name silly notes
-------- ---------- ----------------------- ------- ----------------------
$0000 #00'000 ZERO
$0009 #00'009 TAB
$000A #00'010 LF
$000D #00'013 CR
$0020 #00'032 SPACE
$007F #00'127 inclusive end of 1-oct
$0080 #00'128 $C2,$80 begin of 2-oct
$00A0 #00'160 $C2,$A0 NBSP don't break me
$00BF #00'191 $C2,$BF inclusive end of $C2,xx
$00C0 #00'192 $C3,$80 begin of $C3,xx
$00FF #00'255 $C3,$BF inclusive end of $C3,xx
$0100 #00'256 $C4,$80 begin of $C4,xx
$0200 #00'512 $C8,$80 uppercase "A" with something above
$0300 #00'768 $CC,$80 strange horizontally misplaced apo
$034F #00'847 $CD,$8F COMBINING GRAPHEME JOINER
$0401 #01'025 $D0,$81 CCCP letter with case delta $50
$0451 #01'105 $D1,$91 CCCP letter with case delta $50
$07FF #02'047 $DF,$BF inclusive end of 2-oct
$0800 #02'048 $E0,$80,$80 begin of 3-oct
$200B #08'203 $E2,$80,$8B ZWSP ZERO WIDTH SPACE
$200C #08'204 $E2,$80,$8C ZWNJ ZERO WIDTH NON-JOINER
$200D #08'205 $E2,$80,$8D ZWJ ZERO WIDTH JOINER
$200E #08'206 $E2,$80,$8E LRM LEFT-TO-RIGHT MARK
$200F #08'207 $E2,$80,$8F RLM RIGHT-TO-LEFT MARK
$2060 #08'288 $E2,$81,$A0 absurd "WORD JOINER"
$2068 #08'296 $E2,$81,$A8 FSI FIRST STRONG ISOLATE
$20AC #08'364 $E2,$82,$AC EURO (bank robbery sign)
$D7FF #55'295 $ED,$9F,$BF last before banned range
$D800 #55'296 ($ED,$A0,$80) begin of banned range
$DFFF #57'343 ($ED,$BF,$BF) inclusive end of banned range
$E000 #57'344 $EE,$80,$80 begin of legal range again
$FEFF #65'279 $EF,$BB,$BF 239,187,191 BOM absurd "BOM" sigi
$FFFD #65'533 $EF,$BF,$BD 239,191,189 REPLACEMENT CHARACTER
$FFFE #65'534 $EF,$BF,$BE 239,191,190 invalid (last 2)
$FFFF #65'535 $EF,$BF,$BF 239,191,191 invalid (last 2), inclusive end of 3-oct
$01'0000 #65'536 $F0,$90,$80,$80 begin of 4-oct
$01'0348 #66'376 $F0,$90,$8D,$88 one of few somewhat known
$0F'FFFF #1'048'575 $F3,$BF,$BF,$BF one Mi almost reached
$10'0000 #1'048'576 $F4,$80,$80,$80 one Mi reached here and no end yet
$10'FFFE #1'114'110 $F4,$8F,$BF,$BE invalid (last 2)
$10'FFFF #1'114'111 $F4,$8F,$BF,$BF invalid (last 2), inclusive end of unicode
$11'0000 #1'114'112 ($F4,$90,$80,$80) invalid (finally out of range)
-------- ---------- ----------------------- ------- ----------------------
* UTF8 is defined by "RFC 3629" from 2003-Nov (but already used to
exist before, though)
* UTF8 sigi AKA BOM : HEX: $EF $BB $BF | DEC: 239 187 191 | ABS: $FEFF
* absolute unicode range has 17 (seventeen !!!) planes per 65'536 values
* totally 1'114'112 codepoints, most of them are unused, plane ZERO is
somewhat full, other ones are almost or totally empty
* official notation: "U+0000" ... "U+10FFFF"
* codepoint range ZERO to 31 is valid by RFC but mostly useless, same for
127, range 128 to 159, whereas 160 AKA " " does appear in wikitext
* range "U+D800" to "U+DFFF" is invalid by RFC
* UTF8 starting octet can be only $C2 to $DF , $E0 to $EF , $F0 to $F4
giving a continuous range from $C2 to $F4 of size $33 = #51 values
* UTF8 subsequent octet:s (1 or 2 or 3) can be only $80 to $BF
(6 bit:s, 64 possible values)
* octet values $C0, $C1 and $F5 to $FF may never appear in a UTF8 stream
Abs. char number range | UTF8 octet sequence | beginning octet
(hexadecimal) | (binary) |
-----------------------+--------------------------------+------------------
0000'0000 to 0000'007F | 0xxxxxxx | $00 to $7F
0000'0080 to 0000'07FF | 110xxxxx 10xxxxxx | $C0 -> $C2 to $DF
0000'0800 to 0000'FFFF | 1110xxxx 10xxxxxx 10xxxxxx | $E0 to $EF
0001'0000 to 0010'FFFF | 11110xxx 10xxxxxx 10xxxxxx ... | $F0 to $F7 -> $F4
]===]
local exporttable = {}
------------------------------------------------------------------------
---- CONSTANTS [O] ----
------------------------------------------------------------------------
-- constant strings (error circumfixes)
local constrelabg = '<span class="error"><b>' -- lagom whining begin
local constrelaen = '</b></span>' -- lagom whining end
local constrlaxhu = ' # # ' -- lagom -> huge circumfix
-- HTML stuff for our tiny table and background around every char
local constrtabu3 = '<table style="display:inline-block; vertical-align:middle; margin:0.15em; padding:0.15em; border:0.15em solid #000000; text-align:center; background-color:#' -- missing color code and many char:s (only 3 ';">' to close element)
local constrtabu4 = ';"><tr><td>'
local constrtabu5 = '</td></tr></table>'
local constrbkg3 = '<span style="font-size:160%;background-color:#E0A0FF;"> '
local constrbkg4 = ' </span>'
local constrpilen = '[[File:Return arrow.svg|20px|link=]]' -- the file is Public Domain
-- color for "lfiultencode"
local contabempatwarna = {[0]='FFA0A0','D0FFD0','A0A0FF','D0D0D0'} -- red, light green, blue, light grey
-- color for main for big boxes and summary boxes
-- 1 white default, 2...4 grey getting darker,
-- 5 red (also bloat box), 6 yellow, 7 light yellow, 8 light blue
-- fill the gap between "constrtabu3" and "constrtabu4" always with
-- help of this table, do NOT put hardcoded color values there
local contabwar8na = {'FFFFFF','E8E8E8','D0D0D0','B8B8B8','FF6060','FFFF60','FFFFB0','C8C8FF'} -- (index 1...8)
-- known codepoints
-- invalid sequence or codepoint ZERO -> "R" -> "red class error"
-- TAB CR NBSP ZWSP LRM RLM BOM -> "Y" -> "yellow class error"
-- LF SPACE -> "L" -> "light yellow class char"
local contabcodepoints = {}
contabcodepoints [ -1] = {'' , 'R'} -- pseudo codepoint, name not used -> "constrinvalid"
contabcodepoints [ 0] = {'ZERO' , 'R'}
contabcodepoints [ 9] = {'TAB' , 'Y'}
contabcodepoints [ 10] = {'LF' , 'L'}
contabcodepoints [ 13] = {'CR' , 'Y'}
contabcodepoints [ 32] = {'SPACE' , 'L'}
contabcodepoints [ 160] = {'NBSP' , 'Y'}
contabcodepoints [ 8203] = {'ZWSP' , 'Y'}
contabcodepoints [ 8206] = {'LRM' , 'Y'}
contabcodepoints [ 8207] = {'RLM' , 'Y'}
contabcodepoints [65279] = {'BOM' , 'Y'}
-- constant strings EN vs EO vs ID vs SV
-- local constrkosong = 'empty string submitted' -- EN
local constrkosong = 'malplena signocxeno transdonita' -- EO
-- local constrkosong = 'string datang bersifat kosong' -- ID
-- local constrkosong = 'inkommen string aer tom' -- SV
-- local constrinvalid = 'invalid UTF8 value sequence' -- EN
local constrinvalid = 'nevalida sekvo de UTF8-valoroj' -- EO
-- local constrinvalid = 'rantai nilai UTF8 bersifat invalid' -- ID
-- local constrinvalid = 'ogiltig sekvens av UTF8-vaerden' -- SV
------------------------------------------------------------------------
---- MATH FUNCTIONS [E] ----
------------------------------------------------------------------------
-- Local function MATHDIV
local function mathdiv (xdividend, xdivisor)
local resultdiv = 0 -- DIV operator lacks in LUA :-(
resultdiv = math.floor (xdividend / xdivisor)
return resultdiv
end--function mathdiv
-- Local function MATHMOD
local function mathmod (xdividendo, xdivisoro)
local resultmod = 0 -- MOD operator is "%" and bitwise AND operator lack too
resultmod = xdividendo % xdivisoro
return resultmod
end--function mathmod
------------------------------------------------------------------------
-- Local function MATHXOR
-- Depends on functions :
-- [E] mathdiv mathmod
local function mathxor (xa, xb)
local resultxor = 0
local crap6 = 0
local crap7 = 0
local crap8 = 1 -- single bit value 1 -> 2 -> 4 -> 8 ...
while true do
if ((xa==0) and (xb==0)) then
break -- we have run out of bits on both
end--if
crap6 = mathmod (xa,2) -- pick bit before dividing
crap7 = mathmod (xb,2) -- pick bit before dividing
xa = mathdiv (xa,2) -- shift right
xb = mathdiv (xb,2) -- shift right
if (crap6~=crap7) then
resultxor = resultxor + crap8 -- add one bit rtl only if true
end--if
crap8 = crap8 * 2
end--while
return resultxor
end--function mathxor
------------------------------------------------------------------------
---- NUMBER CONVERSION FUNCTIONS [N] ----
------------------------------------------------------------------------
-- Local function LFDEC1DIGIT
-- Convert 1 decimal ASCII digit to integer 0...9 (255 if invalid).
local function lfdec1digit (num1digit)
num1digit = num1digit - 48 -- may become invalid
if ((num1digit<0) or (num1digit>9)) then
num1digit = 255
end--if
return num1digit
end--function lfdec1digit
------------------------------------------------------------------------
-- Local function LFNUINT8TOHEX
-- Convert UINT8 (0...255) to a 2-digit hex string.
-- Depends on functions :
-- [E] mathdiv mathmod
local function lfnuint8tohex (numinclow)
local strheksulo = ''
local numhajhaj = 0
numhajhaj = mathdiv (numinclow,16)
numinclow = mathmod (numinclow,16)
if (numhajhaj>9) then
numhajhaj = numhajhaj + 7 -- now 0...9 or 17...22
end--if
if (numinclow>9) then
numinclow = numinclow + 7 -- now 0...9 or 17...22
end--if
strheksulo = string.char (numhajhaj+48) .. string.char (numinclow+48)
return strheksulo
end--function lfnuint8tohex
------------------------------------------------------------------------
-- Local function LFUINT32TOHEX
-- Convert UINT32 (0 ... $FFFF'FFFF = #4'294'967'295) to
-- a (2 or 4 or 6 or 8)-digit hex string.
-- Depends on functions :
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod
local function lfuint32tohex (numincom)
local strheksulego = ''
while true do
strheksulego = lfnuint8tohex ( mathmod (numincom,256) ) .. strheksulego
numincom = mathdiv (numincom,256)
if (numincom==0) then
break
end--if
end--while
return strheksulego
end--function lfuint32tohex
------------------------------------------------------------------------
---- LOW LEVEL STRING FUNCTIONS [G] ----
------------------------------------------------------------------------
-- test whether char is an ASCII digit "0"..."9", return boolean
local function lfgtestnum (numkaad)
local boodigit = false
boodigit = ((numkaad>=48) and (numkaad<=57))
return boodigit
end--function lfgtestnum
------------------------------------------------------------------------
-- test whether char is an ASCII uppercase letter, return boolean
local function lfgtestuc (numkode)
local booupperc = false
booupperc = ((numkode>=65) and (numkode<=90))
return booupperc
end--function lfgtestuc
------------------------------------------------------------------------
-- test whether char is an ASCII lowercase letter, return boolean
local function lfgtestlc (numcode)
local boolowerc = false
boolowerc = ((numcode>=97) and (numcode<=122))
return boolowerc
end--function lfgtestlc
------------------------------------------------------------------------
-- Local function LFGIS62SAFE
-- Test whether incoming ASCII char is very safe (0...9 A...Z a...z).
-- Depends on functions :
-- [G] lfgtestnum lfgtestuc lfgtestlc
local function lfgis62safe (numcxair)
local booguud = false
booguud = lfgtestnum (numcxair) or lfgtestuc (numcxair) or lfgtestlc (numcxair)
return booguud
end--function lfgis62safe
------------------------------------------------------------------------
---- SOME FUNCTIONS ---- !!!FIXME!!!
------------------------------------------------------------------------
-- Local function LFHEXDEC
-- Example output : "$FE=#254" (we have to save text with)
-- Depends on "lfnuint8tohex"
local function lfhexdec (numkodo)
local strrezulto = ''
strrezulto = "$" .. lfnuint8tohex (numkodo) .. "=#" .. tostring (numkodo)
return strrezulto
end--function lfhexdec
------------------------------------------------------------------------
-- Local function LFNUMTODECBUN
-- Convert non-negative integer to decimal string with bunching.
-- Depends on functions :
-- [E] mathdiv mathmod
local function lfnumtodecbun (numnomoriin)
local strnomorut = ''
local numindeex = 0
local numcaar = 0
numnomoriin = math.floor (numnomoriin) -- transcendental numbers suck
if (numnomoriin<0) then
numnomoriin = 0 -- negative numbers suck
end--if
while true do
numcaar = mathmod(numnomoriin,10) + 48 -- get digit moving right to left
numnomoriin = mathdiv(numnomoriin,10)
if (numindeex==3) then
strnomorut = "'" .. strnomorut -- ueglstr apo
numindeex = 0
end--if
strnomorut = string.char(numcaar) .. strnomorut -- ueglstr digit
numindeex = numindeex + 1
if (numnomoriin==0) then
break
end--if
end--while
return strnomorut
end--function lfnumtodecbun
------------------------------------------------------------------------
---- UTF8 FUNCTIONS [U] ----
------------------------------------------------------------------------
-- Local function LFULNUTF8CHAR
-- Evaluate length of a single UTF8 char in octet:s.
-- Input : * numbgoctet -- beginning octet of a UTF8 char
-- Output : * numlen1234x -- number 1...4 or ZERO if invalid
-- Does NOT thoroughly check the validity, looks at 1 octet only.
local function lfulnutf8char (numbgoctet)
local numlen1234x = 0
if (numbgoctet<128) then
numlen1234x = 1 -- $00...$7F -- ANSI/ASCII
end--if
if ((numbgoctet>=194) and (numbgoctet<=223)) then
numlen1234x = 2 -- $C2 to $DF
end--if
if ((numbgoctet>=224) and (numbgoctet<=239)) then
numlen1234x = 3 -- $E0 to $EF
end--if
if ((numbgoctet>=240) and (numbgoctet<=244)) then
numlen1234x = 4 -- $F0 to $F4
end--if
return numlen1234x
end--function lfulnutf8char
------------------------------------------------------------------------
-- Local function LFUTF8DEKO
-- Decode a single UTF8 char, return ZERO length if invalid.
-- Output : * "tabresult" -- LUA table [0] length and [1] codepoint
-- Depends on functions :
-- [E] mathdiv mathmod mathxor
local function lfutf8deko (num0, num1, num2, num3)
local tabresult = {}
local numlength = 0 -- preASSume invalid
local numkodepoin = 0 -- preASSume invalid
num1 = mathxor (num1,128) -- XOR 3 of 4
num2 = mathxor (num2,128) -- XOR 3 of 4
num3 = mathxor (num3,128) -- XOR 3 of 4
while true do -- fake loop
if ((num0>193) and (num1>63)) then
break -- to join mark
end--if
if ((num0>223) and (num2>63)) then
break -- to join mark
end--if
if ((num0>239) and (num3>63)) then
break -- to join mark
end--if
if (num0<128) then -- ZERO to $7F
numkodepoin = num0
numlength = 1
break -- to join mark
end--if
if ((num0>193) and (num0<224)) then -- $C0 # $C2 to $DF
numkodepoin = (mathxor(num0,192)) * 64 + num1
if ((numkodepoin>127) and (numkodepoin<2048)) then
numlength = 2
end--if
break -- to join mark
end--if
if ((num0>223) and (num0<240)) then -- $E0 to $EF
numkodepoin = (mathxor(num0,224)) * 4096 + num1 * 64 + num2
if (((numkodepoin>2047) and (numkodepoin<55296)) or ((numkodepoin>57343) and (numkodepoin<65536))) then
numlength = 3
end--if
break -- to join mark
end--if
if ((num0>239) and (num0<245)) then -- $F0 to $F7 # $F4
numkodepoin = (mathxor(num0,240)) * 262144 + num1 * 4096 + num2 * 64 + num3
if ((numkodepoin>65535) and (numkodepoin<1114112)) then
numlength = 4
end--if
break -- to join mark
end--if
break -- finally to join mark
end--while -- fake loop -- join mark
tabresult [0] = numlength
tabresult [1] = numkodepoin
return tabresult
end--function lfutf8deko
------------------------------------------------------------------------
---- HIGH LEVEL STRING FUNCTIONS [I] ----
------------------------------------------------------------------------
-- Local function LFIULTENCODE
-- Generously encode char:s to prevent parsing and show hex if needed, make
-- single chars visible, bypass all wiki parsing and HTML parsing. Our cool
-- module has brewed something with "[["..."]]" and repeated spaces but we
-- want to see plain text for debugging purposes. Thus we dec-encode some
-- char:s, use NBSP to fix spaces, workaround EOL, and maybe add color.
-- Input : * strkrampuj : string, empty tolerable, but type "nil" is NOT
-- * nummxwidth : maximal width of text (20...200, default 80)
-- * boowarrna : "true" to enable color
-- * boosplitutf : "true" to split UTF8 char:s into hex numbers
-- Output : * strkood : string, empty in worst case
-- Depends on functions :
-- [U] lfulnutf8char
-- [G] lfgtestnum lfgtestuc lfgtestlc lfgis62safe
-- [N] lfnuint8tohex
-- [E] mathdiv mathmod
-- Depends on constants :
-- * string constrpilen [[File:...]]
-- * table contabempatwarna 0...3
-- This helps with:
-- * "[["..."]]", "["..."]", "*", "#", ":" (note that there is no
-- problem with plain "{{"..."}}")
-- * multiple spaces (they are no longer reduced to one piece due to HTML)
-- * EOL:s (they do not vanish in favor of spaces due to HTML, instead
-- the EOL arrow is showed)
-- * too long lines (they are force-broken)
-- * codes below 32 other than EOL
-- There is also "mw.text.nowiki" with some limitations, most notably
-- about multiple spaces and EOL:s.
-- In order to fix EOL we show the EOL arrow (preceded by space) for every
-- incoming LF, but do a "<br>" only once after multiple subsequent LF:s.
-- We must be UTF8-aware. A UTF8 char must be either split into hex codes,
-- or preserved over its complete length ie not split nor encoded at all.
-- Note that this causes BLOAT. The caller is responsible for
-- adding "<big>"..."</big>" if desired.
local function lfiultencode (strkrampuj,nummxwidth,boowarrna,boosplitutf)
local stronechar = ''
local strkolorr = ''
local strkood = ''
local numstrlne = 0
local numpeekynx = 1 -- ONE-based index
local numcahr = 0
local numcxxhr = 0
local numutf8len = 0
local numaccuwidth = 0 -- accumulated width
local numcolor = 0 -- 0,1,2,3 -- R,G,B,Y
local boonbsp = true -- "true" needed for junk lines containing only space
local boosplnow = false -- allow forced split in some cases
local boofickpilen = false -- true after LF arrow causes "<br>" later
if (type(nummxwidth)~='number') then
nummxwidth = 80
end--if
if ((nummxwidth<20) or (nummxwidth>200)) then
nummxwidth = 80
end--if
numstrlne = string.len (strkrampuj)
while true do -- outer genuine loop
if (numpeekynx>numstrlne) then
break
end--if
numcahr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1 -- ONE-based index
while true do -- inner fake loop
if (numcahr==10) then
break -- to join mark -- inner fake loop -- special processing for LF
end--if
if (numcahr==32) then
if (boonbsp) then
stronechar = ' ' -- this prevents space reduction
else
stronechar = ' '
end--if
boonbsp = not boonbsp
break -- to join mark -- inner fake loop
end--if
if (numcahr<32) then
stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}' -- always as hex
break -- to join mark -- inner fake loop
end--if
if (numcahr>127) then
boosplnow = boosplitutf
numutf8len = lfulnutf8char (numcahr)
if (numutf8len==0) then
boosplnow = true -- forced split for broken UTF8 sequence
else
numutf8len = numutf8len - 1 -- more char:s to pick
end--if
if ((numpeekynx+numutf8len)>(numstrlne+1)) then
boosplnow = true -- forced split for truncated UTF8 sequence
end--if
if (boosplnow) then
stronechar = '{$' .. lfnuint8tohex (numcahr) .. '}'
else
stronechar = string.char (numcahr) -- preserve "numcahr" below
while true do -- deep loop copy UTF8 char
if (numutf8len==0) then
break
end--if
numcxxhr = string.byte (strkrampuj,numpeekynx,numpeekynx)
numpeekynx = numpeekynx + 1
numutf8len = numutf8len - 1
stronechar = stronechar .. string.char (numcxxhr)
end--while -- deep loop copy UTF8 char
end--if
break -- to join mark
end--if (numcahr>127) then
if (lfgis62safe(numcahr)) then -- safe ASCII ie 0...9 A...Z a...z
stronechar = string.char (numcahr) -- do NOT encode safe char:s
break -- to join mark
end--if
stronechar = '&#' .. tostring (numcahr) .. ';' -- dec-encode some ASCII
break -- finally to join mark
end--while -- inner fake loop -- join mark
if (numcahr==10) then
if (numaccuwidth>=nummxwidth) then
strkood = strkood .. '<br>'
numaccuwidth = 0
boonbsp = true -- "true" needed for junk lines containing only space
end--if
strkood = strkood .. ' ' .. constrpilen
numaccuwidth = numaccuwidth + 2 -- counts doubly
boofickpilen = true
else
if (boofickpilen or (numaccuwidth>=nummxwidth)) then
strkood = strkood .. '<br>'
numaccuwidth = 0
boonbsp = true -- "true" needed for junk lines containing only space
end--if
if (boowarrna) then
strkolorr = contabempatwarna [numcolor]
numcolor = mathmod ((numcolor+1),4) -- index 0...3
strkood = strkood .. '<span style="background-color:#' .. strkolorr .. ';">' .. stronechar .. '</span>'
else
strkood = strkood .. stronechar
end--if
numaccuwidth = numaccuwidth + 1
boofickpilen = false
end--if (numcahr==10) else
end--while -- outer genuine loop
return strkood
end--function lfiultencode
------------------------------------------------------------------------
-- Local function LFIVALIUMDCTLSTR
-- Validate control string against restrictive pattern (dec).
-- Input : * strresdpat -- restrictive pattern (max 200 char:s)
-- * strctldstr -- incoming suspect
-- Output : * numbadpos -- bad position, or 254 wrong length, or 255 success
-- Depends on functions :
-- [N] lfdec1digit
-- Content of restrictive pattern:
-- * "." -- skip check
-- * "-" and "?" -- must match literally
-- * digit "1"..."9" ("0" invalid) -- inclusive upper limit (min ZERO)
local function lfivaliumdctlstr (strresdpat, strctldstr)
local numlenresdpat = 0
local numldninkom = 0
local numcomperindex = 0 -- ZERO-based
local numead2 = 0
local numead3 = 0
local numbadpos = 254 -- preASSume guilt (len differ or too long or ...)
local booddaan = false
numlenresdpat = string.len(strresdpat)
numldninkom = string.len(strctldstr)
if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then
while true do
if (numcomperindex==numlenresdpat) then
numbadpos = 255
break -- success
end--if
numead2 = string.byte(strresdpat,(numcomperindex+1),(numcomperindex+1)) -- rest
numead3 = string.byte(strctldstr,(numcomperindex+1),(numcomperindex+1)) -- susp
booddaan = false
if ((numead2==45) or (numead2==63)) then
if (numead2~=numead3) then
numbadpos = numcomperindex
break -- "-" and "?" must match literally
end--if
booddaan = true -- position OK
end--if
if (numead2==46) then -- skip for dot "."
booddaan = true -- position OK
end--if
if (not booddaan) then
numead2 = lfdec1digit(numead2) -- rest
if (numead2>9) then -- limit defined or bad ??
numbadpos = 254
break -- bad restrictive pattern
else
numead3 = lfdec1digit(numead3) -- susp
if (numead3>numead2) then
numbadpos = numcomperindex
break -- value limit violation
end--if
end--if (numead2>9) else
end--if (not booddaan) then
numcomperindex = numcomperindex + 1
end--while
end--if ((numlenresdpat<=200) and (numlenresdpat==numldninkom)) then
return numbadpos
end--function lfivaliumdctlstr
------------------------------------------------------------------------
---- VARIABLES [R] ----
------------------------------------------------------------------------
function exporttable.ek (arxframent)
-- general unknown type
local vartamp = 0 -- variable without type
-- special type "args" AKA "arx"
local arxsomons = 0 -- metaized "args" from our own or caller's "frame"
-- general "tab"
local tabutf8dec = {}
-- general "str"
local strinctx = '' -- incoming text from anon parameter
local strctrl = '' -- from optional parameter "outctl="
local strmytemp = ''
local strret = '' -- final output string
-- general "num"
local numinctx = 0 -- length of incoming text in octets
local numchrlen = 0 -- number of UTF8 char:s
local numtymp = 0
-- general "boo"
local boocrap = false
local boopendlf = false -- pending LF between sections
-- more "boo" from parameters
local booempsil = false -- from "empsil=1"
local boooktblo = false -- from "outctl="
local boobigbox = false -- from "outctl=" show big boxes
local boohardnw = false -- from "outctl=" foursate "true" from "1" "2" "3"
local boohnwcol = false -- from "outctl=" foursate "true" from "2" "3"
local boohnwspt = false -- from "outctl=" foursate "true" from "3" only
local booutfblo = false -- from "outctl=" show UTF8 char bloat
------------------------------------------------------------------------
---- MAIN [Z] ----
------------------------------------------------------------------------
---- GUARD AGAINST INTERNAL ERROR ----
-- "constrkosong" and "constrinvalid" must be uncommented and assigned
-- note that reporting of this error may NOT depend on uncommentable strings
boocrap = ((type(constrkosong)~='string') or (type(constrinvalid)~='string'))
---- GET THE ARX (ONE OF TWO) ----
if (not boocrap) then
arxsomons = arxframent.args -- "args" from our own "frame"
vartamp = arxsomons ['caller']
if (vartamp=='true') then
arxsomons = arxframent:getParent().args -- "args" from caller's "frame"
end--if
end--if
---- CHECK ----
if (not boocrap) then
if (type(arxsomons[2])=='string') then
boocrap = true -- too much
end--if
end--if
---- SEIZE ONE ANONYMOUS AND OBLIGATORY PARAMETER ----
-- on success assign "strinctx" and "numinctx" (not to be touched later)
if (not boocrap) then
vartamp = arxsomons [1]
if (type(vartamp)=="string") then
numinctx = string.len (vartamp)
if (numinctx>65536) then
boocrap = true -- this causes bloat, we can never enocode such big
else
strinctx = vartamp
end--if
end--if (type(vartamp)=="string") then
end--if
---- SEIZE AND CHECK NAMED AND OPTIONAL PARAMETER WITH CONTROL STRING ----
-- default is "1101", "0000" is prohibited, "nw" is synonymous
-- with "0010", empty input switches the type to "1000"
if (not boocrap) then
do -- scope
local vartumip = 0
local numsilur = 0
strctrl = '1101' -- default
vartumip = arxsomons ['outctl']
if (type(vartumip)=='string') then
if (vartumip=='nw') then -- alias
vartumip = '0010'
end--if
if (vartumip=='0000') then
boocrap = true
else
numsilur = lfivaliumdctlstr ('1131',vartumip) -- 255 is OK
if (numsilur==255) then
strctrl = vartumip
else
boocrap = true
end--if
end--if
end--if (type(vartumip)=='string') then
end--do scope
end--if (not boocrap) then
---- SEIZE AND CHECK NAMED AND OPTIONAL PARAMETER WITH BOOLEAN ----
if (not boocrap) then
vartamp = arxsomons ['empsil']
if (type(vartamp)=='string') then
if (vartamp=='1') then
booempsil = true
else
boocrap = true
end--if
end--if
end--if
---- EMPTINESS ----
if ((not boocrap) and (numinctx==0)) then
if (booempsil) then
strctrl = '0000' -- empty input switches type to silly "0000"
else
strctrl = '1000' -- empty input switches type to "1000"
end--if
end--if
---- PROCESS CONTROL STRING TO BOOLEANS ----
if (not boocrap) then
numtymp = string.byte(strctrl,1,1)
boooktblo = (numtymp==49) -- show octet bloat
numtymp = string.byte(strctrl,2,2)
boobigbox = (numtymp==49) -- big boxes mode
numtymp = string.byte(strctrl,3,3) -- subtypes of hard nowiki mode
boohardnw = (numtymp>=49) -- "true" from "1" or "2" or "3"
boohnwcol = (numtymp>=50) -- "true" from "2" or "3"
boohnwspt = (numtymp==51) -- "true" from "3" only
numtymp = string.byte(strctrl,4,4)
booutfblo = (numtymp==49) -- show UTF8 char bloat
end--if
---- WHINE IF YOU MUST ----
-- note that reporting of this error may NOT depend of uncommentable strings
if (boocrap) then
strmytemp = 'FATAL in "utf8debug" : internal error or invalid parameter'
strret = constrlaxhu .. constrelabg .. strmytemp .. constrelaen .. constrlaxhu
end--if
---- SHOW OCTET BLOAT ----
-- empty input switches type to "1000" ie only "boooktblo" is
-- true, or to "0000" (invalid from caller)
if ((not boocrap) and boooktblo) then
if (numinctx==0) then
numtymp = 5 -- red on empty string (only 5 or 8 here)
strmytemp = constrkosong
else
numtymp = 8 -- light blue (only 5 or 8 here)
strmytemp = "number of<br>octet:s : " .. lfnumtodecbun(numinctx)
end--if
strret = constrtabu3 .. contabwar8na [numtymp] .. constrtabu4 .. strmytemp .. constrtabu5
boopendlf = true -- the earliest one, "boopendlf" not assigned above
end--if
---- PROCESS UTF8 AND GENERATE BIG BOXES ----
-- incoming "strinctx" and "numinctx"
-- we brew a private HTML table with just one cell for every single char
-- this is done for both boobigbox (use generated string) and booutfblo
-- only (discard generated string, "numchrlen" is the big prey)
numchrlen = 0 -- counts UTF8 char:s, pass to below
if ((not boocrap) and (boobigbox or booutfblo)) then
do -- scope
local varkop = 0
local strchname = ''
local strchkolr = ''
local strsngchar = '' -- one char with "span" background
local strchrblok = '' -- prebrewed block with table for one char
local strbunch = '' -- full report with big boxes
local numindx = 0 -- counts octet:s
local numreserv = 0
local numutfone = 0 -- length of ONE UTF8 char
local numdecode = 0 -- decoded "codepoint" value
local numoct = 0 -- temp some char
local numodt = 0 -- temp some char
local numoet = 0 -- temp some char
local numoft = 0 -- temp some char
local numwarna = 0
while true do
if (numindx>=numinctx) then
break
end--if
numreserv = numinctx - numindx -- at least 1
numoct = string.byte (strinctx,(numindx+1),(numindx+1))
numodt = 0
numoet = 0
numoft = 0
if (numreserv>=2) then
numodt = string.byte (strinctx,(numindx+2),(numindx+2))
end--if
if (numreserv>=3) then
numoet = string.byte (strinctx,(numindx+3),(numindx+3))
end--if
if (numreserv>=4) then
numoft = string.byte (strinctx,(numindx+4),(numindx+4))
end--if
tabutf8dec = lfutf8deko (numoct,numodt,numoet,numoft)
numutfone = tabutf8dec [0] -- ZERO invalid or 1...4
if (numutfone==0) then
numdecode = -1 -- pseudo codepoint for invalid sequence
else
numdecode = tabutf8dec [1] -- have valid codepoint
end--if
varkop = contabcodepoints [numdecode] -- risk for type "nil"
strchname = ''
strchkolr = '' -- "R" or "Y" or "L"
if (type(varkop)=='table') then
strchname = varkop[1] or ''
strchkolr = varkop[2] or ''
end--if
numwarna = numutfone -- preASSume, ZERO invalid or 1...4
if (strchkolr=='R') then
numwarna = 5 -- red on code ZERO or invalid sequence
end--if
if (strchkolr=='Y') then
numwarna = 6 -- yellow on TAB CR NBSP ZWSP LRM RLM BOM
end--if
if (strchkolr=='L') then
numwarna = 7 -- light yellow on LF SPACE
end--if
strchrblok = constrtabu3 .. contabwar8na [numwarna] .. constrtabu4 .. "<small>index</small> " .. lfnumtodecbun(numindx)
strchrblok = strchrblok .. "<br><small>beg code</small> " .. lfhexdec (numoct)
if (numutfone==0) then
strchrblok = strchrblok .. "<br>" .. constrinvalid -- color sudah done before
else
strchrblok = strchrblok .. "<br><small>length</small> " .. tostring (numutfone)
strsngchar = string.char (numoct) -- maybe we will need it
if (numutfone>=2) then
strchrblok = strchrblok .. "<br><small>extra</small> $" .. lfnuint8tohex (numodt)
strsngchar = strsngchar .. string.char (numodt)
if (numutfone>=3) then
strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoet)
strsngchar = strsngchar .. string.char (numoet)
end--if
if (numutfone==4) then
strchrblok = strchrblok .. ",$" .. lfnuint8tohex (numoft)
strsngchar = strsngchar .. string.char (numoft)
end--if
strchrblok = strchrblok .. "<br><small>codepoint</small> U+$" .. lfuint32tohex (numdecode)
strchrblok = strchrblok .. "<br><small>dec</small> #" .. lfnumtodecbun(numdecode)
end--if (numutfone>=2) then
if (strchname~='') then
strchrblok = strchrblok .. "<br>" .. strchname -- known by name
else
strchrblok = strchrblok .. "<br>" .. constrbkg3 -- begin char background
if (numutfone==1) then
strchrblok = strchrblok .. "&#" .. tostring (numoct) .. ";" -- dec-encode, give a F**K in "strsngchar"
else
strchrblok = strchrblok .. strsngchar -- let wiki software & browser bother
end--if
strchrblok = strchrblok .. constrbkg4 -- close char background
end--if
end--if (numutfone==0) else
strchrblok = strchrblok .. constrtabu5 -- close table
numindx = numindx + numutfone -- ZERO-based index
numchrlen = numchrlen + 1 -- invalid char:s do count too, the big prey
if (boobigbox) then
strbunch = strbunch .. strchrblok -- later use or discard
end--if
end--while
if (boobigbox) then -- else just discard it ;-)
if (boopendlf) then
strret = strret .. "<br>"
end--if
strret = strret .. strbunch
boopendlf = true
end--if
end--do scope
end--if ((not boocrap) and (boobigbox or booutfblo)) then
---- HARD NOWIKI ----
-- incoming "strinctx" and "numinctx"
-- boohardnw "true" from "1" "2" "3" -- do "hard nowiki"
-- boohnwcol "true" from "2" "3" -- requested color
-- boohnwspt "true" from "3" only -- split UTF8
-- restrict the width to 100 char:s (HTML parser breaks on spaces and some
-- other chars, but unreasonably long words cause trouble, we break at 100)
if ((not boocrap) and boohardnw) then
if (boopendlf) then
strret = strret .. "<br>"
end--if
strret = strret .. "<big>" .. lfiultencode (strinctx,100,boohnwcol,boohnwspt) .. "</big>"
boopendlf = true
end--if
---- UTF8 BLOAT ----
-- incoming "numchrlen" cannot be ZERO if "booutfblo" is "true"
if ((not boocrap) and booutfblo) then
if (boopendlf) then
strret = strret .. "<br>" -- the last one, "boopendlf" not needed below
end--if
strmytemp = "number of UTF8<br>char:s : " .. lfnumtodecbun(numchrlen)
strret = strret .. constrtabu3 .. contabwar8na [8] .. constrtabu4 .. strmytemp .. constrtabu5
end--if
---- RETURN THE JUNK STRING ----
return strret
end--function
---- RETURN THE JUNK LUA TABLE ----
return exporttable