
first divider regexp
  use beginning or end of regexp match?
  extra match constraints
  handling of first element (null or skip if empty or white)


find-type-name-start-regexp + beginning-or-end (relative to start?)
find-body-start-regexp + beginning-or-end  (relative to start?)

also need a way to get the name from style frobs

subsequent dividers regexp
  regexp + beginning-or-end
  extra constraints, referring to last divider position constraints

skip-trailers-regexp + beginning-or-end relative to end of body



  constraints:
    the section header constraint

    the style frobs constraints

    the paragraph separator constraint

    the null constraint




also need an automatic way to identify bodies for subnodes
  when they are just a substring of the region for them.


outline
        first_divider = beginning of regexp(section_header)
        next_divider = matched beginning of regexp(section_header)

	"matched" means same frob character?
                        and same or less length
                        and same or less indentation


dash3-thingie
	the same except without the additional restrictions
          on "matched"

lines
        first_divider end of regexp (this_line)
        next_divider end of regexp (this_line)
        no additional restrictions on "matched"


literal-text
        there is no first_divider --- literal text has
        no subnodes


section
	first_divider beginning of regexp (blank_line)
        next_divider = <none>
        (so, always two subnodes?)



styled-text
        context makes this harder

        first_divider = beginning of regexp (opener_frob)
	   but "matched" means satisfying context constraints

        next_divider = match _end_ of a regexp(closer_frob)
           but "matched" means satisfying context and length
             requirements (including those determined by the
             full extent of the earlier match.

        atomic frobs (e.g., `{ref}') just match the end of the
          divider


        can styled text like that handle inline displays?
          (fairly trivially, no?  the "same-indent" constraint
            for a ]] terminator can be generalized to "same-indent"
            or "end-of-same-line")


---- so far, then, each is just two regexps plus one of a very
     small number of (highly idiosyncratic) "context constraints",
     where those constraints can refer to results of the previous
     match

essay
    start is beginning of any non-blank line

    next is beginning of a blank line, a || line or a ~~ line
      must match: || or ~~ same or less indented, 


    that's close but misses the case where a paragraph starts 
      with [[

    start is beginning of any non-blank line

    next is beginning of a blank line, a || line, a ~~ line,
        or non-blank line

      matching rule:
        any suitably indented line terminates [[

	~~ on a line by itself terminates a similarly indented 
          || paragraph


        ||, suitably indented, terminates an earlier ||.

        
     before next iteration, skip over any ~~ or ]] line.








    is a list of paragraphs

    a paragraph can be just a solitary display

    or it can be a bunch of styled text, possibly with
    explicit line breaks, and possibly containing nested
    displays

    this makes really text-prejudiced trees for 
      nested displays, though:

	my-disp
          paragraph
            sub-display-1
          paragraph
            sub-display-2
          ...

    from parsing 

         [[my-disp
            [[sub-display-1
               ...
            ]]
            [[sub-display-2
               ...
            ]]
            ...
          ]]

    as an essay.    some way is also desirable to parse it getting

	my-disp
            sub-display-1
            sub-display-2
          ...


    ||(type) dfljksdkf
    ~~

    [[my-disp
       ||(sub-display-1) 
          ...

       ||(sub-display-2)
          ...
    ]]


     first_divider = beginning of regexp (non-blank-line)
		thus, the first section will be entirely 
                  blank (and should not generate a subnode)

     next_divider = if the last divider match wasn't for ||
                       beginning of regexp (blank_line)
                    otherwise
                       beginning of match for ~~ or ||
	  the catch is that if ~~ is matched, then 
          the beginning of the match is the region end for
          the subnode, but end of that match (next line)
          is the place to start the next iteration.
          on the other hand, if || is matched, then the
          beginning of that line is both the end of the 
          subnode region and the beginning for the next
          iteration.


 



beginning of regexp (blank-line|-- line| || line)
        with a complicated context constraint
          if the previous match matched ||, then blank line matches
            don't count.    

          blank line must not be within a display
          


        finding a paragraph separator is not regular because 
          of displays (skipping over a display requires examining
          indentation).

        
rethinking paragraphs

   blank lines separate paragraphs --except--

   || begins a new paragraph which must be explicitly
     terminated.

   it's terminated by a similarly indented || or solitary --


  	|| yadda yadda
        blah blah blah

        [[inline-display
           ....
        ]]

        more yadda he haw
        foo bar
        --

	normal paragraph
        whcih ends right
        here

        next normal paragraph

	||yadda yadda

        ||next yadda yadda
        --













starting regexp, ending regexp w/ indentation and depth adjustment,
trailer lines adjustment.

  except that, in this stuff, ^[[:blank:]] should
   be ^([.,[:blank:]])

  outline:
    start		/^([*+]+.|[[:blank:]](\*+|\++)\.)/
    matching-end        change leading spaces to indent regexp
                          change last char to space
                          escape
                          change last char to [^X] where X is
                           the second-to-last char
    recurse just for additional sections

  styled_text
    start 		... big regexp of open frobs ...
	(what about word context??)
        need an inverse regexp --- the end of the match
          is the end of the lead rather than the 
          beginning
    match-end	maybe substitute character (' for `) and
                escape
                context can matter, again.
    full tail recurse

  lines
    start		^
    match-end		newline
    either recurse works

  dash3-thingie
    start		/^[[:blank:]]---+(:ID)?[[:blank:]]*.*$/
    match-end           same regexp as start
    half recurse

  essay
    start		/NOT([...])\n\n/
      the lead section ends at the beginning of the 
        last line of a match

    that's not right.  doesn't handle displays properly.

    pattern is like (display|text-graph) ([...] (display|text-graph))*
    but DISPLAY is not a regular language




   is there a regexp for a paragraph break?


   how about ", " as the first non-blank?

   This is a paragraph
   blah blah blah

,  [[inline-thing
   	.sdlfkjsdlf 
   ]]

,  is foo.


   and ". " for explicit breaks:
    

   This is a paragraph
   blah blah blah
.
.  [[inline-thing
   	.sdlfkjsdlf 
   ]]
.
.  is foo.

   but regexps can't hack the inlined displays

   also: single line displays





indented line regexps:

  eight column indent, beginning from a 
    tab stop:

  /(        |( ? ? ? ? ? ? ?\t))/

 given a line:

    |     ***. some section |

 you can produce a regexp:

    /$SIMILAR_INDENTATION\*\*\*[^*]/



parser outline (lead_type, section_type)
{
  first_section_start =
    /^([*+]|[[:blank:]](\*+|\++)\.)/;

  end_first_section = 
    /^$SIMILAR_INDENTATION ...same length of same character ...[^samechar]/

  type_start regexp
  type_end regexp
  body_start regexp
  body_end regexp
}

parser styled_text (lead_type, section_type)
{
  first_section_start =
    ,\<word-boundry>([/`'...]),

  end_first_section = 
    corresponding regexps of same length

  type_start regexp
  type_end regexp
  body_start regexp
  body_end regexp
}






section header regexp

  to parse 

     sdlfkjsdklfj
     * sdflj
     ...


  search for a section header regexp
  deal with lead
  search for a substrlength/column-constrained comparable section header regexp
  search from start to start of type name, if any
  search past type name or marker
  subparse


--------

  styled text

  

  
	


depth by length of heder marker

~ <itle-
   part>

  <body>

*#:<type> <title-
          part>

  <body>

+.:<type> <title-
           part>

  <body>



 depth by indentation

 [[type <title-
         part>

	<body>
 ]]





Typical frob:

	```hello world'''
          ^     ^
          |     body
          empty title part


Hypertexdt frobs:


	{{{"......" -- foo}}}
	      ^		^
              |       body
           title part














parsing algorithm

  pos = start_of_first_explicit_subnode ();

  if (start .. pos is not all whitespace or there is a mandatory lead)
    parse (subnode[0], lead, start, pos);


  while (pos is not at end)
   {
     type = assert_or_extract_type (pos);
     section_end = find_explicit_subnode_end (type, pos);
     set_pos (subnode[++n]) = <pos, section_end>
     set_type (subnode[n]) = type;
     .... what subrange of that string actually gets parsed? ....
     parse (subnode[++n], type, pos, section_end);
     pos = section_end;
   }


----------------------------


  outlines:
    start_of_first_explicit_subnode is
     a line matching /^(\*+|[[:blank:]]*\*+\./

   find_explicit_subnode_end is a line matching that
     is not more indented
     does not have more '*' characters at front

   (generalize both to permit '+')

   extract_type is the identifier starting after /^[[:blank:]]*\*+\.?:/
     or else use the default


---------------------

  dash3-thingie
    start_
	"^[[:blank:]]*---+(:{ID})?[[:blank:]]\n"
    _end
        same as start or end of input

---------------------

  essay
    start_
      first non-blank line
    end
      if first line is a display start, then by indentation
        and maybe matching ]]
      if first line is a section start, then until the first
        same-indented thing which is not a section start or
        the first less-indented thing of any sort
      if the first line is text (i.e., otherwise)
        then scan til a blank line.   if preceeded by
        [...] then skip blanks and keep scanning as if starting
          a new paragraph
        else 
          if followed by space and [...] then goto just past the [...]
           if looking at blanks and eol, goto next line, and again,
           continue as if starting a new paragraph

---------------------

  styled text
    start
      first frob start (regexp + syntax context)
    end
      closing frob (regexp + syntax context + length of match)
   
---------------------
  lines
    start: newline
    end: newline or end

 to break a string into
  optional front goo then 
  "sections":

     outline
     paragraph
     dash-line
     lines
     list-outline
     styled-text-list  


useful text assignments are always
       optional special case for first optional element
       either fixed or parsed-from type for "sections"




  

* Fully Table Driven Wiki Parsing

  Here is a technique for implementing a table-driven wiki parser.

  The resulting parser is easy to extend by adding new parsing
  primitives.

  The resulting collection of primitives is easy to configure 
  into a wide range of customized syntaxes.


** The Class of Wiki Parsing Primitives

  A *wiki parsing primitive* is a program which implements a function.
  The function takes as its input a string and a list of symbols.  It
  produces as its output a partition of that string into substrings,
  each substring labeled with one of symbols passed as a parameter.

  So, what does that mean?

  Consider a string of text, like this:

  [[tty
     "A *famous* program prints `hello world' when run. "
  ]]

  And here is a list of symbols:

  [[tty
    plain
    code
    stress
    ...
  ]]

  I might pass that string and list of symbols to a wiki parsing
  primitive and it could return a partition o the string into   
  four parts, labeled as illustrated:

  [[tty
     plain:             "A "
     stress:            "*famous*"
     plain:             " program prints "
     code:              "`hello world'"
     plain:             " when run."
  ]]

  Any function I write to do something like that is, by definition,
  a *wiki parsing primitive*.   The only restriction is that 
  the substrings in the partition have to "add up to" the original
  string, and the symbol labels returned must all be chosen from
  the ones passed as parameters.


** Wiki Parser Tables

  Every wiki parsing primitive has a symbolic name.  That name
  is useful in a wiki parser definition, such as this example:


  [[tty

  	outline:		lead_then_sections (essay, 
                                                    outline_section);


	essay:			list_of_paragraphs (paragraph);

	outline_section:	outline_title_and_body (section_title,
                                                        outline);

        section_title:		styled_lines (main_title, subtitle);

        paragraph:		styled_text (plain, stress, empy,
					     code, link, ...);
  ]]


  That table defines *named parsing rules*.  The names are in the left
  column and the definition of each is in the right column.

  The "functions" named in the function calls on the right are the
  names of wiki parsing primitives.   The "arguments" to those
  functions are lists of symbol names.

  Thus, to parse a string as an `outline', one would first call the
  wiki primitive `lead_then_sections'.   It will partition the
  input string into a "lead secdtion" and then the various
  top-level "outline sections".   It will label the lead-section
  substring `essay' and the other substrings `outline_section'.

  Then, recursively, programs can parse those components.   To parse
  the first substring of an `outline' as an `essay', the primitive
  `list_of_paragraphs' is called, with the singleton list of 
  symbols `paragraph'.

  Recursion stops at nodes that wind up with labels for which no
  rule is defined.


** Wiki Trees

  Recursive execution of a parsing table can give rise, in 
  a natural way, to a tree.   Nodes of the tree are labeled
  with symbols from parsing rules, and with the substring
  they refer to:

  [[tty

    input string:

 	"I said *Wow it's a `hello world' program*!"


    top-level (non-recursive) parse:

	styled-text:  "I said *Wow it's a `hello world' program*!"
	  plain-text: "I said "
	  stress:     "*Wow it's a `hello world' program*"
	  plain-text: "!"


    full (recursive) parse:

	styled-text:  "I said *Wow it's a `hello world' program*!"
	  plain-text: "I said "
	  stress:     "*Wow it's a `hello world' program*"
	      plain-text:    "Wow it's a "
	      code:	     "`hello world'"
	      plain-text:    " program"
	  plain-text: "!"

  ]]



what kind of parsers are needed? how can they be parameterized
(e.g., for "composite style macros" like ` */foo/* ' meaning "filename
 foo")


outline parser:
  find top-depth section headers, label each a section
  if there is stuff at the beginning, label that a lead

title parser:
  separate into a list of dash-separated things
  label the first one main-title
  label the others subtitle

abstract parser:
  split into a pair of dash-separated things
  label the first one abstract-title
  label the others essay


title parser and abstract parser are almost the same thing

they are exactly the same thing if abstracts can have multiple
"essay" nodes (and why not?)

both are also the same as an outline parser, to the extend that
finding a dash-line separator is the same kind of thing as finding
top-level section header lines

authors:
  split into lines
   each line is styled-text   (alt.: first is main-author, others coauthor, etc.)

main-title:: like authors
subtitle:: like authors
abstract-title:: like authors

authors is also like the title parser, abstract parser, and
outline parser.  it's "degenerate" in that the first
substring parsed is labeled the same as the rest of them.

essay:
  split into paragraphs
  all labeled paragraph


section:
  weird one -- find end of section title, label that
  styled-lines, 

  then treat the rest as per outline

section might not be that weird.  for an `outline', also, the first
separator is "special".

but, it gets weird again if '*(type)' is supported.

paragraph:
  ugh.

  find frob or frob-opener

  if there's stuff to the left, append a plain-text node

  find the frob extent

  get the frob type

  append the thus-typed node

  goto ugh






how does that compare to an outline?

  find any top-level section marker

  if there's stuff to the left, append a lead node

  find the next same-level section marker or end of input

  parse the `*(type)' or use `section'

  append the thus-typed node

  goto loop :-)




those steps:

  find the (possibly preceeded by non-whitespace)
  separator:

	outline-style:

        style-frob: worst case is a GNU regexp, possibly
                    just a regexp


  







composite style macros
  */foo/*

  stress
      emph

  there are only a few characters that can be used that way

  longest opener is easy enough to match

  longest simple (non composite) opener is very easy to match

  closer is easy enough to find

  given that, rematch for longest composite 

 
reconsider outline indenting
  make it mandatory?
  not trivially so, because * and ** bodies are indented the same amount

  outline section is everything indented or
  containing a *^n which is longer

  *(type)
  parse the section as `type' rather than `section'

  finding outline things in indented blocks?


outline parser:
  scan for section header or end
  intervening stuff not just whitespace?
	yes: parse that as lead
  while (next section header or end)
    ...


leaf subsections?

  //list

  //

parsed as essays  

 
can an essay serve as a list?
essay is a bunch of paragraphs
list is a bunch of logical divisions
possibly with a generated part (e.g., item number)




* text style

** Add Quoting

  Figure out a suitable quoting mechanism for text style frobs.
  Perhaps

  [[tty

        Those pesky "Either _/ Or" Questions.

  ]]

  /Rationale for `_'/ is that it has not conventional meaning other
  than as an underline character and that use isn't likely to be
  easily confused with this one.

  /Rationale for quoting at all/ is that the only other choice is that
  there are some (logical) text strings that your text *must not
  contain.   In other words, the alternative to having a quoting
  mechanism is to have a severe restriction on content.   


** Enumerate the Finite Supply of Text Frobs

  [[cartouche

    /Style Text Markup Frobs/

    [[tty


        syntax          repeated form   name
        ------          -------------   ----

        ^???^           ^^???^^         stress-a
        *???*           **???**         stress-b

        /???/           //???//         emph-a
        \???\           \\???\\         emph-b


        `___'           ``___''         code

        |???|           ||???||         user-a

        ~___~           ~~___~~         user-b

        <LINK>                          link
        {ANCHOR-REF}                    aref
        {*ANCHOR-DEF}                   adef

        --                              mdash

        [...]                           cont

        . (in column 0)                 break


    ]]

    *`???' indicates recursively nested styled text.

    *`___' indicates nested literal text.


  ]]


** Allow a User-defined Markup Mapping

  [[cartouche

    /Sample (Simple) Text Style Translation Table/

    [[tty

        markup type     html       html 
                        element    class
        -----------     -------    -----
        stress-a        i
        stress-b        i

        emph-a          b
        emph-b          b

        code            code
        code            code

        user-a          div         index-entry
        user-b          code        user-name

    ]]
  ]]

    */etc./*


** Allow Composite Configurations


  E.g.:

  [[cartouche

    /Sample (Complex) Text Style Translation Table/

    [[tty


        markup type     html       html 
                        element    class
        -----------     -------    -----
        code            code
        stress-a+code   code       filename
        stress-b+code   code       program-identifier
        ...
    ]]
  ]]

  Implying translations:

  [[cartouche

    /Sample Table-driven Translations/

    *Note the context sensative translations of composite
     markups like ``` /`main.c'/ '''

    [[tty

      Awiki Source:
        `hello world'
      Awiki Syntax Tree:
        (code "hello world")
      HTML:
        <code>hello world</code>


      Awiki Source:
        /`main.c'/
      Awiki Syntax Tree:
        (stress-a (code "main.c"))
      HTML:
        <code class="filename">main.c</code>


      Awiki Source:
        /`printf'/
      Awiki Syntax Tree:
        (stress-b (code "printf"))
      HTML:
        <code class="program-identifier">printf</code>
    ]]
  ]]


* Parameterize the Essay and Outline Parsers

  E.g., one parameter to control `<p>' vs. some
  other markup for paragraphs.  Another to 
  say which parser to use for the contents
  of a paragraph.


* Displays as, Generally, Div

  Display kinds are divided up into:

  /literal content/ e.g., `tty' displays.

  /essay content/ e.g., `blockquote' displays.

  /nested outline content/ not currently supported because of
  questions about how to handle outlines with their section header
  lines indented.

  /dash-separated-content/ as in titles and abstracts.
  This one is tricky because it implies a combinatorics:
  each section of a dash-separated multi-part display
  must itself be recursively parsed in any of the N ways.

  /purely sub-display content?/  A display could be 
  expected to contain nothing but a list of sub-displays.
  This is necessary to make displays a general syntax
  for trees.

  Can that be expressed as a type system so that users
  can explicitly define the syntax of the contents of
  a display?

  
* Parser Types

  Parsers should always, reliably, consume all of their
  input.

  Some parsers yield a single node.

  Some yield multiple nodes.

  Parsers pair with scanners, sort of, but not very cleanly.

  Scanning styled text for frobs resembles scanning
  a dash-separated display.

  Those also resemble scanning a whole text for top-level
  outline sections.

  A dissector takes a string and returns a list of disjoint
  substrings whose append yields the original string.
  For each element of the list, the dissector returns the
  identifier name to pass to parse_for_type to parse
  that substring.


  general parsing algorithm

  [[tty

    parse_for_type (node, input)

        set the node type
        set its start/end to the entire input

        if (!recursive_type)
          return

        dissection = dissector_for_my_type ();

        for (x in dissection)
          append a new subnode
          subnode <- parse x.substring for type x.type


  ]]

  specifying dissectors:

  dissectors fall into generic classes, all of which identify the
  same substrings but which differ about subtypes.

  [[tty
    dissector = substring_partioner + type_assignment_rule;
  ]]

  [[

  Consider the way that the type assignment rule and substring
  partiioning interact in "abstract" nodes with their optional 
  title.



  So:

  [[cartouche

    /User Configurable Wiki Syntax

    [[tty

      node_type:                dissector:
      ----------                ----------
      paper                     outline-structure
      title                     
      outline-section           outline-section-structure
      section-title             explicit-linebreaks-structure
      essay                     paragraph-structure
      



    ]]


  ]]
   

        
       





  outline needs a disector that breaks up into lead and
  top-level sections, saying to parse the lead with `essay'
  and each section with `outline-section'

  essay neesd a disector that breaks up into paragraph regions
  (honoring the `[...]' joiner).   each to be parsed with
  `paragraph'

  literal displays don't need a dissector -- no children nodes.

  title: use a dash3 dissector, assigning types MAIN_TITLE SUBTITLE

  abstract: use a dash3 dissector, assigning types ABSTRACT_TITLE
    ABSTRACT_BODY_DIVISION

  authors: styled-lines    MAIN_AUTHOR , COAUTHOR*

  blockquote: essa
  



* Lists, Enumerations, Isolated Bullet Points, and Leaf-section
  Headers


; arch-tag: Tom Lord Fri Dec  3 14:11:23 2004 (awiki/cleanup)

