Spec-Zone .ru
спецификации, руководства, описания, API
Spec-Zone .ru
спецификации, руководства, описания, API
Библиотека разработчика Mac Разработчик
Поиск

 

Эта страница руководства для  версии 10.9 Mac OS X

Если Вы выполняете различную версию  Mac OS X, просматриваете документацию локально:

Читать страницы руководства

Страницы руководства предназначаются как справочник для людей, уже понимающих технологию.

  • Чтобы изучить, как руководство организовано или узнать о синтаксисе команды, прочитайте страницу руководства для страниц справочника (5).

  • Для получения дополнительной информации об этой технологии, ищите другую документацию в Библиотеке Разработчика Apple.

  • Для получения общей информации о записи сценариев оболочки, считайте Shell, Пишущий сценарий Учебника для начинающих.



htmlparse(n)                                     HTML Parser                                    htmlparse(n)



____________________________________________________________________________________________________________

NAME
       htmlparse - Procedures to parse HTML strings

SYNOPSIS
       package require Tcl  8.2

       package require struct::stack  1.3

       package require cmdline  1.1

       package require htmlparse  ?1.2?

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

____________________________________________________________________________________________________________

DESCRIPTION
       The  htmlparse  package  provides  commands  that allow libraries and applications to parse HTML in a
       string into a representation of their choice.

       The following commands are available:

       ::htmlparse::parse ?-cmd cmd? ?-vroot tag? ?-split n? ?-incvar var? ?-queue q? html
              This command is the basic parser for HTML. It takes an HTML string, parses it  and  invokes  a
              command  prefix  for  every  tag encountered. It is not necessary for the HTML to be valid for
              this parser to function. It is the responsibility of the command  invoked  for  every  tag  to
              check  this.  Another  responsibility of the invoked command is the handling of tag attributes
              and character entities (escaped  characters).  The  parser  provides  the  un-interpreted  tag
              attributes  to  the  invoked command to aid in the former, and the package at large provides a
              helper command, ::htmlparse::mapEscapes, to aid in the handling of the latter. The parser does
              ignore leading DOCTYPE declarations and all valid HTML comments it encounters.

              All  information  beyond  the HTML string itself is specified via options, these are explained
              below.

              To help understand the options, some more background information about the parser.

              It is capable of detecting incomplete tags in the HTML string given to it. Under  normal  cir-cumstances circumstances
              cumstances  this will cause the parser to throw an error, but if the option -incvar is used to
              specify a global (or namespace) variable, the parser will store the  incomplete  part  of  the
              input  into  this  variable  instead.  This  will aid greatly in the handling of incrementally
              arriving HTML, as the parser will handle whatever it can and defer the handling of the  incom-plete incomplete
              plete part until more data has arrived.

              Another  feature  of  the  parser  are its two possible modes of operation. The normal mode is
              activated if the option -queue is not present on the command line invoking the parser.  If  it
              is present, the parser will go into the incremental mode instead.

              The main difference is that a parser in normal mode will immediately invoke the command prefix
              for each tag it encounters. In incremental mode however the parser will generate a  number  of
              scripts  which  invoke the command prefix for groups of tags in the HTML string and then store
              these scripts in the specified queue. It is then the  responsibility  of  the  caller  of  the
              parser to ensure the execution of the scripts in the queue.

              Note:  The  queue  object  given  to the parser has to provide the same interface as the queue
              defined in tcllib -> struct. This means, for example, that all queues created via that  tcllib
              module  can  be  immediately  used  here. Still, the queue doesn't have to come from tcllib ->
              struct as long as the same interface is provided.

              In both modes the parser will return an empty string to the caller.

              The -split option may be given to a parser in incremental mode to  specify  the  size  of  the
              groups  it  creates.  In  other  words, -split 5 means that each of the generated scripts will
              invoke the command prefix for 5 consecutive tags in the HTML string. A parser in  normal  mode
              will ignore this option and its value.

              The  option  -vroot specifies a virtual root tag. A parser in normal mode will invoke the com-mand command
              mand prefix for it immediately before and after it processes the tags in the HTML, thus  simu-lating simulating
              lating that the HTML string is enclosed in a <vroot> </vroot> combination. In incremental mode
              however the parser is unable to provide the closing virtual root as it never  knows  when  the
              input  is  complete.  In this case the first script generated by each invocation of the parser
              will contain an invocation of the command prefix for the virtual root as  its  first  command.
              The following options are available:

              -cmd cmd
                     The  command  prefix  to  invoke  for every tag in the HTML string. Defaults to ::html-parse::debugCallback. ::htmlparse::debugCallback.
                     parse::debugCallback.

              -vroot tag
                     The virtual root tag to add around the HTML in normal mode. In incremental mode  it  is
                     the first tag in each chunk processed by the parser, but there will be no closing tags.
                     Defaults to hmstart.

              -split n
                     The size of the groups produced by an incremental mode parser. Ignored when  in  normal
                     mode. Defaults to 10. Values <= 0 are not allowed.

              -incvar var
                     The name of the variable where to store any incomplete HTML into. This makes most sense
                     for the incremental mode. The parser will throw an error if it sees incomplete HTML and
                     has no place to store it to. This makes sense for the normal mode. Only incomplete tags
                     are detected, not missing tags.  Optional, defaults to 'no variable'.


              Interface to the command prefix
                     In normal mode the parser will invoke the command prefix with four arguments  appended.
                     See ::htmlparse::debugCallback for a description.

                     In incremental mode, however, the generated scripts will invoke the command prefix with
                     five arguments appended. The last four of these  are  the  same  which  were  mentioned
                     above.  The first is a placeholder string (@win@) for a clientdata value to be supplied
                     later during the actual execution of the generated scripts. This could be a  tk  window
                     path,  for  example.  This  allows  the user of this package to preprocess HTML strings
                     without committing them to a specific window, object,  whatever  during  parsing.  This
                     connection can be made later. This also means that it is possible to cache preprocessed
                     HTML. Of course, nothing prevents the user of the parser from replacing the placeholder
                     with an empty string.

       ::htmlparse::debugCallback ?clientdata? tag slash param textBehindTheTag
              This  command  is  the  standard callback used by the parser in ::htmlparse::parse if none was
              specified by the user. It simply dumps its arguments to stdout.  This callback can be used for
              both  normal  and  incremental  mode of the calling parser. In other words, it accepts four or
              five arguments. The last four arguments are described below. The optional fifth argument  con-
              tains  the  clientdata value passed to the callback by a parser in incremental mode. All call-backs callbacks
              backs have to follow the signature of this command in the last four arguments,  and  callbacks
              used in incremental parsing have to follow this signature in the last five arguments.

              The  first  argument, clientdata, is optional and present only if this command is invoked by a
              parser in incremental mode. It contains whatever the user of this package wishes.

              The second argument, tag, contains the name of the tag which is  currently  processed  by  the
              parser.

              The  third argument, slash, is either empty or contains a slash character. It allows the call-back callback
              back to distinguish between opening (slash is empty) and closing tags (slash contains a  slash
              character).

              The fourth argument, param, contains the un-interpreted list of parameters to the tag.

              The  fifth  and  last argument, textBehindTheTag, contains the text found by the parser behind
              the tag named in tag.

       ::htmlparse::mapEscapes html
              This command takes a HTML string, substitutes all escape sequences with their  actual  charac-ters characters
              ters  and  then  returns  the  resulting  string.   HTML  strings  which do not contain escape
              sequences are returned unchanged.

       ::htmlparse::2tree html tree
              This command is a wrapper around ::htmlparse::parse which takes an HTML string (in  html)  and
              converts  it  into a tree containing the logical structure of the parsed document. The name of
              the tree is given to the command as its second argument (tree). The command does not  generate
              the tree by itself but expects that the caller provided it with an existing and empty tree. It
              also expects that the specified tree object follows the same interface as the tree  object  in
              tcllib  ->  struct.  It doesn't have to be from tcllib -> struct, but it must provide the same
              interface.

              The internal callback does some basic checking of HTML validity and tries to recover from  the
              most  basic  errors. The command returns the contents of its second argument. Side effects are
              the creation and manipulation of a tree object.

              Each node in the generated tree represent one tag in the input. The name of the tag is  stored
              in  the attribute type of the node. Any html attributes coming with the tag are stored unmodi-fied unmodified
              fied in the attribute data of the tag. In  other  words,  the  command  does  not  parse  html
              attributes into their names and values.

              If  a  tag  contains text its node will have children of type PCDATA containing this text. The
              text will be stored in the attribute data of these children.

       ::htmlparse::removeVisualFluff tree
              This command walks a tree as generated by ::htmlparse::2tree and removes all the  nodes  which
              represent  visual tags and not structural ones. The purpose of the command is to make the tree
              easier to navigate without getting bogged down in  visual  information  not  relevant  to  the
              search. Its only argument is the name of the tree to cut down.

       ::htmlparse::removeFormDefs tree
              Like  ::htmlparse::removeVisualFluff  this command is here to cut down on the size of the tree
              as generated by ::htmlparse::2tree. It removes all nodes representing forms and form elements.
              Its only argument is the name of the tree to cut down.


BUGS, IDEAS, FEEDBACK
       This  document,  and  the  package  it  describes,  will undoubtedly contain bugs and other problems.
       Please  report  such  in  the  category  htmlparse  of  the  Tcllib   SF   Trackers   [http://source -
       forge.net/tracker/? group_id=12883].   Please  also report any ideas for enhancements you may have for
       either package and/or documentation.

SEE ALSO
       struct::tree

KEYWORDS
       html, parsing, queue, tree

CATEGORY
       Text processing



htmlparse                                            1.2                                        htmlparse(n)

Сообщение о проблемах

Способ сообщить о проблеме с этой страницей руководства зависит от типа проблемы:

Ошибки содержания
Ошибки отчета в содержании этой документации со ссылками на отзыв ниже.
Отчеты об ошибках
Сообщите об ошибках в функциональности описанного инструмента или API через Генератор отчетов Ошибки.
Форматирование проблем
Отчет, форматирующий ошибки в интерактивной версии этих страниц со ссылками на отзыв ниже.