source: XMLF90-doc/Tutorial/UserGuide.tex @ 2275

Last change on this file since 2275 was 6, checked in by ymipsl, 16 years ago

Import des sources XMLF90

File size: 43.9 KB
Line 
1\documentclass[11pt]{article}
2%\decimalpoint
3\tolerance 10000
4\textheight 24cm
5\textwidth 16cm
6\oddsidemargin 1mm
7\topmargin -20mm
8\parindent 0mm
9\begin{document}\title{xmlf90: A parser for XML in Fortran90}
10\author{Alberto Garc\'{\i}a \\
11           Departamento de F\'{\i}sica de la Materia Condensada \\
12           Facultad de Ciencia y Tecnolog\'{\i}a \\
13           Universidad del Pa\'{\i}s Vasco\\
14           Apartado 644 , 48080 Bilbao, Spain\\
15           http://lcdx00.wm.lc.ehu.es/ag/xml/}
16\date{30 January 2004 --- xmlf90 Version 1.1}
17
18\maketitle\section{Introduction}
19
20{\bf NOTE: This version of the User Guide and Tutorial does not
21cover either the WXML printing library or the new DOM API
22conceived by Jon Wakelin. See the html reference material and the
23relevant example subdirectories.}
24\bigskip
25
26This tutorial documents the user interface of \texttt{xmlf90}, a
27native Fortran90 XML parser. The parser was designed to be a useful
28tool in the extraction and analysis of data in the context of
29scientific computing, and thus the priorities were efficiency and the
30ability to deal with very large XML files while maintaining a small
31memory footprint. There are two programming interfaces. The first is
32based on the very successful SAX (Simple API for XML) model: the
33parser calls routines provided by the user to handle certain events,
34such as the encounter of the beginning of an element, or the end of an
35element, or the reading of character data.  The other is based on the
36XPATH standard. Only a very limited set of the full XPATH
37specification is offered, but it is already quite useful.
38
39Some familiarity of XML is assumed. Apart from the examples discussed
40in this tutorial (chosen for their simplicity), the interested reader
41can refer to the \texttt{Examples/} directory in the \texttt{xmlf90}
42distribution.
43
44
45
46\section{The SAX interface}
47\subsection{A simple example}
48
49To illustrate the working of the SAX interface, consider the following
50XML snippet
51
52\begin{verbatim}
53        <item id="003">                                               
54           <description>Washing machine</description> 
55           <price currency="euro">1500.00</price> 
56        </item>                                                           
57\end{verbatim}
58%
59When the parser processes this snippet, it carries out the sequence of calls:
60
61\begin{enumerate}
62\item call to \texttt{begin\_element\_handler} with name="item" and
63    attributes=(Dictionary with the pair (id,003))
64\item call to \texttt{begin\_element\_handler} with name="description" and an
65    empty attribute dictionary.
66\item  call to \texttt{pcdata\_chunk\_handler} with pcdata="Washing machine"
67\item call to \texttt{end\_element\_handler} with name="description"
68\item call to \texttt{begin\_element\_handler} with name="price" and
69    attributes=(Dictionary with the pair (currency,euro))
70\item  call to \texttt{pcdata\_chunk\_handler} with pcdata="1500.00"
71\item call to \texttt{end\_element\_handler} with name="price"
72\item call to \texttt{end\_element\_handler} with name="item"
73\end{enumerate}
74       
75The handler routines are written by the user and passed to the parser
76as procedure arguments. A simple program that parses the above XML
77fragment (assuming it resides in file \textsl{inventory.xml}) and
78prints out the names of the elements and any \textsl{id} attributes as
79they are found, is:
80
81\begin{verbatim}
82program simple
83use flib_sax
84
85type(xml_t)        :: fxml      ! XML file object (opaque)
86integer            :: iostat    ! Return code (0 if OK)
87
88call open_xmlfile("inventory.xml",fxml,iostat)
89if (iostat /= 0) stop "cannot open xml file"
90
91call xml_parse(fxml, begin_element_handler=begin_element_print)
92
93contains !---------------- handler subroutine follows
94
95subroutine begin_element_print(name,attributes)
96   character(len=*), intent(in)     :: name
97   type(dictionary_t), intent(in)   :: attributes
98   
99   character(len=3)  :: id
100   integer           :: status
101   
102   print *, "Start of element: ", name
103   if (has_key(attributes,"id")) then
104        call get_value(attributes,"id",id,status)
105        print *, "  Id attribute: ", id
106   endif
107end subroutine begin_element_print
108
109end program simple
110\end{verbatim}
111%
112To access the XML parsing functionality, the user only needs to \texttt{use}
113the module \texttt{flib\_sax}, open the XML file, and call the main routine
114\texttt{xml\_parse}, providing it with the appropriate event handlers.
115
116The subroutine interfaces are:
117
118\begin{verbatim}
119subroutine open_xmlfile(fname,fxml,iostat)
120character(len=*), intent(in)  :: fname     ! File name
121type(xml_t), intent(out)      :: fxml      ! XML file object (opaque)
122integer, intent(out   )       :: iostat    ! Return code (0 if OK)
123
124
125subroutine xml_parse(fxml,                   &
126                     begin_element_handler,  &
127                     end_element_handler,    &
128                     pcdata_chunk_handler ....
129                     .... MORE OPTIONAL HANDLERS  )
130
131\end{verbatim}
132
133The handlers are OPTIONAL arguments (in the above example we just
134specify \texttt{begin\_element\_handler}). If no handlers are given,
135nothing useful will happen, except that any errors are detected and
136reported.  The interfaces for the most useful handlers are:
137
138\begin{verbatim}   
139   subroutine begin_element_handler(name,attributes)
140   character(len=*), intent(in)     :: name
141   type(dictionary_t), intent(in)   :: attributes
142   end subroutine begin_element_handler
143
144   subroutine end_element_handler(name)
145   character(len=*), intent(in)     :: name
146   end subroutine end_element_handler
147
148   subroutine pcdata_chunk_handler(chunk)
149   character(len=*), intent(in) :: chunk
150   end subroutine pcdata_chunk_handler
151\end{verbatim}
152
153The attribute information in an element tag is represented as a
154dictionary of name/value pairs, held in a \texttt{dictionary\_t}
155abstract type.  The information in it can be accessed through a set of
156dictionary methods such as \texttt{has\_key} and \texttt{get\_value}
157(full interfaces to be found in Sect.~\ref{sec:reference}).
158
159\subsection{Monitoring the sequence of events}
160The above example is too simple and not very useful if what we want is
161to extract information in a coherent manner. For example, assume we
162have a more complete inventory of appliances such as
163%
164\begin{verbatim}
165<inventory>
166        <item id="003">                                               
167           <description>Washing machine</description> 
168           <price currency="euro">1500.00</price> 
169        </item>                                                           
170        <item id="007">                                               
171           <description>Microwave oven</description> 
172           <price currency="euro">300.00</price> 
173        </item>                   
174        <item id="011">                                               
175           <description>Dishwasher</description> 
176           <price currency="swedish crown">10000.00</price> 
177        </item>   
178</inventory>                                                                       
179\end{verbatim}
180%
181and we want to print the items with their prices in the form:
182%
183\begin{verbatim}
184003 Washing machine : 1500.00 euro
185007 Microwave oven : 300.00 euro
186011 Dishwasher : 10000.00 swedish crown
187\end{verbatim}
188
189We begin by writing the following module
190
191\begin{verbatim}
192module m_handlers
193use flib_sax
194private
195public :: begin_element, end_element, pcdata_chunk
196!
197logical, private            :: in_item, in_description, in_price
198character(len=40), private  :: what, price, currency, id
199!
200contains !-----------------------------------------
201!
202subroutine begin_element(name,attributes)
203   character(len=*), intent(in)     :: name
204   type(dictionary_t), intent(in)   :: attributes
205   
206   integer  :: status
207   
208   select case(name)
209     case("item")
210       in_item = .true.
211       call get_value(attributes,"id",id,status)
212     
213     case("description")
214       in_description = .true.
215       
216     case("price")
217       in_price = .true.
218       call get_value(attributes,"currency",currency,status)
219
220   end select
221   
222end subroutine begin_element
223!---------------------------------------------------------------
224subroutine pcdata_chunk_handler(chunk)
225   character(len=*), intent(in) :: chunk
226
227   if (in_description) what = chunk
228   if (in_price) price = chunk
229
230end subroutine pcdata_chunk_handler
231!---------------------------------------------------------------
232subroutine end_element(name)
233   character(len=*), intent(in)     :: name
234   
235   select case(name)
236     case("item")
237       in_item = .false.
238       write(unit=*,fmt="(5(a,1x))") trim(id), trim(what), ":", &
239                                     trim(price), trim(currency)
240     
241     case("description")
242       in_description = .false.
243       
244     case("price")
245       in_price = .false.
246
247   end select
248   
249end subroutine end_element
250!---------------------------------------------------------------
251end module m_handlers
252\end{verbatim}
253%
254PCDATA chunks are passed back as simple fortran character variables,
255and we assign them to \texttt{what} or \texttt{price} depending on the
256context, which we monitor through the logical variables
257\texttt{in\_description, in\_price}, updated as we enter and leave
258different elements. (The variable \texttt{in\_item} is not strictly
259necessary.)
260
261The program to parse the file just needs to use the functionality in
262the module \texttt{m\_handlers}:
263%
264\begin{verbatim}
265program inventory
266use flib_sax
267use m_handlers
268
269type(xml_t)        :: fxml      ! XML file object (opaque)
270integer            :: iostat   
271
272call open_xmlfile("inventory.xml",fxml,iostat)
273if (iostat /= 0) stop "cannot open xml file"
274
275call xml_parse(fxml, begin_element_handler=begin_element, &
276                     end_element_handler=end_element,     &
277                     pcdata_chunk_handler=pcdata_chunk )
278                     
279end program inventory
280
281\end{verbatim}
282%
283\subsubsection{Exercises}
284\begin{enumerate}
285\item Code the above fortran files and the XML file in your
286computer. Compile and run the program and check that the output is
287correct. (Compilation instructions are provided in
288Sect.~\ref{sec:compiling}).
289\item Edit the XML file and remove one of the \texttt{</item>}
290lines. What happens? This is an example of a \textsl{mal-formed} XML
291file. The parser can detect it and complain about it.
292\item Edit the XML file and remove the \texttt{currency} attribute
293from one of the elements. What happens? In this case, the parser
294cannot detect the missing attribute (it is not a \textsl{validating
295parser}). However, it could be possible for the user to detect early
296that something is wrong by checking the value of the \texttt{status}
297variable after the call to \texttt{get\_value}.
298\item Modify the program to print the prices in euros (1 euro buys
299approximately 9.2 swedish crowns).
300\end{enumerate}
301
302\subsection{Other tags and their handlers}
303
304The parser can also process comments, XML declarations (formally known
305as ``processing instructions"), and SGML declarations, although the
306latter two are not acted upon in any way (in particular, no attempt at
307validation of the XML document is done).
308
309\begin{itemize}
310
311\item
312An \textbf{empty element} tag of the form
313%
314\begin{verbatim}
315        <name att="value"...  />
316\end{verbatim}
317%
318can be handled as successive calls to \texttt{begin\_element\_handler}
319and \texttt{end\_element\_handler}.  However, if the optional handler
320\texttt{empty\_element\_handler} is present, it is called instead. Its
321interface is exactly the same as that of
322\texttt{begin\_element\_handler}:
323%
324\begin{verbatim}
325   subroutine empty_element_handler(name,attributes)
326   character(len=*), intent(in)     :: name
327   type(dictionary_t), intent(in)   :: attributes
328   end subroutine empty_element_handler
329\end{verbatim}
330%
331\item
332\textbf{Comments} are sections of the XML file contained between the markup
333\texttt{<!{-}-} and \texttt{{-}->},
334and are handled by the optional argument \texttt{comment\_handler}
335%
336\begin{verbatim}
337   subroutine comment_handler(comment)
338   character(len=*), intent(in) :: comment
339   end subroutine comment_handler
340\end{verbatim}
341%
342\item
343\textbf{XML declarations} can be processed
344in the same way as elements, with the ``target" being the element name, etc.
345For example, in
346%
347\begin{verbatim}
348        <?xml version="1.0"?>
349\end{verbatim}
350%
351\textsl{xml} would be the ``element name", \textsl{version} an
352attribute name, and \textsl{1.0} its value. The optional handler
353interface is:
354%
355\begin{verbatim}
356   subroutine xml_declaration_handler(name,attributes)
357   character(len=*), intent(in)     :: name
358   type(dictionary_t), intent(in)   :: attributes
359   end subroutine xml_declaration_handler
360\end{verbatim}
361%
362\item
363\textbf{SGML declarations} such as entity declarations or doctype
364specifications are treated basically as comments. Interface:
365%
366\begin{verbatim}
367   subroutine sgml_declaration_handler(sgml_declaration)
368   character(len=*), intent(in) :: sgml_declaration
369   end subroutine sgml_declaration_handler
370\end{verbatim}
371%
372\end{itemize}
373In the current version of the parser, overly long comments and SGML
374declarations might be truncated.
375
376
377\section{The XPATH interface}
378
379\textsl{NOTE: The current implementation gets its inspiration from
380XPATH, but by no means it is a complete, or even a subset,
381implementation of the standard. Since it is built on top of the SAX
382interface, it uses a ``stream" paradigm which is completely alien to
383the XPATH specification. It is nevertheless still quite useful. The
384author is open to suggestions to refine the interface.}
385
386\bigskip
387
388This API is based on the concept of an XML path. For example:
389%
390\begin{verbatim}
391/inventory/item
392\end{verbatim}
393%
394represents a 'item' element which is a child of the root element
395'inventory'. Paths can contain special wildcard markers such as
396\texttt{//} and \texttt{*}. The following are examples of valid paths:
397%
398\begin{verbatim}
399  //a       : Any occurrence of element 'a', at any depth.
400  /a/*/b    : Any 'b' which is a grand-child of 'a'
401  ./a       : A relative path (with respect to the current path)
402  a         : (same as above)
403  /a/b/./c  : Same as /a/b/c (the dot (.) is a dummy)
404  //*       : Any element.
405  //a/*//b  : Any 'b' under any children of 'a'.
406
407\end{verbatim}
408%
409\subsection{Simple example}
410Using the XPATH interface it is possible to search for any element
411directly, and to recover its attributes or character content. For
412example, to print the names of all the appliances in the inventory:
413%
414\begin{verbatim}
415program simple
416use flib_xpath
417
418type(xml_t) :: fxml
419
420integer  :: status
421character(len=100)  :: what
422
423call open_xmlfile("inventory.xml",fxml,status)
424!
425do
426      call get_node(fxml,path="//description",pcdata=what,status=status)
427      if (status < 0)   exit
428      print *, "Appliance: ", trim(what)
429enddo
430end program simple
431\end{verbatim}
432%
433Repeated calls to \texttt{get\_node} return the character content of
434the 'description' elements (at any depth). We exit the loop when the
435\texttt{status} variable is negative on return from the call. This
436indicates that there are no more elements matching the
437\texttt{//description} path pattern.\footnote{Returning a negative
438value for an end-of-file or end-or-record condition follows the
439standard practice. Positive return values signal malfunctions}
440
441Apart from path patterns, we can narrow our search by specifying
442conditions on the attribute list of the element. For example, to print
443only the prices which are given in euros we can use the
444\texttt{att\_name} and \texttt{att\_value} optional arguments:
445%
446\begin{verbatim}
447program euros
448use flib_xpath
449
450type(xml_t) :: fxml
451
452integer  :: status
453character(len=100)  :: price
454
455call open_xmlfile("inventory.xml",fxml,status)
456!
457do
458  call get_node(fxml,path="//price", &
459                att_name="currency",att_value="euro", &
460                pcdata=price,status=status)
461  if (status < 0)   exit
462  print *, "Price (euro): ", trim(price)
463enddo
464end program euros
465\end{verbatim}
466%
467We can zero in on any element in this fashion, but we apparently give
468up the all-important context. What happens if we want to print
469\textsl{both} the appliance description and its price?
470%
471\begin{verbatim}
472program twoelements
473use flib_xpath
474
475type(xml_t) :: fxml
476
477integer  :: status
478character(len=100)  :: what, price, currency
479
480call open_xmlfile("inventory.xml",fxml,status)
481!
482do
483  call get_node(fxml,path="//description", &
484                pcdata=what,status=status)
485  if (status < 0)   exit                   ! No more items
486  !
487  ! Price comes right after description...
488  !
489  call get_node(fxml,path="//price", &
490                attributes=attributes,pcdata=price,status=status)
491  if (status /= 0) stop "missing price element!"
492 
493  call get_value(attributes,"currency",currency,status)
494  if (status /= 0) stop "missing currency attribute!"
495 
496  write(unit=*,fmt="(6a)") "Appliance: ", trim(what), &
497                            ". Price: ", trim(price), " ", trim(currency)
498enddo
499end program twoelements
500\end{verbatim}
501%
502\subsubsection{Exercises}
503\begin{enumerate}
504\item Modify the above programs to print only the appliances priced in
505euros.
506\item Modify the order of the 'description' and 'price' elements in a
507item. What happens to the 'twoelements' program output?
508\item The full XPATH specification allows the query for a particular
509element among a set of elements with the same path, based on the
510ordering of the element. For example, "/inventory/item[2]" will refer
511to the second 'item' element in the XML file. Write a routine that
512implements this feature and returns the element's attribute
513dictionary.
514\item Queries for paths can be issued in any order, and so some
515mechanism for "rewinding" the XML file is necessary. It is provided by
516the appropriately named \texttt{rewind\_xmlfile} subroutine (see full
517interface in the Reference section). Use it to implement a silly
518program that prints items from the inventory at random. (Extra points
519for including logic to minimize the number of rewinds.)
520\end{enumerate}
521%
522
523\subsection{Contexts and restricted searches}
524
525The logic of the \texttt{twoelements} program in the previous section
526 follows from the assumption that the 'price' element follows the
527 'description' element in a typical 'item'. If the DTD says so, and
528 the XML file is valid (in the technical sense of conforming to the
529 DTD), the assumption should be correct. However, since the parser is
530 non-validating, it might be unreasonable to expect the proper
531 ordering in all cases. What we should expect (as a minimum) is that
532 both the price and description elements are children of the 'item'
533 element. In the following version we make use of the \textbf{context}
534 concept to achieve a more robust solution.
535%
536\begin{verbatim}
537program item_context
538use flib_xpath
539
540type(xml_t) :: fxml, contex
541
542integer  :: status
543character(len=100)  :: what, price, currency
544
545call open_xmlfile("inventory.xml",fxml,status)
546!
547do
548  call mark_node(fxml,path="//item",status=status)
549  if (status < 0)   exit      ! No more items
550  context = fxml               ! Save item context   
551  !
552  ! Search relative to context
553  !
554  call get_node(fxml,path="price", &
555                attributes=attributes,pcdata=price,status=status)
556  call get_value(attributes,"currency",currency,status)
557  if (status /= 0) stop "missing currency attribute!"
558  !
559  ! Rewind to beginning of context
560  !
561  fxml = context
562  call sync_xmlfile(fxml) 
563  !
564  ! Search relative to context
565  !
566  call get_node(fxml,path="description",pcdata=what,status=status)
567  write(unit=*,fmt="(6a)") "Appliance: ", trim(what), &
568                            ". Price: ", trim(price), " ", trim(currency)
569enddo
570end program item_context
571\end{verbatim}
572%
573The call to \texttt{mark\_node} positions the parser's file handle
574\texttt{fxml} right after the end of the starting tag of the next
575'item' element. We save that position as a ``context marker" to which
576we can return later on. The calls to \texttt{get\_node} use path
577patterns that do not start with a \texttt{/}: they are
578\textbf{searches relative to the current context}. After getting the
579information about the 'price' element, we restore the parser's file
580handle to the appropriate position at the beginning of the 'item'
581context, and search for the 'description' element. In the following
582iteration of the loop, the parser will find the next 'item' element,
583and the process will be repeated until there are no more 'item's.
584
585
586Contexts come in handy to encapsulate parsing tasks in re-usable
587subroutines. Suppose you are going to find the basic 'item' element
588content in a whole lot of different XML files. The following
589subroutine extracts the description and price information:
590%
591\begin{verbatim}
592subroutine get_item_info(context,what,price,currency)
593type(xml_t), intent(in)       :: contex
594character(len=*), intent(out) :: what, price, currency
595
596!
597! Local variables
598!
599type(xml_t)        :: ff
600integer            :: status
601type(dictionary_t) :: attributes
602
603  !
604  ! context is read-only, so make a copy and sync just in case
605  !
606  ff = context
607  call sync_xmlfile(ff) 
608  !
609  call get_node(ff,path="price", &
610                attributes=attributes,pcdata=price,status=status)
611  call get_value(attributes,"currency",currency,status)
612  if (status /= 0) stop "missing currency attribute!"
613  !
614  ! Rewind to beginning of context
615  !
616  ff = context
617  call sync_xmlfile(ff) 
618  !
619  call get_node(ff,path="description",pcdata=what,status=status)
620
621end subroutine get_item_info
622\end{verbatim}
623%
624Using this routine, the parsing is much more compact:
625%
626\begin{verbatim}
627program item_context
628use flib_xpath
629
630type(xml_t) :: fxml
631
632integer  :: status
633character(len=100)  :: what, price, currency
634
635call open_xmlfile("inventory.xml",fxml,status)
636!
637do
638  call mark_node(fxml,path="//item",status=status)
639  if (status /= 0)   exit      ! No more items
640  call get_item_info(fxml,what,price,currency)
641  write(unit=*,fmt="(6a)") "Appliance: ", trim(what), &
642                            ". Price: ", trim(price), " ", trim(currency)
643  call sync_xmlfile(fxml)
644enddo
645end program item_context
646\end{verbatim}
647%
648It is extremely important to understand the meaning of the call to
649\texttt{sync\_xmlfile}. The file handle \texttt{fxml} holds parsing
650context \textbf{and} a physical pointer to the file position
651(basically a variable counting the number of characters read so
652far). When the context is passed to the subroutine and the parsing
653carried out, the context and the file position get out of
654sync. Synchronization means to re-position the physical file pointer
655to the place where it was when the context was first created.
656
657
658\subsubsection{Exercises}
659\begin{enumerate}
660\item Modify the above programs to print only the appliances priced in
661euros.
662\item Write a program that prints only the most expensive
663item. (Assume that the inventory is very large and it is not feasible
664to hold everything in memory...)
665\item Use the \texttt{get\_item\_info} subroutine to print
666descriptions and price information from the following XML file:
667%
668\begin{verbatim}
669<vacations>
670        <trip>                                               
671           <description>Mediterranean cruise</description> 
672           <price currency="euro">1500.00</price> 
673        </trip>                                                           
674        <trip>                                               
675           <description>Week in Majorca</description> 
676           <price currency="euro">300.00</price> 
677        </trip>                   
678        <trip>                                               
679           <description>Wilderness Route</description> 
680           <price currency="swedish crown">10000.00</price> 
681        </trip>   
682</vacations>                                         
683\end{verbatim}
684%
685(Note that the routine does not care what the context name is (it
686could be 'item' or 'trip'). It is only the fact that the children
687('description' and 'price') are the same that matters.
688\end{enumerate}
689
690\section{Handling of scientific data}
691
692\subsection{Numerical datasets}
693
694While the ASCII form is not the most efficient for the storage of
695numerical data, the portability and flexibility offered by the XML
696format makes it attractive for the interchange of scientific
697datasets. There are a number of efforts under way to standardize this
698area, and presumably we will have nifty tools for the creation and
699visualization of files in the near future. Even then, however, it will
700be necessary to be able to read numerical information into fortran
701programs. The \texttt{xmlf90} package offers limited but useful
702functionality in this regard, making it possible to build numerical
703arrays on the fly as the XML file containing the data is parsed. As an
704example, consider the dataset:
705%
706\begin{verbatim}
707<data>
708 8.90679398599 8.90729421510 8.90780189594 8.90831710494
709 8.90883991832 8.90937041202 8.90990866166 8.91045474255
710 8.91100872963 8.91157069732 8.91214071958 8.91271886986
711 8.91330522098 8.91389984506 8.91450281355 8.91511419713
712 8.91573406560 8.91636248785 8.91699953183 8.91764526444
713 8.91829975142 8.91896305734 8.91963524555 8.92031637799
714 8.92100651514 8.92170571605 8.92241403816 8.92313153711
715 8.92385826683 8.92459427943 8.92533962491 8.92609435120
716 8.92685850416 8.92763212726 8.92841526149 8.92920794545
717</data>
718\end{verbatim}
719%
720and the following fragment of a \texttt{m\_handlers} module for SAX parsing:
721%
722\begin{verbatim}
723
724real, dimension(1000)  :: x    ! numerical array to hold data
725
726subroutine begin_element(name,attributes)
727 ...
728   select case(name)
729     case("data")
730       in_data = .true.
731       ndata = 0     
732     ...
733   end select
734   
735end subroutine begin_element
736!---------------------------------------------------------------
737subroutine pcdata_chunk_handler(chunk)
738   character(len=*), intent(in) :: chunk
739
740   if (in_data) call build_data_array(chunk,x,ndata)
741   ...
742
743end subroutine pcdata_chunk_handler
744!-------------------------------------------------------------
745subroutine end_element(name)
746 ...
747   select case(name)
748     case("data")
749       in_data = .false.
750       print *, "Read ", ndata, " data elements."
751       print *, "X: ", x(1:ndata)
752     ...
753   end select
754   
755end subroutine end_element
756\end{verbatim}
757%
758When the \texttt{<data>} tag is encountered by the parser, the
759variable \texttt{ndata} is initialized. Any PCDATA chunks found from
760then on and until the \texttt{</data>} tag is seen are passed to the
761\texttt{build\_data\_array} generic subroutine, which converts the
762character data to the numerical format (integer, default real, double
763precision) implied by the array \texttt{x}. The array is filled with
764data and the \texttt{ndata} variable increased accordingly.
765
766If the data is known to represent a multi-dimensional array (something
767that could be encoded in the XML as attributes to the 'data' element,
768for example), the user can employ the fortran \texttt{reshape}
769intrinsic to obtain the final form.
770
771There is absolutely no limit to the size of the data (apart from
772filesystem size and total memory constraints) since the parser only
773holds in memory at any given time a small chunk of character data (the
774default is to split the character data stream and call the
775\texttt{pcdata\_chunk\_handler} routine at the end of a line, or at
776the end of a token if the line is too long). This is one of the most
777useful features of the SAX approach to XML parsing.
778
779In order to read numerical data with the XPATH interface in its
780current implementation, one must first read the PCDATA into the
781\texttt{pcdata} optional argument of \texttt{get\_node}, and then call
782\texttt{build\_data\_array}. However, there is an internal limit to
783the size of the PCDATA buffer, so this method cannot be safely used
784for large datasets at this point. In a forthcoming version there will
785be a generic subroutine \texttt{get\_node} with a \texttt{data}
786numerical array optional argument which will be filled by the parser
787on the fly.
788
789
790
791
792\subsubsection{Exercises}
793\begin{enumerate}
794\item Generate an XML file containing a large dataset, and write a
795program to read the information back. You might want to include
796somewhere in the XML file information about the number of data
797elements, so that an array of the proper size can be used.
798\item Devise a strategy to read a dataset without knowing in advance
799the number of data elements. (Some possibilities: re-sizable
800allocatable arrays, two-pass parsing...).
801\item Suggest a possible encoding for the storage of two-dimensional
802arrays, and write a program to read the information from the XML file
803and create the appropriate array.
804\item Write a program that could read a 10Gb Monte Carlo simulation
805dataset and print the average and standard deviation of the data. (We
806are not advocating the use of XML for such large datasets. NetCDF
807would be much more efficient in this case).
808\end{enumerate}
809
810\subsection{Mapping of XML elements to derived types}
811
812After the parsing, the data has to be put somewhere. A good strategy
813to handle structured content is to try to replicate it within data
814structures inside the user program. For example, an element of the
815form
816%
817\begin{verbatim}
818<table units="nm" npts="100">
819<description>Cluster diameters</description>
820<data>
8212.3 4.5 5.6 3.4 2.3 1.2 ...
822...
823...
824</data>
825</table>
826\end{verbatim}
827%
828could be mapped onto a derived type of the form:
829%
830\begin{verbatim}
831type :: table
832  character(len=50)            :: description
833  character(len=20)            :: units
834  integer                      :: npts
835  real, dimension(:), pointer  :: data
836end type table
837\end{verbatim}
838%
839There could even be parsing and output subroutines associated to this
840derived type, so that the user can handle the XML production and
841reading transparently. Directory \texttt{Examples/} in the
842\texttt{xmlf90} distribution contains some code along these lines.
843
844\subsubsection{Exercises}
845%
846\begin{enumerate}
847\item Study the \texttt{pseudo} example in \texttt{Examples/sax/} and
848\texttt{Examples/xpath/}. Now, with your own application in mind,
849write derived-type definitions and parsing routines to handle your XML
850data (which would also need to be \textsl{designed} somehow).
851
852\end{enumerate}
853%
854
855
856\section{REFERENCE: Subroutine interfaces}
857\label{sec:reference}
858
859\subsection{Dictionary handling}
860
861Attribute lists are handled as instances of a derived type
862\texttt{dictionary\_t}, loosely inspired by the Python type.  The
863terminology is more general: keys and entries instead of names and
864attributes.
865
866\begin{itemize}
867\item
868%
869\begin{verbatim}
870function number_of_entries(dict) result(n)
871!
872! Returns the number of entries in the dictionary
873!
874type(dictionary_t), intent(in)   :: dict
875integer                          :: n
876\end{verbatim}
877%
878\item
879%
880\begin{verbatim}
881function has_key(dict,key) result(found)
882!
883! Checks whether there is an entry with
884! the given key in the dictionary
885!
886type(dictionary_t), intent(in)   :: dict
887character(len=*), intent(in)     :: key
888logical                          :: found
889\end{verbatim}
890\item
891%
892\begin{verbatim}
893subroutine get_value(dict,key,value,status)
894!
895! Gets values by key
896!
897type(dictionary_t), intent(in)            :: dict
898character(len=*), intent(in)              :: key
899character(len=*), intent(out)             :: value
900integer, intent(out)                      :: status
901\end{verbatim}
902%
903\item
904%
905\begin{verbatim}
906subroutine get_key(dict,i,key,status)
907!
908! Gets keys by their order in the dictionary
909!
910type(dictionary_t), intent(in)            :: dict
911integer, intent(in)                       :: i
912character(len=*), intent(out)             :: key
913integer, intent(out)                      :: status
914
915\end{verbatim}
916%
917\item
918%
919\begin{verbatim}
920subroutine print_dict(dict)
921!
922! Prints the contents of the dictionary to stdout
923!
924type(dictionary_t), intent(in)   :: dict
925\end{verbatim}
926\end{itemize}
927
928\subsection{SAX interface}
929
930\begin{itemize}
931\item
932\begin{verbatim}
933subroutine open_xmlfile(fname,fxml,iostat)
934!
935! Opens the file "fname" and creates an xml handle fxml
936! iostat /= 0 on error.
937!
938character(len=*), intent(in)      :: fname
939integer, intent(out)              :: iostat
940type(xml_t), intent(out)          :: fxml
941\end{verbatim}
942\item
943\begin{verbatim}
944subroutine xml_parse(fxml, begin_element_handler,    &
945                           end_element_handler,      &
946                           pcdata_chunk_handler,     &
947                           comment_handler,          &
948                           xml_declaration_handler,  &
949                           sgml_declaration_handler, &
950                           error_handler,            &
951                           signal_handler,           &
952                           verbose,                  &
953                           empty_element_handler)
954
955type(xml_t), intent(inout), target  :: fxml
956
957optional                            :: begin_element_handler
958optional                            :: end_element_handler
959optional                            :: pcdata_chunk_handler
960optional                            :: comment_handler
961optional                            :: xml_declaration_handler
962optional                            :: sgml_declaration_handler
963optional                            :: error_handler
964optional                            :: signal_handler  ! see XPATH code
965logical, intent(in), optional       :: verbose
966optional                            :: empty_element_handler
967
968\end{verbatim}
969\item Interfaces for handlers follow:
970
971\begin{verbatim}
972   subroutine begin_element_handler(name,attributes)
973   character(len=*), intent(in)     :: name
974   type(dictionary_t), intent(in)   :: attributes
975   end subroutine begin_element_handler
976
977   subroutine end_element_handler(name)
978   character(len=*), intent(in)     :: name
979   end subroutine end_element_handler
980
981   subroutine pcdata_chunk_handler(chunk)
982   character(len=*), intent(in) :: chunk
983   end subroutine pcdata_chunk_handler
984
985   subroutine comment_handler(comment)
986   character(len=*), intent(in) :: comment
987   end subroutine comment_handler
988
989   subroutine xml_declaration_handler(name,attributes)
990   character(len=*), intent(in)     :: name
991   type(dictionary_t), intent(in)   :: attributes
992   end subroutine xml_declaration_handler
993
994   subroutine sgml_declaration_handler(sgml_declaration)
995   character(len=*), intent(in) :: sgml_declaration
996   end subroutine sgml_declaration_handler
997
998   subroutine error_handler(error_info)
999   type(xml_error_t), intent(in)            :: error_info
1000   end subroutine error_handler
1001
1002   subroutine signal_handler(code)
1003   logical, intent(out) :: code
1004   end subroutine signal_handler
1005
1006   subroutine empty_element_handler(name,attributes)
1007   character(len=*), intent(in)     :: name
1008   type(dictionary_t), intent(in)   :: attributes
1009   end subroutine empty_element_handler
1010\end{verbatim}
1011\end{itemize}
1012
1013Other file handling routines (some of them really only useful within
1014the XPATH interface):
1015
1016\begin{itemize}
1017\item
1018\begin{verbatim}
1019subroutine REWIND_XMLFILE(fxml)
1020!
1021! Rewinds the physical file associated to fxml and clears the data
1022! structures used in parsing.
1023!
1024type(xml_t), intent(inout) :: fxml
1025\end{verbatim}
1026
1027\item
1028\begin{verbatim}
1029subroutine SYNC_XMLFILE(fxml,status)
1030!
1031! Synchronizes the physical file associated to fxml so that reading
1032! can resume at the exact point in the parsing saved in fxml.
1033!
1034type(xml_t), intent(inout) :: fxml
1035integer, intent(out)       :: status
1036
1037\end{verbatim}
1038\item
1039\begin{verbatim}
1040subroutine CLOSE_XMLFILE(fxml)
1041!
1042! Closes the file handle fmxl (and the associated OS file object)
1043!
1044type(xml_t), intent(inout) :: fxml
1045\end{verbatim}
1046\end{itemize}
1047
1048\subsection{XPATH interface}
1049%
1050\begin{itemize}
1051\item
1052\begin{verbatim}
1053subroutine MARK_NODE(fxml,path,att_name,att_value,attributes,status)
1054!
1055! Performs a search of a given element (by path, and/or presence of
1056! a given attribute and/or value of that attribute), returning optionally
1057! the element's attribute dictionary, and leaving the file handle fxml
1058! ready to process the rest of the element's contents (child elements
1059! and/or pcdata).
1060!
1061! Side effects: it sets a "path_mark" in fxml to enable its use as a
1062! context.
1063!
1064! If the argument "path" is present and evaluates to a relative path (a
1065! string not beginning with "/"), the search is interrupted after the end
1066! of the "ancestor_element" set by a previous call to "mark_node".
1067! If not earlier, the search ends at the end of the file.
1068!
1069! The status argument, if present, will hold a return value,
1070! which will be:
1071!
1072!   0 on success,
1073!   negative in case of end-of-file or end-of-ancestor-element, or
1074!   positive in case of other malfunction
1075!
1076type(xml_t), intent(inout), target           :: fxml
1077character(len=*), intent(in), optional       :: path
1078character(len=*), intent(in), optional       :: att_name
1079character(len=*), intent(in), optional       :: att_value
1080type(dictionary_t), intent(out), optional    :: attributes
1081integer, intent(out), optional               :: status
1082\end{verbatim}
1083
1084\item
1085\begin{verbatim}
1086subroutine GET_NODE(fxml,path,att_name,att_value,attributes,pcdata,status)
1087!
1088! Performs a search of a given element (by path, and/or presence of
1089! a given attribute and/or value of that attribute), returning optionally
1090! the element's attribute dictionary and any PCDATA characters contained
1091! in the element's scope (but not child elements). It leaves the file handle
1092! physically and logically positioned:
1093!
1094!     after the end of the element's start tag if 'pcdata' is not present
1095!     after the end of the element's end tag if 'pcdata' is present
1096!
1097! If the argument "path" is present and evaluates to a relative path (a
1098! string not beginning with "/"), the search is interrupted after the end
1099! of the "ancestor_element" set by a previous call to "mark_node".
1100! If not earlier, the search ends at the end of the file.
1101!
1102! The status argument, if present, will hold a return value,
1103! which will be:
1104!
1105!   0 on success,
1106!   negative in case of end-of-file or end-of-ancestor-element, or
1107!   positive in case of a malfunction (such as the overflow of the
1108!   user's pcdata buffer).
1109!
1110type(xml_t), intent(inout), target           :: fxml
1111character(len=*), intent(in), optional       :: path
1112character(len=*), intent(in), optional       :: att_name
1113character(len=*), intent(in), optional       :: att_value
1114type(dictionary_t), intent(out), optional    :: attributes
1115character(len=*), intent(out), optional      :: pcdata
1116integer, intent(out), optional               :: status
1117\end{verbatim}
1118\end{itemize}
1119%
1120\subsection{PCDATA conversion routines}
1121\begin{itemize}
1122\item
1123
1124\begin{verbatim}
1125subroutine build_data_array(str,x,n)
1126!
1127! Incrementally builds the data array x from
1128! character data contained in str. n holds
1129! the number of entries of x set so far.
1130!
1131character(len=*), intent(in)                ::  str
1132NUMERIC TYPE, dimension(:), intent(inout)   ::    x
1133integer, intent(inout)                      ::    n
1134!
1135! NUMERIC TYPE can be any of:
1136!            integer
1137!            real
1138!            real(kind=selected_real_kind(14))
1139!
1140\end{verbatim}
1141\end{itemize}
1142
1143\subsection{Other utility routines}
1144\begin{itemize}
1145\item
1146
1147\begin{verbatim}
1148function xml_char_count(fxml) result (nc)
1149!
1150! Provides the value of the processed-characters counter
1151!
1152type(xml_t), intent(in)          :: fxml
1153integer                          :: nc
1154
1155nc = nchars_processed(fxml%fb)
1156
1157end function xml_char_count
1158\end{verbatim}
1159\end{itemize}
1160
1161\section{Other parser features, limitations, and design issues}
1162
1163\subsection{Features}
1164\begin{itemize}
1165\item 
1166The parser can detect badly formed documents, giving by default an
1167error report including the line and column where it happened. It also
1168will accept an \texttt{error\_handler} routine as another optional
1169argument, for finer control by the user. In the SAX interface, if the
1170optional logical argument "verbose" is present and it is ".true.", the
1171parser will offer detailed information about its inner workings. In
1172the XPATH interface, there are a pair of routines,
1173\texttt{enable\_debug} and \texttt{disable\_debug}, to control
1174verbosity. See \texttt{Examples/xpath/} for examples.
1175
1176\item
1177It ignores PCDATA outside of element context (and warns about it)
1178
1179\item
1180Attribute values can be specified using both single and double
1181quotes (as per the XML specs).
1182
1183\item 
1184It processes the default entities: \&gt; \&amp; \&lt;  \&apos; and
1185\&quot; and decimal and hex character entities (for example: \&\#123;
1186\&\#4E;).  The processing is not
1187"on the fly", but after reading chunks of PCDATA.
1188
1189\item
1190Understands and processes CDATA sections (transparently passed as
1191PCDATA to the handler).
1192
1193\end{itemize}
1194
1195See \texttt{Examples/sax/features} for an illustration of the above
1196features.
1197
1198\subsection{Limitations}
1199\begin{itemize}
1200
1201\item It is not a validating parser.
1202
1203\item It accepts only single-byte encodings for characters.
1204
1205\item Currently, there are hard-wired limits on the length of element
1206  and attribute identifiers, and the length of attribute values and
1207  unbroken (i.e., without whitespace) PCDATA sections.  The limit is
1208  set in \texttt{sax/m\_buffer.f90} to \texttt{MAX\_BUFF\_SIZE=300}.
1209
1210\item Overly long comments and SGML declarations can also be
1211truncated, but the effect is currently harmless since the parser does
1212not make use of that information. In a future version there could be a
1213more robust retrieval mechanism.
1214
1215\item  The number of attributes is limited to \texttt{MAX\_ITEMS=20}
1216  in \texttt{sax/m\_dictionary.f90}:
1217
1218 
1219 \item In the XPATH interface, returned PCDATA character buffers
1220 cannot be larger than an internal size of
1221 \texttt{MAX\_PCDATA\_SIZE=65536} set in \texttt{xpath/m\_path.f90}
1222
1223
1224\end{itemize}
1225
1226\subsection{Design Issues}
1227
1228See \texttt{\{sax,xpath\}/Developer.Guide}.
1229
1230The parser is actually written in the \texttt{F} subset of Fortran90,
1231for which inexpensive compilers are available. (See
1232\texttt{http://fortran.com/imagine1/}).
1233 
1234There are two other projects aimed at parsing XML in Fortran: those of
1235Mart Rentmeester (\texttt{http://nn-online.sci.kun.nl/fortran/}) and
1236Arjen Markus (\texttt{http://xml-fortran.sourceforge.net/}). Up to
1237this point the three projects have progressed independently, but it is
1238anticipated that there will be a pooling of efforts in the near
1239future.
1240
1241\newpage
1242\section{Installation Instructions}
1243%
1244There is extensible built-in support for arbitrary compilers.  The
1245setup discussed below is taken from the author's \texttt{flib}
1246project\footnote{There seems to be other projects with that very obvious
1247name...}  The idea is to have a configurable repository of useful
1248modules and library objects which can be accessed by fortran
1249programs. Different compilers are supported by tailored macros.
1250
1251\texttt{xmlf90} is just one of several packages in \texttt{flib},
1252hence the \texttt{flib\_} prefix in the package's visible module
1253names.
1254
1255To install the package, follow this steps:
1256
1257\begin{verbatim}
1258
1259 * Create a directory somewhere containing a copy of the stuff in the
1260   subdirectory 'macros':
1261
1262        cp -rp macros $HOME/flib
1263
1264 * Define the environment variable FLIB_ROOT to point to that directory.
1265
1266        FLIB_ROOT=$HOME/flib  ; export FLIB_ROOT     (sh-like shells)
1267        setenv FLIB_ROOT $HOME/flib                  (csh-like shells)
1268
1269
1270 * Go into $FLIB_ROOT, look through the fortran-XXXX.mk files,
1271   and see if one of them applies to your computer/compiler combination.
1272   If so, copy it or make a (symbolic) link to 'fortran.mk':
1273
1274        ln -sf fortran-lf95.mk fortran.mk
1275
1276   If none of the .mk files look useful, write your own, using the
1277   files provided as a guide. Basically you need to figure out the
1278   name and options for the compiler,  the extension assigned to
1279   module files, and the flag used to identify the module search path.
1280
1281 The above steps need only be done once.
1282
1283 * Go into subdirectory 'sax' and type 'make'.
1284 * Go into subdirectory 'xpath' and type 'make'.
1285 * Go into subdirectory 'Tutorial' and try the exercises in this guide
1286  (see the next section for compilation details).
1287 * Go into subdirectory 'Examples' and explore.
1288
1289\end{verbatim}
1290%
1291\section{Compiling user programs}
1292\label{sec:compiling}
1293
1294After installation, the appropriate modules and library files should
1295already be in \texttt{\$FLIB\_ROOT/modules} and
1296\texttt{\$FLIB\_ROOT/lib}, respectively. To compile user programs, it
1297is suggested that the user create a separate directory to hold the
1298program files and prepare a \texttt{Makefile} following the template
1299(taken from \texttt{Examples/sax/simple/}):
1300
1301\begin{verbatim}
1302#---------------------------------------------------------------
1303#
1304default: example
1305#
1306#---------------------------
1307MK=$(FLIB_ROOT)/fortran.mk
1308include $(MK)
1309#---------------------------
1310#
1311# Uncomment the following line for debugging support
1312#
1313FFLAGS=$(FFLAGS_DEBUG)
1314#
1315LIBS=$(LIB_PREFIX)$(LIB_STD) -lflib
1316#
1317OBJS= m_handlers.o example.o
1318     
1319example:  $(OBJS)
1320        $(FC) $(LDFLAGS) -o $@ $(OBJS)  $(LIBS)
1321#
1322clean:
1323        rm -f *.o example *$(MOD_EXT)
1324#
1325#---------------------------------------------------------------
1326\end{verbatim}
1327%
1328Here it is assumed that the user has two source files,
1329\texttt{example.f90} and \texttt{m\_handlers.f90}. Simply typing
1330\texttt{make} will compile \texttt{example}, pulling in all the needed
1331modules and library objects.
1332
1333
1334\end{document}
Note: See TracBrowser for help on using the repository browser.