Skip to content

Xcerpt

Sections
Personal tools
You are here: Home » Documentation » Language » Data Terms
Xcerpt Team
Researchers:
Sebastian Schaffert
François Bry
Sacha Berger
Tim Furche
Paula Patranjan
Michael Eckert

Students:
Mira Blazheva
Oliver Bolzer
Michael Brade
Raja Gigova
Clemens Ley
Inna Romanenko
Andreas Schroeder
Christoph Wieser
 

Data Terms

Document Actions
Data terms represent XML documents and data items in semistructured databases. Data terms correspond to ground functional programming expressions and ground logical atoms. Apart from the special constructs for ordered/unordered term specification and the Xcerpt reference mechanism, data terms are just a simplified syntax for XML, or ``XML in disguise''. Data terms are not restricted to representing XML data or semistructured expressions: they are meant as an abstraction of many of the available formalisms for rooted, graph structured data like data represented in OEM or ACeDB, but also Lisp S-expressions or RDF graphs.
      <data-term> := ( oid "@" )? <ns-label> <list> .
       <ns-label> := (<ns-prefix> ":")? label
      <ns-prefix> := label | '"' iri '"' .
           <list> := <ordered-list> | <unordered-list> .
   <ordered-list> := "[" <attributes>? <data-subterms>? "]" . 
 <unordered-list> := "{" <attributes>? <data-subterms>? "}" . 
  <data-subterms> := <data-subterm> ( "," <data-subterm> )*
   <data-subterm> := <data-term> | '"' string  '"' | number | "^" oid .
     <attributes> := "attributes" "{" <attribute> ( "," <attribute> )* "}" .
      <attribute> := <ns-label> "{" '"' string '"' "}" .

Expressions between < and > are non-terminal symbols (or variables). Expressions enclosed in the quotation characters " or ' are terminal symbols. oid and label denote object identifiers and expression labels (tag names), respectively. oid, label, and string are character sequences corresponding to XML identifiers, tag names, and text content. number is an arbitrary integer or floating point number. iri is an internationalised resource identifier as defined in \cite{iri}.

If a data term t is of the form label[t1,...,tn] or label{t1,...,tn}, then the ti are called immediate subterms of t. Subterms of the ti are called indirect subterms of t. If neither "immediate" nor "indirect" is specified, the term subterm usually only refers to the immediate subterms of a term. In analogy to the XML terminology, t is the parent term of its subterms, (immediate) subterms are sometimes also referred to as child terms, and the topmost parent term is called the root term. In an expression of the form attributes{label1{...},...,labeln{...}}, the labels must be different, because XML attributes need to have different names.

Consider again the publication list from this document. The representation of this semistructured data item as a data term (or semistructured expression) is shown on the left. An equivalent representation (except subterm ordering) as an XML document is shown on the right. Note that the document prologue is omitted for brevity.

publications {
  book {
    title [ "Folket i Birka på Vikingarnas Tid" ],
    authors [ 
      author [ "Mats Wahl" ],
      author [ "Sven Nordqvist" ]
      author [ "Björn Ambrosiani" ]
    ]
  },

  book {
    title [ "Boken Om Vikingarna" ],
    authors [
      author [ "Catharina Ingelman-Sundberg" ]
    ]
  }
}
<publications>
  <book>
    <title>Folket i Birka på Vikingarnas Tid</title>
    <authors>
      <author>Mats Wahl</author>
      <author>Sven Nordqvist</author>
      <author>Björn Ambrosiani</author>
    </authors>
  </book>

  <book>
    <title>Boken Om Vikingarna</title>
    <authors>
      <author>Catharina Ingelman-Sundberg</author>
    </authors>
  </book>
</publications>

In this example, the terms with label book are immediate subterms or child terms of the term with label publications, which is also the root term. The term with label publications is thus the parent term of the terms with label book. The terms labelled author are immediate subterms of the respective terms labelled authors, and indirect subterms of e.g. the respective terms labelled book.

Data terms may be used as an abstraction for many other formalisms that represent hierarchical or graph structured data. The following two examples show the publication list as a Lisp S-expression and in the Object Exchange Model (OEM)

(publications
  (book
    (title "Folket i Birka på Vikingarnas Tid")
    (authors
      (author "Mats Wahl")
      (author "Sven Nordqvist")
      (author "Björn Ambrosiani")
    )
  )

  (book
    (title "Boken Om Vikingarna")
    (authors
      (author "Catharina Ingelman-Sundberg")
    )
  )
)

{ publications:
  { book:
     { title: "Folket i Birka på Vikingarnas Tid",
       authors: 
         { author: "Mats Wahl",
           author: "Sven Nordqvist",
           author: "Björn Ambrosiani" 
         },
     },  

    book:
      { title: "Boken Om Vikingarna",
        authors:
          { author: "Catharina Ingelman-Sundberg" }
      }
  }
}

Term Specifications

Like semistructured expressions, data terms allow the specification of ordered and unordered lists of subterms. These properties are expressed by using different kinds of braces to parenthesise the subterms.

  • Square brackets (i.e. [ ]) denote ordered term specification, i.e. the order of subterms in the list is significant. An ordered term specification allows to select subterms by position and is important e.g. in text documents.
  • Curly braces (i.e. { }) denote unordered term specification, i.e. the order of subterms in the term is insignificant, although they are stored in a particular sequence. An unordered term specification allows to rearrange subterms in the list e.g. for building an index for faster access, or for more efficient use of a storage system (like grouping several small subterms in a single page of background memory while storing large subterms in an individual page each). Unordered term specification is commonly found in semistructured databases.

In the example above, the term with label publications has an unordered term specification, meaning that the order of the book subterms is irrelevant, i.e. the storage system might choose to rearrange them in a different order. The terms with label authors have ordered term specification, meaning that the order of the list of author elements is significant (e.g. for proper citing).

Terms with different term specifications may be nested (i.e. subterms of a term may have a term specification different from the parent term's), but nesting of term specifications within the same list of subterms is not permitted. For example, the term f{g["a","b"],h{"c","d"}} is a data term, but f{"a",["b","c"],"d"} is not.

References

References are used for representing graph structures in a textual syntax. In Xcerpt data terms, subterms of the form oid @ t (read: "oid at t") are defining occurrences of oid and associate the identifier oid with the subterm t. Subterms of the form \Verb|^oid| (or \Verb[commandchars=\\\(\)]|\rf(oid)|, read: ``reference to oid'') are referring occurrences of oid and refer to the subterm associated with the identifier oid. As with semistructured expressions, every identifier may occur at most once in a defining occurrence, and an identifier used in a referring occurrence must also occur in a defining occurrence somewhere.

References in data terms are a unified representation for the various linking mechanisms available for XML (and other formalisms), like ID/IDREF, XPointer, XLink and URIs, and serve to simplify their representation in Xcerpt (Note that Xcerpt is not limited to its own reference mechanism: e.g. ID/IDREF can easily be dereferenced using an appropriate query). Unlike other query languages, Xcerpt automatically dereferences such references when querying, i.e. a reference can be treated like a parent-child relationship.

The following two terms are considered to be equivalent:

f {
  b { &o1 @ d {} },
  c { ^&o1 }
}
f {
  b { ^&o1 },
  c { &o1 @ d {} }
}

Attributes

Unlike XML, Xcerpt does not have a special representation for attributes. Instead, XML attributes are treated as subterms of a term with the specific restriction that the value may not be structured content. An attribute of the form key = "value" is represented in Xcerpt as a term of the form key{"value"}

In order to separate attributes from child elements and thus retain the possibility to perform one-to-one transformations between Xcerpt and XML, Xcerpt groups them in a special subterm with the label attributes. Since attributes in XML are always unordered, this special subterm always has an unordered term specification (see above). As a convention, every data term should contain at most one attributes subterm, and this subterm, if existent, should be the first subterm in the list of subterms (even in case the parent term is unordered). Also, all attributes of a term need to have different labels.

Each book in the bib.xml database contains an attribute year in the XML syntax. Consider for example the following book:

  <book year="1995">
   <title>Vikinga Blot</title>
   <authors> 
     <author> 
       <last> Ingelman-Sundberg </last> 
       <first> Catharina </first> 
     </author> 
   </authors> 
   <publisher> Richters </publisher> 
   <price> 5.95 </price> 
 </book> 

In Xcerpt syntax, this book can be represented as follows. Note in particular that the element itself is ordered (as it is a representation of an XML document) while the attributes are unordered:

book [ 
  attributes { year { "1995" } },
  title [ "Vikinga Blot" ],
  authors [ 
    author [
      last [ "Ingelman-Sundberg" ], 
      first [ "Catharina" ] 
    } 
  ],
  publisher [ "Richters" ],
  price [ "5.95" ]
]

This treatment of attributes has the main advantage that no exceptions are needed in the definition of Xcerpt extensions like variables or regular expressions. Instead, since attributes are represented in the same term structure as elements, it is possible to use the standard constructs for all occurrences of attributes.

Namespaces

Xcerpt supports namespaces in a straightforward manner that follows closely the use of namespaces in XML. Like in XML, namespaces are URIs (uniform resource identifiers) or IRIs (internationalised resource identifiers). Namespace prefixes can be declared and are then separated from term labels by a colon. As an extension to XML namespaces, it is also possible to use the namespace URI as a prefix (In XML, this is not admissible due to syntactic restrictions. Xcerpt does not need to adhere to such restrictions as it is not necessary to retain backwards compatibility with applications that are not namespace aware.).

Namespace Declarations

Namespace prefixes are declared with the keyword ns-prefix followed by the defined prefix, a = and the namespace IRI. The default namespace (i.e. the namespace of all subterms that do not have an explicit namespace prefix) can be defined with the keyword ns-default, followed by = and the namespace IRI of the default namespace.

<ns-declaration> ::= "ns-prefix" <ns-prefix> "=" '"' iri '"' 
                   | "ns-default" "=" '"' iri '"'    .

As a simplification over XML namespaces, this thesis allows namespace declarations only outside terms. This restriction obviously anticipates nested namespace declarations and shadowing, and thus a syntactic one-to-one mapping between XML documents and Xcerpt terms preserving the namespace prefixes is not always possible, although the two approaches have equivalent expressiveness (both allow to associate namespace IRIs with term/element labels). Transforming XML documents that use nested namespace declarations into data terms and vice versa is nevertheless possible as the namespaces themselves are preserved and just the namespace prefixes might get lost. Further refinements of namespaces that take into account both nested declarations and shadowing are currently being investigated.

Namespaces in Data Terms

In Xcerpt terms, namespaces are used almost as in XML. The most significant difference to XML is that the namespace IRI may also be used as a namespace prefix. In this case, it is not necessary to define the namespace in advance.

<ns-prefix> = label | '"' uri '"' .
Namespaces in Xcerpt (needs revision)

Consider again Example \ref{ex:xml:namespaces} on page \pageref{ex:xml:namespaces}, which illustrated the use of namespaces in XML by adding a remarks element to address book entries that might contain HTML elements for markup. It uses the namespace prefix a to refer to the address book schema, and the namespace prefix b to refer to the XHTML schema. As a data term, this document might be represented as follows:

ns-prefix a = "http://www.myschemas.org/address-book"
ns-prefix b = "http://www.w3.org/2002/06/xhtml2"

a:address-book {
  &o1 @ a:person {
    a:name {
      a:first { "Mickey" },
      a:last { "Mouse" }
    },

    a:phone { 
     attributes {
       a:type { "home" }
     },
     "19281118"
    },
    a:knows { ^&o2 },

    a:remarks {
      b:strong{"Note:"}, "The phone number is also the", b:em{"birthday"},"!"
    }
  },

  &o2 @ a:person {
    a:name {
      a:first { "Donald" },
      a:last { "Duck" }
    }

  }

}

Instead of declaring the namespace prefix b, it would also be possible to use the namespace URI directly, as in the following example. Note also the use of the default namespace declaration.

ns-default = \str(http://www.myschemas.org/address-book)

address-book {

  &o1 @ person {
    name {
      first { "Mickey" },
      last { "Mouse" }
    },

    phone { 
     \keyword(attributes) {
       type { "home" }
     },
     "19281118"
    },
    knows { ^&o2 },

    remarks {
      "http://www.w3.org/2002/06/xhtml2":strong{"Note:"}, 
      "The phone number is also the", 
      "http://www.w3.org/2002/06/xhtml2":em{"birthday"},"!"
    }
  },

  &o2 @ person {
    name {
      first { "Donald" },
      last { "Duck" }
    }
  }

}
Created by wastl
Last modified 2005-07-14 01:57 PM
 

Powered by Plone

This site conforms to the following standards: