Skip to content

Xcerpt

Sections
Personal tools
You are here: Home » Documentation » Language » Query Terms
Xcerpt Team
Researchers:
Sebastian Schaffert
François Bry
Sacha Berger
Tim Furche
Paula Patranjan
Michael Eckert

Students:
Mira Blazheva
Oliver Bolzer
Michael Brade
Raja Gigova
Clemens Ley
Inna Romanenko
Andreas Schroeder
Christoph Wieser
 

Query Terms

Document Actions
Query terms are (possibly incomplete) patterns matched against Web resources represented by data terms. A pattern is like a form augmented by variables acting as place holders for data retrieved from data terms, very similar to (non-ground) atoms in logic programming. Query terms build upon data terms, but may contain variables, constructs for expressing incompleteness, as well as position specifications, subterm negation, and subterm exclusion.

Incompleteness

As discussed in Section \ref{sec:principles:incomplete}, query patterns need to support incomplete query specifications, because data represented on the Web has a much more flexible schema than data represented e.g. in relational databases. Query terms may contain constructs for expressing incompleteness in breadth, in depth, with respect to order, and with respect to optional subterms. The terms ``breadth'' and ``depth'' refer to the graph induced by a data term or semistructured expression (cf. Sections \ref{sec:ssd-graphs} and \ref{sec:ground-query-terms}). Note that the constructs described here together realise requirement \ref{maier.2.4} (``no schema required'') of David Maier's database desiderata (cf. Section \ref{sec:database-desiderata}).

Incompleteness in Breadth: Partial Term Specifications

Incompleteness in breadth (i.e. within the subterms of the same parent term) is expressed by using so-called partial and total term specifications:

  • double square or curly braces (i.e. [[ ]] or {{ }}) denote partial term specifications, i.e. a data term matched by the query term may contain additional subterms not matched by subterms of the query term.
  • single square or curly braces (i.e. [ ] or { } as in data terms) denote total term specifications, i.e. a data term matched by the query term must not contain additional subterms that are not matched by subterms of the query term.

Consequently, a data term that is used as a query term matches only itself (and all such terms that are equivalent with respect to subterm ordering in case of unordered term specifications), whereas a query term containing partial term specifications matches possibly infinitely many data terms. As with ordered/unordered term specifications, subterms with different term specifications may be nested, but nesting within the same list of subterms is disallowed.

Total/Partial Term Specifications

Consider the bib.xml document of the bookstore example from Section \ref{sec:examples:bookstore}. The following two are query terms for this database:

    
bib {
  book {
    title { "Boken Om Vikingarna" }
  }
}

This query term does not match with the data term, as its total term specification requires that there is exactly one book with exactly one title element.

      
bib {{
  book {{
    title {{ "Boken Om Vikingarna" }}
  }}
}}

This query term will match with the data term, as it allows for additional books and additional elements inside the book element.

Incompleteness wrt. Order: Unordered Term Specifications

Like data terms, query terms may contain both ordered term specifications (square brackets [ ] and [[ ]]), and unordered term specifications (curly braces { } and {{ }}). Let t1 be a query term and let t2 be a data term:

  • if t1 has an ordered term specification, then it matches with t2 only if t2 also has an ordered term specification. Furthermore, all subterms of t1 must match subterms in t2 in the same order of appearance.
  • if t2 has an unordered term specification, then it matches with t2, if t2 has either an ordered term specification or an unordered term specification. All subterms of t1 must match subterms in $t_2$ in arbitrary order.

In case a query term uses ordered and partial term specification, the matched data term has to contain corresponding subterms in the same order as the subterms of the query term, but there may be additional subterms in between.

Ordered/Unordered Term Specifications

Consider the bib.xml example of Section \ref{sec:examples:bookstore}. Recall that in this example the list of authors for each book uses an ordered term specification. The following two query terms show the difference between ordered and unordered term specifications in query terms:

    
bib {{
  book {{
    authors [[
      author { 
        first [ "Björn" ], 
        last [ "Ambrosiani" ]  },
      author { 
        first [ "Sven" ], 
        last [ "Nordqvist"]    }
    ]]
  }}
}}

Match with all books where the author "Björn Ambrosiani" appears before the author "Sven Nordqvist".

This query term does not match with the data term, as the authors in the database do not have the same order as in the query term.

    
bib {{
  book {{
    authors {{
      author { 
        first [ "Björn" ], 
        last [ "Ambrosiani" ]        },
      author { 
        first [ "Sven"], 
        last [ "Nordqvist"]    }
    }}
  }}
}}

Match with all books that have (at least) the two authors "Björn Ambrosiani" and "Sven Nordqvist" in any order.

This query term will match with the database, as the query term does not enforce a particular order on authors.

Incompleteness in Depth: Descendant

Incompleteness in depth is expressed using the descendant construct. A query term of the form desc t (read: ``descendant t'') matches with all data terms that contain a subterm that is matched by t at an arbitrary depth (including zero). It is the counterpart to the Kleene star operator of regular path expressions and to XPath's descendant (in short notation: //) construct (cf. Section \ref{sec:xpath}).

Descendant

The following query term matches with a text document (like the one introduced in Section \ref{sec:examples:text}), if at arbitrary depth below the root term, the data term representing the text document contains a section term with a title subterm containing the string ``Data Terms'', i.e. either a section, a subsection, a sub-subsection, etc.

report {{
 desc section {{
    title {{ "Data Terms" }},
  }}
}}

Currently, the descendant construct is unrestricted, i.e. it ``matches'' with any path. Extensions are being considered that allow restrictions to these paths, e.g. using regular expressions over labels, or sets of admissible term labels.

Incompleteness wrt. Optional Subterms: Optional

Terms containing a subterm of the form optional t specify to match the subterm t with a subterm of the data term if possible (and yield variable bindings for the variables in t accordingly); otherwise, the evaluation of the query does not fail, but does not yield any bindings for the variables in t.

Consider in the following the student database example introduced in Section \ref{sec:examples:studentdb}. The following query term retrieves student names (variable Name) and student ids (variable MatrNr). If both exist, both are returned. If only the name exists, the evaluation does not fail (i.e. the query term still matches), but binds only the variable Name. If there is no name in the data term, the query term fails to match it.

students {{
  student {{
    name { var Name },
    optional matrnr { var MatrNr }
  }}
}}

The construct optional is not strictly necessary as the same queries can be expressed by using several query terms instead of only one. However, it is a convenient construct in many practical examples of semistructured databases and XML documents, as the schema languages of such formats often allow optional elements.

Term Variables, Label/Namespace Variables and the "->"-Construct

Variables act as "handles" for those subterms of the data term that match with the subterm the variable is "attached to". If a query term matches with a data term, the variables are bound to the corresponding subterms. They can thus be used to retrieve data from a data term and assemble it in a new structure (with the help of construct terms, see next section). As in logic programming, a single variable can occur at several positions in a term. Of course, bindings to such variables have to be consistent for all occurrences, i.e. all occurrences of the same variable must have the same binding.

Matching a query term with a data term yields a set of alternative substitutions, each of which represents a possible binding for the variables in the query term such that the resulting ground instance matches with the data term. Obviously, the use of unordered and partial term specifications allows several alternative bindings for the variables that all fulfill this requirement.

In Xcerpt query terms, the following variable notions are used:

  • Variables without restriction are expressed using the keyword var followed by an identifier (variable name). They can be bound to any subterm in the data term and are thus very similar to the variables in logic programming, i.e. they act as place holders.
  • Variables with restriction are expressed like a variable without restriction followed by the symbol -> or $\leadsto$ (read ``as'') and a query term. They can only be bound to subterms of the data term that match with the pattern they are restricted to. Note that variable restrictions are also used in the language XMAS.
  • Label Variables are, like variables without restrictions, expressed by using the keyword var followed by an identifier, but they occur at the position of a label in a query term. They can be bound to any label of a subterm of the data term that matches with the remaining term specification.
  • Namespace Variables are similar to label variables. They occur at the position of a namespace prefix in a query term. Namespace variables are always bound to the namespace URI/IRI, not to the namespace prefix.

Note that in logic programming, variable restrictions are represented using external constraints. The advantage of constraining a variable to certain subterms within a query term instead of outside the query is to better convey the overall structure of the considered query. Arguably, restricting variables inside the query term more appropriately realises the concept of query patterns.

Substitutions

In the student database (Section \ref{sec:examples:studentdb}), the query term given on the left hand side matches the variable Name with the student name and variable Email with the email address. The right hand side lists different substitutions that yield ground instances of the query term that match with the data term given in students.xml.

students {{
  student {{
    name  {{\varc var Name }}
    email {{\varc var Email }}
  }}
}}
Substitution σ1:
Name Donald Duck
Email donald@duck.org

Substitution σ2:
Name Mickey Mouse
Email mickey@mouse.org
Substitution σ3:
Name Goofy
Email goofy@goofy.org

Substitution σ4:
Name Goofy
Email goofy@disney.com

Note in particular that Goofy is listed twice as the data term contains two possible email addresses that can be bound to the variable Email.

Pattern Restrictions

The following query terms for the bib.xml database of Section \ref{sec:examples:bookstore} illustrate the difference between variables without and with pattern restrictions.

bib {{
  book {{
    var X,
    authors {{ var AUTHOR }}
  }}
}}
bib {{
  book {{
    var X -> title {{ }},
    authors {{ var AUTHOR }}
  }}
}}
In this query term, the occurrence of the variable X is unrestricted. Thus, the variable X might be bound to any subterm of the book element (besides authors), e.g. to price or title, since the variable X occurs without restriction. In this query term, the occurrence of the variable X is restricted to such subterms that are matched by the query term title {{ }}. Thus, the variable X can only be bound to the title element.

The use of the keyword var to introduce a variable is not strictly necessary. It is often possible to determine from the context whether a term is a variable or not. In particular, extensions of the Xcerpt syntax are investigated that allow to declare variables in a context block. However, using the keyword var simplifies the syntax in particular for the programmer, as it allows to easily identify variables without having to look at the context. Label variables are useful to retrieve structural information that is unknown in advance, e.g. when transforming an XML document into an HTML representation displaying the structure of the XML document (as e.g. in the implementation of the visual language visXcerpt \cite{sacha-da,visxcerpt-vldb,visxcerpt-ppswr}).

Label Variables

Consider the student database of figure \ref{fig:stud-db}. The following query term retrieves the label of the element containing the string ``Goofy'' in the variable X:

students {{
  student {{
    var X {{ "Goofy" }},
  }}
}}

Position Specification and Positional Variables

In some applications it is desirable to query subterms only at a certain position while still being able to use partial query patterns, or to query the position of subterms in the database. For example, a query to a report in XML format could select the second paragraph of all sections.

In query terms, subterms of the form position X t denote that the query term only matches with data terms that have at position X a subterm t' that is matched by t. The position specification X is either a positive number, or a negative number, or a variable:

  • a positive number specifies the position of the matched subterm below its parent term, where 1 is the position of the first subterm
  • a negative number specifies the position relative to the last subterm of the parent, where -1 is the position of the last subterm.
  • a variable matches with a subterm with any position and binds the variable to the position of this subterm as a positive integer number.

Position specification is admissible in all kinds of term specifications, i.e. ordered and unordered as well as total and partial query terms. Note, however, that it is possible to express patterns that are contradictory and thus impossible to match, as position specifications might conflict with ordered or total term specifications (e.g. f[[position 2 a, position 1 b]] or f{position 2 a}).

Note that a term containing a subterm with position specification can never match against a data term with unordered term specification, as in such cases there is no information available about the position of elements.

Position Specification

Consider an HTML document containing a table with books and prices, like the following:

<table>
  <th>
    <td>No.</td>
    <td>Title</td>
    <td>Price</td>
  </th>
  <tr>
    <td>3675</td>
    <td>Vikinga Blot</td>
    <td>5.95</td>
  </tr>
  <tr>
    <td>6743</td>
    <td>Boken Om Vikingarna</td>
    <td>22.95</td>
  </tr>
</table>
table [
  th [
    td ["No."], 
    td ["Title"], 
    td ["Price"]
  ],
  tr [
    td ["3675"], 
    td ["Vikinga Blot"], 
    td ["5.95"]
  ],
  tr [
    td ["6743"], 
    td ["Boken Om Vikingarna"], 
    td ["22.95"]
  ]
]

Now suppose you want to select the titles and prices of books in this HTML table. Since there is no possibility to determine this via subterm labels, it is necessary to explicitly specify the position in the selection, as in the following query term:

table {{
  td {{
    position 2 td { var Title },
    position 3 td { var Price }
  }}
}}

A solution that is even more flexible takes advantage of the column labels in the table headings and uses variables in the position specification to select the positions of the columns with label Title and Price. The same variables are then used in place of the positions 2 and 3 of the example above.

table {{
  th {{
    position var TPos td { "Title" },
    position var PPos td { "Price" }
  }},
  td {{
    position var TPos td { var Title },
    position var PPos td { var Price }
  }}
}}

Note that this query term does not assume that the price column comes after the title column!

Subterm Negation: without

Subterms of the form without t denote so-called subterm negation. Subterm negation allows to express that a data term should not contain subterms matching a certain query pattern. It is only applicable to subterms and may not be used at the root level. Furthermore, subterm negation is only reasonable in partial term specifications, and order does not have influence on the negated subterms (only on all positive subterms).

This kind of negation is useful in semistructured data, as the schema of such data often allows to omit subterms. For example, a query might ask for "all students that did not submit their homework" (i.e. all student elements that do not contain an element indicating that they submitted their homework). Note that in relational database systems, this negation is very similar to querying for NULL values.

Recall the student database from Section \ref{sec:examples:studentdb}. The following query term retrieves students that did not submit exercise 2 in the variable S.

students {{
  var S -> student {{
    without exercise {{ number { 2 } }}
  }}
}}

The query matches if there is at least one student that does not have an exercise element with number 2.

Subterms negated by without may contain variables, but such an occurrence can never yield variable bindings, i.e. it is not possible to retrieve all subterms that do not occur. Accordingly, all variables that occur in the scope of a without have to appear elsewhere outside the scope of a negation construct (cf. Section \ref{sec:range-restrictedness}). Nonetheless, using variables in negated subterms can be useful, as shown in the following example.

Given a text document like the PhD thesis described in Section \ref{sec:examples:text}. The following query term uses subterm negation to retrieve all references to citations that have no corresponding entry in the bibliography in the variable Citation (note the representation of attributes in the Xcerpt term notation):

report {{
  desc var Citation -> cite { 
    attributes { ref { var Ref } }
  },
  desc bibliography {{
    without entry {{
      attributes {{ id { var Ref } }}
    }}
  }}
}}

As there is no bibliography entry with an id of rdf, the result of evaluating this rule against the sample document is:

var Citation =  cite { attributes { ref{"rdf"} } }

Subterm negation is an existential negation: As long as there exists at least one term which does not contain the negated subterm (and matches with the remainder of the pattern), all other terms are irrelevant. There might be terms that contain the negated subterm, those simply do not match. Note that although subterm negation might appear less expressive than full negation as failure, it does in fact share the same problems if it occurs in combination with the all construct introduced below.

Regular Expressions

Query terms provide advanced text processing capabilities using regular expressions (abbreviated: RE). In language theory, a regular expression is a means to define a regular language (see e.g. \cite{hopcroft-ullman}) and matches with a character sequence, if the character sequence is a word of the language defined by the regular expression. In Xcerpt, regular expressions may be used either in place of strings or in place of subterm labels, and take the form

/<regexp>/

where <regexp> is a regular expression based on the syntax defined in POSIX (Portable Operating System Interface) with some Xcerpt-specific extensions (see below). POSIX regular expressions are very widespread (they are e.g. used in the languages Perl, Python, and Java) and thus well known to many programmers.

Regular Expressions

The following query term against a text document (like the document described in Section "Examples") selects all sections that contain the substring "XML" in their title, i.e. where an arbitrary number of characters appears before and after a substring "XML" (expressed by .*).

report {{
  desc var S -> section {{
    title {{ /.*XML.*/ }}
  }}
}}

POSIX Regular Expressions

As POSIX regular expressions are very well-known, this thesis only provides a brief summary over the major constructs used for building regular expressions. The language definition is available here, and many introductory books into programming languages provide a thorough treatment of the topic.

In POSIX, an underlying character set is assumed (e.g. Unicode). Valid characters are all characters of the character set, where an ordinary (i.e. not special) character usually matches only itself, and the special character . matches all characters.

Special Characters

Special characters are not matched. The following special characters are used in POSIX regular expressions:

%$
character(s) description
* arbitrary repetition (0--) of the preceding character or subexpression
+ arbitrary repetition, but at least one (1--) of the preceding character or subexpression
? optional occurrence (0--1) of the preceding character or subexpression
| separates alternatives
{n} exactly n occurrences of the preceding character or subexpression
{n,m} between $n$ and $m$ occurences of the preceding character or subexpression
^ anchor (beginning of line)
$ anchor (end of line)
( and ) enclose subexpressions (see below)
[ and ] define character classes (see below)
\ quote special characters

If a special character is to match instead of being interpreted, it has to be quoted using the prefix symbol \. For instance, \. matches the point and \+ matches the plus character.

Character Classes

Square brackets are used to define character classes, i.e. sets of characters. A character class matches with all characters that are part of the class. The following character classes are predefined by POSIX:

[:alnum:]   [:cntrl:]   [:lower:]   [:space:]
[:alpha:]   [:digit:]   [:print:]   [:upper:]
[:blank:]   [:graph:]   [:punct:]   [:xdigit:]

For example, the character class [:alnum:] contains all alphanumeric characters, [:upper:] contains all upper case characters, and [:blank:] contains all whitespace characters of the character set.

It is also possible to define new character classes by enclosing all matched characters in square brackets. The character class [abc] for instance matches with the characters a, b, and c. Character classes may contain range expressions using the hyphen character, like in [a-zA-Z] (matching all Latin lower/upper case letters), where the range is defined depending on the underlying character set (e.g. ASCII or Unicode).

Character classes are negated if the first character after the opening bracket is ^. For instance, [^-+] denotes all characters except - and +. Note the different meaning of ^ in regular expressions as negation of character classes and as beginning of line anchor, and in Xcerpt terms as reference to an identifier.

Subexpressions.

POSIX allows to specify subexpressions in regular expressions in order to retrieve specific parts from the matched text. A subexpression is enclosed in parentheses ( and ) (often also \( and \), e.g. in the search function of the editor Emacs). Subexpressions can later be referred to by their position. If subexpressions are nested, the position is determined by counting the opening parentheses.

POSIX Regular Expressions

The following regular expression matches with date strings of the form "1999-12-23" (i.e. in ISO syntax) and retrieves the year, month and day in the subexpressions 1, 2 and 3.

([1-9][0-9]{3})-([01][0-9])-([0-3][0-9])

Backreferences

Backreferences are denoted by \n, where n is a single digit other than 0. A backreference matches a literal copy of whatever was matched by the corresponding n'th subexpression of the pattern. Note that matching with backreferences is NP-hard, as it is possible to encode the 3-SAT problem in a regular expression with backreferences.

The regular expression matches e.g. the strings
(.*)-\1
a-a
go-go
wiki-wiki

Note that "regular expressions with backreferences" are (strictly speaking) not regular, they describe a subset of context free languages.

Xcerpt Extensions

In POSIX, the (substrings matched by) subexpressions are referred to by position after the matching is evaluated. This approach is well suited for imperative languages like Perl or Java, where the evaluation is sequential. For example, the following Perl program retrieves the year, month, and day from the string "2004-03-04" by referring to the positions of the subexpressions after the regular expression is matched. Subexpressions are addressed by counting opening braces from left to right.

if("2004-03-04" =~ /([1-9][0-9]{3})-([01][0-9])-([0-3][0-9])/) {
  $year = $1;
  $month = $2;
  $day = $3;
}

In Xcerpt, referring to subexpressions by position is not feasible: it is incompatible with pattern-matching as it requires a specific control flow and does not fit with Xcerpt's notion of variable binding. Instead, Xcerpt introduces variables (-> restrictions) into regular expression patterns, similar to the way variables are part of term patterns. With this extension, subexpressions can take the form

(<name>->...)

where ... denotes the regular expression pattern and <name> is the name of the variable restricted by this pattern. When the regular expression is matched against a character sequence, the variable is bound to the part of the character sequence that is matched by the subexpression. Note that this approach is similar to the extensions of regular expressions in the language Python, where a group may have the form (?P<name>...). The substring matched by this group is later accessible via the symbolic name <name>. For example, the following query term binds the year, month and day of a date string to the variables Y, M and D:

/(Y->[1-9][0-9]{3})-(M->[01][0-9])-(D->[0-3][0-9])/

Note that subexpressions of the form (...) are still possible as a means for structuring the expression, but the character sequences they matched with cannot be retrieved.

Variables in Regular Expressions

The following query term retrieves student names and email addresses from the file students.xml and separates the local name (User) from the domain name (Domain) of the email addresses. The first subexpression binds everything from the beginning of the string (indicated by ^) up till the first appearance of @ (indicated by the character class [^@] matching all characters except @) to the variable User. The second subexpression binds every alphanumeric character (including . and -) after the @ to the variable Domain.

in {
  resource { "file:students.xml" },
  students {{
    student {{
      name { var Name},
      email {
        /?(var User ->[?@]+)@(var Domain ->[a-zA-Z0-9.-]+)/
      }
    }}
  }}
}

Such a separation could be useful for rendering email addresses on Web pages in a "spamvertised form", i.e. not easily recognisable by automatic email address harvesters: the variable bindings for User and Domain could be reassembled in a construct term (see below) with a suitable representation (e.g. separate User and Domain by <at>).

Created by wastl
Last modified 2005-06-23 02:35 PM
 

Powered by Plone

This site conforms to the following standards: