Announcements Readings: • Simeon, Wadler: The essence of XML. POPL 2003. Review due Wednesday • J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, J. F. Naughton. Relational Databases for Querying XML Documents: Limitations and Opportunities VLDB 1999. Review due on Monday • Other suggested readings: – http://www.xml.com/pub/a/98/10/guide0.html – http://www.w3.org/TR/2003/WD-xquery-use-cases-20030502/
CSE544 XML, XQuery Monday+Wednesday, April 11+13, 2004
• Guest Lecturer: Alan Fekete, Transactions, Monday, Wednesday 1
2
Outline
XML Syntax • • • • • •
• XML: syntax, semantics, data, DTDs • XPath • Xquery
3
XML Syntax
tags: book, title, author, … start tag: , end tag: elements: …,… elements are nested empty element: abbrv. an XML document: single root element
well formed XML document: if it has matching tags 4
XML Syntax
Foundations of Databases Abiteboul … 1995
Jane Mary John
attributes are alternative ways to represent data 5
oids and references in XML are just syntax 6
1
XML Semantics: a Tree ! Mary Maple 345 Seattle John Thailand 23456
XML Data Element node
Attribute node
• XML is self-describing • Schema elements become part of the data
data person person
id address
name
name
address phone
o555 Mary
street
no
city
Thai John
Maple
345
23456 Text node
Seattle
Relational Data as XML row
phone 3634 6343 6363
name “John”
8
XML is Semi-structured Data
person
person name John Sue Dick
• Consequence: XML is much more flexible • XML = semistructured data
7
Order matters !!!
XML:
– Reational schema: persons(name,phone) – In XML , , are part of the data, and are repeated many times
row
phone name phone 3634 “Sue”
• Missing attributes:
row
name
6343 “Dick”
phone 6363
John 3634 Sue 6343 Dick 6363 9
XML is Semi-structured Data
John 1234 Joe
• Could represent in a table with nulls
no phone !
name
phone
John
1234
Joe
10
XML is Semi-structured Data • Attributes with different types in different objects
• Repeated attributes Mary 2345 3456
John Smith 1234
two phones !
• Impossible in tables: name
phone
Mary
2345
structured name !
• Nested collections (non 1NF) • Heterogeneous collections: 3456
??? 11
– contains both s and s 12
2
Document Type Definitions DTD
Very Simple DTD
• Part of the original XML specification • To be replaced by XML Schema
]>
– Much more complex
• An XML document may have a DTD • XML document: well-formed = if tags are correctly closed Valid = if it has a DTD and conforms to it
• Validation is useful in data exchange 13
14
Very Simple DTD
DTD: The Content Model
Example of valid XML document:
123456789 John B432 1234 987654321 Jim B123 ... ...
• Content model: – – – – –
content model
Complex = a regular expression over other elements Text-only = #PCDATA Empty = EMPTY Any = ANY Mixed content = (#PCDATA | A | B | C)*
15
DTD: Regular Expressions DTD
Very Simple DTD
XML
sequence
. . . . . . . . . .
optional
Kleene star
alternation
16
. . . . . . . . . . . . . . . . . . . . ...... 17
]> 18
3
Very Simple DTD
Attributes in DTDs
Example of valid XML document:
123456789 John B432 1234 987654321 Jim B123 ... ...
.... ...
19
Attributes in DTDs
Attributes in DTDs
.... ... 21
Attributes in DTDs Kind: • #REQUIRED • #IMPLIED • value • value #FIXED
20
Types: • CDATA = string • ID = key • IDREF = foreign key • IDREFS = foreign keys separated by space • (Monday | Wednesday | Friday) = enumeration • NMTOKEN = must be a valid XML name • NMTOKENS = multiple valid XML names • ENTITY = you don’t want to know this
22
Using DTDs • Must include in the XML document • Either include the entire DTD: –
= optional = default value = the only value allowed
• Or include a reference to it: –
• Or mix the two... (e.g. to override the external definition) 23
24
4
Sample Data for Queries
Querying XML Data • XPath = simple navigation through the tree • XQuery = the SQL of XML • XSLT = recursive traversal – will not discuss in class 25
Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998
Data Model for XPath
XPath: Simple Expressions /bib/book/year
The root bib
book
publisher
Addison-Wesley
Result: 1995 1998
The root element book
author
26
/bib/paper/year
. . . .
Result: empty
(there were no papers)
Serge Abiteboul 27
28
XPath: Restricted Kleene Closure //author
Xpath: Text Nodes /bib/book/author/text()
Result: Serge Abiteboul
Result:
Rick Hull Victor Vianu Jeffrey D. Ullman
Serge Abiteboul Victor Vianu Jeffrey D. Ullman
Rick Hull doesn’t appear because he has firstname, lastname
Functions in XPath: – text() = matches the text value – node() = matches any node (= * or @* or text()) – name() = returns the name of the current tag
Result: Rick /bib//first-name 29
30
5
Xpath: Wildcard
Xpath: Attribute Nodes /bib/book/@price
//author/*
Result: “55”
Result: Rick
@price means that price is has to be an attribute
Hull
* Matches any element 31
Xpath: Predicates
32
Xpath: More Predicates
/bib/book/author[firstname]
/bib/book/author[firstname][address[.//zip][city]]/lastname
Result: Rick Hull Result: …
…
33
Xpath: More Predicates
34
Xpath: Summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book/[@price 1995 RETURN $x/title
39
FOR-WHERE-RETURN
Result: abc def ghi
40
FOR-WHERE-RETURN
Equivalently (perhaps more geekish)
• Find all book titles and the year when they were published:
FOR $x IN document("bib.xml")/bib/book[year/text() > 1995] /title FOR $x IN document("bib.xml")/bib/book RETURN { $x/title/text() } { $x/year/text() }
RETURN $x
And even shorter:
document("bib.xml")/bib/book[year/text() > 1995] /title
We can construct whatever XML results we want ! 41
42
7
Answer How to cook a Turkey 2003 Cooking While Watching TV 2004 Turkeys on TV 2002 .....
FOR-WHERE-RETURN • Notice the use of “{“ and “}” • What is the result without them ? FOR $x IN document("bib.xml")/bib/book RETURN $x/title/text() $x/year/text()
43
XQuery: Nesting
44
Result:
For each author of a book by Morgan Kaufmann, list all books she published: FOR $b IN document(“bib.xml”)/bib, $a IN $b/book[publisher /text()=“Morgan Kaufmann”]/author RETURN { $a, FOR $t IN $b/book[author/text()=$a/text()]/title RETURN $t }
XQuery
Jones abc def Smith ghi
In the RETURN clause comma concatenates XML fragments 45
46
Aggregates
Aggregates
Find all books with more than 3 authors:
Same thing:
FOR $x IN document("bib.xml")/bib/book WHERE count($x/author)>3 RETURN $x
count = a function that counts avg = computes the average sum = computes the sum distinct-values = eliminates duplicates
FOR $x IN document("bib.xml")/bib/book[count(author)>3] RETURN $x
47
48
8
Aggregates
Aggregates
Print all authors who published more than 3 books
Be aware of duplicates !
FOR $b IN document("bib.xml")/bib, $a IN distinct-values($b/book/author/text()) WHERE count($b/book[author/text()=$a)>3 RETURN { $a }
FOR $b IN document("bib.xml")/bib, $a IN $b/book/author/text()) WHERE count($b/book[author/text()=$a)>3 RETURN { $a }
What’s wrong ?
49
XQuery
50
FOR-WHERE-RETURN
Find books whose price is larger than average: FOR $b in document(“bib.xml”)/bib LET $a:=avg($b/book/price/text()) FOR $x in $b/book WHERE $x/price/text() > $a RETURN $x
• “Flatten” the authors, i.e. return a list of (author, title) pairs FOR $b IN document("bib.xml")/bib/book, $x IN $b/title/text(), $y IN $b/author/text() RETURN { $x } { $y }
LET binds a variable to one value; FOR iterates a variable over a list of values We will come back to that
Answer: abc efg abc hkj
51
FOR-WHERE-RETURN • For each author, return all book titles he/she wrote Answer:
FOR $b IN document("bib.xml")/bib, efg $x IN $b/book/author/text() abc klm RETURN . ... { $x } { FOR $y IN $b/book[author/text()=$x]/title What about RETURN $y } duplicate authors ? 53
52
FOR-WHERE-RETURN • Same, but eliminate duplicate authors: FOR $b IN document("bib.xml")/bib LET $a := distinct-values($b/book/author/text()) FOR $x IN $a RETURN $x { FOR $y IN $b/book[author/text()=$x]/title RETURN $y } 54
9
FOR-WHERE-RETURN
SQL and XQuery Side-by-side Product(pid, name, maker, price)
• Same thing:
SELECT x.name, x.price FROM Product x ORDER BY x.price
FOR $b IN document("bib.xml")/bib, $x IN distinct-values($b/book/author/text()) RETURN $x { FOR $y IN $b/book[author/text()=$x]/title RETURN $y }
Find all product names, prices, sort by price
FOR $x in document(“db.xml”)/db/Product/row ORDER BY $x/price/text() RETURN { $x/name, $x/price }
SQL XQuery 55
56
Producing a Well-Formed Answer
Answers Name
Price
abc
7
def
23
...
...
....
abc 7
def 23
Notice: this is NOT a well-formed document ! (WHY ???)
57
Xquery’s Answer ....
{ FOR $x in document(“db.xml”)/db/Product/row ORDER BY $x/price/text() RETURN { $x/name, $x/price } }
58
SQL and XQuery Side-by-side Product(pid, name, maker, price) Company(cid, name, city, revenues) Find all products made in Seattle
abc 7
Now it is well-formed !
def 23
SELECT x.name FROM Product x, Company y WHERE x.maker=y.cid and y.city=“Seattle” SQL
59
Cool XQuery
FOR $r in document(“db.xml”)/db, $x in $r/Product/row, $y in $r/Company/row WHERE $x/maker/text()=$y/cid/text() and $y/city/text() = “Seattle” RETURN { $x/name }
XQuery FOR $y in /db/Company/row[city/text()=“Seattle”], $x in /db/Product/row[maker/text()=$y/cid/text()] 60 RETURN { $x/name }
10
SQL and XQuery Side-by-side
123 abc efg …. … ... ....
For each company with revenues < 1M count the products over $100 SELECT y.name, count(*) FROM Product x, Company y WHERE x.price > 100 and x.maker=y.cid and y.revenue < 1000000 GROUP BY y.cid, y.name
61
SQL and XQuery Side-by-side
FOR $r in document(“db.xml”)/db, $y in $r/Company/row[revenue/text()100]) } 62
FOR v.s. LET
Find companies with at least 30 products, and their average price SELECT y.name, avg(x.price) An element FROM Product x, Company y WHERE x.maker=y.cid GROUP BY y.cid, y.name FOR $r in document(“db.xml”)/db, HAVING count(*) > 30 $y in $r/Company/row LET $p := $r/Product/row[maker/text()=$y/cid/text()] WHERE count($p) > 30 RETURN A collection { $y/name/text() } avg($p/price/text())
FOR • Binds node variables
iteration
LET • Binds collection variables
one value
63
FOR v.s. LET FOR $x IN /bib/book RETURN { $x }
64
XQuery
Returns: ... ... ... ...
Summary: • FOR-LET-WHERE-RETURN = FLWR FOR/LET Clauses List of tuples
LET $x := /bib/book RETURN { $x }
Returns: ... ... ... ... 65
WHERE Clause List of tuples RETURN Clause 66 Instance of Xquery data model
11
Collections in XQuery
XML from/to Relational Data
• Ordered and unordered collections
• XML publishing:
– /bib/book/author/text() = an ordered collection: result is in document order – distinct-values(/bib/book/author/text()) = an unordered collection: the output order is implementation dependent
– relational data
XML
• XML storage: – XML
relational data
• LET $a := /bib/book $a is a collection • $b/author a collection (several authors...) Returns:
RETURN { $b/author }
... ... ... ... 67
68
Client/server DB Apps
XML Publishing Tuple streams
Tuple streams
XML
Network
Relational Database
Application
XML publishing
Relational Database
Web
Xpath/ XQuery
SQL 69
Enroll
SQL
70
XML Publishing
XML Publishing
Student
Application
Course
• Relational schema: Student(sid, name, address) Course(cid, title, room) Enroll(sid, cid, grade) 71
Operating Systems MGH084 John Seattle 3.8 … … Database EE045 Mary Shoreline 3.9 … … …
Group by courses: redundant representation of students
Other representations possible too 72
12
Now we write an XQuery to export relational data Note: result is is the right DTD
XML Publishing First thing to do: design the DTD:
73
XML Publishing
XML
{ FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()] $z IN /db/Student/row[sid/text() = $y/sid/text()] RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } }
74
XML Publishing
Query: find Mary’s grade in Operating Systems XQuery FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() Can be done automatically
SQL SELECT Enroll.grade FROM Student, Enroll, Course WHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid
How do we choose the output structure ? • Determined by agreement with partners/users • Or dictated by committees – XML dialects (called applications) = DTDs
• XML Data is often nested, irregular, etc • No normal forms for XML 75
XML Storage
76
XML Storage
• Most often the XML data is small
Three solutions: • Schema derived from DTD • Storing XML as a graph: “Edge relation” • Store it as a BLOB
– E.g. a SOAP message – Parsed directly into the application (DOM API)
• Sometimes XML data is large – need to store/process it in a database
– Simple, boring, inefficient – Won’t discuss in class
• The XML storage problem: – How do we choose the schema of the database ? 77
78
13
Designing a Schema from DTD
Designing a Schema from DTD First, construct the DTD graph:
Design a relational schema for: ]>
company
*
We ignore the order
*
person
product
* ssn
name
office
phone
pid
price
avail.
descr.
79
80
Designing a Schema from DTD Next, design the relational schema, using common sense.
Designing a Schema from DTD What happens to queries:
company
*
*
person
FOR $x IN /company/product[description] RETURN { $x/name, $x/description }
product
* ssn
name
office
phone
pid
price
avail.
descr.
Person(ssn, name, office) Phone(ssn, phone) Product(pid, name, price, avail., descr.) Which attributes may be NULL ? (Look at the DTD)
SELECT Product.name, Product.description FROM Product WHERE Product.description IS NOT NULL 81
82
Storing XML as a Graph
Storing XML as a Graph
0
Can be ANY XML data (don’t know DTD)
Sometimes we don’t have a DTD: • How can we store the XML data ?
db 2
4
3
title
Edge
Every XML instance is a tree • Store the edges in an Edge table • Store the #PCDATA in a Value table 83
5
book author
6
title
1 9
book 7
publisher 8
author author
11
10
title
state
“Complete “Morgan “Chamberlin” “Transaction “Bernstein” “Newcomer”Kaufman” Guide Processing” to DB2”
Source
Tag
Dest
0
db
1
1
book
2
2
title
3
2
author
4
1
book
5
5
title
5
author
...
...
“CA”
Value Source
Val
3
Complete guide . . .
6
4
Chamberlin
7
6
...
...
...
...
84
14
Storing XML as a Graph
Storing XML as a Graph What happens to queries:
What happens to queries:
FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title
SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source = 0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberlin” and xtitle.dest = vtitle.source
xdb
db xbook xauthor
book
xtitle
author “Chamberlin”
title Return value
vauthor
vtitle
A 6-way join !!!
85
86
Storing XML as a Graph
Other XML Topics
Edge relation summary: • Same relational schema for every XML document:
• Name spaces • XML API:
Edge(Source, Tag, Dest) Value(Source, Val)
– DOM = “Document Object Model”
• XML languages:
• Generic: works for every XML instance • But inefficient:
– XSLT
• XML Schema • Xlink, XPointer • SOAP
– Repeat tags multiple times – Need many joins to reconstruct data
87
Available from www.w3.org (but don’t spend rest of your life reading those standards !) 88
Old&New XML Research at UW • Processing: – Query languages (XML-QL, a precursor of XQuery) – Tukwila – XML updates
• XML publishing/storage – SilkRoute: silkroute.sourceforge.net – STORED
• XML tools – XML Compressor: Xmill - a very popular tool – XML Toolkit (xsort, xagg, xgrep, xtransf, etc): xmltk.sourceforge.net
• Theory: – Typechecking – Xpath, Xquery containment 89
15