Web Distributed Authoring and Versioning (WebDAV) URL constraints
greenbytes GmbH
Hafenweg 16
MuensterNW48155
Germany
+49 251 2807760
+49 251 2807761
julian.reschke@greenbytes.de
http://greenbytes.de/tech/webdav/
WEBDAV Working Group
Both WebDAV servers and clients frequently map URI-escaped characters
inside a path segment to non-ASCII characters. These mappings can only
be interoperable if there is a consensus about the appropriate
character encoding. This document specifies a default encoding that
is compatible with both the recommendations for URIs in HTML
content and the "Internationalized Resource Identifiers" (IRI)
specification.
Furthermore, servers that implement a mapping to locally constrained
names frequently do not support specific names, or silently map "similar"
names to the same resource (for instance when content is stored in
a filesystem that is case-preserving, but not case-sensitive). For
these cases, discovery and error signalling features are defined.
Distribution of this document is unlimited. Please send comments to the
Distributed Authoring and Versioning (WebDAV) working group at w3c-dist-auth@w3.org, which may be joined by sending a message with subject
"subscribe" to w3c-dist-auth-request@w3.org.
Discussions of the WEBDAV working group are archived at URL:
.
Both WebDAV servers and clients frequently map URI-escaped characters (see )
inside a path segment to non-ASCII characters. These mappings can only
be interoperable if there is a consensus about the appropriate
character encoding. This document specifies a default encoding that
is compatible with both the recommendations for URIs in HTML
content (see , Appendix B.2.1) and the IRI
specification .
Furthermore, servers that implement a mapping to locally constrained
names frequently do not support specific names, or silently map "similar"
names to the same resource (for instance when content is stored in
a filesystem that is case-preserving, but not case-sensitive). For
these cases, discovery and error signalling features are defined.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in .
The terminology used here follows that in WebDAV ,
HTTP and "Versioning Extensions to WebDAV"
. Definitions of the terms resource, Uniform Resource
Identifier (URI), and Uniform Resource Locator (URL) are provided in
.
This document uses the terms "precondition" and "postcondition" as defined in
. Servers SHOULD report pre-/postcondition failures
as described in section 1.6 of this document.
In proposing a common mapping, the following requirements were taken into
account:
For URL characters inside the US-ASCII range (0..127), the mapping should
be the identity mapping.
The mapping should provide support for all characters defined in the
Unicode character set.
The only widely-deployed character encoding fulfilling these
requirements is the UTF-8 character decoding, defined in
. Consequently, it's also the encoding recommended
for URLs in HTML content (, Appendix B.2.1) and for
IRIs ().
Therefore, clients and servers SHOULD use
the UTF-8 character encoding to map non-ASCII characters to/from character
sequences in URL segments.
When mapping HTTP URL segments (see , section 3.3) to
local storage, the server's behaviour
usually depends on the API used to access that storage. In practice,
two styles are widely deployed: binary and character-based. The sections below discuss
the implications of each and also describe an "identity" mapping.
A typical scenario for this case is when the server does a direct mapping between
URLs and objects in a filesystem, and the filesystem uses filenames based
on byte sequences. This is the case for typical Unix filesystem
implementations.
In this case, mapping between URL segments and local names is straightforward:
To map from URL segments, just apply URL unescaping to obtain a byte
sequence (see , section 2.1)
To map to URL segments, just apply URL escaping to obtain a sequence
of characters suitable for use in a URL segment
The advantage of this simple mapping is that it faithfully stores
whatever the original URL contained. On the other hand, this is a binary
encoding, and programs that display filenames usually have to map the
byte sequence to a character sequence for display. Unless both character
encodings match, the results will be either inaccurate (incorrect
characters) or the display function will break completely (for instance
when an attempt is made to UTF-8-decode a byte stream that was originally
encoded using an incompatible encoding such as ISO-8859-1).
Things get even more complicated when there is no single character encoding
being used on the server. For instance, in a Unix system multiple users
may use different character encodings for filenames. However, the filesystem
does not preserve information about what character encoding the filename was
encoded with; thus, depending on their "locale" settings, different users
will see different names for the same filesystem object.
This scenario is similar to the one discussed in the previous section
().
For instance it occurs when objects are stored locally in a way that allows
Unicode characters in names, such as filenames in the Windows filesystem.
However, in addition to the mapping to byte sequences, an additional
mapping to a character sequence is required. As discussed in
, this mapping should use
the UTF-8 character encoding (). Thus, here the
mapping can be described as:
To map from URL segments, apply URL unescaping to obtain a byte
sequence (see , section 2.1), then
UTF-8-decode to a sequence of characters.
To map to URL segments, UTF-8-encode the character sequence to
a sequence of bytes, then apply URL escaping to obtain a sequence
of characters suitable for use in a URL segment
Finally, it's also possible to simply store the URL segments character by
character, in which case no special mapping considerations apply. Note
that this approach may be inefficient in case the names contain many
URL-escaped sequences (such as when asian characters have been encoded
using UTF-8).
The non-trivial mappings have the common drawback that certain sets of
legal HTTP URLs can not be mapped to local names (and therefore usually
need to be rejected). For the byte sequence mapping described in , this will usually be
just the null character.
However, when using the character mapping described in
, whole Unicode
character ranges may either be impossible to represent (such as when the
underlying filesystem does only support a Unicode subset), or
explicitly disallowed (such as non-normalized character sequences, see
, section 3.2).
In cases like these, servers SHOULD reject operations that attempt
to create those non-mappable URLs. Appropriate precondition names
are defined in .
In general, the mappings discussed in
apply to clients as well. Whether a client maps segments to byte or character
sequences usually depends on the platform it runs on, and what system layer
it uses. For instance, a filesystem driver for a Unix system usually will
have to translate to byte sequences (because that's how many Unix system
internally represent filenames).
However, if the client needs to do any mapping it all, there may be sitations
where parts of a URL segment can't be mapped to what the client needs
internally. In cases like these, it is recommended that the client signals the problem,
and provides a way to repair the problem (such as renaming the resource).
The name specified by the HTTP request as path segment is available
for use as a new binding name (see , section 4 and 6).
Servers that use a non-identity mapping may not be able to create new resources
with the URLs specified by the client (such as in an MKCOL or a PUT request).
Clients that use a non-identity mapping may not be able to handle all URLs
returned by a server (such as a result of a PROPFIND request).
All of the security considerations of HTTP/1.1 and the WebDAV Distributed
Authoring Protocol specification also apply to this protocol specification.
TBD: add notes about the inherent security risks when a backend storage
maps multiple notations to the same physical object (file), think uppercase/lowercase,
trailing blanks/dots, resolution of relative paths ("./", "../").
All internationalization considerations mentioned in also apply to
this document.
There are no IANA Considerations.
Key words for use in RFCs to Indicate Requirement Levels
Harvard University
sob@harvard.edu
General
keyword
HTTP Extensions for Distributed Authoring -- WEBDAV
Microsoft Corporation
yarong@microsoft.com
Dept. Of Information and Computer Science, University of California, Irvine
ejw@ics.uci.edu
Netscape
asad@netscape.com
Novell
srcarter@novell.com
Novell
dcjensen@novell.com
Hypertext Transfer Protocol -- HTTP/1.1
University of California, Irvine
fielding@ics.uci.edu
W3C
jg@w3.org
Compaq Computer Corporation
mogul@wrl.dec.com
MIT Laboratory for Computer Science
frystyk@w3.org
Xerox Corporation
masinter@parc.xerox.com
Microsoft Corporation
paulle@microsoft.com
W3C
timbl@w3.org
Versioning Extensions to WebDAV
Rational Software
geoffrey.clemm@rational.com
IBM
jamsden@us.ibm.com
IBM
tim_ellison@uk.ibm.com
Microsoft
ckaler@microsoft.com
UC Santa Cruz, Dept. of Computer Science
ejw@cse.ucsc.edu
UTF-8, a transformation format of ISO 10646
Alis Technologies
fyergeau@alis.com
Uniform Resource Identifier (URI): Generic Syntax
World Wide Web Consortium
timbl@w3.org
Day Software
fielding@gbiv.com
Adobe Systems Incorporated
LMM@acm.org
HTML 4.01 Specification
W3C
dsr@w3.org
W3C
W3C
Character Model for the World Wide Web 1.0: Normalization
W3C
duerst@w3.org
W3C
shida@w3.org
Reuters Ltd.
misha.wolf@reuters.com
XenCraft
tex@XenCraft.com
WebMethods
aphillips@webmethods.com
Internationalized Resource Identifiers (IRIs)
World Wide Web Consortium
duerst@w3.org
http://www.w3.org/People/D%C3%BCrst/
Microsoft Corporation
michelsu@microsoft.com
http://www.suignard.com
Binding Extensions to Web Distributed Authoring and Versioning (WebDAV)
IBM
20 Maguire Road
Lexington
MA
02421
geoffrey.clemm@us.ibm.com
IBM Research
P.O. Box 704
Yorktown Heights
NY
10598
ccjason@us.ibm.com
greenbytes GmbH
Salzmannstrasse 152
MuensterNW48159
Germany
julian.reschke@greenbytes.de
UC Santa Cruz, Dept. of Computer Science
1156 High Street
Santa Cruz
CA
95064
ejw@cse.ucsc.edu