URLs
Terms
String and structure representation [1] :
----> URL parser ----->
valid URL string URL record
<--- URL serializer <-----
The Python standard library urllib.parse
module offers the functions
urlparse
/ urlunparse
to parse and serialize URLs. To support
parameters in each path segment and not only the last segment, use
urlsplit
/ urlunsplit
. [2]
Components
The URL record can have the following items [3] :
-
scheme
-
username
-
password
-
host (netloc)
-
port
-
path. Can be an ASCII string, “.” or “..”
-
fragment
-
query
Relative URLs
a compact representation of the location of a resource relative to an absolute base URL.
… it is often the case that a group or “tree” of documents has been constructed to serve a common purpose; the vast majority of URLs in these documents point to locations within the tree rather than outside of it. Similarly, documents located at a particular Internet site are much more likely to refer to other resources at that site than to resources at remote sites.
Relative addressing of URLs allows document trees to be partially independent of their location and access scheme. For instance, it is possible for a single set of hypertext documents to be simultaneously accessible and traversable via each of the “file”, “http”, and “ftp” schemes if the documents refer to each other using relative URLs. Furthermore, document trees can be moved, as a whole, without changing any of the embedded URLs. [4]
Base URL
The term “relative URL” implies that there exists some absolute “base URL” against which the relative reference is applied. [5]
If no base URL is embedded and the document is not encapsulated within some other entity (e.g., the top level of a composite entity), then, if a URL was used to retrieve the base document, that URL shall be considered the base URL. [6]
To resolve a relative URL (simplified): [7]
-
Establish the base URL
-
Parse the base and the relative URL
-
Remove the last path segment of the base URL and append the relative URL path.
-
Apply the following operations to the new path:
a. Remove all occurrences of “./”
b. Remove “.” at the end of the path.
c. Remove all occurrences of “
/../” from left to right d. Remove “
/..” at the end of the path. -
Recombine the resulting URL components to obtain the absolute form of the relative URL.
urljoin
This function can be used to check the normal and abnormal examples: [8]
>>> from urllib.parse import urljoin
>>> urljoin('http://a/b/c/d', 'g')
'http://a/b/c/g'
References
- Section
4. URLs of the URL Living
Standard.
- urllib.parse in the Python standard library documentation.
- Section 4.1 URL representation of the URL Living Standard.
- Section 1. Introduction of RFC 1808 Relative Uniform Resource Locators.
- Section 3. Establishing a Base URL
of RFC 1808 Relative Uniform Resource Locators.
- Section 3.3 Base
URL from the Retrieval
URL
of RFC 1808 Relative Uniform Resource Locators.
- Section 4 Resolving
Relative URLs
of RFC 1808 Relative Uniform Resource Locators.
- Section 5 Examples and Recommended Practice
URLs of
RFC 1808 Relative Uniform Resource Locators.