URLs

Terms

String and structure representation [1] :

                      ---->  URL parser    ----->
   valid URL string                                 URL record
                      <--- URL serializer  <-----

The Python standard library urllib.parse module offers the functions urlparse / urlunparse to parse and serialize URLs. To support parameters in each path segment and not only the last segment, use urlsplit / urlunsplit. [2]

Components

The URL record can have the following items [3] :

  • scheme

  • username

  • password

  • host (netloc)

  • port

  • path. Can be an ASCII string, “.” or “..”

  • fragment

  • query

Relative URLs

a compact representation of the location of a resource relative to an absolute base URL.

… it is often the case that a group or “tree” of documents has been constructed to serve a common purpose; the vast majority of URLs in these documents point to locations within the tree rather than outside of it. Similarly, documents located at a particular Internet site are much more likely to refer to other resources at that site than to resources at remote sites.

Relative addressing of URLs allows document trees to be partially independent of their location and access scheme. For instance, it is possible for a single set of hypertext documents to be simultaneously accessible and traversable via each of the “file”, “http”, and “ftp” schemes if the documents refer to each other using relative URLs. Furthermore, document trees can be moved, as a whole, without changing any of the embedded URLs. [4]

Base URL

The term “relative URL” implies that there exists some absolute “base URL” against which the relative reference is applied. [5]

If no base URL is embedded and the document is not encapsulated within some other entity (e.g., the top level of a composite entity), then, if a URL was used to retrieve the base document, that URL shall be considered the base URL. [6]

To resolve a relative URL (simplified): [7]

  1. Establish the base URL

  2. Parse the base and the relative URL

  3. Remove the last path segment of the base URL and append the relative URL path.

  4. Apply the following operations to the new path:

    a. Remove all occurrences of “./”

    b. Remove “.” at the end of the path.

    c. Remove all occurrences of “/../” from left to right

    d. Remove “/..” at the end of the path.

  5. Recombine the resulting URL components to obtain the absolute form of the relative URL.

urljoin

This function can be used to check the normal and abnormal examples: [8]

>>> from urllib.parse import urljoin
>>> urljoin('http://a/b/c/d', 'g')
'http://a/b/c/g'

References

  1. Section 4. URLs of the URL Living Standard.
  2. urllib.parse in the Python standard library documentation.
  3. Section 4.1 URL representation of the URL Living Standard.
  4. Section 1. Introduction of RFC 1808 Relative Uniform Resource Locators.
  5. Section 3. Establishing a Base URL of RFC 1808 Relative Uniform Resource Locators.
  6. Section 3.3 Base URL from the Retrieval URL of RFC 1808 Relative Uniform Resource Locators.
  7. Section 4 Resolving Relative URLs of RFC 1808 Relative Uniform Resource Locators.
  8. Section 5 Examples and Recommended Practice URLs of RFC 1808 Relative Uniform Resource Locators.