Python urllib Module Tutorial

Python provides the urllib module in order to handle URL-related tasks. The URL is the short form of the Unified Resource Locators and used to address web pages or other resources uniquely. The urllib module provides the following features and functions.

  • urllib.request is used to open and read requests.
  • urllib.parse is used to parse URLs for their domain name, URI, or parameters.
  • urllib.error is used to handle exceptions.
  • urllib.robotparser is used to parse robot.txt files.

urllib.request

The urllib.request module used to open specified URLs without the UI or browser. The URL is provided to the urlopen() metod like below.

import urllib.request

request_url = urllib.request.urlopen('http://www.pythontect.com')

print(request_url.read())

urllib.parse

The urllib.parse module is used to parse and manupilate the URL for its differet parts. A tipical URL consist of scheme, netlocation, path, parameters, query and fragment.

import urllib.parse

url = "https://www.pythontect.com/about?user=ismail"

parsed_url = urllib.parse.urlparse(url)

print(parsed_url)

The urllib.parse also provides other methods like below which can be used to parse or split the URLs.

MethodDescription
urllib.parse.urlparseSeparates different components of URL
urllib.parse.urlunparseJoin different components of URL
urllib.parse.urlsplitIt is similar to urlparse() but doesn’t split the params
urllib.parse.urlunsplitCombines the tuple element returned by urlsplit() to form URL
urllib.parse.urldeflagIf URL contains fragment, then it returns a URL removing the fragment.

urllib.error

Sometimes URL related methods may provides errors and exceptions. The urllib.error is used to handle and manage these errors and eceptions. There are two main and most popular error and exception types named URLError and HTTPError .

URLError is raised when error occurs during fething of the URL for connectivity etc.

HTTPError is raised for HTTP related errors which are rare and different from commo errors. The HTTPError is subclass or subtype of the URLError.

urllib.robotparser

Web sites provides the robot.txt or robot files in order to provide information and instrcutions for the web scarappers. The robot file is created manually or automatically and provides paths or URLs for the web sites.

import urllib.robotparser

robot = urllib.robotparser.RobotFileParser()

x = robot.set_url('https://www.pythontect.com/robot.txt')

a=robot.read()

print(a)

Leave a Comment