Python provides the urllib
module in order to handle URL-related tasks. The URL is the short form of the Unified Resource Locators
and used to address web pages or other resources uniquely. The urllib module provides the following features and functions.
- urllib.request is used to open and read requests.
- urllib.parse is used to parse URLs for their domain name, URI, or parameters.
- urllib.error is used to handle exceptions.
- urllib.robotparser is used to parse
robot.txt
files.
urllib.request
The urllib.request
module used to open specified URLs without the UI or browser. The URL is provided to the urlopen()
metod like below.
import urllib.request
request_url = urllib.request.urlopen('http://www.pythontect.com')
print(request_url.read())
urllib.parse
The urllib.parse
module is used to parse and manupilate the URL for its differet parts. A tipical URL consist of scheme, netlocation, path, parameters, query and fragment.
import urllib.parse
url = "https://www.pythontect.com/about?user=ismail"
parsed_url = urllib.parse.urlparse(url)
print(parsed_url)
The urllib.parse also provides other methods like below which can be used to parse or split the URLs.
Method | Description |
---|---|
urllib.parse.urlparse | Separates different components of URL |
urllib.parse.urlunparse | Join different components of URL |
urllib.parse.urlsplit | It is similar to urlparse() but doesn’t split the params |
urllib.parse.urlunsplit | Combines the tuple element returned by urlsplit() to form URL |
urllib.parse.urldeflag | If URL contains fragment, then it returns a URL removing the fragment. |
urllib.error
Sometimes URL related methods may provides errors and exceptions. The urllib.error
is used to handle and manage these errors and eceptions. There are two main and most popular error and exception types named URLError
and HTTPError
.
URLError
is raised when error occurs during fething of the URL for connectivity etc.
HTTPError
is raised for HTTP related errors which are rare and different from commo errors. The HTTPError is subclass or subtype of the URLError.
urllib.robotparser
Web sites provides the robot.txt
or robot files in order to provide information and instrcutions for the web scarappers. The robot file is created manually or automatically and provides paths or URLs for the web sites.
import urllib.robotparser
robot = urllib.robotparser.RobotFileParser()
x = robot.set_url('https://www.pythontect.com/robot.txt')
a=robot.read()
print(a)