Friday, October 8, 2010

Downloading images from a website in Python

From time to time, you might find yourself needing to sift through a website and grab images from it. While this isn't terribly difficult in any language (except, maybe, Perl and Java), it's amazingly easy in Python:


import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://www.yahoo.com"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: some_text
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
try:
imgData = urllib2.urlopen(imgUrl).read()
fileName = basename(urlsplit(imgUrl)[2])
output = open(fileName,'wb')
output.write(imgData)
output.close()
except:
pass


As you can see, the code above is incredibly simple. All it does is connect to the URL specified in the 'url' variable, searches for the HTML '<img>' tag using standard regular expressions and downloads whatever file is specified in the 'src' parameter. Earthshaking code? No. Useful? Yes!

The code could be made better if you added the ability to also parse URL's that were on the page and follow those to continue parsing. That way, you could point the program to a URL and have it automatically explore associated URL's and grab images from their sites too. Still, even as it is, it's pretty useful.

** Thanks to the folks at ActiveState for this Python recipe!

No comments: