In this blog, I would like to introduce how Selenium library can help us get the content from website and compare files according to different format in the Oz plan content checker project.
The goal of Oz plan content checker project is to detect the content of plan in certain website has changed or not. (compared with the content we collected before).
To do this, we separate it into two steps:
- Get data from website
- Compare them according to different format (Text, Image, PDF)
Get data from website
The powerful function of selenium allows us to crawl all the website with Javascript rendered and can help us locate the element we want to crawl in website.
There are several things we needs to know before getting element using Selenium:
- Web Url
- Xpath of that element
- Type of data (Image, text, pdf)
After knowing these, we can start crawl data from website!
from selenium import webdriver import urllib.request import json driver = webdriver.Chrome() # Using Chrome as browser to visit website driver.get(url) # open the url time.sleep(3) # Important ! we need to give the website time to render all the content element = driver.find_element_by_xpath({Xpath})
There probably are some dependencies issues if Chrome is not installed in you PC.
If the content is Image format
src = image.get_attribute("src") r = requests.get(src, stream=True) with open("/{}.png".format({web-name}), "wb") as file: file.write(r.content)
If the content is Text
with open("/{}.json".format({web-name}), 'w+') as f: json.dump(text_dict, f, ensure_ascii=False)
If the suffix of website ends with .pdf, then this website is special because it has no xpath. We can use urllib.request
to download data to local directory.
response = urllib.request.urlopen(<i>url</i>) file = open("/{}.pdf".format({web-name}), 'wb') file.write(response.read()) file.close()
Then no matter which type the website plan is, we can save them all in our local directory! Step 1 complete !
Compare files
Because we need to compare the content of website is changed or not, which means we need to compare the data of today with the data of before.
To compare json
file, firstly we load them and just compare them as dict
import json def json_compare(file_path1, file_path2): with open(file_path1) as f1, open(file_path2) as f2: data1 = json.load(f1) data2 = json.load(f2) key1 = data1.keys() key2 = data2.keys() if key1==key2: for key in key1: if data2[key] == data1[key]: continue else: return False else: return False return True
To compare image
file, the key is to transform the image to the numpy matrix format then we compare the mean square error of each pixels.
import cv2 import numpy as np def image_compare(file_path1, file_path2): img1 = cv2.imread(file_path1) img2 = cv2.imread(file_path2) # we need to resize them to the same size to compare img1.resize((256, 256)) img2.resize((256, 256)) # compare image diff_img = cv2.subtract(img1, img2) if np.sum(diff_img) == 0: return True else: return False
To compare pdf
file, there is a powerful library calleddiff_pdf_visually
, it convert pdf file to image file in that inner function and compare them.
from diff_pdf_visually import pdfdiff def pdf_compare(file_path1, file_path2): try: if pdfdiff(file_path1, file_path2): return True else: return False except: return False
Summary
In this blog, method to get element from website and 3 functions to compare file with different format are introduced. That's all I want to share, hoping this will help your coding in the future.