Using Selenium to crawl & Compare difference between files

In this blog, I would like to introduce how Selenium library can help us get the content from website and compare files according to different format in the Oz plan content checker project.

The goal of Oz plan content checker project is to detect the content of plan in certain website has changed or not. (compared with the content we collected before).

To do this, we separate it into two steps:

Get data from website

Compare them according to different format (Text, Image, PDF)

Get data from website

The powerful function of selenium allows us to crawl all the website with Javascript rendered and can help us locate the element we want to crawl in website.

There are several things we needs to know before getting element using Selenium:

Web Url

Xpath of that element

Type of data (Image, text, pdf)

After knowing these, we can start crawl data from website!

from selenium import webdriver
import urllib.request
import json

driver = webdriver.Chrome()    # Using Chrome as browser to visit  website
driver.get(url) # open the url
time.sleep(3) # Important ! we need to give the website time to render all the content
element =  driver.find_element_by_xpath({Xpath})

There probably are some dependencies issues if Chrome is not installed in you PC.

If the content is Image format

src = image.get_attribute("src")
r = requests.get(src, stream=True)
with open("/{}.png".format({web-name}), "wb") as file:
   file.write(r.content)

If the content is Text

with open("/{}.json".format({web-name}), 'w+') as f:
   json.dump(text_dict, f, ensure_ascii=False)

If the suffix of website ends with .pdf, then this website is special because it has no xpath. We can use urllib.request to download data to local directory.

response = urllib.request.urlopen(<i>url</i>)
file = open("/{}.pdf".format({web-name}), 'wb')
file.write(response.read())
file.close()

Then no matter which type the website plan is, we can save them all in our local directory! Step 1 complete !

Compare files

Because we need to compare the content of website is changed or not, which means we need to compare the data of today with the data of before.

To compare json file, firstly we load them and just compare them as dict

import json

def json_compare(file_path1, file_path2):
    with open(file_path1) as f1, open(file_path2) as f2:
        data1 = json.load(f1)
        data2 = json.load(f2)
    key1 = data1.keys()
    key2 = data2.keys()
    if key1==key2:
        for key in key1:
            if data2[key] == data1[key]:
                continue
            else:
                return False
    else:
        return False
    return True

To compare image file, the key is to transform the image to the numpy matrix format then we compare the mean square error of each pixels.

import cv2
import numpy as np

def image_compare(file_path1, file_path2):
    img1 = cv2.imread(file_path1)
    
    img2 = cv2.imread(file_path2)
    # we need to resize them to the same size to compare 
    img1.resize((256, 256))
    img2.resize((256, 256))

    # compare image
    diff_img = cv2.subtract(img1, img2)
    if np.sum(diff_img) == 0:
        return True
    else:
        return False

To compare pdf file, there is a powerful library calleddiff_pdf_visually, it convert pdf file to image file in that inner function and compare them.

from diff_pdf_visually import pdfdiff
def pdf_compare(file_path1, file_path2):
    try:
        if pdfdiff(file_path1, file_path2):
            return True
        else:
            return False
    except:
        return False

Summary

In this blog, method to get element from website and 3 functions to compare file with different format are introduced. That's all I want to share, hoping this will help your coding in the future.