Back to Home

Selenium

Selenium can work with

  • htmlunit
  • FirefoxDriver
  • PhantomJs driver a popular headless javascript enabled browser

PhantomJs

This version using the following dependencies will download phantomjs (binaries) to the tmp directory. The gradle dependencies are:

    compile ('commons-io:commons-io:2.2')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('commons-io:commons-io:2.2')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.detro:phantomjsdriver:1.2.0')
    compile 'org.codehaus.groovy:groovy:2.4.12'

Setup Selenium and PhantomJs code

 public void setup() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
 
 
        def phantomJsPath = PhantomJsDowloader.getPhantomJsPath()
 
        def DesiredCapabilities settings = new DesiredCapabilities()
        settings.setJavascriptEnabled(true)
        settings.setCapability("takesScreenshot", true)
        String [] allowSsl =    [ "--web-security=no", "--ssl-protocol=tlsv1", "--ignore-ssl-errors=true" ]
        settings.setCapability("phantomjs.cli.args", allowSsl)
        settings.setCapability("userAgent", Crawler.USER_AGENT)
 
        settings.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, phantomJsPath)
 
        def WebDriver driver = new PhantomJSDriver(settings)
}

Selecting and Deselecting

println driver.getPageSource()                                                  // get the full page source 
def title = driver.findElement(By.tagName("title")).getAttribute("innerText")   // get the title. getText() does not work on title because it is not displayed 
println "Got title : " + title
driver.findElement(By.id("login-email")).sendKeys("rdonovan2004@gmail.com")     // set an input value 
List<WebElement> overview  = driver.findElements(By.cssSelector("> p"))

Creating a Screenshot

  String screenshotAs = driver.getScreenshotAs(OutputType.BASE64)
  File resultFile = File.createTempFile("phantomjs", ".html")
  OutputStreamWriter streamWriter = new OutputStreamWriter(new FileOutputStream(resultFile), "UTF-8")
 
 
  resultFile.write(""" <html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head>
            <body><p> ${name} </p><p> ${lastVisited} </p>
            <img alt="Embedded Image" src="data:image/png;base64, ${screenshotAs}">
            </body>
            </html> """)
 
  println "html ${resultFile} created "

Shutting down the driver

   driver.quit();

Selenium and Python

Ensure the driver e.g default geckodriver is on the path or in a ~/bin directory

wget https://github.com/mozilla/geckodriver/releases/download/v0.18.0/geckodriver-v0.18.0-linux64.tar.gz

unzip the latest version and put it on a path

Selenium code can then be run from python

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# from bs4 import BeautifulSoup
# import re
# import pandas as pd
# import os
 
url = "http://kanview.ks.gov/PayRates/PayRates_Agency.aspx"
 
# create a new Firefox session
driver = webdriver.Firefox()
driver.implicitly_wait(30)
driver.get(url)
driver.get_screenshot_as_png()
 
 
 
python_button = driver.find_element_by_id('MainContent_uxLevel1_Agencies_uxAgencyBtn_33') #FHSU
python_button.click() #click fhsu link

Beautiful Soup

Install for python3

pip install beautifulsoup4

Lxml Parser

We use lxml parer with beautiful soup. The C libraries libxml2 and libxslt have huge benefits:

  • Standards-compliant XML support.
  • Support for (broken) HTML.
  • Full-featured.
  • Actively maintained by XML experts.
  • fast. fast! FAST!
pip install lxml

and then to use:

soup = BeautifulSoup(driver.page_source, 'lxml')
 
selenium_and_geb.txt · Last modified: 2018/12/08 11:58 by root
 
RSS - 200 © CrosswireDigitialMedia Ltd