Back to Home


Selenium can work with

  • htmlunit
  • FirefoxDriver
  • PhantomJs driver a popular headless javascript enabled browser


This version using the following dependencies will download phantomjs (binaries) to the tmp directory. The gradle dependencies are:

    compile ('commons-io:commons-io:2.2')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('commons-io:commons-io:2.2')
    compile ('com.github.igor-suhorukov:phantomjs-runner:1.1')
    compile ('com.github.detro:phantomjsdriver:1.2.0')
    compile 'org.codehaus.groovy:groovy:2.4.12'

Setup Selenium and PhantomJs code

 public void setup() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
        def phantomJsPath = PhantomJsDowloader.getPhantomJsPath()
        def DesiredCapabilities settings = new DesiredCapabilities()
        settings.setCapability("takesScreenshot", true)
        String [] allowSsl =    [ "--web-security=no", "--ssl-protocol=tlsv1", "--ignore-ssl-errors=true" ]
        settings.setCapability("phantomjs.cli.args", allowSsl)
        settings.setCapability("userAgent", Crawler.USER_AGENT)
        settings.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, phantomJsPath)
        def WebDriver driver = new PhantomJSDriver(settings)

Selecting and Deselecting

println driver.getPageSource()                                                  // get the full page source 
def title = driver.findElement(By.tagName("title")).getAttribute("innerText")   // get the title. getText() does not work on title because it is not displayed 
println "Got title : " + title
driver.findElement("login-email")).sendKeys("")     // set an input value 
List<WebElement> overview  = driver.findElements(By.cssSelector("> p"))

Creating a Screenshot

  String screenshotAs = driver.getScreenshotAs(OutputType.BASE64)
  File resultFile = File.createTempFile("phantomjs", ".html")
  OutputStreamWriter streamWriter = new OutputStreamWriter(new FileOutputStream(resultFile), "UTF-8")
  resultFile.write(""" <html><head><meta http-equiv="content-type" content="text/html; charset=UTF-8"></head>
            <body><p> ${name} </p><p> ${lastVisited} </p>
            <img alt="Embedded Image" src="data:image/png;base64, ${screenshotAs}">
            </html> """)
  println "html ${resultFile} created "

Shutting down the driver


Selenium and Python

Ensure the driver e.g default geckodriver is on the path or in a ~/bin directory


unzip the latest version and put it on a path

Selenium code can then be run from python

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
# from bs4 import BeautifulSoup
# import re
# import pandas as pd
# import os
url = ""
# create a new Firefox session
driver = webdriver.Firefox()
python_button = driver.find_element_by_id('MainContent_uxLevel1_Agencies_uxAgencyBtn_33') #FHSU #click fhsu link

Beautiful Soup

Install for python3

pip install beautifulsoup4

Lxml Parser

We use lxml parer with beautiful soup. The C libraries libxml2 and libxslt have huge benefits:

  • Standards-compliant XML support.
  • Support for (broken) HTML.
  • Full-featured.
  • Actively maintained by XML experts.
  • fast. fast! FAST!
pip install lxml

and then to use:

soup = BeautifulSoup(driver.page_source, 'lxml')
selenium_and_geb.txt · Last modified: 2018/12/08 11:58 by root
RSS - 200 © CrosswireDigitialMedia Ltd