24 February 2024

Guest blog – Automating Web Scraping using Python, Selenium and Web Drivers

1. Introduction

This tutorial is a practical example of how you might go about automating interaction with a web browser, in situations where you really don’t want to be doing this manually on a regular basis; for example, where you have lot of website options to check, or you need to carry out these checks repeatedly over time. 

The inspiration for this tutorial is as the lure of amazing train travel – the Dogu Express, a 26-35 hour train journey running between Ankara and Kars in Turkey.  There is a tourist focussed version of this train journey called the “Touristic Dogu Express” with fewer stops and sleeping cars; this can get booked up quickly, and only releases its tickets up to 30 days in advance of travel. 

This tutorial covers automating the checking of availability for a particular journey of interest to me (a single journey on the Touristic Dogu Express from Kars to Ankara) in the next 30 days.  It is intended that, until I am ready to book my train tickets, this system checks more tickets as they are released and I receive notifications as to what berths are available, on the website, so I don’t miss out on a chance to book.

The method required for this is to automate web clicking functions in a web browser; in this tutorial this is going to be carried out using Python and Selenium on a Windows 11 computer. 

2. Choosing a Web Driver to use with Selenium

In order to use Selenium, we need a driver to control the web browser we select, and the main options available are shown in Table 1.

Web browserDriver
Chromehttps://sites.google.com/chromium.org/driver/
Edgehttps://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefoxhttps://github.com/mozilla/geckodriver/releases
Safarihttps://webkit.org/blog/6900/webdriver-support-in-safari-10/

Table 1: Common Web Browsers and their respective Driver downloads for use with Selenium

Of these, this tutorial will use the Firefox browser for automation, and therefore install its driver ‘Geckodriver’. 

3. Installing Geckodriver to use with Firefox

 The Mozilla Firefox geckodriver can be found here (shown in Figure 1).

Figure 1: List of geckodriver options – choose ‘Show all 13 assets’ to get the full list

If you click ‘Show all 13 assets’ (Figure 1) to look at more Geckodriver downloads, you will see geckodriver-v0.34.0-win64.zip as well, which is what I required for my computer setup.  When you download that geckodriver file, it will be zipped.  Extract the .zip file to anywhere you wish on your computer.  When unzipped, note the full path to your geckodriver.exe file, so you can point to this later in your Python Selenium code e.g. in a Windows environment:

# full local path to your geckodriver.exe

C:\some-folder\geckodriver-v0.34.0-win64\geckodriver.exe

4. Installing Python bindings for Selenium

Assuming that you have Python installed on your computer, and also (recommended) that you have set up and activated a Python virtual environment, then you are ready to install Python binding for Selenium:

# if you are installing via the command line:

pip install selenium

# if you are installing within a Python juypter notebook, the ‘!’ is required:

!pip install selenium

Note that the version of Python Selenium which has been installed and used in this tutorial is version 4.17.2.  You can check this by running a command in the terminal or within a Python module or jupyter notebook.  This is important as there was a rewrite of many of the Selenium methods brought in from version 4.10 onwards, such that some of the tutorial examples available on the internet are now out of date.

# checking Python selenium version from the terminal (e.g. within Pycharm)

python -c “import selenium; print(selenium.__version__)”

Within a Python module:

# within a Python module

import selenium

print(selenium.__version__)

5. Web Site and its Automation

The website we will be checking for train tickets is the TCDD website; this is described further at section 5.1 and following.

5.1 Overview of the Website to be Automated

Figure 2: Main Page for the Turkish Railways website TCDD (English version) to be automated with Selenium

The main page of the Turkish Railways website TCDD shown in Figure 2 can be found here.  The page defaults to Turkish, so you have to press the link labelled ‘English’ on the top right of the main page to change language.  On the front page of this website, the following defaults are relevant for the subsequent automation steps:

  • Departure Date is prefilled with today’s date
  • Number of Passengers is prefilled with the default number of ‘1
  • The default journey search (radio button) is ‘One Way

5.2 Overview of desired website automation actions

The actions I would like to automate are as follows:

  • Searching from today’s date, run a search 31 times, up to 30 days from today
  • Choose a single journey from Kars to Ankara Gar
  • For two passengers
  • Pressing the search button, which should (if train listings are available) take you to the next page, page 2 (search results).  If no train results are available, you will stay on the main page. 
  • From page 2 (search results), retrieve availability for the Tourist Dogu Express Train (TURİSTİK DOĞU EKS) only, not the regular Dogu Express Train (DOĞU EKSPRESİ)
  • This search on a given date may give rise to a) a listing for the Tourist Dogu Express Train , but not the regular Dogu Express Train  b) conversely, a listing for the regular train, but not the Tourist train, c) a listing for both types of train, and d) No train listing at all. 

5.3 Set Selenium Imports and Driver settings

  • from selenium import webdriver
  • from selenium.webdriver.common.keys import Keys
  • from selenium.common.exceptions import TimeoutException, NoSuchElementException
  • from selenium.webdriver.firefox.service import Service
  • from selenium.webdriver.firefox.options import Options
  • from selenium.webdriver.common.by import By
  • from selenium.webdriver.support.ui import WebDriverWait
  • from selenium.webdriver.support import expected_conditions as EC
  • from selenium.webdriver.common.action_chains import ActionChains

Below are the various settings required to set up the Firefox webdriver, and point it towards the train website: 

# set the driver to find the correct geckodriver on your computer

geckodriverpath = C:\some-folder\geckodriver-v0.34.0-win64\geckodriver.exe

# set the Turkish train website for the driver

train_website = ‘https://ebilet.tcddtasimacilik.gov.tr/view/eybis/tnmGenel/tcddWebContent.jsf’

Using the geckodriverpath and train_website, the driver and wait objects are created:

def set_driver_and_set_wait():

    “””

    function to declare the driver (based on the website you are automating)

    and wait object

    :return: driver, wait

    “””

    # https://stackoverflow.com/questions/76802588/python-selenium-unexpected-keyword-argument-executable-path

    driver_service = Service(executable_path=geckodriverpath)

    # Set up the Firefox WebDriver for Python Selenium in headless mode

    options = Options()

    options.headless = True

    driver = webdriver.Firefox(options=options, service=driver_service)

    # set the website for the driver

    driver.get(train_website)

    # set a wait time for the driver (this will be used in multiple places)

    wait = WebDriverWait(driver, 10)

    return driver, wait

5.4 Set the desired journey start and endpoint

It was set out in section 5.2, we are looking for a single journey from Kars to Ankara Gar.  Therefore we need to find the website elements on the main search page which are the “From” (“Nereden”) and “To” (“Nereye”) text boxes, and insert “Kars” and “Ankara Gar” to these boxes respectively.

5.4.1 Locating the origin and destination boxes by element name

First, locate the name of the origin and destination boxes in the code, by right clicking on the web page and selecting Inspect(Q) as shown in Figure 3:

Figure 3: Right click on webpage and choose Inspect (Q) in Firefox browser to show webpage source code

Clicking inspect and moving around the webpage will highlight the code which corresponds to page elements.  In this case, the element with the name “nereden” (“From”) is located in the webpage source code shown in Figure 4:

Figure 4: Looking at webpage source code elements using inspect to look at element ids, names

The code from the elements relating to “nereden” can be copied by right clicking on the bottom pane where the code is, and there are various copying options: Inner HTML, Outer HTML, CSS Selector, CSS Path, XPath.  These options (particularly the Outer HTML and the XPath), when copied can be used to locate elements via name, ID, xpath, and then called via Selenium.  

Figure 5: Copying of elements from source code (HTML, CSS, XPath) – to locate element id, name or xpath

The use of ‘wait’ ensures that the elements we’re locating has loaded on the web page.  If the elements which are waiting to load do not exist, Selenium will produce a TimeoutException. 

# our desired journey starting point: in this case, Kars, in the East of Turkey

FROM_input_box = wait.until(EC.visibility_of_element_located((By.NAME, “nereden”)))

# our desired destination, Ankara Gar station

TO_input_box = wait.until(EC.visibility_of_element_located((By.NAME, “nereye”)))

5.4.2 Typing the origin and destination stations to their boxes on the web page

We have previously defined the  FROM_input_box and TO_input_box elements on the webpage; using the send_keys method sends whatever text you need to send to those boxes.  Here we send the origin and destination railway stations:

# Type ‘Kars’ into the “From” box

FROM_input_box.send_keys(“Kars”)

# Type ‘Ankara Gar’ into the “To” box

TO_input_box.send_keys(“Ankara Gar”)

5.5 Changing website frontend language

To assist with visualizing the automation of your website, as this is primarily a Turkish language website, you can change the website language to English (note that this does not change the name of the elements in the web page source code)

def change_language(wait):

    “””

    function to change the language of the frontend from Turkish to English

    :return:

    “””

    english_button = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, “English”)))

    # Click the English button (this language change is to assist non Turkish speaker

    # with viewing the automated use of the website)

    english_button.click()

5.6 Overriding the Default Passenger Number

The number of passengers on the main search page, is prefilled with the default number of ‘1‘.  This default number must first be removed, before replacing it with the desired number of passengers in the passenger number box.  First, the clickable passenger number box is identified and named as a Python object.  Then an ‘action chain’ is defined to operate on that ‘passenger_number’ object, which carries out a series of actions. 

First, a double click is carried out on the object, as this selects ‘1‘ on the first click, and on the second click it highlights it in full before sending a delete command.  Following the delete command, the send_keys command is sent with the updated number of passengers. 

# number of passengers found by ID as a clickable element

passenger_number = wait.until(EC.element_to_be_clickable((By.ID, “syolcuSayisi”)))

# performs these actions in a chain

actions = ActionChains(driver)

actions.move_to_element(passenger_number)

# clicking spinner button twice selects and then highlights the ‘1’ first

actions.click(passenger_number)

actions.click(passenger_number)

# delete selected default passenger number of ‘1’

actions.send_keys(Keys.DELETE)

# send “2” to indicate the number of passengers is now “2”

actions.send_keys(“2”)

actions.perform()

# This is another, alternative way in which to run the action chains above

ActionChains(driver).move_to_element(passenger_number).click(passenger_number).click(passenger_number).send_keys(Keys.DELETE).send_keys(“2”).perform()

5.7 Setting and Incrementing the Search Date

From Figure 2, it can be seen that the main website page (English version) has a default outbound travel date of whatever today’s date is.  The intention is to carry out a search for specific trains for all dates from today’s date, up to 30 days from now.  Therefore, we will be incrementing the date from today for each day up to today+30 days. 

5.7.1 Incrementing the date

The search for a one-way train ticket from Kars to Ankara is to be iterated from today, every day until 30 days from today.  Therefore Python datetime can be used [ref] https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior [/ref] to iterate over those dates.  We can generate a string for any day with reference to today’s date as follows:

from datetime import datetime, timedelta

# increment the number of days

number_of_days_from_now = timedelta(days=0)

todays_date = datetime.today()

future_date = number_of_days_from_now + todays_date

# convert to string after addition

future_date_string = future_date.strftime(‘%d.%m.%Y’)

5.7.2 Finding and Setting the Outward Travel Date

As with the number of passengers, the default travel outbound date is set to today’s date.  Therefore , the date widget must be found, its value fully selected before being deleted and overridden with the desirned outbound date [ref] https://stackoverflow.com/questions/69690674/how-to-override-the-default-input-field-value-using-selenium-and-python [/ref].  This is shown by the code below:

# trCalGid is the ID of the outwards date, in the format: 12.02.24

# CHANGE THE SEARCH DATE for the outward leg (from)

date_widget = wait.until(EC.element_to_be_clickable((By.ID, “trCalGid”)))

date_actions = ActionChains(driver)

date_actions.move_to_element(date_widget)

# three clicks moves over the entire date of format 12.02.24

date_actions.click(date_widget)

date_actions.click(date_widget)

date_actions.click(date_widget)

# fourth click selects whole date

date_actions.click(date_widget)

# clears default date

date_actions.send_keys(Keys.DELETE)

# sets new date – new_date_string is iterated over your range of dates

date_actions.send_keys(new_date_string)

date_actions.perform()

# click away from the date picker to close it

# by clicking on an arbitrary section elsewhere

# section to click to is the main ‘intro’ section on the site

outside_element = driver.find_element(By.ID, “intro”)

outside_element.click()

5.8 Finding and Pressing ‘Search’

Once the number of passengers, origin and destination station and desired travel date are set, then the search button must be located and pressed.  This can be done via the following Selenium code:

# find the search button ‘btnSeferSorgula’ and click it

search_button = wait.until(EC.element_to_be_clickable((By.ID, “btnSeferSorgula”)))

search_actions = ActionChains(driver)

search_actions.move_to_element(search_button)

search_actions.click(search_button)

search_actions.perform()

6. Processing Train Search Results

As stated above, searching for a Kars to Ankara train on a given date may give rise to any of the following combinations:

6.1 Search results scenarios

Scenario 1: search results yield a listing for the Tourist Dogu Express Train , but not the regular Dogu Express Train 

In this scenario 1, there is a single row in the table, which is the Tourist Dogu Express Train only :

Figure 6: Search results – listing for the Tourist Dogu Express train but not the regular Dogu Express train

Scenario 2: search results yield a listing for the regular Dogu Express train, but not the Tourist Dogu Express train. 

In this scenario 2, there is a single row in the table (the Dogu Express only):

Figure 7: Search provides listing for the regular Dogu Express train, but not the Tourist Dogu Express train

Scenario 3: search results yield a listing for both types of train:

In this scenario 3, there are two rows in the table:

Figure 8: Search results – listing for both the regular Dogu Express train and the Tourist Dogu Express train

Scenario 4: search results yield No train listing at all (in which case you do not leave the main search page, but get an information bubble instead):

Figure 9: Search results – no train listings found at all on search date

Based on scenarios 1, 2, and 3 (all of which yield search results), it will be noted that the results are a table with either one or two rows.  Based on scenario 4, there will be no table available with search results at all.

6.2 Checking that Search Results Page has loaded

In order to check that the search results page has loaded, I am choosing an element which should always be present on a loaded page – the column header “Tren Adi“.  Therefore the code includes a wait until this element has loaded, using Expected Condition, Wait and until within Selenium – shown below.

Figure 10: Search results page showing Tren Adı column name showing that search results have been loaded

Looking at the search page and copying the outer HTML code,  Tren Adı column header has an id referred to below:

<th id=”mainTabView:gidisSeferTablosu:j_idt78″ class=”ui-state-default” role=”columnheader” style=”text-align:center;”><span>Tren Adı</span></th>

The website code in the console shows the Tren Adı column header in more detail:

Figure 11: Search results page showing Tren Adı column name in web page source code via console

If the search results page has not loaded (because no search results have been returned at all, i.e. scenario 4, or for some other reason), waiting for the expected condition of finding Tren Adı will give rise to a timeout exception.  This exception must be caught: 

from selenium.common.exceptions import TimeoutException

def check_search_results_loaded(wait):

    “””

    run a check to see if the results page has loaded (will load if trains found)

    need to do a wait until the column heading called Tren Adı has loaded

    :return:

    “””

    # id for the Tren Adı column name

    tren_adi_id = “mainTabView:gidisSeferTablosu:j_idt78”

    wait.until(EC.visibility_of_element_located((By.ID, tren_adi_id)))

### calling the check_search_results_loaded function

try:

    # check whether any results loaded

    check_search_results_loaded(wait)

    print(‘search results page loaded’)

except TimeoutException:

    driver.quit()

6.3 Locating Elements from Search Results

On the search result page, each result occupies a row in the table.   The ids and element names to be used by Python Selenium can be cross checked against the browser view. 

6.3.1 Elements in the table first row results

Figure 12 shows the page elements for the results for the first row, to be located by Python Selenium

Figure 12: Search results table first row (index 0) to be located by Python Selenium

Inner HTML view for the webpage source code for the top row of the table of the search results is shown below:

<div class=”ui-button ui-widget ui-state-default ui-button-text-only ui-corner-left ui-state-active”><input id=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi:0″ name=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi” type=”radio” value=”1″ class=”ui-helper-hidden” checked=”checked”><span class=”ui-button-text ui-c”>Standart</span></div><div class=”ui-button ui-widget ui-state-default ui-button-text-only ui-corner-right ui-state-disabled”><input id=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi:1″ name=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi” type=”radio” value=”2″ class=”ui-helper-hidden” disabled=”disabled”><span class=”ui-button-text ui-c”>Esnek</span></div>

X-path for the top row:

//*[@id=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:soBiletTipi”]

Note that both the X-path, and the id for the first row as a mainTabView:gidisSeferTablosu:0: in it.

6.3.2 Elements in the table second row results

Figure 13: Looking at webpage sourcecode and ID for second row to locate with Selenium Python

Inner HTML view for the second row of the search results table:

<div class=”ui-button ui-widget ui-state-default ui-button-text-only ui-corner-left ui-state-active”><input id=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi:0″ name=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi” type=”radio” value=”1″ class=”ui-helper-hidden” checked=”checked”><span class=”ui-button-text ui-c”>Standart</span></div><div class=”ui-button ui-widget ui-state-default ui-button-text-only ui-corner-right ui-state-disabled”><input id=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi:1″ name=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi” type=”radio” value=”2″ class=”ui-helper-hidden” disabled=”disabled”><span class=”ui-button-text ui-c”>Esnek</span></div>

X_path for the bottom row:

//*[@id=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:soBiletTipi”]

Note that both the X-path, and the id for the first row as a :0: in it.  Comparing the X_path and id of the first table row with that of the second table row, the two rows are identical except for the first number (which must relate to the table index ) – the first row as a :0: in it, and the second row contains :1:

Therefore, for multiple rows in the table, we iterate through the results row, and also handle any rows which are not in existence.

6.4 Iterating through Table row and handling non-existent page elements

From the search result scenarios 1-4 (and also variability in how the elements in a page are expressed), it can be seen that a search result page element may in fact be missing, when we search for the element Xpaths or id in question.  Therefore this will cause a Python error (an exception) which must be handled, should it arise. 

def check_results_table_row(driver, row_no):

    “””

   row_no will be iterated through at least 0, 1

    For each row (iterate row_no) and locate the label element by its xpath

    top row xpath

    //*[@id=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:j_idt81″]

    second row xpath

    //*[@id=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:j_idt81″]

    :return:

    “””

    # ensure row_no is a string

    row_no = str(row_no)

    # substitute row number into the table_row at ‘{}’

    table_row = ‘//*[@id=”mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:cbGidisSeferInfo”]’.format(row_no)

    table_xpath_row = None

    try:

        table_xpath_row = driver.find_element(By.XPATH, table_row)

    except NoSuchElementException:

        pass

    if table_xpath_row:

        try:

            table_xpath_row.text

        except NoSuchElementException:

            pass

6.5 Locate Train Type Text Elements

The id of the train type

<label id=”mainTabView:gidisSeferTablosu:0:seferBilgileriDataList:0:j_idt81″ class=”ui-outputlabel” style=”font-weight:bold;font-size:12px;”> : DOĞU EKSPRESİ </label>

<label id=”mainTabView:gidisSeferTablosu:1:seferBilgileriDataList:0:j_idt81″ class=”ui-outputlabel” style=”font-weight:bold;font-size:12px;”> : TURİSTİK DOĞU EKS.</label>

Using Python Selenium to locate the page element by id:

def get_train_type(driver, row_no):

    “””

    :param driver: Selenium driver object set up

    :param row_no: row number in the table (should be a string)

    :return:

    “””

    row_no = str(row_no)

    train_type_id =”mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:j_idt81″.format(row_no)

    mainline_id = “mainTabView:gidisSeferTablosu:{}:seferBilgileriDataList:0:j_idt80”.format(row_no)

    try:

        mainline_id_row = driver.find_element(By.ID, mainline_id)

        print(‘mainline_id_row label text {}’.format(mainline_id_row.text))

        train_type_id_row = driver.find_element(By.ID, train_type_id)

        print(‘train_type_row label text {}’.format(train_type_id_row.text))

    except NoSuchElementException:

        pass

6.6 Locate Berth Availability Text Elements

def check_berth_availability(driver, row_no):

    “””

    function to check the berth availability which

    :return:

    “””

    row_no = str(row_no)

    # row_no string is inserted into berth_availability_id via {}

    berth_availability_id = “mainTabView:gidisSeferTablosu:{}:j_idt109:0:somVagonTipiGidis1_label”.format(row_no)

    try:

        berth_avail_id_row = driver.find_element(By.ID, berth_availability_id)

        berth_availability_text = berth_avail_id_row.text

        print(‘berth availability {}’.format(berth_availability_text))

    except NoSuchElementException:

        pass

7. Conclusion

This tutorial has covered the automation of user interaction with web pages via Python Selenium.  Further steps following this tutorial could be as follows:

  • Scheduling the running of the Python checking script on a regular basis, via Windows (using a Windows Batch file);
  • Sending an email with the summary information which has been gathered by the script

About the Author

Dr Joanne Kitson is an experienced contract Senior Data Scientist and Python programmer with a research background and PhD as Electrical & Electronic Engineer. She runs a website called School for Engineering (https://schoolforengineering.com). You can find her on Linkedin at www.linkedin.com/in/joannekitson.

Featured Image Photo by Josh Nezon on Unsplash

Training

Coding

Engineering

Women in Technology

What's coming up...

Interested in what's happening in Women's Tech Hub?

Sign up to our newsletter
Skip to content