Back to blog
How to Get a Web Element from HTML Source With Selenium


Augustas Pelakauskas
Back to blog
Augustas Pelakauskas
HTML source code is the blueprint of every website, and web elements are its building blocks. Python is arguably the most convenient programming language for extracting web data (web scraping) embedded within HTML code.
To enhance basic Python web scraping further, various frameworks, such as Playwright, Puppeteer, and Selenium, are often used to handle some of the more complex web browsing tasks, including JavaScript rendering.
Let’s overview the basics of getting web elements from HTML sources using Selenium web scraping with Python.
HTML source is the raw HTML code that makes up a web page. Web browsers interpret HTML code to display websites' content, structure, and formatting.
The HTML source includes:
HTML tags like <div>, <p>, <a>.
Text content that appears on the page.
Attributes within tags that provide additional information.
References to external resources like CSS files, JavaScript files, and images.
Metadata about the document in the <head> section.
You can view the HTML source of any web page by right-clicking on it and selecting View page source in your browser. Web developers work directly with HTML source code to create and modify websites.
HTML source for oxylabs.io
A web element is any HTML tag or part of a web page you can interact with – anything you find on a web page is a web element.
Some common web elements include:
Clickable buttons
Fillable Text fields
Links that navigate to other pages
Images that display visual content
Dropdown menus
Navigation bars
Headers and footers
Sliders and carousels
Windows and popups
Web elements are defined using HTML code (for structure), styled with CSS (for appearance), and can be manipulated with JavaScript (for behavior). Identifying and working with these HTML elements is essential for creating interactive and functional websites in web development and testing.
An HTML web element for a button
Python 3.6 or higher
Selenium package
Selenium Wire package for proxies
Selenium WebDriver for your preferred browser
Install Selenium:
pip install selenium selenium-wire
Download the appropriate Selenium WebDriver for your browser:
Chrome: ChromeDriver
Firefox: GeckoDriver
Edge: EdgeDriver
Let’s use ChromeDriver.
Let's integrate proxies with Selenium to conceal your actual IP address and bypass CAPTCHA. Proxies enable you to spread HTTP requests across multiple proxy IPs, enhancing web scraping efficiency. They also allow you to access content that is restricted to specific regions.
As a rule, the best proxy providers are always paid services. Luckily, they tend to offer free proxies for testing purposes.
While Selenium enables proxies through the --proxy-server= parameter, it doesn't allow proxy authentication. For this reason, use Selenium Wire, an unofficial third-party package that extends Selenium's Python bindings and enables you to use authenticated proxies.
Let’s set up a Selenium WebDriver to use a proxy server and add Oxylabs Residential Proxies.
To find an element using Selenium and Python, use the get_attribute('outerHTML') or get_attribute('innerHTML') method.
The difference between outerHTML and innerHTML:
outerHTML includes the element itself and all its content
innerHTML includes only the content inside the element
Remember that the choice between outerHTML and innerHTML depends on your specific needs – whether you need the complete element with its tags or just the content within.
Similarly, various locators offer flexibility in targeting elements on a page:
ID
CSS selector
XPath
tag name
You can use different locators to find an element:
# Find by CSS selector
element = driver.find_element(By.CSS_SELECTOR, ".my-class")
# Find by XPath
element = driver.find_element(By.XPATH, "//div[@class='my-class']")
# Find by tag name
element = driver.find_element(By.TAG_NAME, "div")
Here's how to do it:
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
# Proxy details
USER = "PROXY_USERNAME"
PASS = "PROXY_PASSWORD"
SERVER = "pr.oxylabs.io:7777"
# Create a driver with the proxy dict
driver = webdriver.Chrome(
seleniumwire_options={
"proxy": {
"http": f"http://customer-{USER}:{PASS}@{SERVER}",
"https": f"https://customer-{USER}:{PASS}@{SERVER}",
}
}
)
# Send a web request to a specific URL
driver.get("https://example.com")
# Find the element ID you want to get the HTML from
element = driver.find_element(By.ID, "my-element-id") # or use other locators
# Get the outer HTML (including the element itself)
outer_html = element.get_attribute('outerHTML')
print("Outer HTML:", outer_html)
# Get the inner HTML (only the content inside the element)
inner_html = element.get_attribute('innerHTML')
print("Inner HTML:", inner_html)
# Close the driver
driver.quit()
NOTE: Using hardcoded credentials in the script is generally not recommended for security reasons.
Get free proxies for small tasks or buy proxies for large-scale web data collection.
Here’s the basic process of getting an HTML source using Java.
Initialize ChromeDriver
Navigate to the target page
Find an element with the ID my-element-id
Retrieve the outerHTML attribute
Retrieve the innerHTML attribute
Print both to the console
Close the WebDriver
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class WebElementHtmlRetrieval {
public static void main(String[] args) {
// Initialize the WebDriver
WebDriver driver = new ChromeDriver();
driver.get("https://example.com");
// Find the element ID
WebElement element = driver.findElement(By.id("element-id"));
// Get the outer HTML
String outerHtml = element.getAttribute("outerHTML");
System.out.println("Outer HTML: " + outerHtml);
// Get the inner HTML
String innerHtml = element.getAttribute("innerHTML");
System.out.println("Inner HTML: " + innerHtml);
// Close the driver
driver.quit();
}
}
// Find by CSS selector
WebElement elementByCSS = driver.findElement(By.cssSelector(".my-class"));
// Find by XPath
WebElement elementByXPath = driver.findElement(By.xpath("//div[@class='my-class']"));
// Find by tag name
WebElement elementByTag = driver.findElement(By.tagName("div"));
Whether you're building a web scraper, developing automated tests, or analyzing web content programmatically, understanding how to extract web elements from an HTML source is a fundamental skill.
Selenium provides methods like get_attribute('outerHTML') and get_attribute('innerHTML') that give you precise control over an HTML source you retrieve.
Alternatively, check competing web data extraction frameworks. See how they compare to Selenium:
The Selenium get HTML process typically involves instantiating a Selenium WebDriver object, navigating to the target URL, and extracting the page_source attribute containing the fully rendered HTML source structure.
To extract XPath from an HTML source, use a browser's developer tools. Right-click the element of interest, select Inspect, then right-click the highlighted HTML code in the elements panel and choose Copy → Copy XPath. Alternatively, dedicated XPath tools and browser extensions like ChroPath or XPath Helper can generate XPath expressions automatically from HTML source elements.
To retrieve the entire page content using Selenium, access the page source or DOM elements. The most direct approach is to use the page_source property of the WebDriver, which returns the complete HTML source code of the current page. Alternatively, you can locate the <body> element and extract its text content for a cleaner text-only page version.
To extract text from an element using Selenium, utilize the text property after locating the element with an appropriate selector method. This approach returns the visible text content contained within the specified HTML tag.
For example, if you need to extract text from a paragraph element, locate the web element using a selector method, then access its text property.
This method works across all major browsers and is compatible with various Selenium language bindings, including Python, Java, and JavaScript.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scale up your business with Oxylabs®