Scraping Tables

fletcher5555 · February 13, 2020, 4:47pm

Hi All

The help on here has been great already and I’m really enjoying learning how to code.

I am trying to scrape a table from a website in the aim to pull the data into a CSV file. I can collect the table headers however only 1 row of data is being provided back even though I am using the find all functionality.

Can anybody provide me with some advice how I can obtain all of the rows of data from all pages please?

import requests
from bs4 import BeautifulSoup
import lxml

url="https://www.zoopla.co.uk/house-prices/browse/merseyside/st-helens/?q=st%20helens&results_sort=newest_listings&search_source=home&pn=47"

html_content = requests.get(url).text

soup = BeautifulSoup(html_content, "lxml")

gdp_table = soup.find("table", attrs={"class": "browse-table"})
gdp_table_headings = gdp_table.find_all("tr")
gdp_table_data = gdp_table.find_all("tr")

headings = []
for th in gdp_table_headings[0].find_all("th"):
    headings.append(th.a.text)

data = []
for td in gdp_table_data[1].find_all("td"):
    data.append(td.text)

print(headings)
print(data)

angian00 · February 13, 2020, 7:46pm

I’m not sure it is the cause, but it would be cleaner to find the th in the thead and the td in the tbody, something like

gdp_table_headings = gdb_table.find("thead")
headings = []
for th in gdp_table_headings.find_all("th"):
    headings.append(th.a.text)

gdp_table_data = gdb_table.find("tbody")
...

(If I remember correctly, find_all does not limit itself to direct children, so you don’t the array indexing gdp_table_headings[0] anyway)