I am trying to parse MS URL https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019
I need to get 2 tables under File Information
- For all supported x86-based versions
- For all supported x64-based versions
However when i try to use beautiful soup i am not able to parse table object as it is inside tag
Any points will be appreciated.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = ‘https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019’
soup = BeautifulSoup(requests.get(url).text)
data = soup.findAll(‘script’)[10].string
match = re.search(r’For all supported x86-based versions of Office 2016(.*)’, data)
d = match.group()
pd.read_html(d)
packages you’ll need to install:
pandas
requests
html5lib
bs4
lxml
note the apostrophes are not displayed correctly in this editor.
I only did the top link, I’m sure you can get the other one. GL.
1 Like
Way easier to just use Javascript if this is a one time thing.
let tables = [...document.querySelectorAll('table');
tables.shift();
const columns = ['file_identifier', 'file_name', 'file_version', 'file_size', 'date', 'time'];
let x = tables.map((table) => [...table.querySelectorAll('tr')].map((tr, i) => {
if (i === 0) return null;
return [...tr.children].reduce((a, b, i) => {
return {...a, [columns[i]]: b.innerText};
}, {})
}))
console.log(x);
I would just post the JSON in here, but then the mod police would pull me over.
Be sure to click both links before running this code. The tables must be added to DOM first.
first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot
i might have to think of creating more of generic solution as there are lots of pages which i need to parse
first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot