Need help in parsing page using Python

I am trying to parse MS URL https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019

I need to get 2 tables under File Information

  1. For all supported x86-based versions
  2. For all supported x64-based versions

However when i try to use beautiful soup i am not able to parse table object as it is inside tag

Any points will be appreciated.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

url = ‘https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019
soup = BeautifulSoup(requests.get(url).text)
data = soup.findAll(‘script’)[10].string
match = re.search(r’For all supported x86-based versions of Office 2016(.*)’, data)
d = match.group()
pd.read_html(d)

packages you’ll need to install:
pandas
requests
html5lib
bs4
lxml

note the apostrophes are not displayed correctly in this editor.

I only did the top link, I’m sure you can get the other one. GL.

1 Like

Way easier to just use Javascript if this is a one time thing.

let tables = [...document.querySelectorAll('table');
tables.shift();

const columns = ['file_identifier', 'file_name', 'file_version', 'file_size', 'date', 'time'];

let x = tables.map((table) => [...table.querySelectorAll('tr')].map((tr, i) => { 
   if (i === 0) return null;
   return [...tr.children].reduce((a, b, i) => {
     return {...a, [columns[i]]: b.innerText};
   }, {})
}))

console.log(x);

I would just post the JSON in here, but then the mod police would pull me over.

Be sure to click both links before running this code. The tables must be added to DOM first.

first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot

i might have to think of creating more of generic solution as there are lots of pages which i need to parse

first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot