Need help in parsing page using Python

ashishdalvi3 · May 27, 2020, 3:39am

I am trying to parse MS URL https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019

I need to get 2 tables under File Information

For all supported x86-based versions
For all supported x64-based versions

However when i try to use beautiful soup i am not able to parse table object as it is inside tag

Any points will be appreciated.

kerafyrm · May 27, 2020, 6:55am

import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

url = ‘https://support.microsoft.com/en-us/help/4462242/description-of-the-security-update-for-office-2016-april-9-2019’
soup = BeautifulSoup(requests.get(url).text)
data = soup.findAll(‘script’)[10].string
match = re.search(r’For all supported x86-based versions of Office 2016(.*)’, data)
d = match.group()
pd.read_html(d)

packages you’ll need to install:
pandas
requests
html5lib
bs4
lxml

note the apostrophes are not displayed correctly in this editor.

I only did the top link, I’m sure you can get the other one. GL.

kerafyrm · May 27, 2020, 6:57am

Way easier to just use Javascript if this is a one time thing.

kerafyrm · May 27, 2020, 7:23am

let tables = [...document.querySelectorAll('table');
tables.shift();

const columns = ['file_identifier', 'file_name', 'file_version', 'file_size', 'date', 'time'];

let x = tables.map((table) => [...table.querySelectorAll('tr')].map((tr, i) => { 
   if (i === 0) return null;
   return [...tr.children].reduce((a, b, i) => {
     return {...a, [columns[i]]: b.innerText};
   }, {})
}))

console.log(x);

I would just post the JSON in here, but then the mod police would pull me over.

Be sure to click both links before running this code. The tables must be added to DOM first.

ashishdalvi3 · May 27, 2020, 8:30am

first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot

ashishdalvi3 · May 27, 2020, 8:31am

i might have to think of creating more of generic solution as there are lots of pages which i need to parse

martinjus · May 27, 2020, 9:00am

first solution worked for me.
but there is still need to parse data i get from regex.Thanks a lot