Extract html tags

Extract html tags
0

I want to use regular expression to print out the HTML tags excluding the attributes.

for example:

import re
html = '<h1>Hi</h1><p>test <span class="time">test</span></p>'
tags = re.findall(r'<[^>]+>', html)
for a in tags:
    print(a)

the output is :

<h1>
</h1>
<p>
<span class="time">
</span>
</p>

But I just want the tag, not the attributes

<h1>
</h1>
<p>
<span >
</span>
</p>

You could probably use a regular expression on html to do this, but alternatively, you could just process a in the for loop. All tag attributes should have a pattern similar to attribute="some_value" (Unless this is non-standard HTML), so find and replace them all with re.sub():

for a in tags:
    b = re.sub(r'\s?\w+=\"[\w\d]+\"', '', b)
    print(b)
1 Like

I don’t recommend using regex to parse HTML, because if there are incomplete tags inside of the valid tags, it can become a nightmare. That being said, you could do something like the following:

import re
html = '<h1>Hi</h1><p>test <span class="time">test</span></p>'
tags = re.findall(r'<[^>]+>', html)
for a in tags:
    print(re.sub(r'(<\w+)[^>]+(>)', r'\1\2' , a))

Displays the following:

<h>
</h1>
<p>
<span>
</span>
</p>
1 Like

Thanks, how to i print out only the tags that has no attributes.

<h1></h1>
<p></p>