Python regex - help needed

Python regex - help needed
0

#1

Hi Campers,

I’m currently working with a very complicated .csv file and I have a problem with extracting data. One cell could have the following content:

test_string2 = 'jam = 12, jom = kom, dim_cm = 33,5 x 33,5, mark_o = STRU L EDA P ... hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh '

I would like to extract the data as follows:

k = ['jam', 'jom', 'dim_cm', 'mark_o']
v = ['12', 'kom', '33, 5 x 33,5 ',' STRU L EDA P ... hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh ']

And the final result would be:

0 	1
0 	jam 	12
1 	jom 	kom
2 	dim_cm 	33,5 x 33,5
3 	mark_o 	STRU L EDA P...hjhkj kdk,hjd hdjk; d-hd-kj dhk...

At the moment I can only get a list of ‘k’, so I would like to ask you to help me to get the list of ‘v’.

My code so far:

>>> import pandas as pd
>>> import re
>>> test_string2 = 'jam=12,jom=kom,dim_cm=33,5 x 33,5,mark_o=STRU L EDA P...hjhkj kdk,hjd hdjk; d-hd-kj dhkhdkjdh '
>>> regex2 = '(\w+?)='
>>> k = re.findall(regex2, test_string2)
>>> k
['jam', 'jom', 'dim_cm', 'mark_o']
>>> k = ['jam', 'jom', 'dim_cm', 'mark_o']
>>> v = ['12', 'kom', '33,5 x 33,5', 'STRU L EDA P...hjhkj kdk,hjd hdjk; d-hd-kj dhkhdkjdh ']
>>> pd.DataFrame(list(map(list, zip(k,v))))
        0                                                  1
0     jam                                                 12
1     jom                                                kom
2  dim_cm                                        33,5 x 33,5
3  mark_o  STRU L EDA P...hjhkj kdk,hjd hdjk; d-hd-kj dhk...

Thanks!


#2

Holy moly! This is a curly one - I don’t envy you this task! I recommend using http://regex101.com to work on problems like this - remember to set the ‘flavor’ to Python.

This is complicated by the comma which serves two functions in your test string - one is as a delimiter for your key/value pairs…which is easy to handle. The second is as a European decimal place (33,5), which is harder for me to figure out how to handle…

This is as far as I have got to get your v values:

\=\ ([\w+. ]+) will get you everything after the = and up to a comma (or the ... hiding in there too).

Explanation
\= matches the character = literally (case sensitive)
\ matches the character (space) literally (case sensitive)
1st Capturing Group ([\w+. ]+)
Match a single character present in the list below [\w+. ]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
\w matches any word character (equal to [a-zA-Z0-9_])
+. matches a single character in the list +. (case sensitive)

But like I said, the Euro decimal is throwing me off - you probably have to treat commas followed by a letter differently to a comma followed by a number…is that always going to be true though?

If you figure this out, please post your solution!!!


#3

Second attempt at this. I can get closer, but notice the final match - it has a comma followed by a space, but is not intended to act as a delimiter or euro decimal comma. Again, I can’t wrap my head around how to determine whether a comma is supposed to act as a delimiter given the possibility that it can be followed by a space…

Actually, I think I’m thinking about this all wrong. I’ll come back to this later…I have an idea, but no time to test it :slight_smile:


#4

Hi @JacksonBates
Thanks for trying.
I’m fighting with this for a couple of days now :slight_smile:
I’ll post, if I get a solution.


#5

I tried again last night - it’s really tricky!

One trick I thought of was, since we know all the parts of the string that go into K we can remove them and just work on a string that has V in it.

Then all you need is a suitable delimiter to split them into the array. I didn’t have time to test out my idea though.

What’s the source of values for V? Can it be random, or is there some predictability to the input? Are there wider test cases aside from those you included?

Also, is this something like a homework assignment or a real life piece of data you’re wrangling?


#6

Alright, I’ve got it :slight_smile:

This is a hacky mess, but bear with me…

>>> import re
>>> str = 'jam=12,jom=kom,dim_cm=33,5 x 33,5,mark_o=STRU L EDA P...hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh '
>>> regexForK = '(\w+?)='
>>> k = re.findall(regexForK, str)
>>> # k = ['jam', 'jom', 'dim_cm', 'mark_o']
>>> # Next we remove every item from k from our string
>>> for item in k:
...  str = re.sub(item, '', str)
>>> # Now we split the remaining string into a list using = as the delimiter
>>> arr = str.split('=')
>>> # arr = ['', '12,', 'kom,', '33,5 x 33,5,', 'STRU L EDA P...hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh ']
>>> # We remove the initial empty string in the list 
>>> arr.remove('')
>>> # Now we can start to construct v
>>> v = []
>>> # we need to trim the end comma off each item in the list, 
>>> # and put the final value in the v list
>>> for item in arr:
...  pruned = re.sub(',$', '', item)
...  v.append(pruned)
# Here's our ending values:
# k = ['jam', 'jom', 'dim_cm', 'mark_o']
# v = ['12', 'kom', '33,5 x 33,5', 'STRU L EDA P...hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh ']

There is probably one killer regex string that knocks all this out in one go, but I can’t mentally parse any regex string longer than 4 characters!

This can probably be refactored or simplified, but it should get the job you want done. You may want to wrap it all up in a function.


#7

Great job @JacksonBates

I came to the solution without the use of regex. Only Python.

import pandas as pd

test_string = 'jam=12,jom=kom,dim_cm=33,5 x 33,5,mark_o=STRU L EDA P...hjhkj kdk, hjd hdjk; d-hd-kj dhkhdkjdh '
split_string = test_string.split('=')

# first item in main list
main_list = []
main_list.append(split_string[0]) # should be the first item in main list

# make temp list - between first and last item
temp_list = split_string[1:-1]

for item in temp_list:
    it = item.rsplit(',', 1)
    for i in it:
        main_list.append(i)

main_list.append(split_string[-1]) # shoould be the last item in main list

k = main_list[:-1:2]
v = main_list[1::2]

pd.DataFrame(list(map(list, zip(k,v))))