Generate unique short values from a large number of long strings

nkaretnikov · August 23, 2017, 8:44pm

Hello to everyone!
In my ActiveDirectory I have about 100k users.
Each of the users has a “Description” field that looks similar to an example below
"The Paramount Company FullName \ Paramount Company \ The Department One FullName \ Dep1 \ Sales Department of the Department One \ Sales Department \ South Division of the Sales Department \ South Division"
and so forth.
This looong (over 250 characters) string is to be somehow parsed and shortened into just 32 characters and kept to a unique User’s ID in a csv file
The simplest idea would be to get just the last value, like Description.split().[-1] (powershell notation)
UserID, Dep32
ivan_ivanov, SouthDivision

But that will not guarantee uniqueness, obviously.

Could someone recommend an approach that would at least rise a probability of the resulting string (Dep32) to be unique?

kevinSmith · August 23, 2017, 9:02pm

I guess I’m misunderstanding. Is “ivan_ivanov” already a unique identifier? Or does the system allow more than one “ivan_ivanov”? Does “South Division” have to be unique too or just in combination with “ivan_ivanov”. And does this unique ID have to be human readable, or is it just for the DB?

ppc · August 23, 2017, 9:04pm

Use Get-FileHash in powershell

it generates a SHA-256 checksum which is 32 bytes

ppc · August 23, 2017, 9:17pm

it’s easy to save the string to a file

kevinSmith · August 23, 2017, 9:20pm

Yeah, hashing is an option, but I always worry about “unique”.

You might be able to look for some kind of text compression, etc.

I’m still not clear why the description needs to be part of the unique identifier. What will happen if that person transfers or changes jobs?

Do the different divisions already have codes associated with them?

nkaretnikov · August 23, 2017, 9:42pm

Hello!
Thank you for the reply!
To simplify the issue, let’s say we have in 100 records:
LongDeptDescription;UserName

we need out 100 records
ShortestPossibleDeptDescription;UserName

In:
Deparment1…1\Division1…1\SubDivision1…1 … \SubDivision1…M1;ivan_ivanov
Deparment2…1\Division2…1SubDivision2…1 … \SubDivision1…M2;petr_petrov
…
Deparment100…1\Division2…1\SubDivision2…1 … \SubDivision1…M100;john_dow

Out:
Dep1…1\Div1…1\SubDiv1…1 … \SubDiv1…M1;ivan_ivanov
Dep2…1\Div2…1\SubDiv2…1 … \SubDiv2…M2petr_pertov
…
Dep100…1\Div2…1\SubDiv2…1 … \SubDiv2…M100;john_dow

Short version of department names should be read by humans not machines.

While shortening LongDeptDescription we should still maintain the uniqueness of the department name as our users not to get into another department.

Hope I made it clear this time

mikep · August 24, 2017, 2:06am

use SHA-256 which generates a 256 bit hash of your string. 256/8=32 bytes which you can map to 32 characters.

kevinSmith · August 24, 2017, 3:27am

While shortening LongDeptDescription we should still maintain the uniqueness of the department name as our users not to get into another department.

I don’t see how that would be foolproof. Too many chances for them to rename a subdivision and suddenly you have a collision.

Does this need to be unique to be a DB index? Or just because you want to use it?

If it’s needed for and index, you may want to check and see if they already have codes for different departments. Every large corp I’ve been a part of had specific codes for all of these. And you could use those in your index and just parse them out for display.

But if you really must have some human readable index (again, I think those functions should be split) then you could have a table of different departments/subdepts/etc and what their abbreviations are. From there you could make sure that the abbreviations are unique. Of course the usernames would have to be unique too.

nkaretnikov · August 24, 2017, 5:30am

Thank you for your suggestions!

DanCouper · August 24, 2017, 6:53am

Just use a UUID, like here’s a generator for node: https://github.com/kelektiv/node-uuid, but theres a lib for any language you’re using.

nkaretnikov · August 24, 2017, 7:34am

Human readable, please

DanCouper · August 24, 2017, 8:48am

It’s not possible to guarantee uniqueness that way though; there has to be some form of unique id, and that id has to be a certain length to reduce chance of collisions. You can define some part of the (string) description as always needing to be unique, but given the amount of users, the likelihood of race conditions leading to duplicated keys is relatively high. It means some extra logic, but having a unique randomised string of 32 chars, or attaching a shorter randomised string to some key piece of the description is your best bet imo

nkaretnikov · August 24, 2017, 10:30am

This is why I wrote “Could someone recommend an approach that would at least rise a probability of the resulting string (Dep32) to be unique?”

hydracus · August 24, 2017, 12:39pm

An approach may be to use some sort of reference system.
This example use a base 32 numbering system
eg,DEP[A-Z,0-9][A-Z,0-9]
DEPAA = department 1
DEPAB = department 2
…
DEP99 = department 1024

dep99 divAB usAAA … = department 1024/1024, division 2/1024, countrycode city 1/32768.
Based on the provided description, probably leave the last 4 characters as for unique numbers allowing a further 1048576 as unique identifier

lionel-rowe · August 24, 2017, 1:14pm

You could do better than base 36 if you made it case sensitive (base 62, in fact). Four digits and you’ve got yourself almost 15 million combinations for some decent scalability. And if you’re obsessed with making it human-readable, just append some abbreviated versions of the other data (dept name abbreviation + truncated name, perhaps) to that.

So you’d have something like “aM8k-SalesSouth-IvanIvanovicIvan” (-ov got truncated). As long as the dept abbreviations were kept short and you were smart about how to truncate the name (e.g. don’t include middle names like above), you should usually end up with something pretty readable.

…But that whole approach is just messy, like mixing ice cream with gravy. Better to use two separate fields and concatenate them together if you need to (e.g. in a URL). Besides, what if Ivan transfers to Operations? Then your whole “ID” would be ruined if you used the ice cream gravy approach.