Suggestion for a solution in finding anagrams

Suggestion for a solution in finding anagrams
0
#1

Suppose you have a dictionary with millions of words and you have to find all the anagrams of the given string from the dictionary…
(An anagram is a word or phrase formed by rearranging the letters of a different word or phrase, typically using all the original letters exactly once. Example: “TEAM” and “MATE” are anagrams)…
The solution I have in mind is to find all the combinations of the letters in the given string and search for their occurrence in the dictionary? Do you have a better and efficient way to work out this?

#2

Yes, sort of. That’s slow though. Even optimised versions are slow: you can test this by using one of the online anagram finders/crossword solvers. Normally they restrict both the input (in size) and possible output (in number of results, the size of the dictionary, and whether the returned collection includes multiple words). And even then, the lookup tends to take a while. Overall, you can’t really brute-force this problem beyond inputs of very short words, it takes too long to compute.

Edit: forgot to include this, it is quite critical

Note that to see if something is an anagram of another word, what you generally can do is assign a value to each letter in the alphabet, either increasing primes (a is 2, b is 3, c is 5, d is 7) or a geometric progression where each number is 2* the previous (a is 1, b is 2, c is 4, d is 8), then respectively either the sum the values or apply bitwise OR across them. Comparing the two results will tell you if one word is an anagram of another. (The bitwise version is off the top of my head, I think applying OR across the values works, maybe it’s AND… I can’t remember)

Few things:

You won’t have millions of words, you’ll generally only have maximum 100-200k; you can’t feasibly include proper nouns and names because that list is potentially infinite (though obviously sets of proper nouns could be included).

The dictionary itself can be restricted to the most common words, so can be dropped to a few tens of thousands in size.

You have to restrict the size of the input because the number of permutations you need to look up increases exponentially.

You have to specify a language, which then leads to a second lookup table of impossible combinations - like in English, “aaa” is not possible, or “xq” or “ttt”. This is one of the ways permutations function can be optimised. There will need to be other ways as well, much of what it will generate will be junk, so there needs to be a way to not process the junk (as much as is viable).

You can restrict the output to single words, this makes things massively quicker, because you can simply get the permutations and look them up; then the only thing that needs to be optimised is the permutations function, lookup is trivially quick.

If the output isn’t restricted to single words then it’s a bit harder because you need a way to find combinations of words that add up to the same length and have the same letter. So, gets slow, fast. Again, there are language-specific rules regarding valid words, so this can be (has to be!) optimised.

With multiple words, you probably then have other tables of single and double letter words etc which allow another degree of optimisation (makes it much quicker to narrow down possibilities). eg “dog”: doesn’t matter what you do, “d” “o” “g”, “d” “og”, or “do” “g” can’t work: there is one valid two-letter word there, but that comes with an invalid single-letter word, so is overall invalid, can only have that three-letter word as output. If you have valid single letter words in a table, the first two posisiblities only need to check that first character, once they see it’s a “d”, can move on immediately.

So to deal with the above, parallelizing (either just cores or across a cluster, or using GPUs) is probably the way to go, but that’s a whole 'nother kettle of fish.
The fastest way is to precompute everything, so for every given set of letters (key), have every valid combination (set of values). Then just sort the input and look up in the precomputed table (which will have millions of entries). But you still need to get all the above put in place first of all.

1 Like
#3

Hello Dan,Thank you very much. I will try with that…Between I just thought about making an array with groups of all anagrams in the dictionary and then use it to compare with the input string…what do you think??

I am doing this in a node,express environment. Will let you know what I find :+1: