🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Quickly finding location in an array that is similar to a smaller array.

Started by
12 comments, last by alvaro 6 years, 2 months ago

So I've got some very large (a few thousand elements long) arrays of numbers. Each element in the array is a number from 1 to 6. I then have a lot of smaller arrays (maybe 20 or 30 elements long) which also consist of the numbers 1 to 6. For each of these smaller arrays I have to find the location on the larger array that is similar enough to pass my test if there is one.

The similarity for any one of the smaller arrays is found as follows:

1. Tally up the differences between each pair of numbers: one from the smaller array, and one from the larger, going for the whole length of the smaller array.

2. Sum the total difference between all pairs together.

3. Raise it to a high power because I need the results to follow a curve. (e.g. 40/40 has a similarity of 1, but 39/40 is only 0.81 similar, and 35/40 is 0.21 similar)

4. Advance the offset of the larger array by one. Starting over a 1, but this time the first element of the smaller array is compared to the second element of the larger array, and so on.

Unfortunately this means that for each smaller array I have to do an exponentiation operation for each element of the larger array. (e.g. to compare a smaller array to an array with 1000 characters, I have to call the 'pow' method over 1000 times) I also have find the difference between an amount of pairs of numbers equal to the number of elements in the smaller array multiplied by the number of elements in the larger array for each of the smaller arrays. (e.g. I have to find out if a 20 element smaller array matches elements 0-19 of the larger array, then elements 1-20 of the larger array, then elements 2-21 ... until the end of the larger array)

Can anyone suggest some optimizations I can do here?

 

Here's a snippet of code as an example:



void activator::calculateFree()
{
    if(attachSize < 1)
        return;
    //attach size is the size of the binding domain of the activator
    //e.g. the size of the 'smaller array'

    calculateLigand();
    if(!boundLigand)
    {
        //parent->mainDNA->size is the size of the 'larger array'
        //a is the offset as we work our way through the larger array
        for(unsigned int a = 0; a<=parent->mainDNA->size-attachSize; a++)
        {
            float totalDifference = 0;
            float actualDifference = 0;
            bool inhibited = false;
            for(unsigned int b = 0; b<attachSize; b++)
            {
                totalDifference += 6;
                //parent->mainDNA->code is the larger array
                //attachBinding is the smaller array
                actualDifference += abs(parent->mainDNA->code[a+b]-attachBinding[b]);
                if(parent->mainDNA->inhibitedCode[a+b])
                {
                    inhibited = true;
                    break;
                }
            }
            if(inhibited)
                continue;

            //here (totalDifference-actualDifference)/totalDifference is the raw similarity between the two arrays
            geneAffinity = (totalDifference-actualDifference)/totalDifference;

            //try and avoid doing a pow if we don't have to
            if(geneAffinity > 0.61324)
            {
                geneAffinity = pow(geneAffinity,8);

                //For example, if we have a smaller array length of 20
                //And at a particular offset 19 of its elements are equal to the larger arrays elements
                //and the last element is only 1 different from the corresponding larger array element
                //that grants a similarity of 119/120
                //(totalDifference-actualDifference)/totalDifference in this case would be 0.99166
                //raised to the 8th power, that's 0.935

                //here we do a check to see if the similarity is close enough to do something with
                float bindingAffinity = 1-geneAffinity;
                //geneAffinity > 0.02
                if(bindingAffinity < 0.98)
                {
                    unsigned int randMax = 1+(bindingAffinity*1000);
                    if(rand() % randMax == 0)
                    {
                        boundDNA = parent->mainDNA;
                        boundPosition = a;
                        for(unsigned int b = 0; b<parent->unboundActivators.size(); b++)
                        {
                            if(parent->unboundActivators[b] == this)
                            {
                                parent->unboundActivators.erase(parent->unboundActivators.begin() + b);
                                break;
                            }
                        }
                        boundDNA->activators.push_back(this);
                        break;
                    }
                }
            }
        }
    }
}

 

Advertisement

No idea what your method does but I'd do something like this:

First totally ignoring that you can store 1 to 6 in 3 bits, we'll just start with storing each in 1 byte...

  1. Find a number of length classes for your short 'strings', some of them may share length classes (e.g. length 20, 18 etc)
  2. Come up with a hash function for your 'strings'. It doesn't have to be unique. You could start by simply adding the values up for the whole string.
  3. Then for each of your longer test data, maintain a 'running total' of the hash for each length class as you traverse the data.
  4. If the hash matches one of your strings (of that length class), do a more accurate compare.
  5. Profit!

The running total works like this, for example for a length class of 3:

String is 3, 4, 1 = Total 8

Test Data is

4563414663

Process:

4 + 5 + 6 = 15 -> no match

15 MINUS 4 (going out) PLUS 3 (going in) = 14 -> no match

14 -5 +4 = 13 -> no match

13 -6 +1 = 8 -> match!!

Do full compare -> Match!! -> record match

continue...

This idea is commonly used in e.g. audio processing. It has the advantage that because of the hash, the length of the test strings is irrelevant (you can test very long test strings with no extra cost). There may be other cool ways of doing it.

And you can go further and utilize the smaller bit depth for packing, but that may not be necessary.

Ah I've just seen it's DNA.. I've done this exact thing myself in the past, and with codons. It's going to be a massively important field! :)

Have just read your question more thoroughly (skim reading haha :) sorry!) .. and you are after similarities as well as exact matches. The summing idea should still work for this too afaik just do a second more accurate check if the sums are within a range of the test string. You might be able to precalculate what this range should be with your pow function.

I just want to mention that


pow(x, 8)

will probably be computed as three multiplications, so it shouldn't be expensive.

17 minutes ago, alvaro said:

I just want to mention that



pow(x, 8)

will probably be computed as three multiplications, so it shouldn't be expensive.

Yes, I am a little surprised this is a performance problem at all.. unless you are doing this whole process multiple times? Are you trying to find possible binding sites for generated ligands (I forget what all these terms mean I never paid attention in biochemistry at uni!)? I had a go at doing it in 3d, and doing the protein folding from the amino acid sequence. Fun! :)

1 minute ago, lawnjelly said:

Yes, I am a little surprised this is a performance problem at all.. unless you are doing this whole process multiple times? Are you trying to find possible binding sites for generated ligands (I forget what all these terms mean I never paid attention in biochemistry at uni!)? I had a go at doing it in 3d, and doing the protein folding from the amino acid sequence. Fun! :)

I've got like thousands of enzymes to compare to thousands of ligands each frame. I guess it just boils down to doing what ya'll said and finding ways to preemptively determine a failed match and weed out any unnecessary comparison. 

Ah, so this is for a game? If you are inventing the genetic code you might make it match bit depths exactly so you can take advantage .. i.e. use 4 combinations, or 16 for instance. I seem to remember that conveniently the ACGT sequences in actual triplet codons go nicely into 4 bits (leaving room for some duplicates which does happen in nature).

You can also look at SIMD for doing several sequences at a time, or even doing it on the GPU?

Just now, lawnjelly said:

Ah, so this is for a game? If you are inventing the genetic code you might make it match bit depths exactly so you can take advantage .. i.e. use 4 combinations, or 16 for instance. I seem to remember that conveniently the ACGT sequences in actual triplet codons go nicely into 4 bits (leaving room for some duplicates which does happen in nature).

You can also look at SIMD for doing several sequences at a time, or even doing it on the GPU?

I actually considered doing 4 bit sequences. For now they're just unsigned chars though. That would save RAM, but would it save CPU? RAM isn't the issue at all.

As for doing it on the GPU, that'd be great, though I imagine that's a bit above my head. I understand the basics of parallel computing though.

It is not just about saving RAM. For fast calculations a big bottleneck can be memory bandwidth, so if you use half the bandwidth you can get theoretically get as much as twice the speed.

First, it's a good idea to measure how much of the time is going to what operation. For instance, you seem to think that the pow is expensive, but how much faster is your algorithm if you remove it? Once you have a good idea of what the slow parts are, it's easier to think of optimizations.

Here's another idea: Since the quality of each attempted match can be computed independently from all the others, you could program a GPU to run this algorithm. That could be a very big speed up.

 

This topic is closed to new replies.

Advertisement