Thursday, December 24, 2009

Computing the minimum snippet from a search result

I looked for literature for computing the shortest snippet of a matched document for a search result containing multiple terms, but didn't find anything pedagogical. The closest I came was the getBestFrament method in the Highlighter class of the Lucene source distribution.

So I set about trying to come up with a way to compute it. This is what I got (so far):

If we have three sorted inverted indices for three terms, say R, G and B:

R : [1, 43]
G : [2, 45]
B : [10, 46]

we merge the arrays. This is a O(n) operation (There is also log K work to maintain a priority heap of K terms, where K is the number of search terms. Here, K = 3).

UPDATE: It turns out the k-item priority queue idea is discussed in Exercise 1. in section 5.2.4 of TAOCP Vol 3. (Sorting and Searching).

But we also want to differentiate the values coming from different arrays. We do this with different colored balls. So the indices from the array R are colored Red, the ones from G are colored G and the ones from B are colored Blue.

In the merged array, we insert a ball of the appropriate color at the position specified in the original array. We'll also have to use color to maintain the priority queue above so that there is only one ball per color (array) in the queue.

Result : [R, G, ...., B, ........, R, G, B]

Once we have the resultant array, we start scanning it from the left. We stop when we mark one ball of each color. Once we do that, max index - min index of this RGB combination is current least distance. We store this as the current min distance, and move one index to the right. We terminate when we can't get a complete color set anymore

This is an O(n^2) operation, since for every starting point, we might have to scan the entire array to complete the set of colors.

To complete the above example:

Step 1: R = 1, G = 2, B = 10. Dist = 10 - 1 = 9
Step 2: G = 2, B = 10, R = 43. Dist = 43 - 2 = 41
Step 3: B = 10, R = 43, G = 45. Dist = 45 - 10 = 35.
Step 4: R = 43, G = 45, B = 46. Dist = 46 - 43 = 3. ***Min***
Step 5: G = 45, B = 46, R = ? . Terminate

Does this make sense?