Thursday, December 24, 2009

Computing the minimum snippet from a search result

I looked for literature for computing the shortest snippet of a matched document for a search result containing multiple terms, but didn't find anything pedagogical. The closest I came was the getBestFrament method in the Highlighter class of the Lucene source distribution.

So I set about trying to come up with a way to compute it. This is what I got (so far):

If we have three sorted inverted indices for three terms, say R, G and B:

R : [1, 43]
G : [2, 45]
B : [10, 46]

we merge the arrays. This is a O(n) operation (There is also log K work to maintain a priority heap of K terms, where K is the number of search terms. Here, K = 3).

UPDATE: It turns out the k-item priority queue idea is discussed in Exercise 1. in section 5.2.4 of TAOCP Vol 3. (Sorting and Searching).

But we also want to differentiate the values coming from different arrays. We do this with different colored balls. So the indices from the array R are colored Red, the ones from G are colored G and the ones from B are colored Blue.

In the merged array, we insert a ball of the appropriate color at the position specified in the original array. We'll also have to use color to maintain the priority queue above so that there is only one ball per color (array) in the queue.

Result : [R, G, ...., B, ........, R, G, B]

Once we have the resultant array, we start scanning it from the left. We stop when we mark one ball of each color. Once we do that, max index - min index of this RGB combination is current least distance. We store this as the current min distance, and move one index to the right. We terminate when we can't get a complete color set anymore

This is an O(n^2) operation, since for every starting point, we might have to scan the entire array to complete the set of colors.

To complete the above example:

Step 1: R = 1, G = 2, B = 10. Dist = 10 - 1 = 9
Step 2: G = 2, B = 10, R = 43. Dist = 43 - 2 = 41
Step 3: B = 10, R = 43, G = 45. Dist = 45 - 10 = 35.
Step 4: R = 43, G = 45, B = 46. Dist = 46 - 43 = 3. ***Min***
Step 5: G = 45, B = 46, R = ? . Terminate

Does this make sense?

Wednesday, October 07, 2009

Google SearchRank?

I just realized that I've read Google literature on storing and
searching web pages, but not on storing and searching (or otherwise
analyzing) searches (user search phrases)?

What do they do here? Do they use the same technologies as for pages?
Store it in BigTable? Analyze it using MapReduce? Do they compute a
"SearchRank", i.e. the rank of a page based on how often it is clicked
when presented as part of a search result set? Do they take combine
this with the page's PageRank before displaying the TopK results?

What other analytics do they perform on search results? User-visible
"analytics" examples include Google Trends and Zeitgeist but those
are just pure aggregation and ordering.

Tuesday, May 19, 2009

getattr on a script

I wanted to dispatch calls to other functions within my script, but couldn't figure out what object to pass as argument to getattr. Turns out __name__ is the name of my module, and I can look up a reference in sys.modules.

Handy!

REF: getattr on a function - comp.lang.python | Google Groups

Sunday, February 01, 2009

Nvidia Nabs a Top Scientist From Stanford - Bits Blog - NYTimes.com

Nvidia Nabs a Top Scientist From Stanford - Bits Blog - NYTimes.com:

"Since 2005, Mr. Dally has served as the chairman of Stanford’s vaunted computer science department. More crucially for Nvidia’s immediate aims, Mr. Dally has done extensive research into “stream processing,” which is a form of computing gaining momentum with both scientists and businesses.

The stream processing model helps spread complex tasks across processors with many cores — like Nvidia’s graphics chips, which can contain hundreds of small processing engines versus just a handful of more powerful engines found on standard processors from companies like Intel. Certain types of software jobs run much faster on the many-core chips, making them desirable for organizations taking on some of the toughest tasks."

This sounds promising. Hopefully we'll see more of this innovation in the market from Nvidia after Google killed off a previous attempt by gobbling up PeakStream.

Wednesday, January 28, 2009

What Red Ink? Wall St. Paid Fat Bonuses - NYTimes.com

What Red Ink? Wall St. Paid Fat Bonuses - NYTimes.com

"Despite crippling losses, multibillion-dollar bailouts and the passing of some of the most prominent names in the business, employees at financial companies in New York, the now-diminished world capital of capital, collected an estimated $18.4 billion in bonuses for the year.

That was the sixth-largest haul on record, according to a report released Wednesday by the New York State comptroller."

It pays to be a banker, I suppose.

Thursday, January 22, 2009

The Famous Fingers Were Live, but Their Sound Was Recorded at the Inauguration - NYTimes.com

The Famous Fingers Were Live, but Their Sound Was Recorded at the Inauguration - NYTimes.com: "“It’s not something we would announce, but it’s not something we would try to hide,” Ms. Florman said. “Frankly, it would never have occurred to me to announce it. The fact they were forced to perform to tape because of the weather did not seem relevant, nor would we want to draw attention away from what we believed the news is, that we were having a peaceful transition of power from one administration to the next.”"

But of course! So much for transparency.

Bush Approves Bill Reducing Secretary of State’s Pay - The Caucus Blog - NYTimes.com

Bush Approves Bill Reducing Secretary of State’s Pay - The Caucus Blog - NYTimes.com: "Mr. Bush signed a bill reducing by several thousand dollars the salary of the secretary of state. It was a crucial step in Mrs. Clinton’s confirmation process because of a clause in the Constitution that forbids a member of Congress from being appointed to a government position which was either created or given a compensation increase during the lawmaker’s current term."

How are these things checked? Automatically? Who writes this compliance software? The same folks who implement Sarbanes-Oxley?

Saturday, January 17, 2009

An Online Farmers Market - Bits Blog - NYTimes.com

An Online Farmers Market - Bits Blog - NYTimes.com: "The local food movement has been all about buying seasonal food from nearby farmers. Now, thanks to the Web, it is expanding to include far-away farmers too."

Not really. They don't sell produce. Mostly preserved stuff.