Tech Talk With "BA Baracus": May 2008

I have been reading a lot of algorithm based on the random surfer model. I rechecked the original simrank paper from Glen Jeh and Jeniffer Widom @stanford. It's a very simple and basic idea of using the graph structure to identify similarity's between graph nodes. It can be define the probability by which 2 nodes would meet if they keep choosing an edge to walk from current node at random.The "random surfer" model was popularized by google pagerank, this paper here uses it for a two node similliarity problem.

Surfing is kool by me, I only fear flying you Fooool !!

The problem with this approach I think are
1) Graph Structure: The graph need to be structured very strongly for simrank to give any meaningful result. The unpopular or nodes with low fanout would be categorized similar to many non-worthy nodes due to some common neighbor values.
2) I think the simrank would suffer in a diverse graph scenario where both dense and sparse sub-graphs are part of a big graph.

The good part about simrank are:
1) The approach is context-free and can be applied to different domains
2) simrank is iterative , can be parallelized easily and is guaranteed to converge (with cutoff thresholds).

Similar interesting problems:
1) Given a Graph G, and two nodes (s,d) find the neighbors of 's' who have maximum influence on 'd'.
2) Finding communities in graph.
3) sub-graph matching ?? This problem would be hard to model using random surfer. would be interesting to do that though.

Alright !! Need to go and convert the van for assault !!

I have been thinking about starting a technical blog for a while. so without further Ado. Here is my First Post.

I have been "Trying" to read lots of paper lately, I feel I am not doing justice to all the papers I have collected and just thinking about the printing cost of them to the company makes me feel really really guilty.

I went back and read few articles about how to Read Research Papers actually. One good article by Philip Fong Link.

I will use his method of reading paper to read few of the papers suggested in really remarkable post at http://glinden.blogspot.com/2008/05/random-walks-of-click-graph.html

I have chosen paper by Hector Garcia-Molina et al. "Simrank ++: Query Rewriting through link analysis"

Problem Space
---------------

The paper talks about Query Rewriting problem i.e Given a New Query 'q' finding similar queries set 'Q' with known clicks from historical Query-Click data.

The problem space make sense for search marketing folks to find correlation between different query/keyword terms to maximize their CTR yield.

Can the problem be extended to search queries in general, I mean on finding relation like if query:"football" and extending search to "soccer" etc.

Importance of Problem space
----------------------------

A good solution here can increase CTR for search ads and can mean direct money :)

Paper Type
----------

This paper improves an existing algorithm Simrank.

BackGround
------------

Simrank :: Simrank is an approach of finding similliarity between different objects. It is a domain independent, graph structure based approach which uses the simmiliar/common friends principal to measure similarity's between objects. Excerpt from original simrank paper.
"“two objects are similar if they are related to similar objects" eg. On the web two pages with common hyperlinks / Papers with shared references etc. [I will be analyzing this paper next]

Contributions
----------------
1) Application of simrank algorithm in sponsored search space
2) Improves on existing algorithm
3) Experiment evaluation of real query-click data from yahoo

Simrank improvements
--------------------------------
1) Evidence based: The author argues citing cases of complete bipartite graphs that for some cases simrank doesn't give optimal pair and use of a new "evidence factor" can improve score.

2) Weighted edge traversal: The author then devises a simrank improvement by considering the weight calculated in step 1 for the random surfer in simrank.

Results
--------
1) Coverage improvement : same as simrank
2) Precision Improvement : 5 % improvement ??
3) Rewriting depth Improvement : 5 % over simrank
4) Correctness : 30-40 % improvement over simrank. +2

Conclusions
-------------

Query rewrite effectiveness: The query rewrites are measured using two major
Methodlogies
1) Comparison with manual scores from experts : Not based on real click-data/ person bias.
2) Checking rewrite quality by removing some links.

Checking query rewrite looks like a difficult problem, I would like someone doing an a-b testing to really make a quality test-suite.

Tech Talk With "BA Baracus"

Thursday, May 8, 2008

Random Surfer Model revisited

Tuesday, May 6, 2008

How To Read A Research Paper

Blog Archive

About Me

BookMarks

Blogs I Follow

My Linkedin