Friday, June 27, 2008

Hadoop and Table Joins !!

In continuation of my post yesterday I want to talk about doing table joins on Hadoop. The current model of Hadoop presents a problem for doing Joins for different tables/datasets.

The key problem is that both mapper/reducer takes in only a single key/value pair. A simple workflow would look like this.



The mapper input is text files, for Hadoop to do an intelligent join we need the mapper to distinguish between two different files just looking at data. I have been using a hack to distinguish my data types based on number of colums in my CSV data files. It breaks when I have to join two tables with same number of columns and types.

Can Hadoop support multiple Mapper classes??

If we can make Hadoop support multiple mappers I can define one class to handle one type of data and the other to other kind of data. Till the time they are churning out same 'key' type for the intermediate output and input to sorter we should be good.

I will try and escalate in Hadoop forums !! what do folks think about this ??

Thursday, June 26, 2008

Ad-hoc Hadoop


I am back working on my favorite software platform "Hadoop". I have been working on writing applications on Hadoop for quite some time and I love the immense power of easily available, reliable and scalable distributed computing at my disposal.

I have few things in mind though, sometimes I want to do a lot of Ad-hoc analysis and at those times the bottleneck of having to write a script or some java-code is still a hassle. SQL based database query rocks for that purpose. I have to agree though that its my relative inexperience with python and lazyness which is the problem here not Hadoop BUT I can still see that a Ad-hoc query tool on top of hadoop would be great help. Pig Latin from yahoo was a good step in that direction and I liked it before I got frustrated with bugs and not working features in it.

Hadoop can become a very powerfull backend-analysis tool for any data oriented company. The market opportunities are very very big as well coz I guess world is full of lazy people like me :)

Monday, June 23, 2008

cog in the wheel ...

Today is one of those days when you look back and think about your life .. (yes !! even super heroes like B.A Baracus also have these moments of introspection).

My alternate persona is taking center stage for a long time now, he is lazy, too comfortable and too busy to dream. I am feeling so tied up. Like just another cog in the small wheel. I need to get rid of this guy ASAP.

something need to change and first change need to be in me. I am not going to lay back and enjoy a regular job life. I am not going to be a regular guy, not anymore. Whatever it takes BA is waking up!! I need the passion back. ooh I missed India :)

Wednesday, June 18, 2008

Cache Cache Cache !!

Hi,

I have been looking/hearing about Memcached a lot lately. It looks like the universal answer to scaling problems. Every database is being sorrounded by a memecached layer. Every reusable object is being pushed to memcached server.

I have looked a bit deeper into memcached here is an excellent post from Brian Aker and Alan Kasindrof . I am really excited about the persistance angle (still in work) for memcached. It might provide a simple solution for a distributed, highly scalable key-value map storage kind of solution. Which can be used in tons of places, Amazon Dynamo is another solution I am really interest and excited about.

Are there other solutions ?? A distributed scalable key-value pair should not be that hard to build or am I missing something here ?

Monday, June 2, 2008

PageRank for a social network

Lately , I have been thinking about doing a page-rank over a social network. The problem definition would be to given a social-networking graph to find the top respected/ranked person for a given field/industry.

The solution can have immense value for recruiters (cutting search time for potential candidates) , Hedge Funds (Finding the domain experts) or for marketers ( finding the trend-setters , heavy influencers ) to market their new gizmo's or ideas in a network.

To define the problem in more concrete terms

Problem
-----------

Given a social graph ( nodes/ attributes / connection info) find a ranking/respect value for members for specific domains ??

it gives birth to one other Question in my mind ? Does one single global pagerank enough or we need a topic-domain based pagerank in general ??

Thursday, May 8, 2008

Random Surfer Model revisited

I have been reading a lot of algorithm based on the random surfer model. I rechecked the original simrank paper from Glen Jeh and Jeniffer Widom @stanford. It's a very simple and basic idea of using the graph structure to identify similarity's between graph nodes. It can be define the probability by which 2 nodes would meet if they keep choosing an edge to walk from current node at random.The "random surfer" model was popularized by google pagerank, this paper here uses it for a two node similliarity problem.

Surfing is kool by me, I only fear flying you Fooool !!

The problem with this approach I think are
1) Graph Structure: The graph need to be structured very strongly for simrank to give any meaningful result. The unpopular or nodes with low fanout would be categorized similar to many non-worthy nodes due to some common neighbor values.
2) I think the simrank would suffer in a diverse graph scenario where both dense and sparse sub-graphs are part of a big graph.

The good part about simrank are:
1) The approach is context-free and can be applied to different domains
2) simrank is iterative , can be parallelized easily and is guaranteed to converge (with cutoff thresholds).

Similar interesting problems:
1) Given a Graph G, and two nodes (s,d) find the neighbors of 's' who have maximum influence on 'd'.
2) Finding communities in graph.
3) sub-graph matching ?? This problem would be hard to model using random surfer. would be interesting to do that though.

Alright !! Need to go and convert the van for assault !!

Tuesday, May 6, 2008

How To Read A Research Paper

I have been thinking about starting a technical blog for a while. so without further Ado. Here is my First Post.

I have been "Trying" to read lots of paper lately, I feel I am not doing justice to all the papers I have collected and just thinking about the printing cost of them to the company makes me feel really really guilty.

I went back and read few articles about how to Read Research Papers actually. One good article by Philip Fong Link.

I will use his method of reading paper to read few of the papers suggested in really remarkable post at http://glinden.blogspot.com/2008/05/random-walks-of-click-graph.html

I have chosen paper by Hector Garcia-Molina et al. "Simrank ++: Query Rewriting through link analysis"

Problem Space
---------------

The paper talks about Query Rewriting problem i.e Given a New Query 'q' finding similar queries set 'Q' with known clicks from historical Query-Click data.

The problem space make sense for search marketing folks to find correlation between different query/keyword terms to maximize their CTR yield.

Can the problem be extended to search queries in general, I mean on finding relation like if query:"football" and extending search to "soccer" etc.

Importance of Problem space
----------------------------

A good solution here can increase CTR for search ads and can mean direct money :)

Paper Type
----------

This paper improves an existing algorithm Simrank.

BackGround
------------

Simrank :: Simrank is an approach of finding similliarity between different objects. It is a domain independent, graph structure based approach which uses the simmiliar/common friends principal to measure similarity's between objects. Excerpt from original simrank paper.
"“two objects are similar if they are related to similar objects" eg. On the web two pages with common hyperlinks / Papers with shared references etc. [I will be analyzing this paper next]

Contributions
----------------
1) Application of simrank algorithm in sponsored search space
2) Improves on existing algorithm
3) Experiment evaluation of real query-click data from yahoo

Simrank improvements
--------------------------------
1) Evidence based: The author argues citing cases of complete bipartite graphs that for some cases simrank doesn't give optimal pair and use of a new "evidence factor" can improve score.

2) Weighted edge traversal: The author then devises a simrank improvement by considering the weight calculated in step 1 for the random surfer in simrank.


Results
--------
1) Coverage improvement : same as simrank
2) Precision Improvement : 5 % improvement ??
3) Rewriting depth Improvement : 5 % over simrank
4) Correctness : 30-40 % improvement over simrank. +2

Conclusions
-------------

Query rewrite effectiveness: The query rewrites are measured using two major
Methodlogies
1) Comparison with manual scores from experts : Not based on real click-data/ person bias.
2) Checking rewrite quality by removing some links.

Checking query rewrite looks like a difficult problem, I would like someone doing an a-b testing to really make a quality test-suite.