Friday, June 27, 2008

Hadoop and Table Joins !!

In continuation of my post yesterday I want to talk about doing table joins on Hadoop. The current model of Hadoop presents a problem for doing Joins for different tables/datasets.

The key problem is that both mapper/reducer takes in only a single key/value pair. A simple workflow would look like this.



The mapper input is text files, for Hadoop to do an intelligent join we need the mapper to distinguish between two different files just looking at data. I have been using a hack to distinguish my data types based on number of colums in my CSV data files. It breaks when I have to join two tables with same number of columns and types.

Can Hadoop support multiple Mapper classes??

If we can make Hadoop support multiple mappers I can define one class to handle one type of data and the other to other kind of data. Till the time they are churning out same 'key' type for the intermediate output and input to sorter we should be good.

I will try and escalate in Hadoop forums !! what do folks think about this ??

Thursday, June 26, 2008

Ad-hoc Hadoop


I am back working on my favorite software platform "Hadoop". I have been working on writing applications on Hadoop for quite some time and I love the immense power of easily available, reliable and scalable distributed computing at my disposal.

I have few things in mind though, sometimes I want to do a lot of Ad-hoc analysis and at those times the bottleneck of having to write a script or some java-code is still a hassle. SQL based database query rocks for that purpose. I have to agree though that its my relative inexperience with python and lazyness which is the problem here not Hadoop BUT I can still see that a Ad-hoc query tool on top of hadoop would be great help. Pig Latin from yahoo was a good step in that direction and I liked it before I got frustrated with bugs and not working features in it.

Hadoop can become a very powerfull backend-analysis tool for any data oriented company. The market opportunities are very very big as well coz I guess world is full of lazy people like me :)

Monday, June 23, 2008

cog in the wheel ...

Today is one of those days when you look back and think about your life .. (yes !! even super heroes like B.A Baracus also have these moments of introspection).

My alternate persona is taking center stage for a long time now, he is lazy, too comfortable and too busy to dream. I am feeling so tied up. Like just another cog in the small wheel. I need to get rid of this guy ASAP.

something need to change and first change need to be in me. I am not going to lay back and enjoy a regular job life. I am not going to be a regular guy, not anymore. Whatever it takes BA is waking up!! I need the passion back. ooh I missed India :)

Wednesday, June 18, 2008

Cache Cache Cache !!

Hi,

I have been looking/hearing about Memcached a lot lately. It looks like the universal answer to scaling problems. Every database is being sorrounded by a memecached layer. Every reusable object is being pushed to memcached server.

I have looked a bit deeper into memcached here is an excellent post from Brian Aker and Alan Kasindrof . I am really excited about the persistance angle (still in work) for memcached. It might provide a simple solution for a distributed, highly scalable key-value map storage kind of solution. Which can be used in tons of places, Amazon Dynamo is another solution I am really interest and excited about.

Are there other solutions ?? A distributed scalable key-value pair should not be that hard to build or am I missing something here ?

Monday, June 2, 2008

PageRank for a social network

Lately , I have been thinking about doing a page-rank over a social network. The problem definition would be to given a social-networking graph to find the top respected/ranked person for a given field/industry.

The solution can have immense value for recruiters (cutting search time for potential candidates) , Hedge Funds (Finding the domain experts) or for marketers ( finding the trend-setters , heavy influencers ) to market their new gizmo's or ideas in a network.

To define the problem in more concrete terms

Problem
-----------

Given a social graph ( nodes/ attributes / connection info) find a ranking/respect value for members for specific domains ??

it gives birth to one other Question in my mind ? Does one single global pagerank enough or we need a topic-domain based pagerank in general ??