jebu.net

thoughts scribbles images from silicon plateau
RSS icon Home icon
  • Really Real Time view to Twitter – current state

    Posted on December 10th, 2009 jebu 17 comments

    So I was stuck at home today with a cold, thought I would take an updated capture of the current state of the real time twitter client that I had previewed some time back. Since that last capture lots of things have gone in. Primarily integrated with all of the new updates from Twitter ie; lists, retweet and geo support. The tweet stream is now not a drip from the firehose but the stream via an explicit follow of the users whom I follow or are on the lists that I follow. Tweets pop up under the appropriate lists tab and also under the Merged tab which is a combined list view. Any tweet can be picked for a retweet. My tweets retweeted and replies to me appear under a separate list.

    Geocoded tweets are called out by a special marker in the footer of the tweet, on hover it displays the map and reverse coded location information, thanks to YQL and the API’s from Flickr and Yahoo! maps. When used from Firefox it determines the location and adds it to the tweets that you send.

    Some client side smarts like URL preview with information retrieved from TweetMeMe, preview of tweets to which a reply was sent, url shortening when sending etc…

    Really realtime twitter client – current state from Jebu Ittiachen on Vimeo.

  • TweetALoc – Sync your geocoded tweets to FireEagle

    Posted on November 29th, 2009 jebu 11 comments

    tweetaloc_logoThis was always there at the back of my mind ever since Twitter announced they were adding geo tagging to status updates. So here it is TweetALoc.

    Once you authorize it by linking your Twitter account and your FireEagle account, it will listen to your tweet stream, if geo information is present in any of your tweets it sends it along to FireEagle. The updates occur usually under 20secs from you tweeting. No polling involved here. Once authorized your twitter id is added to tweetaloc’s follow list within 5 minutes. So the first update will be only after that, and continuous from then on. I guess right now there are only a handful of Twitter clients which add location to your tweets, Twitter blog post on geotagging release has a list. I use one of my own which is why you see my tweets signed from nowwhat.in.

    Just in case you feel that you need to stop updating into FireEagle, I have also provided for a de-authorizing facility. FireEagle tokens associated with your Twitter id are removed and your id is removed from the follow list. You can of-course go into FireEagle and remove the permissions to TweetALoc.

    Behind the scenes this extends on the Erlang Twitter stream boiler plate that I had talked about in a previous post. I have used the Erlang OAuth library for talking OAuth to FireEagle and the PHP OAuth libraries for the web interface portions. That logo was GIMPed by me :)

  • Really Real Time view to Twitter

    Posted on October 25th, 2009 jebu 7 comments

    Really Real Time view to Twitter from Jebu Ittiachen on Vimeo.

    This is a screen capture showing a hack put together to make the twitter display real time, it captures tweets from the stream api and filters the ones only that you follow. The capture shows a couple of users who I’m testing with @scobleizer and @zee cos they follow a lot of people.

    Brought to you by Twitter stream API, XMPP PubSub (ejabberd), exmpp erlang client library for xmpp, Strophe JS library for XMPP BOSH

  • Erlang tap to the Twitter stream

    Posted on September 18th, 2009 jebu 8 comments

    Erlang http module works really well for plugging into the Twitter stream. The async option of the http module gives each chunk of a chunk encoded http response as a callback to the async request handler. And guess what the Twitter stream api gives each tweet as a chunk, on the json version each chunk is a self contained json. Club this with the Erlang light weight processes and twitter stream processing just flows along.

    Here is the code which taps into the Twitter data stream. I use the erlang json parser from Lshift for processing the tweets.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    
    -module(twitter_stream).
    -author('jebu@jebu.net').
    %% 
    %% Copyright (c) 2009, Jebu Ittiachen
    %% All rights reserved.
    %% 
    %% Redistribution and use in source and binary forms, with or without modification, are
    %% permitted provided that the following conditions are met:
    %% 
    %%    1. Redistributions of source code must retain the above copyright notice, this list of
    %%       conditions and the following disclaimer.
    %% 
    %%    2. Redistributions in binary form must reproduce the above copyright notice, this list
    %%       of conditions and the following disclaimer in the documentation and/or other materials
    %%       provided with the distribution.
    %% 
    %% THIS SOFTWARE IS PROVIDED BY JEBU ITTIACHEN ``AS IS'' AND ANY EXPRESS OR IMPLIED
    %% WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
    %% FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL JEBU ITTIACHEN OR
    %% CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
    %% CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
    %% SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
    %% ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
    %% NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
    %% ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
    %% 
    %% The views and conclusions contained in the software and documentation are those of the
    %% authors and should not be interpreted as representing official policies, either expressed
    %% or implied, of Jebu Ittiachen.
    %%
    %% API
    -export([fetch/1, fetch/3, process_data/1]).
     
    % single arg version expects url of the form http://user:password@stream.twitter.com/1/statuses/sample.json
    % this will spawn the 3 arg version so the shell is free
    fetch(URL) ->
      spawn(twitter_stream, fetch, [URL, 5, 30]).
     
    % 3 arg version expects url of the form http://user:password@stream.twitter.com/1/statuses/sample.json  
    % retry - number of times the stream is reconnected
    % sleep - secs to sleep between retries.
    fetch(URL, Retry, Sleep) when Retry > 0 ->
      % setup the request to process async
      % and have it stream the data back to this process
      try http:request(get, 
                        {URL, []},
                        [], 
                        [{sync, false}, 
                         {stream, self}]) of
        {ok, RequestId} ->
          case receive_chunk(RequestId) of
            {ok, _} ->
              % stream broke normally retry 
              timer:sleep(Sleep * 1000),
              fetch(URL, Retry - 1, Sleep);
            {error, unauthorized, Result} ->
              {error, Result, unauthorized};
            {error, timeout} ->
              timer:sleep(Sleep * 1000),
              fetch(URL, Retry - 1, Sleep);
            {_, Reason} ->
              error_logger:info_msg("Got some Reason ~p ~n", [Reason]),
              timer:sleep(Sleep * 1000),
              fetch(URL, Retry - 1, Sleep)
          end;
        _ ->
          timer:sleep(Sleep * 1000),
          fetch(URL, Retry - 1, Sleep)
      catch 
        _:_ -> 
          timer:sleep(Sleep * 1000),
          fetch(URL, Retry - 1, Sleep)
      end;
    %
    fetch(_, Retry, _) when Retry =< 0 ->
      error_logger:info_msg("No more retries done with processing fetch thread~n"),
      {error, no_more_retry}.
    %
    % this is the tweet handler persumably you could do something useful here
    %
    process_data(Data) ->
      error_logger:info_msg("Received tweet ~p ~n", [Data]),
      ok.
     
    %%====================================================================
    %% Internal functions
    %%====================================================================
    receive_chunk(RequestId) ->
      receive
        {http, {RequestId, {error, Reason}}} when (Reason =:= etimedout) orelse (Reason =:= timeout) -> 
          {error, timeout};
        {http, {RequestId, {{_, 401, _} = Status, Headers, _}}} -> 
          {error, unauthorized, {Status, Headers}};
        {http, {RequestId, Result}} -> 
          {error, Result};
     
        %% start of streaming data
        {http,{RequestId, stream_start, Headers}} ->
          error_logger:info_msg("Streaming data start ~p ~n",[Headers]),
          receive_chunk(RequestId);
     
        %% streaming chunk of data
        %% this is where we will be looping around, 
        %% we spawn this off to a seperate process as soon as we get the chunk and go back to receiving the tweets
        {http,{RequestId, stream, Data}} ->
          spawn(twitter_stream, process_data, [Data]),
          receive_chunk(RequestId);
     
        %% end of streaming data
        {http,{RequestId, stream_end, Headers}} ->
          error_logger:info_msg("Streaming data end ~p ~n", [Headers]),
          {ok, RequestId}
     
      %% timeout
      after 60 * 1000 ->
        {error, timeout}
     
      end.
  • Simhashing in Erlang – beauty with binary comprehension

    Posted on August 23rd, 2009 jebu 15 comments

    Simhashing is popular technique to detect near duplicates in content. Given two files the similarity in their simhashes gives a mathematical way to compute the similarity of the documents. The algorithm works like this

    • split the content into a set of features
    • for each of the feature compute a hash of fixed width
    • for each bit in the hash let 1 be a positive increment and 0 a negative increment
    • sum up the bits in hashes using the above translation
    • for each bit position if the value is positive the sim hash has 1 in the corresponding bit position
    • for a negative value the resultant hash has 0 in that bit position

    Working this out in a simplified form, if the text is “Twitter is littered with spam”, I break this into features which are the individual words “twitter” “is” “littered” “with” “spam”. Suppose the hashes for these are 10101, 11001, 11000, 01100, 01000. This translates then to 1 -1 1 -1 1, 1 1 -1 -1 1, 1 1 -1 -1 -1, -1 1 1 -1 -1, -1 1 -1 -1 -1. Adding these up 1 3 -3 -5 1. The simhash for this is 11001.

    The hamming distance between the hashes of two documents can be used to figure out the similarity of the docs. This presentation has a detailed explanation of the simhashing technique.

    Now tweets are not an ideal place to experiment with the simhashes given we are limited to 140 chars. But its a pretty effective technique for deciding similarity, and can be expressed very easily. The power of Erlang binary comprehensions and the bit syntaxes allows us to calculate simhashes in a very nifty way. Without further blabber here is an Erlang implementation.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    -module(simhash).
     
    -define(HASH_RANGE, 1 bsl 128).
    -define(HASH_ACCU, 12).
    -define(HASH_WIDTH, 128).
    -export([hash_file/1]).
     
    hash_file(File) ->
      {ok, Binary} = file:read_file(File),
      Tokens = re:split(Binary, "\\W"),
      calculate_simhash(Tokens).
     
    calculate_hash(A) ->
      %% Hash = erlang:phash2(A, ?HASH_RANGE),
      << Hash:(?HASH_WIDTH) >> = erlang:md5(A),
      Hash.
     
    calculate_simhash(Tokens) ->
      FeatureHashes = [calculate_hash(A) || A <- Tokens, A =/= <<>>],
      {HashAcc, Len} = lists:foldl(fun accumulate_simhash/2, {0,0} , FeatureHashes),
      << <<(is_hash_valid(B, (Len / 2))):1>> || 
         <<B:(?HASH_ACCU)>> <= <<HashAcc:(?HASH_WIDTH * ?HASH_ACCU)>> >>.
     
    accumulate_simhash(Hash, {Accum, L}) ->
      <<A:(?HASH_WIDTH * ?HASH_ACCU)>> = 
              << <<B:(?HASH_ACCU)>> || <<B:1>> <= << Hash:(?HASH_WIDTH) >> >>,
      {Accum + A, L + 1}.
     
    is_hash_valid(E, Len) ->
      case (E > Len) of 
        true -> 1; 
        _ -> 0 
      end.

    The beauty of this is in accumulate_simhash function. Here I expand each bit of the feature hash into an arbitrary width representation and just add it to the accumulator. Depending on the number of features change the HASH_ACCU which is the number of bits for each bit of the resultant accumulator. At line 21 the resultant hash is calculated by reducing HASH_ACCU bits back to a 1 or 0 based on the number of hashes found.

    Is that not beautiful?

  • Tweet and share your location

    Posted on April 25th, 2009 jebu 5 comments

    Last week I had put up MyBooth, where you could go to for figuring out your polling booth. Essentially this is a service built on top of Twitter messages. You send out a tweet with the GPS co-ordinates and tags for that location to @tweetaloc, in the following format

    @tweetaloc <message> (:<location tag>:)* L:<lat>,<lon>:

    The lat, long is reverse geo-coded from geonames.org to figure out the country code, state code and nearest place name, all of these and the tags that you provided are tagged on the location and available for search on MyBooth. Currently the interface does not allow you to search in a country other than India, will fix that up shortly.

    Behind the scenes is some old fashioned perl and php and Amazon SimpleDB for data storage. Uses Net::Twitter and Geo::GeoNames to poll twitter for messages and lookup geonames for the location data. A simple php page pulls the data out and feeds to Google Maps API as a geo-rss feed. Thats it nothing much there.

    The itch behind this was the inability to find some place which gave out the exact location of where I could go and cast my vote. Addresses in India are ad-hoc and locating a place with just the address in hand is not easy. Online map services do not figure out the address for this reason and the only thing that can work here is crowd sourcing, like what wikimapia.org allows people to do. But then on the road putting things into wikimapia :) , so thats where this thing fits in, use any twitter client from your GPS enabled device and send a tweet out and this takes care of the rest.

    Couple of things that I might add in to this

    • search by twitter-id and tags, so you can search for @jebui‘s office
    • search for tags around a given location, for things like polling booths within 5kms of my current loc
    • query via twitter


  • Erlang talking to apache via AJP mod_jk

    Posted on February 20th, 2009 jebu 6 comments

    Where have I been all this time since Onam? Its been 5 months since I wrote something, things have been busy at work and I have been exploring a new language, Erlang.

    I was first exposed to Erlang via blog posts popping up on HackerNews and Lambda the Ultimate. Lots of good things were said there and I have had a liking to recursive functions since my Pascal days. So did a deep dive, got the Erlang Bible by Joe Armstrong on my London trip and the book was a revelation. The syntax was weird at first but then I have begun to like it. The pattern matching function syntax is just the one for writing recursive functions. And yeah bit manipulation with pattern matching is like absolute heaven if you want to do some protocol implementation.

    Coupling Erlang to provide a web interface is essential, and I was hacking around with the scgi adapter, which was in disco and lighttpd. But then I needed a plugin to apache and scgi was not an option. Other options cgi, fcgi, ajp. Of these ajp was the best fit since mod_jk is pretty widely used for coupling servlet containers. I could not find an official spec for AJP and Apache’s documentation was the best available. The protocol is pretty straight forward and getting it working from an Erlang implementation was quick. I have a working version up and is available on github. This is the first time I’m putting something out in public domain so forgive any obvious mistakes and feel free to point out or jump in and fix things.

    There is lot of scope for improvement in there, this is just a working skeleton to serve requests from Erlang. I have tried to give it a bit order by having a behaviour defined for any module to plugin. Have your module confirm to the gen_ajp_handler behaviour and call back to gen_ajp_handler methods send_headers, send_data, request_data, end_request for obvious functionality. The dispatch mechanism is not yet filled out. Watch out here and on git hub for updates.

  • Happy Onam!

    Posted on September 12th, 2008 jebu 2 comments


    DSC_0042-small, originally uploaded by n i v e a d.

    Happy Onam folks.

  • Fireeagle location updater

    Posted on September 2nd, 2008 jebu 8 comments

    I have been running it for a while now and things look pretty good. The python script which updates my Fireeagle location does a couple of things.

    • Watches the cell tower that the phone is connected to, queries Google cell mapping webservices (using the script detailed in previous post)and caches the location information of the cell tower in a local db on the phone
    • Optionally uses GPS to get the current location
    • If current location, determined by above positioning methods with preference to GPS, is more than a configurable distance away from my last updated location, update Fireeagle
    • Optionally when doing a cell tower lookup also looks up geonames.org for the nearest neighborhood to give a meaningful place name to the cell location
    • Oh yeah of course does a mobile auth with Fireeagle first time and stores the authorization key for future access.

    Why one more updater? Navizon does a good job but i wanted it to do the distance based update, if i am within a certain radius, say 1 Km, of my location i really am not bothered about it getting reflected on Fireeagle. Making it configurable tunes it to the users taste. Also in an urban area cell towers keep hopping even though i have not moved. Dont bother updating in those cases. Not sure if Navizon caches the cell lookup information locally, but i feel its one piece that should be really cached. Why waste my mobile bandwidth doing the same cell lookup calls. Persisting this in a DB is really efficient. And last I would really like to have my location to be at one place.

    The J2ME updater is a perfectly simple GPS updater but GPS is too power hungry and useless indoors. I really dont want to always keep my phone such that GPS signals are available (you know the belt pouch). This one too a distance based decision to update Fireeagle or not would be great in addition to time based.

    So yeah i wanted a mix of the two mentioned above and with Python on S60 what better way to get it done that writing my own. The pain point to get this working was to port the existing oAuth and fireeagle libraries to python 2.2. The fireeagle libraries i just switched it to talk json and used the python-json module. Doing XML DOM parsing was really useless. The oAuth libraries needed some tinkering but manageable. The dependencies were the problem most of them were resolve by installing the mobile web server for s60 which has a bunch of python libraries like the cgi stuff.

    After all this rambling where is the stuff? Well I have not bothered to package it into a standalone sis. If anyone is interested let me know i will put the bunch of scripts out with hopefully some sensible instructions to get it running. Download the source here GTower FireEagle Updater

  • Google cell tower mapping with Python on S60

    Posted on July 11th, 2008 jebu 67 comments

    I have had my N95 for some days now. I’ll just say that the device is all I could ask for in a smart phone for, now. iPhone 3G, yeah, lets just say my take on it is clear from my choice of the N95. So the best part of this being the Python for S60 and the location API’s available via Python.

    I came across this piece of beauty which uses a hidden Google API for location detection based on cell tower information. Now what better way to start learning Python than getting this piece working on my phone with Python. I’m fed up of writing Hello Worlds to start learning a language. Here is a python version of the poor mans GPS.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    
    from httplib import HTTP
    import location
     
    latitude = 0
    longitude = 0
     
    def doLookup(cellId, lac, host = "www.google.com", port = 80):
      from string import replace
      from struct import unpack
      page = "/glm/mmap"
      http = HTTP(host, port)
      result = None
      errorCode = 0
     
      content_type, body = encode_request(cellId, lac)
      http.putrequest('POST', page)
      http.putheader('Content-Type', content_type)
      http.putheader('Content-Length', str(len(body)))
      http.endheaders()
      http.send(body)
      errcode, errmsg, headers = http.getreply()
      result = http.file.read()
      # could need some modification to get the answer: here I just need
      # to get the 5 first characters
      if (errcode == 200):
        (a, b,errorCode, latitude, longitude, c, d, e) = unpack(">hBiiiiih",result)
        latitude = latitude / 1000000.0
        longitude = longitude / 1000000.0
      return latitude, longitude
     
    def encode_request(cellId, lac):
      from struct import pack
      content_type = 'application/binary'
      body = pack('>hqh2sh13sh5sh3sBiiihiiiiii', 21, 0, 2, 'in', 13, "Nokia N95 8Gb", 5,"1.3.1", 3, "Web", 27, 0, 0, 3, 0, cellId, lac, 0, 0, 0, 0)
      return content_type, body
     
    (mcc, mnc, lac, cellId) = location.gsm_location()
    (latitude, longitude) = doLookup(cellId, lac, "www.google.com", 80)
    print latitude
    print longitude

    Download
    Beware this does not work from all IP’s, from my wifi connection at home this threw an error while on the GPRS connection this worked well.