Category Archives: Projects

Opening social networks/ graphs up to Researcher Collaboration Tools, a UCSF Harvard Profiles collaboration project

For the past few months I have been consulting part time with UCSF and the department of Clinical and Translational Sciences. You might think that has something to do with my browser plugin Babelfin, but Translational Sciences really has nothing to do with language learning, but rather taking exciting patterns in other fields and translating the processes between them. In this case UCSF is focusing on how Social Media and Social Networking can be used in an academic sense for collaboration and messaging rather than games, photo sharing, or virtual resumes.

The UCSF OpenSocial project (http://code.google.com/p/ucsf-opensocial–shindig-apps/ ) started as a Harvard project called Catalyst PROFILES ( http://connects.catalyst.harvard.edu/profiles/about/opensource ). Profiles (as we call it), is a simple social networking server that manages the graph relationships between colleagues, co-authors, and research interests. Profiles looks at relationships differently than Facebook, Linkedin, or event MySpace, but it’s pretty bare bones and limited in what it can do.

The innovative part comes in where UCSF thought it would be neat to extend Profiles without altering it’s code. So, Eric Meeks at UCSF bolted on an Opensocial container named Shindig to Harvard’s Profiles project which allows external apps to run on top of Profiles. This makes for an interesting mix of code, as Profiles is a Microsoft C# ASPX project, and Shindig comes in PHP and Java flavors. Eric rightly choose to implement the Java flavor of Shindig as it’s the most current.

So, this is where I come in, as, I am building the applications that run on the Shindig server accessing the Profiles social graph. In many cases it’s just like building an application that runs on Linkedin, Bebo or MySpace, however, there is no friend graph, but, there are 3 other graphs I can use, co-author, colleague, and interest graphs.

Initially we are keeping it simple, but we plan to extend Opensocial in a standard way so that other universities and research institutions can apply Opensocial to their graph servers. UCSF and Harvard are hoping that their work will make it easier to use Profiles as the graph server, but they are both very excited about creating an open platform that can develop a rich ecosystem of applications that extend their work, and are able to run on other platforms with small tweaks. In the end we want researchers to be able to better collaborate using social tools.

HTML5 Storage, localStorage, WebDatabase, indexedDB.

For Babelfin, a browser plugin, I have been looking for ways to elegantly store data in the browser for each of the translations of a given word or short phrase.  The goal would be to use a data pattern that was easily portable to the two most popular extensible browsers – Chrome and Firefox.  I decided to start with Chrome durring startup weekend because it has a newer extension model and has simplified a lot of how extentions work.  Firefox still uses xul which is an odd combination of XML and Javascript, where as Chrome uses native HTML and Javascript with a few additional libraries to work with.

So for Babelfin, the goal would be to list a set of phrases that one would want to learn, or that the user might enter, and then the plugin would pre-fetch the translation and store it in the language that the user wants to learn.  If the phrase list is short, using a key-value store like localStorage would make sense, as almost every browser supports it, including IE8, and even though there are small differences, the data model and pattern would still work. However for anything other than the minimal product, a larger data set will be required for more advanced users.  And as a result it would be nice to have a query-able data set.

Twice google has tried to solve this problem, once with Google Gears, and again with Web SQL Database, which works in Chrome, Opera, and Safari, but does not work in Firefox, or IE.  Microsoft and Mozilla do not want to add SQL as a language to browser side logic, and I happen to agree.

The result is IndexedDB, a javascript data storage model in the front end, however,  IndexedDB is not quite CouchDB or MongoDB, but rather a javascript level abstraction of storage data.  Several Mozilla folks seem to think that CouchDB might be built on top of it, and so there is BrowserCouch which will try to do just that.  But, where is the BrowserMongo ( Maybe we can call it Mango, as a Norwegian, told me that Mongo means retard in Norwegian, oops. )

So in any case there is still a gap for Browser Side Storage that is query-able and indexed.  I suppose something could be built on top of localStorage, but it would be a hack until the browsers fully support it.

I am trying to find out how much work it would be to put a stub wrapper on top of Webkit’s Web SQL Database layer and have it look like IndexedDB since that seems to be what Google will support eventually, once they figure out how to thread it, but it would be nice to have something now. 😉

A few XPath links to make YQL and HTML scraping easier.

quick and basic
http://www.w3schools.com/XPath/xpath_syntax.asp

great resource for detailed string functions
http://oreilly.com/catalog/xmlnut/chapter/ch09.html

quick list of all of the functions in xpath
http://www.w3schools.com/xpath/xpath_functions.asp

great example of more sophisticated function selectors
http://www.eggheadcafe.com/articles/20030627d.asp

multiple attribute
http://www.coderanch.com/t/128329/XML/XPATH-selection-based-multiple-attribute

fun with concat
http://blogs.sun.com/rajeshthekkadath/entry/xpath_searching_for_a_text

http://www.xml.com/pub/a/2002/08/14/xpath_tips.html?page=3

Large Binary Numbers in Javascript

Have you ever wondered what the largest number is?  At one point it was a googolplex, but unless you write a custom class to deal with really large numbers, your dev environment will never count to a googolplex.

For http://loc.is I am reworking a geohash encoding algorithm today, and I wanted to see what Javascript could handle.  Strongly typed languages like Java have known upper limits, but languages like javascript are a bit more mysterious, so the only way to know is to find out.  I wrote this function to test it, and I ran it in the firebug console on a random webpage ( firebug won’t work unless the DOM is ready ).

for( i = 0 ; i < 500 ; i++){
    var value = Math.pow(2,i) +1;
    console.log (   i + "n" +
    value + "n" +  
    value.toString(2));
}

The output finally looks like this, and reveals that 2^52 is the largest number you can add 1 to and still represent it as an integer. Any larger and javascript will just use it as a float, and you will not have a sufficient number of bits to represent it.  I wonder if this is the same in all browser/ os combination pairs?

51
2251799813685249
1000000000000000000000000000000000000000000000000001

52
4503599627370497
10000000000000000000000000000000000000000000000000001

53
9007199254740992
100000000000000000000000000000000000000000000000000000

54
18014398509481984
1000000000000000000000000000000000000000000000000000000

So in this circumstance, javascript seems to have 54 bits of integer accuracy, a few short of a LONG 64 bit int, or IEEE 80 bit notation.


Test it yourself, and let me know.

New jEdit Syntax Edit Modes for Perl+HTML & XML+CDATA + HTML

I love my syntax highlighting in jEdit, and why shouldn’t I?  Good Syntax Highlighting helps you catch code errors early.  So about 2 years ago, I created a custom jEdit “Edit Mode” so that HTML buried in your XML CDATA tags would be Syntax Highlighted.  Here is the original post on Google Groups ( http://groups.google.com/group/opensocial-api/browse_thread/thread/12e250246ab64054/343c858695e8eb12 ).

Here is the custom Edit Mode for XML->CDATA->HTML c-xml.xml

It uses a simple trick to identify the CDATA with HTML with a special CDATA tag like the one below

<![CDATA[ <!--HTML-->
   <html>
       <!-- friendly neighborhood web page -->
   </html>
]]>

The perl version is here perl-html.xml

So for the perl version i use the structure of Perl’s HEREDOCS and use the following java capable regex

<<p{Space}*(['"])([p{Space}p{Alnum}]*)1;?s*<!--.*HTML.*-->

Which will HTML Syntax highlight any perl HEREDOC as HTML if it has <!– HTML –> in it so the following should syntax highlight.  There is however, one bug.  If your HEREDOC is not well formed, meaning you have an unclosed tag css object, or  javascript call, then the syntax highlighting gets stuck.  Maybe I will figure out how to make this work for non well-formed HTML.  any ideas?

<<< EOT <!-- HTML -->
    <html>
        <!-- friendly neighborhood web page -->
    </html>
EOT

Improving PHPBB2 and MySQL performance

For the last few days I have been trying to track down why Unity’s PHPBB 2 Forum (http://forum.unity3d.com) was so slow. Page rendering times were taking between 2 and 10 seconds and for me this was just unacceptable.

A proper website should be able to render a majority of the page in less than 250ms, and delivering most of the content in less than 50ms is ideal. Sites like Amazon, Yahoo, and Google have studied the effects of response time vs. features, and have found that response time is often times more important. Greg Linden pulls together a few sources on the topic in his blog.

So, 10 seconds was just unacceptable, and I was bound to find a solution. I started with the low hanging fruit, and installed PHP’s APC cache which is an OpCode Cache that stores compiled versions of the PHP code to reuse on the next request. There is evidence out there to suggest that not only do opcode caches reduce CPU requirements, but they also reduce your memory load. I saw a few sites that claim about a 3-4x performance increase on the CPU and about a 25% reduction in memory usage.

The next step, as the forums were still slow, was to start looking into our MySQL usage. Did the server have enough memory, were the MySQL caches and buffers large enough, or were searches and queries getting pruned to make room for more queries? After using MySQL’s GUI tools through an SSH Tunnel I was able to see that the server had 64MB of query cache, and only 90% of it was being used, so the caches were good.

Next was to look for slow queries, however a slow query or two would be hard to fix considering PHPBB was building the queries as part of the application and should have been tested in advance. If a slow query was the problem, it probably was not part of the application design, but rather an indication of something else that was going wrong. We did notice a few slow queries, but nothing ridiculous. However, what we did notice was that a number of queries were running slow because other ones where holding a lock for a long period. Hmmm?

Next we started looking at the problem from a system level using apps like ‘top’, ‘dstat’, ‘mytop’, etc… From here we found that disk IO was over working. Hmmm… but why?

I later found out that we were serving large files in our forum, however it did not seem like our forum was being used as a hot linking service, and I would suspect that PHPBB has some defense from that, but going forwards this is a concern of mine.

I finally found a post on the web talking about MySQL CPU spikes, and it was related to a corrupted table index. http://forums.fedoraforum.org/showthread.php?t=232008

After reading the forum for a bit, I thought, well there seems to be a few low risk commands I can run to look into our table integrity. Here is the full command list http://dev.mysql.com/doc/refman/5.0/en/mysqlcheck.html

First I ran:

mysqlcheck -A -F -u [username] –p

This reported several table closed improperly errors, and took about 2-6 mins to run without taking down the forums.

Next I ran:

mysqlcheck -A -q -r -u [username] –p

This is basically the fastest repair option, and would only fix minor corruption. It took about 5-10 min to run.

The result was that our MySQL usage dropped from 100-300% to 3-6% and page response times being about 200-500ms. I have a few other things I would like to try, from using our CDN with the forums to DB optimizations, to other server upgrades.

Over the weekend, I think I am going to run:

mysqlcheck --all-databases  --optimize -u [username] –p

Which should be more thorough, but might take an hour. This will defrag the table and resolve some indexing errors.

Learning Linux Shell Scripting and SVN

Since this took me a little longer than I would have liked to find, and improve I thought I would post a little nugget about Linux Shell scripting and SVN that I learned today.

You can use the following line to add multiple files to your repository:

svn st | grep "^?" | awk '{ print $2}' | while read f; 
    do svn add "$f";
    done

However, the solution above from array studios to add multiple to a subversion repository fails when you have filenames that contain spaces. Normally files with spaces in them are not idea, however, I had a few files with spaces and I thought that there must be a solution. With a little digging, I found the AWK Manual very helpful. Below is an updated solution that should work with filenames that have spaces.

svn st | grep "^?" | awk '{ $1=""; print;}' | while read f; 
    do svn add "$f"; 
    done

About Loc.is

About Loc.is

Loc.is is a short url service that creates location aware hashes.  The initial prototype of Loc.is was created during the fall 2009 LA Startup weekend.  The idea was pitched by Justin Kruger, and was created with the help of Alexis Eller, and Andrew….

Loc.is uses geolinks to represent geographic areas on a map.  Each geolink may be accurate enough to describe something like a region, city, or street address.  We can even define a specific GPS coordinate with about 12 characters that you can share in your tweets, on your blog, or in your text messages to friends.  Geolinks are especially neat because you can vary the precision of the defined area by removing characters from the end of the link.   Because of this special attribute, geo hashes are easily compared and hierarchically grouped, so a computer might be able to find all of the geohashes that are within a city just by comparing strings; no calculation is necessary.

The geohash algorithm has a few other improvements over the one found at http://geohash.org, for one, our hashes dedicate the 1st character to defining the longitudinal region.  We use 1 of 64 possible characters in each character position thus breaking the earth into 64 vertical regions.  We even put the 1st region at the international dateline to make time calculations easier.  The idea here is that if your geohash was only one character long, or you/ your computer only wanted to look at the first character, it could then roughly approximate which time zone the remaining characters are in.  Using the 1st, and 2nd character, you should be able to define a region the size of Kansas.  Together the 1st two characters define 4,096 regions on earth ( 64×64 ).  The remaining characters work a bit differently and work more like a traditional geohash.  Each character in the hash describes with increasing precision an area inside of one of those 4,096 regions.  Using 5 or 6 characters should define about the size of a city, and 12 characters the area of a laptop computer.

Because geolinks, are both a geohash, and a hyperlink, we can collect interesting stats on who is visiting a given region, and we can crawl the web and twitter to see what people are saying about a given reason.  Think of it as a sort of pageRank for location on the web.  And because it’s a hierarchical data format, any shorter url can include the information from all of the locations that it contains.

We plan to have a lot of fun with taking the site further, how would you want to use the site?  Please provide us feedback through the Uservoice link on the left hand side and we would love to add interesting features.

Validation: Sometimes as an entrepreneur you think you are crazy.

As an entrepreneur you are constantly coming up with ideas, some good, some bad, and you are always looking for someone that will buy your story.  So,  sometimes when a 1st customer, an investor, your parents, or even when a competitor just gets what you are up to feel set free, validated and even empowered.  Now, no entrepreneur likes to loose, but sometimes just knowing that someone else is thinking what you are thinking, well, it validates you and your hair brained ideas, and sometimes that feels good.

So, when I saw that eventbrite had revenue of $100 million in what they thought was a $35 billion dollar market place, it kinda gives me goose bumps.

Over 2 years ago I started pitching a company called “Social Helix” and then and now I am still excited about the idea of connecting people thr0ugh events.  I always felt like profiles like myspace and facebook were a bit lifeless compared to the real thing.  I mean the people we care about, really care about, well we meet them in person don’t we?  And when I realized that, I started to get really excited about changing the way events work.

Today, I am working on a few other ideas, some with potential investors, and others are just engineering projects with a possible revenue upside, like http://loc.is, but in those cases, SocialHelix got me in the door, but we never got the money we needed to make it happen.

Sometimes as an entrepreneur you think you are crazy, not because other people tell you so, but you really wonder if your bet is right.

Today I received some validation that I am not crazy when eventbrite announced revenue.  Eventbrite claims they have revenue of $100 million ( some of that goes to event curators ), and placed their market size at $35 Billion.  $35 billion is an interesting number because that is what we predicted with SocialHelix nearly 2 years ago.

The event space is huge and ticketmaster plays a role in such a small space of it.  It’s the classic problem of long tail vs short tail.  Ticketmaster went after some of the worlds largest venues, and secured deals with those locations.  Large Venue style ticket sales which are usually divided into primary and secondary sales total about $1.5 billion and $6 billion respectively.  While movie ticket sales are around $10 billion a year.

Sites like Eventbrite and SocialHelix would rather target long tail sales which are usually personally, or organizationally curated instead of venue curated.  In that industry we have several interesting segments, and this market is huge, it so huge that most people don’t think about it.  Groups and Meetings which include corporate events, business meetings, and