Workin Hard and Making Progress

Sorry I didn’t post much today. I pulled an all-nighter last night working on Web-mining algorithms and today we had back to back meetings all day.

I just came back from a really good product team meeting facilitaed by Chris Jones on our product messaging. It’s really getting simple, direct, clear and tangible. Very positive. It all makes sense.

It’s pretty exciting around here these days — a lot of pieces we have been working on for months and even years are falling into place and there’s a whole-is-greater-than-the-sum-of-it’s-parts effect kicking in. The vision is starting to become real — we really are making a new dimension of the Web, and it’s not just an idea, it’s something that actually works and we’re playing with it in the lab. It’s visual, tangible, and useful.

Another cool thing today was a presentation by Peter Royal, about the work he and Bob McWhirter have done architecting our distributed grid. For those of you who don’t know, part of our system is a homegrown distributed grid server architecture for massive-scale semantic search. It’s not the end-product, but it’s something we need for our product. It’s kind of our equivalent of Google’s backend — only semantically aware. Like Google, our distributed server architecture is designed to scale efficiently to large numbers of nodes and huge query loads. What’s hard, and what’s new about what we have done, is that we’ve accomplished this for much more complex data than the simple flat files that Google indexes. In a way you could say that what this enables is the database equivalent of what Google has done for files. All of us in the presentation were struck by how elegantly designed the architecture is.

I couldn’t help grinning a few times in the meeting because there is just so much technology there — I’m really impressed by what the team has built. This is deep tech at its best. And it’s pretty cool that a small company like ours can actually build the kind of system that can hold it’s own against the backends of the major players out there. We’re talking hundreds of thousands of lines of Java code.

It’s really impressive to see how much my team has built. It just goes to show that a small team of really brilliant engineers can run circles around much larger teams.

And to think, just a few years ago there were only three of us with nothing but a dream.

0 thoughts on “Workin Hard and Making Progress

  1. A few weeks ago, I blogged about how little confidence I had in centralized approaches to semantic web database building. Giovanni Tummerello ( wrote a great paper on the subject, and let me tell you, it’s one challenging undertaking. The main challenge facing any centralized approach is what’s known as the computational burden problem:

    “On the WWW, the interaction is based on HTTP requests/replies that in the great majority of the cases will be of limited impact on the server (e.g serving a file). This means that, disregarding anomalous cases, both the computational resources and network traffic required by a HTTP request are bounded. On the contrary, “requests” on the semantic web are naturally expressed in query languages and, given the graph nature of RDF structured information, the complexity of execution is not bounded a priori as it is a function of the query type as well as the quantity and the structure of the data. In other words, whoever would decide to offer the ability to answer “arbitrary questions” on a SW, would easily open himself to “denial of service” situations even in the ideal, good faith usage.”

    Creating a centralized database that solves the computational burden problem is one of the holy grails of the semantic web. My hat goes off to you and your team for tackling and solving this problem. I always predicted that P2P networks were the only feasible solution. Giovanni’s approach is to periodically synchronize each peer’s database, but only from within small peer groups, and once the data has been downloaded the query is sent to the local database, thus limiting the “damange” to the user’s local resources. The obvious drawback is that no one peer has 100% visibility across the entire distributed database. So if the answer to a particular SPARQL query happens to exist in triples across seperate peers, and I haven’t sych’d with each of those peers or I’m not in those peers’ groups, then I’m just up the creek. The ideal repository would be centralized, and accept SPARQL with the speed and scaliblity of Google, which (correct me if I’m wrong) sounds to me you guys have achieved. Again, I’m jaw dropped. For example, this will have serious ramification for my work with Cypher, as my major Achilles Tendon is the lack of a centralized repository of shared lexical descriptions (in RDF) collected from across the semantic web. If your service/framework could crawl, collect and most importantly “cook” RDF lexical descriptions (as the last item is what’s lacking in current services like Swoogle), and if it can serve Cypher results to arbitrary SPARQL which queries the metadata of lexical entries, then you’ve just sped up natural language processing for the Semantic Web by about 5 years!