Skip to content

{ Category Archives } Uncategorized

Build a distributed realtime tweet search system in no time. Part 2/2

Chirper is all about realtime indexing, so we wanted to highlight that on the frontend as much as possible, the search box performs the instant search as you type in the terms, and also show the number of tweets as they are indexed live on the system. It was important to keep the frontend as [...]

Build a distributed realtime tweet search system in no time. Part 1/2

Last Friday was an InDay (A day at LinkedIn where you can take a break from your day-to-day work and build something cool) As a part of the SNA team, we have been building some really cool distributed systems, from storage, to messaging, to search. So we thought it’d be cool on this InDay to [...]

Azkaban Flow UI

One of the most important aspects of a production offline processing system is a workflow scheduler. A workflow scheduler allows you to string together a group of processes to run in an order that respects the dependencies between the jobs (i.e. one processes output is another’s input). At LinkedIn we run a number of very [...]

Tagged

Optimizing TCP Socket Across Data Centers

Recently, I had a real opportunity to work on machines across different data centers (DCs). The task is simple: we’d like to replicate data stored in Kafka, a messaging system developed at LinkedIn, from one DC to another. We measured the transfer throughput and it’s extremely low. Even though there is a 1Gb link between [...]

LinkedIn Signal – a look under the hood

On September 29th, we unveiled LinkedIn Signal, a social search application for LinkedIn shares and tweets from LI-Twitter bounded accounts. Let’s take a look at what’s under the hood: The Scalatra-backend is a Restful service written in Scala, on the Sinatra framework. The Rest/Json RPC model is chosen for quick adhoc data manipulation for fast [...]

Distributed Applications with Norbert

Distributed Applications at LinkedIn Here at LinkedIn we need to build a variety of services that can horizontally scale.  Two of the common reasons we need to horizontally scale is because we need either more memory or more CPUs (and sometimes both) than we can fit into a single machine.  The solution to this problem [...]

Zookeeper experience

While working on Kafka, a distributed pub/sub system (more on that later) at LinkedIn, I need to use Zookeeper (ZK) to implement the load-balancing logic. I’d like to share my experience of using Zookeeper. First of all, for those of you who don’t know, Zookeeper is an Apache project that implements a consensus service based [...]

The Kamikaze version 3.0.0 is released

Kamikaze is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays. For example, in search engines, for each term t, the index, or called inverted index, contains an inverted list, which is typically a sequence of [...]

Tagged , , ,

LinkedIn Faceted Search

Faceted search has been fully rolled out late last year, we wanted to give you some insights into how it came to be, some of its challenges and what is in the future. At scale and with relevance, faceted search makes a lot of sense on the rich structured data we have here at LinkedIn.  [...]

Tagged ,

When Pigs Fly: Apache Pig, Open Source and Understanding Systems

Pig at LinkedIn Hadoop drives many of our most powerful features at LinkedIn.  About half of our Hadoop jobs are submitted by Apache Pig.  This means that along with Azkaban and Voldemort, Pig is a large part of LinkedIn’s data cycle – the process behind features like People You May Know and Who Viewed My Profile. I have [...]

Tagged