Archive for the ‘Technology’ Category
Any company serious about its network and system health has a Network Operations Center (NOC), and now the Rubicon Project has taken that important step as well. With the launch of tRP NOC we now have the improved ability to supervise, monitor, and maintain our network for our publishers with increased visibility – 24/7.
The Rubicon Project NOC contains visualizations of the networks, systems, and services used across the REVV platform. It is the focal point for network troubleshooting, software distribution, performance monitoring, and coordination with our real time trading partners. This kind of information gives us the ability to take a proactive approach to troubleshooting our infrastructure.
Ultimately it is an investment in making our technology stronger and more reliable than ever. The vision and execution of the NOC was truly a Rubicon Project team effort. Take a peek inside!
The clouds parted and the sun shined on the University of Southern California last week for their Fall 2011 Viterbi Engineering Career Expo. Along with over 75 (more…)
Team member Ryan Rosario (Data Scientist, Engineer) recently attended the 2011 Data Scientist Summit in Vegas. For those data geeks (or undercover geeks) that couldn’t make it to the event, here’s a recap on what he learned and his overall thoughts on the event. And if you’re interested in a complete download on the event, you can read his full review on his personal blog, Byte Mining.
From Ryan (@DataJunkie):
Data Scientist Summit, Day 1
I arrived to the conference room and quickly took my seat. The keynote by Thorton May provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from Bloom Studios, 23andMe, Kaggle and Google. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).
There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with it all, so it was great to hear from the companies themselves. 23andMe provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help predict how likely a patient is to contract a certain condition, given their genetics.
Kaggle is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis…as a competition. I love the geekiness of this endeavor, and it continues where the Netflix Prize left off. Kaggle has some awesome prizes for winning the competition such as $3M for the Heritage Health Prize. There are other freebies as well, such as a Revolution R Enterprise free for competitors.
Of all the presentations on the first day, Data Scientist DNA was my favorite. In this panel, Anthony Goldbloom of Kaggle, Joe Hellerstein from UC Berkeley, David Steier from Deloitte and Roger Magoulas from O’Reilly Media discussed what makes a good Data Scientist or “data ninja” as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a “basic” background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, “you don’t just hire a Data Scientist by themselves, you hire them onto a team.” I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term “big data” in 2005, but a Twitter follower found evidence that the term has been used since as early as 2000.
Data Scientist Summit, Day 2
It seemed that the highlight of the morning was the talk by Jonathan Harris titled The Art and Science of Storytelling. He introduced his project “We Feel Fine” which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between “feeling fat” and “being fat.” States like California and New York were hot spots for “feeling fat”, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for “I feel sick” showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn’t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.
The last session in the trifecta was titled The Data Scientist’s Toolset – The Recipes that Win. Representatives from various companies were panelists: SAS, Informatica, Cloudscale, Revolution Analytics and Zementis. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, Greenplum, etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too “enterprisey” for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the “enterprisey” vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, “Perl, C++, Java, R [sic, thanks David], SQL and Python.” I was disappointed that SQL was mentioned without a counter-mention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist’s toolbox if the user is not familiar with Python, and doesn’t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.
All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.
The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was “here is why Data Science is cool” and “here is why others should be interested.” Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community.
Without a doubt I will be at next year’s Data Scientist Summit!
Machine Learning is the branch of Computer Science, specifically Artificial Intelligence (AI) that studies systems that learn viz. systems that improve their performance with experience. It is about teaching a computer about the world – you observe the world, develop models that match observations, teach a computer to learn these models and finally the computer applies the learned model to the real world. Applications of Machine Learning (ML) are vast and diverse. Some examples include image recognition, fingerprint identification, weather prediction, medical diagnosis, game playing, text categorization, handwriting recognition, fraud detection, spam filtering, recommended articles/books/movies, solving calculus problems, driving a car … the list is endless. With the advent of the internet and internet advertising, ML is used to solve a whole slew of optimization problems to help improve the user, advertiser and publisher experience. ML draws techniques and algorithms from a diverse range of fields, which include information theory, numerical optimization, control theory, natural language processing, neurobiology, computational complexity theory and linguistics.
HISTORY OF MACHINE LEARNING
The intellectual roots of ML (or more generally AI) go back a long time, with some concept of intelligent machines being found even in Greek mythology. However, it was the advent of computers after World War II that provided the real thrust and theoretical underpinnings to this rich field. I have listed below some pioneering events to help us get a good picture of how the field has evolved.
40’s: This decade marked the advent of computers and the foundation of Formal Decision-making theory being laid down by VonNeuman and Morgerstern.
50’s: John McCarthy coined the term ‘Artificial Intelligence in 1956. Also, Arthur Samuel came up with the first game-playing program for checkers.
60’s: Pattern recognition became the prime emphasis in AI. In particular, instance-based methods a.k.a. “This document has the same label as the most similar document”, became popular. The most important event, however, was the perceptron or single-layer Neural Network – Marvin Minsky and Seymour Papert proved some nice properties and limits of the perceptron.
70’s: This decade saw the advent of expert systems (adhoc rule-based systems) and decision trees (“I can decide a document by incrementally considering its properties”).
80’s: Advanced decision tree and rule learning methods were invented. The biggest hype was around Artificial (multi-layer) Neural Networks (ANN), supposed to be loosely based on neural networks in the brain. ANN loosely defined is a method to extract linear combinations of input and output non-linear function of these combinations. Also, the focus in ML shifted to experimental methodology.
90’s: This marked major advances in all areas of ML. New techniques became popular like Reinforcement Learning (RL), Inductive Logic Programming (ILP), and Bayesian Networks (BN). A new set of meta-algorithms or ensembles were developed that combine the results of multiple less-accurate models to output a more accurate prediction – boosting, bagging and stacking becoming the popular methods. With the advent of the World Wide Web, text learning became the primary focus for ML.
2000’s: The pace of innovations and applications accelerated. Kernels and Support Vector Machines (SVM) became state-of-art methods to build highly accurate classifiers, regressers and rankers. Graphical methods began to be used more. Applications of ML now extended Transfer Learning, Sequence Labeling, Robotics and Computer Systems (debuggers, compilers). In a nutshell, this marked the solidifying of ML into an established science with a firm theoretical underpinning, and opening it up to unexplored areas.
DATA FLOW IN MACHINE LEARNING
As a machine learner, your goal is to use the past data to make predictions or summary of the data. But in order to do that effectively, the data has to pass through a number of stages. Here I have listed down some high-level steps that most ML-based solutions take in order to achieve the desired outputs.
1) Data Preparation: This marks the beginning of the data flow. Raw data is collected from the domains of interest and ‘cleaned’ according to the problem at hand. A few questions need to be answered at this before proceeding:
a. Data-specific: Is the data lawfully in our disposal? Are there issues with user privacy?
b. Learner-specific: How do I de-noise the data? How do I scale the individual values?
c. Task-specific: How do I reconstruct hidden signals from observed ones?
2) Exploratory Data Analysis: This is carried out to learn about the data distribution, validate assumptions and formulate hypotheses. Examples of graphical techniques are box plots, histograms, scatter plots and Multi-Dimensional Scaling.
3) Feature Engineering: This is concerned with decomposing the observed signals into individual variables that are pertinent to the problem at hand. For example, in bio-informatics, you may want to decompose mass spectrometry signals into variables that are predictive for the detection of cancer, and that can be traced back to certain proteins.
4) Training: This is the most complicated step of ML. It involves coming up with the optimization objective for the problem at hand, and solving it using exact or approximate methods to attain the desired objective. The training assumes that there is a loss function that is minimized and measures the system performance. Based on the nature of the output, this is divided into two kinds:
a. Supervised: The goal is to learn patterns to simulate given output. In classification, the output is a categorical label and loss function is accuracy of prediction (e.g. spam vs. non-spam email). In regression, the output is a real number and loss function is Mean Square Error (e.g. predict tomorrow’s stock price).
b. Unsupervised: The goal is to look for patterns in the data with no examples of output. For example, find interesting patterns in the data or find outliers.
5) Evaluation: This measures the performance of the system and provides an estimate of how the system will perform when deployed in the real world. Some example metrics are accuracy, sensitivity, specificity and Mean Squared Error.
6) Deployment: This uses the trained model to make predictions in the real world. Performance in the real world can be optionally fed back into the original data flow to tune the model parameters and improve overall performance.
HOW ML IS USED AT RUBICON PROJECT
the Rubicon Project provides an excellent opportunity to apply ML. One, there is an enormous amount of rich data to analyze. Two, there are some really challenging problems around prediction and forecasting that need to be solved. I will just mention two such high-level areas to illustrate – but in reality this list is long and ever-expanding. One area is the prediction of performance of advertiser campaigns, or click-through rate (CTR) prediction. The target/dependent variable is the performance of the advertisement given the features/independent variables like location of the page, size of the advertisement, content of the advertisement. This is critical component that is required to set the right expectations around pricing and yield for advertiser campaigns. Another challenging area is to infer the demographics and interests of the user to improve the efficacy of targeting him/her with the right advertisements. For example, merely knowing the age and gender of the person goes a long way to optimize the performance of the ad campaign and improve the user experience. On top of that, if you can infer some interests of the user e.g. “auto enthusiast”, “in the market for Toyota Camry 2007”, it gives an enormous insight into the prospective buy interest of the user and helps drive both user experience and ROI for the advertisers.
POPULAR EXAMPLES OF REAL WORLD MACHINE LEARNING
ML in popular culture has come a long way from the time that Arthur Samuel came up with the first game-playing program for checkers in the 50’s. Here are three popular examples:
1. IBM made history in the 90’s when Deep Blue (a chess-playing computer) beat the then World Champion (Garry Kasparov).
2. In 2005, Stanford’s autonomous vehicle (Stanley) won the DARPA Grand Challenge race for driver-less cars, passing through three narrow tunnels and navigating more than 100 sharp left and right turns.
3. More recently, in February 2011, IBM’s jeopardy-playing computer (Watson) defeated two of Jeopardy’s greatest (human) players handsomely, and proved yet another example of how smart machines can be trained to perform seemingly impossible tasks.
Please feel free to contact me if you have any questions or comments. Here are 2 excellent resources to get you started on ML:
1. The Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman).
2. Machine Learning (Tom Mitchell)