Join Us
Press  |  Investors
Contact  |  Content

2011 Data Scientist Summit

data-summit-22Team member Ryan Rosario (Data Scientist, Engineer) recently attended the 2011 Data Scientist Summit in Vegas. For those data geeks (or undercover geeks) that couldn’t make it to the event, here’s a recap on what he learned and his overall thoughts on the event. And if you’re interested in a complete download on the event, you can read his full review on his personal blog, Byte Mining.

From Ryan (@DataJunkie):

ryan_rosario7Data Scientist Summit, Day 1

I arrived to the conference room and quickly took my seat. The keynote by Thorton May provided a lot of humor that kicked off a very energetic event. In the second session, we heard from data scientists and team from Bloom Studios, 23andMe, Kaggle and Google. I was happy to see somebody from Google present, as they never seem to attend these type events (neither does Facebook).

There has been a lot of buzz about 23andMe and Kaggle in the past few months. It is hard to keep up with it all, so it was great to hear from the companies themselves. 23andMe provides users with a kit containing a test tube into which the user spits. The kit is then sent back to 23AndMe labs which analyzes something like 500,000 to a million different markers (I am not a biologist) and can provide information about what markers are present such as: predisposition to diabetes or cancer etc. In 2011, it costs about $5,000 to do this analysis whereas 10 to 20 years ago the figure was in the millions. 23andMe goes a step further. They understand that genetics have a strong association with particular conditions, but that they are not necessarily causal. For example, someone with a predisposition to diabetes will not necessarily contract the disease. 23andMe wants to integrate other data into their models to help predict how likely a patient is to contract a certain condition, given their genetics.

Kaggle is a community-based platform for individuals and organizations to submit datasets and open them up to the Data Science community for analysis…as a competition. I love the geekiness of this endeavor, and it continues where the Netflix Prize left off. Kaggle has some awesome prizes for winning the competition such as $3M for the Heritage Health Prize. There are other freebies as well, such as a Revolution R Enterprise free for competitors.

Of all the presentations on the first day, Data Scientist DNA was my favorite. In this panel, Anthony Goldbloom of Kaggle, Joe Hellerstein from UC Berkeley, David Steier from Deloitte and Roger Magoulas from O’Reilly Media discussed what makes a good Data Scientist or “data ninja” as stated in the program. All were in agreement that candidates should have an understanding of Probability and Statistics, although someone on the panel suggested that a “basic” background was all that was needed; I disagree with that. A Data Scientist should also be a proficient programmer in some language, either compiled or interpreted and understand at least one statistical package. More importantly, the panel stressed that above and beyond knowledge, it is imperative that a Data Scientist be willing to learn new tools, technologies and languages on the job. Dr. Hellerstein suggested some general guidelines in classes students should take: Statistics (I argue for a full year of upper division statistics, and graduate study), Operating Systems, Database Systems and Distributed Computing. My favorite quote from the panel came from David Steirer, “you don’t just hire a Data Scientist by themselves, you hire them onto a team.” I could not agree more. Finally, the moderator of the panel suggested that Roger Magoulas may have been the one to coin the term “big data” in 2005, but a Twitter follower found evidence that the term has been used since as early as 2000.

Data Scientist Summit, Day 2

It seemed that the highlight of the morning was the talk by Jonathan Harris titled The Art and Science of Storytelling. He introduced his project “We Feel Fine” which is a conglomeration of emotions. His project aims to capture the status of the human condition. This was more of the touchy-feely kind of presentation which is different from most of the Data Science talks. He showed beautiful user interfaces and great examples of fluid user experience. Some statistics that caught my eye regard human emotion over time. It seemed that people experienced loneliness earlier in the week than later in the week. Joy and sadness were approximately inversely related throughout the week and hours of the day, but I cannot remember the direction of the trends. The most interesting graphics involved the difference between “feeling fat” and “being fat.” States like California and New York were hot spots for “feeling fat”, but they are actually some of the skinniest states. Instead, the region between the Gulf of Mexico and the Great Lakes was actually the fattest, but did not feel that way. A graphic for “I feel sick” showed a hotspot in Nevada which I thought was very interesting (nuclear fallout? alochol poisoning in Vegas?). The interesting part of this discussion was that it showed the vast geography of the field called Data Science. Some Data Scientists are more of the visualization and human connection variety, and others (where I consider myself) are more of the classic geeks that like to write code and dig into the data to get a noteworthy result. Well, I guess there isn’t much difference between both camps after all. As Jonathan would probably say, Data Science is about storytelling.

The last session in the trifecta was titled The Data Scientist’s Toolset – The Recipes that Win. Representatives from various companies were panelists: SAS, Informatica, Cloudscale, Revolution Analytics and Zementis. I felt that this discussion was lacking. The strength of the Data Science community stems from open-source technology I believe, and except for Revolution Analytics, none of the companies have a strong reputation in the open-source community yet. Discussion seemed to focus too much on enterprise analytics (SQL, SAS, Greenplum, etc.) and Hadoop, and not enough on analysis and visualization. All in all, this panel was a bit too “enterprisey” for me. Some Twitterers felt that they were pushing their products too much. This was surprising because I felt the exact opposite, unless they were picking up on the “enterprisey” vibe. The panelists were asked what one tool for data science they would choose of they were on a desert island. The panelists responded with the following tools, “Perl, C++, Java, R [sic, thanks David], SQL and Python.” I was disappointed that SQL was mentioned without a counter-mention for NoSQL because not all data fits in a nice rectangle called a table. By itself, SQL is very limited. Python and R I definitely agree with. Perl is dated, but still has a use in the Data Scientist’s toolbox if the user is not familiar with Python, and doesn’t want to be. I was baffled by the C++ response and the lack of overlap in the other responses. But these are my opinions only.

Overall Impression

All in all, the Data Scientist Summit was an eye-opening and empowering event, and it was only planned in six weeks. There was a great sense of community and collaboration among those in attendance. I work as a Data Scientist professionally because I love it. The one fact that I tend to overlook is that Data Scientists are in high demand and short supply. I was reminded of how important our work as Data Scientists is.

The Data Scientist Summit set a very solid foundation for the future. I felt like the modus operandi was “here is why Data Science is cool” and “here is why others should be interested.” Although this is not a groundbreaking discussion, it sets the stage for future conferences and solidification of the community.
Without a doubt I will be at next year’s Data Scientist Summit!