Big Data and Data Science: methods and tools

January 17, 2013

Today we want to share with you guest post by Dmytro Karamshuk (@karamshuk), one of Stanfy’s founders who now doing research in complex networks, human mobility and mobile networks in Cambridge. If you are interested in Big Data – feel free to give us a shout in the comments.

As the world is becoming more and more obsessed with mobile gadgets, social networking, e-commerce, GPS devices and billions of other sensors recording every aspect of our lives software engineers have to deal with enormously increasing amounts of data. According to IBM “we create 2.5 quintillion bytes of data every day — so much that 90% of the data in the world today has been created in the last two years alone.”

This data is too big, moves too fast and, as a result, exceeds the processing capacity of conventional data processing systems. However, along with the technological challenges, Big Data is bringing to the world new exciting insights on patterns and laws of human behavior hidden deeply inside massive collections of information.

Hidden Data

Big data analysis stands behind various products and tools we explore in our everyday life including search engines, restaurants or products recommendation systems, friend finders in social networks, targeted advertising etc.

For example:

  • search engine algorithm finds the most relevant site for your query basing on its popularity among your friends, its global “prestige” on the web and its semantic correspondence with the query;
  • product recommendation system matches you purchasing history with the purchasing history of other users to predict your future preferences or a restaurant suggesting app analysis the check-in history of yourself, your friends and general popularity of venues to recommend you a nearby place to eat.

All of these tasks require effective tools to process fast growing and dynamically changing data as well as sophisticated machine-learning methods to predict and understand users’ expectations from the traces they leave when interacting with mobile gadgets and with the web. .

Big Data Technologies

While only few years ago processing petabytes of data was a prerogative of big companies and research institutes, the tools and methods of big data analysis are now available to every software developer thanks to the emergence of cheap cloud technologies and rich assortment of machine learning and statistical libraries.

MapReduce is, by large margin, the most popular big data technology. The brilliance of the MapReduce approach comes from its simplicity and fundamental scalability. A software engineer writing a MapReduce application concentrates on the logic of the data processing while all utility functions including data input/output, running parallel tasks and dividing input between tasks are handled by the MapReduce implementation, such as Amazon EMR, Google AppEngine or Hadoop. If one is planning to play with MapReduce on his local machine or manually set up a MapReduce cluster in his network – Hadoop would be a right software choice, whereas Amazon EMR and Google AppEngine are the full-cycle cloud solutions with a flexible per hour payment schemes.

Software Libraries

The methods for big data analysis range from statistics and machine learning to artificial intelligence and data mining and are usually combined under a common umbrella of Data Science.

Although design of the data analysis tools and machine-learning techniques remains a craft of the scientific community, their practical applications have spread in various engineering contexts. This is largely due to appearance of various data science software libraries, most of which are easy to use even for developers with a minimal machine-learning expertise.

To name  a few:

  • Weka – is a java-based library with a bunch of different classification and clusterization techniques;
  • RankLib – is a collection of ranking algorithms;
  • R – is a powerful environment for statistical computing and graphics;
  • Mahout – is a MapReduce based framework for machine-learning and statistical analysis.

Online courses

On the other hand, feeling the gap of the machine-learning knowledge in a software team is also not any longer a problem. The internet now offers numerous ways to improve one’s data science expertise with a free online courses from the leading professors at Stanford, MIT etc. The introductory course on machine-learning taught by a Stanford’s professor, for example, can be found on Coursea.

The methods and tools for big data analysis are becoming more and more powerful while the entry level and the infrastructural costs are constantly reducing. This opens a door for big data experiments to small businesses and startups. As a consequence Big Data has all chances to become a new foundation for IT innovations in the following years.

By Anna Iurchenko