Context:
It is difficult to imagine the great volume of data we supply to different agencies in our everyday actions, bit by bit through surfing the Internet, posting on social media, using credit and debit cards, making online purchases, and other things where we share information about our identity. It is believed that social media and networking service companies such as Facebook may already have more data than they are leveraging. There are infinite ways to slice and dice data, which itself is quite daunting as at every step, there is potential to make huge mistakes.
The concept:
- The concept of big data has been around for years; most organizations now understand that if they capture all the data that streams into their businesses, they can apply analytics and get significant value from it. Big data analytics examines large amounts of data to uncover hidden patterns, correlations and other insights.The Hollywood film Money-ball (2011) stands out for focussing the spotlight on data science by showing that the art of data science is more about asking the right questions than just having the data.
- There are the great volume of data we supply to different agencies in our everyday actions, bit by bit through surfing the Internet, posting on social media, using credit and debit cards, making online purchases, and other things where we share information about our identity. There are infinite ways to slice and dice data, which itself is quite daunting as at every step, there is potential to make huge mistakes.
Why need of careful Data mining
- Careful data mining from Big Data might help understand the behaviour in order to facilitate planning. But there are examples of blunders being made with a load of information at one’s fingertips. The problem with so much information is that there is a much larger haystack now in which one has to search for the needle.
The Google project
- In 2008, Google was excited about “Big Data hubris” and launched its much-hyped Google Flu Trends (GFT) based on online searches on Google for flu-related information.
- The aim was to “providing almost instant signals” of overall flu prevalence weeks earlier than data out by the Centers for Disease Control and Prevention (CDC), the leading national public health institute in the U.S.
- But much of it went wrong, as the GFT missed the 2009 swine flu pandemic, and was wrong for 100 out of 108 weeks since August 2011; it even missed the peak of the 2013 flu season by 140%. Google tried to identify “flu” with the search pattern.
Data Blunders:
- Data blunders often arise out of bias, low-quality data, unreliable sources, technical glitches, an improper understanding of the larger picture, and lack of proper statistical tools and resources to analyse large volumes of data.
- Moreover, Big Data invariably exhibits fake statistical relationships among different variables, which are technically called “spurious correlations” or “nonsense correlations”. Relying too heavily on a particular model is also a common mistake in Big Data analyses. Therefore, the model should be wisely and carefully chosen according to the situation.
- According to the Nassim Nicholas Taleb, (Author of The Black Swan- the Impact of the Highly Improbable) Big data may mean more information, but it also means more false information.There is a possibility of getting lost in the waves of data.
- Mining and geological engineers design mines to remove minerals safely and efficiently. The same principle should be adopted by statisticians in order to mine data efficiently. Big Data is more complex and involves additional challenge.
- They might involve the use of some skills involving analytics, decision-making skills, logical thinking skills, problem-solving, advanced computational expertise and also statistical expertise. So, using some routine algorithm is not enough. Too much reliance on available software is also a serious mistake.
Conclusion
- The Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various heterogeneous data sources. We must need to find the answer for-What is the future of so much reliance on data, where a lot of spurious correlations could dominate our lifestyle and livelihood?
Source:TH