“Enterprise data will grow 650% in the next five years. Also, through 2015, 85% of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage.” – Gartner (mid-2015)
With growth and complexity of the data, comes challenges in putting the data to use and perform effective decision making. The ‘data’ that we are talking about here could be anything from a simple 3-5 word of tweet (in KB) to Photos /videos uploaded on social sites (in MB), a full-length movie on YouTube or other sites (in GB).
Now, think beyond and we have Terabyte (2 years of nonstop listening to MP3 files forms approx. 1 TB) or Petabyte (think about 100 years of television forms a PB, the total photos in Flickr site by 2011 formed 60 PB).
Challenges in data maintenance –
Ineffective decision making if the data is incorrect
Performance issues due to high data magnitude
Increased cost of handling huge volume of data sets
Heterogeneous and unstructured data leads to increased effort in reporting
Challenges in testing big data –
Huge and Heterogeneous data –
Data has grown exponentially last few years, and it will continue to grow. If you recollect a few years ago, processing few millions of records was considered a herculean task which may ultimately take a toll on the system performance thereby requiring to invest heavily in hardware and on-going maintenance.
Gone are the days of Gigabytes (1 GB = 1024 MB). Big data landscape has already seen Terabytes (1 TB = 1024 GB) and even Petabytes (1 PB = 1024 TB).
Such huge volume of data should be audited for its fitment for business purpose. Preparing test cases in this scenario has always been a challenge.
Technical expertise –
Big data is relatively a new term. Technology is growing more frequently than ever. Testing big data, unlike other aspects, need testers who thoroughly understand the big data ecosystem, have the ability to think beyond automated testing. With an unexpected and complex structure, big data can cause automated scripts to fail.
With the shortage of expertise, organizations may need to invest in training and develop automated solutions for big data. Moreover, it requires a mindset shift for testing units within an organization where testers will now have to be on par with developers in leveraging big data technologies.
Understanding data and foreseeing effort –
Without a proper knowledge of the available data, it difficult to strategize testing and derive effort requirement. It’s also necessary for a tester to understand the statistical co-relation between data and business benefits.
Example (1). Let’s consider we need to generate a report from Twitter on a topic that will capture the Emotions of people on percentage. Sounds weird? Yes, understanding the ‘emotion’ factor from the data available is the challenge.
Example (2). We have websites that help us search for similar sounding songs. Just imagine – the song metadata need to be compared with millions of songs from the database and the results need to be displayed in seconds.
Need for Special test environments due to large data size (HDFDistributed file system)
Big data is more than just size. Its significance lies in 4 V’s – Volume (magnitude), Velocity ( the distributedrate at which data is generated /transported), Variety (type of data) & Veracity (accuracy and quality).
Test automation –
Too many data, too many scenarios to be covered, too little time for regression test, many real-time services involved.
Live integration testing –
There has been a sudden demand for capturing live data and analyze in real time. Example – Weather warning systems and forecast mechanism. Data may come from multiple feeds (sources), so the data quality is expected to be reliable and clean.
Scalability testing –
We have been experiencing and discussing the growth of data in exponential terms. Applications working on big data are expected to be scalable so as to handle this increasing volume.
Performance testing –
Big data applications work with live data for real-time analytics and reporting. Performance testing is coupled with live integration and Scalability testing.
Validation of structured and unstructured data
Dealing with non-relational databases
Optimal test environment
Performing non-functional testing
So, what are the key aspects that we need to focus on while dealing with big data testing?
Tools to test big data:
Testing Whiz –
Probably a very popular testing tool, it offers automated big data testing solution to verify structured and unstructured data sets, schema, approaches and inherent processes residing at different sources in your application in languages such as ‘Hive’, ‘Map-reduce’ ‘Sqoop’ and ‘Pig’. The major features provided include – Post ETL data validation, Data migration validation and Big data health check.
Query surge –
A collaborative data testing solution that finds bad data in Big data and provides a holistic view of the data’s health. Helps you ensure the source and target data are compatible.
Contact us to learn how we can assist you with your Software Testing needs.