Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1

This week we were inspired to do some research, driven by an idea: It must be possible to bring the concepts of tracking users in the online world to retail stores. We are not the experts in retail but we know that one of the most important key performance indicators is revenue per square metre. We thought about bringing in some new metrics. From a wider perspective, data is produced by various sensors. With a real store in mind we figured out possible sensors stores could use – customer frequency counters at the doors, the cashier system, free WiFi access points, video capturing, temperature, background music, smells and many more. While for many of those sensors additional hardware and software is needed, for a few sensors solutions are around, e.g. video capturing with face or even eye recognition. We talked about our ideas with executives and consultants from the retail industry and they confirmed our idea is interesting to persue.

WiFi based In-Store analysis with Hadoop-and-Impala, Hackathon

We thought the most interesting sensor data (that doesn’t require additional hardware/software) could be the WiFi access points. Especially given that many visitors will have WiFi enabled mobile phones. With it’s log files we should be able to answer at least the following questions for a particular store:

  • How many people visited the store (unique visits)?
  • How many visits did we have in total?
  • What is the average visit duration?
  • How many people are new vs. returning?

How do we answer these questions?

Before we started designing a blueprint solution we first of all asked ourselves:

  • Who would be asked to answer questions like this?
  • Who is this person?
  • What tools does this person expect to use?
  • And what is a typical skill set of this person?
  • How do they work?

From an interview with a industry leading company we knew that these questions will be answered by analysts. They use data warehouses and they typically have a business intelligence (BI), analysis and report tool with access to the data warehouse. They are used to useing SQL to answer questions.

With our experience at Sentric, we knew that solving the problem with a Big Data approach will introduce a new person – the “Data Scientist”. Right, at that point we slightly adjusted our mission.

So, how do we answer these questions as a Data Scientist?

From a high level of abstraction the answer is simple. We need a data management system with three pieces: ingest, store and process.

Traditional Data Management System Approach

Traditional Data Management System Approach

We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem.

Blueprint for a Data Management System with Hadoop

Blueprint for a Data Management System with Hadoop

With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well.

Setup

We planned a hackathon together with our partner company YMC to prove this concept. Here are the ingredients:

  • 2 WiFi access points to simulate two different stores with OpenWRT, a linux based firmware for routers, installed *
  • A virtual machine acting as central syslog daemon collecting all log messages from the WiFi routers
  • Flume to move all log messages to HDFS, without any manual intervention (no transformation, no filtering)
  • A 4 node CDH4 cluster running on virtual machines (CentOS, 2 GB RAM, 100 GB HDD), installed and monitored with Cloudera Manager
  • Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse (Hive)
  • Hive as data warehouse system on top of Hadoop to project structure onto data
  • Impala for querying data from Hive in real time
  • Microsoft Excel to visualize results **

* We actually fired up the two WiFi routers before the hackathon to collect some data for a period of around 4 days.
** Since Impala is still beta it only supports SELECT statements. Therefore it’s not able to CREATE new tables from query results in Hive’s warehouse. With this restriction we decided to copy & paste query results into MS Excel for further analysis and visualization. Once Impala can CREATE tables a Data Scientist can access that data from their BI, analysis and reporting tools.

Jean-Pierre Koenig            Head of Big Data Analytics at YMC AG

In part 2 of this series you find details of data ingestion. If you would like to give us feedback or you want some more details, do not hesitate to contact me, jean(minus)pierre(dot)koenig(at)sentric.ch.

[Update]

Continue reading:

Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3

  1. We are poccessing web server access logs and http requests log.
    We have created three node cluster and initially processing 3 GB log for cluster testing and benchmarking.
    But it takes 2 to 3 mins to execute map-reduce job.
    I dont understand the heap memory of cluster and how to assign it.
    If you have any knowledge regarding this then please share.

    Thank You

    • So, could you share some of your settings, amount of data processing and your architecture approach?

      Thanks, jp

  2. Hi Dashrath

    We didn’t tune our cluster for this case since we processed data for about 4 days which is actually not “BigData”. The processing time here was under 2 minutes where most of the time is consumed by scheduling and instantiating the jobs.

    Our cluster setup for the case study:
    6 virutal machines with the Cloudera training image (CentOS) installed, installed by the Cloudera Manager. We configured the VM’s with 2 GB RAM and 100 MB hard disk space.

    JP

  3. Thank you for sharing such nice case study,
    I want to know more about your data processing time of MapReduce job.
    Have you done any performance tunning?
    If possible share your cluster configuration.

    Thank You.

  4. Thank you for sharing this nice idea.
    Just one question: what would you use for connecting excel to hive? I ‘m assuming it’s odbc driver . Is there any free version of it?
    Thanks

  5. Jean-Pierre, thank you for sharing your approach to this Retail use case. I’d like to make your readers aware that Informatica released the PowerCenter Big Data Edition in December, 2012. The PowerCenter Big Data Edition http://bit.ly/Wkvvk5 enables organizations to execute ETL, complex file parsing, data quality, and data profiling on Hadoop.

    • Hi John, thank you for the product recommendation. PowerCenter offers a great graphical designer to work with BigData on Hadoop.

      The main focus of this hackathon is to show a simple solution to come up with in one day. What we have done here does not necessarily requires a tool at all. We choosed Pentaho for a simple reason – we used it before. Palo, SpargoBI, Talend, Karmasphere, PowerCenter and many more tools are suitable as well.

      We will show what we have done with Pentaho in a later post.

      Jean-Pierre

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">