This week we were inspired to do some research, driven by an idea: It must be possible to bring the concepts of tracking users in the online world to retail stores. We are not the experts in retail but we know that one of the most important key performance indicators is revenue per square metre. We thought about bringing in some new metrics. From a wider perspective, data is produced by various sensors. With a real store in mind we figured out possible sensors stores could use – customer frequency counters at the doors, the cashier system, free WiFi access points, video capturing, temperature, background music, smells and many more. While for many of those sensors additional hardware and software is needed, for a few sensors solutions are around, e.g. video capturing with face or even eye recognition. We talked about our ideas with executives and consultants from the retail industry and they confirmed our idea is interesting to persue.
We thought the most interesting sensor data (that doesn’t require additional hardware/software) could be the WiFi access points. Especially given that many visitors will have WiFi enabled mobile phones. With it’s log files we should be able to answer at least the following questions for a particular store:
- How many people visited the store (unique visits)?
- How many visits did we have in total?
- What is the average visit duration?
- How many people are new vs. returning?
How do we answer these questions?
Before we started designing a blueprint solution we first of all asked ourselves:
- Who would be asked to answer questions like this?
- Who is this person?
- What tools does this person expect to use?
- And what is a typical skill set of this person?
- How do they work?
From an interview with a industry leading company we knew that these questions will be answered by analysts. They use data warehouses and they typically have a business intelligence (BI), analysis and report tool with access to the data warehouse. They are used to useing SQL to answer questions. With our experience at Sentric, we knew that solving the problem with a Big Data approach will introduce a new person – the “Data Scientist”. Right, at that point we slightly adjusted our mission.
So, how do we answer these questions as a Data Scientist?
From a high level of abstraction the answer is simple. We need a data management system with three pieces: ingest, store and process.
Traditional Data Management System Approach
We take this basis architecture and replace the generic terms while mapping it onto the Hadoop ecosystem.
Blueprint for a Data Management System with Hadoop
With this Hadoop architecture a Data Scientist should be able to answer the questions without any programming environment. He/she can also use familiar BI, analysis and reporting tools as well.
We planned a hackathon together with our partner company YMC to prove this concept. Here are the ingredients:
- 2 WiFi access points to simulate two different stores with OpenWRT, a linux based firmware for routers, installed *
- A virtual machine acting as central syslog daemon collecting all log messages from the WiFi routers
- Flume to move all log messages to HDFS, without any manual intervention (no transformation, no filtering)
- A 4 node CDH4 cluster running on virtual machines (CentOS, 2 GB RAM, 100 GB HDD), installed and monitored with Cloudera Manager
- Pentaho Data Integration‘s graphical designer for data transformation, parsing, filtering and loading to the warehouse (Hive)
- Hive as data warehouse system on top of Hadoop to project structure onto data
- Impala for querying data from Hive in real time
- Microsoft Excel to visualize results **
* We actually fired up the two WiFi routers before the hackathon to collect some data for a period of around 4 days.
** Since Impala is still beta it only supports SELECT statements. Therefore it’s not able to CREATE new tables from query results in Hive’s warehouse. With this restriction we decided to copy & paste query results into MS Excel for further analysis and visualization. Once Impala can CREATE tables a Data Scientist can access that data from their BI, analysis and reporting tools.
Head of Big Data Analytics at YMC AG
In part 2 of this series you find details of data ingestion. If you would like to give us feedback or you want some more details, do not hesitate to contact me, jean(minus)pierre(dot)koenig(at)sentric.ch.[Update]
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 4