Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 4

In the previous article we explained how to parse, transform and finally load data into Hive’s warehouse. Now it’s time to talk about querying the data. Before we start, here is how a sample of the data looks like:

As you can see, there is still some noise in the last column. We are interested in “authentication OK‘ and “deauthenticated” messages only. The messages from the router are not standardized (as are protocols such as TCP). We found that those two status messages are the closest ones to our understanding of a “login”/”logout” on the router. Let’s reduce the data set to those lines. During this step we do the duration calculation as well.

We used Cloudera’s real time query engine Impala for this task and here is how the query looks like (ts refers as timestamp):

We already talked about Impala’s early state of development and that it lacks the ability to CREATE a table from a query output. At this point we did copy & paste the results into a CSV file, created a new Hive table called ‘visit_duration’ and loaded the CSV file into it. Here is how we did it:

Now we have the data we need to answer the following questions:

  • How many people visited the store (unique visitors)?
    Note: Unlike the traditional customer frequency counter at the doors we have mac addresses at the log files that are unique for mobile phones. Supposed people do not change their mobile phones we can recognize unique visitors and not just visits.
  • How many visits did we have?
  • What is the average visit duration?
  • How many people are new vs. returning?

While we had a setup with 2 WiFi routers to simulate different stores we continue to describe the process for just one of them called “buffalo”, aka store number one.

Counting the visits for store number one is very simple:

The plot (figure 1) indicates that about 85% of the visits were detected in store number one and about 15% in store number two. One might draw the conclusion that store number one is in a much better location with more occasional customers. But let’s gain more insights by analysing the number of unique visitors.

Visits for stores number one & two

Figure 1 – Visits for stores number one & two

Collecting the number of unique visitors is even simpler as we have the mac addresses of visitors that make them unique:

This plot (figure 2) gives us more details about the customers. It turns out that the 135 visits in store number one were caused by just 9 unique visitors while store number two encountered 5 unique visitors.

Unique visitors

Figure 2 – Unique visitors

This shows us how important it is to have a thorough look at your data. We realize now, that the big difference shown in figure 1 is not that big anymore. It also shows us that there must have been some customers who returned to store number one. So let’s go into more details here.

To answer the new vs. returning ratio we had to perform this query:

This result gives us the number of returning users. Since we already know the total number of visitors (which is 9 for store number one), we are able to calculate the proportion of new users and plot a graph (figure 3).

Figure 3 - New vs. returning users

Figure 3 – New vs. returning users

This plot (figure 3) indicates that we have more returning than new users in both stores. In store number two we didn’t see a new user over the past 4 days at all. It’s probably a good idea to start a marketing campaign which aims at new customers, e.g. to give out vouchers for the first purchase.
But maybe there are other reasons behind this figures. Store number one might be located in a shopping mall and store number two might be located somewhere in town where people like to walk around when the sun is shining. Perhaps it was raining during the last 4 days and store number one encountered the visits of some new customers because they chose the mall to go shopping and decided to visit store number one out of comfort, as they were “trapped” in the mall anyway. This assumption gives a perspective of what is possible with our BigData approach: why don’t we include weather data and investigate the effects on our visitors?

To investigate whether a customer just popped into our store out of boredom, let’s have a look how long he stayed in it. Answering the question about visit duration is done by using the aggregate function:

The average visit duration in store number one was around 00:16:16h while the average visit duration in store number two was 00:06:06h.

Figure 4 - Visit duration over the past 4 days

Figure 4 – Visit duration over the past 4 days

The plot (figure 4) for the last 4 days vividly visualizes that the visit duration in store number one was evenly distributed while the distribution in store number two shows some peaks. We can also see that visitors tend to stay in shop number one much longer. Assuming that both shops sell the same product (which seems to need some consultation) one might think that store number two did not sell a single product. But maybe the customers just enjoyed consultation in store number one and then bought the product in store number two. We would need to include sales figures to investigate this.

During our work of writing queries we acquired a better understanding of the data and the information it carries. Unsurprisingly, we realized that we can answer a different question as well:

  • What is the average length of time between two visits?

And here is how it goes:

Since aggregated functions can not be nested, we calculated the average by hand: 7332.484127 seconds which is around 02:02:12h for store number one and 71053.64286 seconds which is around 19:44:14h for store number two.

Now let’s analyse the behavior of one particular user over both stores:

The calculated average duration between visits over both stores for this particular user is 15009.77778  seconds which is around 04:10:09h.

Figure 5 - Average Duration Between Visits of one particular user

Figure 5 – Average Duration Between Visits of one particular user

There is a lot of useful information that can be derived from this plot (figure 5). Firstly, there is a repeating pattern of step-ins and step-outs within a short period of time. Perhaps this user was having a mobile phone conversation somewhere around the door of this store. Secondly, there was a step-out of store number one and a step-in into store number two within just 28 seconds. Imagine for a while both routers were in the same store on different floors. This pattern is a clear indicator that this particular user went from one floor to another, which should be interpreted as one visit.

Conclusion

  • Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. But one of the main benefits of this architecture is the ability to store a variety of semi structured files and to apply a schema afterwards. As the raw data contains a lot of information beyond our questions, it’s easy to answer different questions ad hoc. This effect could be leveraged whenever new log data from other sources can be processed and joined together.
  • Answering such questions based on WiFi router log files can be done without programming software by using graphical designers from existing BI/analysis and reporting tools with a BigData platform integration.
  • Given the fact that one can quickly ramp up a test cluster with a few nodes, similar problems can be solved within one day with a handful of engineers. The Cloudera Manager makes it very easy to install, maintain and monitor a Hadoop cluster and it can be used without profound understanding of the whole ecosystem.
  • Impala as a query engine is still in beta phase but querying massive amounts of data in real time is definitely the future. Hive does not support implicit JOINs that we used here. Furthermore we used JOIN with the “=” condition, where the left and right side comes from the same table which is not supported in Hive.
  • It’s possible to track paths from people based on WiFi router signals using triangulation. There a few projects following this idea. You can find some links below.
  • Assuming that a retail store has several floors, each of which equipped with a WiFi router, each visit interpreted as “login”/”logoff” on a particular router is not correct anymore. Additional data processing is required to identify visitors that just walk through the store and visit different floors as they “login” and “logout” within a short period of time between the levels, e.g. within 30 seconds.

Similar projects/solutions

  • (German) Handysignale verraten Wege der HB-Passanten
    Wo wird es für die Bevölkerung eng am Hauptbahnhof, woher strömen die Menschen in den Bahnhof, wohin gehen sie und wo staut es? Diese Fragen soll eine neue Technologie beantworten, die frei verfügbare Signale von Mobiltelefonen ortet. Zürcher Oberländer – 09.10.12
  • (German) MagicMap  - Ein System zur kooperativen Positionsbestimmung über  WLAN
    MagicMap ist eine reine Softwarelösung, die bei den mobilen Systemen außer einer konventionellen WLAN-Ausstattung keine weitere Hardware erfordert. Die WLAN Access Points können beliebig verteilt sein und es sind weder Eingriffe an der AP-Hardware noch an der Software nötig. magicmap.de 
  • (German) Euclid Zero: US-Startup überträgt Monitoring für Onlineshops auf lokalen Handel
    Das US-Startup Euclid Analytics nutzt die WLAN-Funktion von Smartphones, um das Einkaufsverhalten von Kunden im lokalen Handel zu verfolgen und auszuwerten. So entstehen umfangreiche Statistiken, vergleichbar mit den Besucherstatistiken eines Onlineshops aus Google Analytics. t3n – 22.01.2013
  • PlaceLab Geopositioning system
    The Place Lab software listens for wireless network base-stations, it can then look-up the coordinates of whatever networks it finds and will use triangulation to calculate its position. 4BA2 Technology Survey
  • OsmocomBB Projct
    OsmocomBB is an Free Software / Open Source GSM Baseband software implementation. It intends to completely replace the need for a proprietary GSM baseband software, such as a) drivers for the GSM analog and digital baseband (integrated and external) peripherals b) the GSM phone-side protocol stack, from layer 1 up to layer 3. osmocom.org
  • Indoor positioning system (IPS)
    An indoor positioning system (IPS) is a network of devices used to wirelessly locate objects or people inside a building. Wikipedia

Our Moment

And this is how a hackathon at YMC looks like:

Jean-Pierre König, CTO Sentric

Jean-Pierre König, Head of Big Data Analytics

This is the final post of this series. If you have questions or feedback: jean(minus)pierre(dot)koenig(at)ymc(dot)ch.

Continue reading:

Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3 

 

  1. Hi there, it would be interesting developing the performance aspects since Impala is intendes for real time/interactive querying.
    How many rows were ingested through flume, how many lines in the facts table, how long did all these queries took to complete?
    Looking forward reading your figures guys ! Regards

    • Hi Coulibaly. Unfortunately i have to mention that performance analysis was not part of this case study.

      Since our hackathons are time-boxed we produced just a few thousand log lines on our office wifi routes during one day. The fact table is obviously quite small and query duration around milliseconds. Said that, we are not really talking about ‘big’ data here. It’s more about showing how easy and fast one could adapt those technologies and generate actionable results from ‘useless’ log-file data. Compared to a traditional data warehouse approach its a massive difference and that’s what we wanted to focus on. What would you expect to have in one day? That’s the question here.

      Regards, Jean-Pierre

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">