In the previous article we explained how to parse, transform and finally load data into Hive’s warehouse. Now it’s time to talk about querying the data. Before we start, here is how a sample of the data looks like:
|
1 2 3 4 5 6 7 8 9 |
[root@cdh-master ~]# hadoop fs -cat /user/hive/warehouse/routerlogs/part-00000 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-AUTHENTICATE.indication(98:0c:82:dc:8b:15, OPEN_SYSTEM) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.request(98:0c:82:dc:8b:15) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,authenticated 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,association OK (aid 2) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,associated (aid 2) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-ASSOCIATE.indication(98:0c:82:dc:8b:15) 1358756939,2013,1,21,9,28,59,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,MLME,MLME-DELETEKEYS.request(98:0c:82:dc:8b:15) 1358757010,2013,1,21,9,30,10,+01:00,buffalo,hostapd,wlan0,98:0c:82:dc:8b:15,IEEE 802.11,deauthenticated |
As you can see, there is still some noise in the last column. We are interested in “authentication OK‘ and “deauthenticated” messages only. The messages from the router are not standardized (as are protocols such as TCP). We found that those two status messages are the closest ones to our understanding of a “login”/”logout” on the router. Let’s reduce the data set to those lines. During this step we do the duration calculation as well.
We used Cloudera’s real time query engine Impala for this task and here is how the query looks like (ts refers as timestamp):
|
1 |
SELECT A.ts, MIN(B.ts - A.ts), A.host, A.mac_address FROM routerlogs A, routerlogs B WHERE A.host = B.host AND A.mac_address = B.mac_address AND A.ts <= B.ts AND A.message LIKE '%authentication OK%' AND B.message LIKE '%deauthenticated%' GROUP BY A.host, A.mac_address, A.ts; |
We already talked about Impala’s early state of development and that it lacks the ability to CREATE a table from a query output. At this point we did copy & paste the results into a CSV file, created a new Hive table called ‘visit_duration’ and loaded the CSV file into it. Here is how we did it:
|
1 2 3 4 5 6 7 |
create table visit_duration ( ts int, duration_in_seconds int, router string, mac_address string) row format delimited fields terminated by ','; |
Now we have the data we need to answer the following questions:
- How many people visited the store (unique visitors)?
Note: Unlike the traditional customer frequency counter at the doors we have mac addresses at the log files that are unique for mobile phones. Supposed people do not change their mobile phones we can recognize unique visitors and not just visits. - How many visits did we have?
- What is the average visit duration?
- How many people are new vs. returning?
While we had a setup with 2 WiFi routers to simulate different stores we continue to describe the process for just one of them called “buffalo”, aka store number one.
Counting the visits for store number one is very simple:
|
1 2 3 4 5 |
Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Build version: Impala v0.3 (3cb725b) built on Fri Nov 23 13:51:59 PST 2012) [localhost:21000] > SELECT COUNT(*) FROM visit_duration WHERE router = "buffalo"; 135 |
The plot (figure 1) indicates that about 85% of the visits were detected in store number one and about 15% in store number two. One might draw the conclusion that store number one is in a much better location with more occasional customers. But let’s gain more insights by analysing the number of unique visitors.
Collecting the number of unique visitors is even simpler as we have the mac addresses of visitors that make them unique:
|
1 2 |
[localhost:21000] > SELECT COUNT(DISTINCT(mac_address)) FROM visit_duration WHERE router = "buffalo"; 9 |
This plot (figure 2) gives us more details about the customers. It turns out that the 135 visits in store number one were caused by just 9 unique visitors while store number two encountered 5 unique visitors.
This shows us how important it is to have a thorough look at your data. We realize now, that the big difference shown in figure 1 is not that big anymore. It also shows us that there must have been some customers who returned to store number one. So let’s go into more details here.
To answer the new vs. returning ratio we had to perform this query:
|
1 2 |
[localhost:21000] > SELECT count(distinct(A.mac_address)) FROM visit_duration A, visit_duration B WHERE A.mac_address = B.mac_address AND A.ts != B.ts AND A.router = "buffalo"; 7 |
This result gives us the number of returning users. Since we already know the total number of visitors (which is 9 for store number one), we are able to calculate the proportion of new users and plot a graph (figure 3).
This plot (figure 3) indicates that we have more returning than new users in both stores. In store number two we didn’t see a new user over the past 4 days at all. It’s probably a good idea to start a marketing campaign which aims at new customers, e.g. to give out vouchers for the first purchase.
But maybe there are other reasons behind this figures. Store number one might be located in a shopping mall and store number two might be located somewhere in town where people like to walk around when the sun is shining. Perhaps it was raining during the last 4 days and store number one encountered the visits of some new customers because they chose the mall to go shopping and decided to visit store number one out of comfort, as they were “trapped” in the mall anyway. This assumption gives a perspective of what is possible with our BigData approach: why don’t we include weather data and investigate the effects on our visitors?
To investigate whether a customer just popped into our store out of boredom, let’s have a look how long he stayed in it. Answering the question about visit duration is done by using the aggregate function:
|
1 2 |
[localhost:21000] > SELECT AVG(duration_in_seconds) FROM visit_duration WHERE router = "buffalo"; 976.6666666666666 |
The average visit duration in store number one was around 00:16:16h while the average visit duration in store number two was 00:06:06h.
The plot (figure 4) for the last 4 days vividly visualizes that the visit duration in store number one was evenly distributed while the distribution in store number two shows some peaks. We can also see that visitors tend to stay in shop number one much longer. Assuming that both shops sell the same product (which seems to need some consultation) one might think that store number two did not sell a single product. But maybe the customers just enjoyed consultation in store number one and then bought the product in store number two. We would need to include sales figures to investigate this.
During our work of writing queries we acquired a better understanding of the data and the information it carries. Unsurprisingly, we realized that we can answer a different question as well:
- What is the average length of time between two visits?
And here is how it goes:
|
1 2 3 4 5 6 7 8 9 |
[localhost:21000] > SELECT B.ts, MIN(B.ts - A.ts), A.router, A.mac_address FROM visit_duration A, visit_duration B WHERE A.router = B.router AND A.router = "buffalo" AND A.mac_address = B.mac_address AND A.ts + A.duration_in_seconds <= B.ts GROUP BY A.router, A.mac_address, B.ts; 1358758959 5045 buffalo 10:68:3f:40:20:2d 1358771917 754 buffalo d8:d1:cb:e9:ed:6c 1358766344 628 buffalo d8:d1:cb:e9:ed:6c 1358764299 47 buffalo d8:d1:cb:e9:ed:6c 1358771935 18 buffalo d8:d1:cb:e9:ed:6c 1358517188 400 buffalo 24:ab:81:91:c8:62 1358764341 89 buffalo d8:d1:cb:e9:ed:6c … |
Since aggregated functions can not be nested, we calculated the average by hand: 7332.484127 seconds which is around 02:02:12h for store number one and 71053.64286 seconds which is around 19:44:14h for store number two.
Now let’s analyse the behavior of one particular user over both stores:
|
1 2 3 4 5 6 7 8 9 10 11 |
SELECT B.ts, MIN(B.ts - A.ts), A.router, A.mac_address FROM visit_duration A, visit_duration B WHERE A.router = B.router AND A.mac_address = B.mac_address AND A.ts + A.duration_in_seconds <= B.ts AND A.mac_address = "10:68:3f:40:20:2d" GROUP BY A.router, A.mac_address, B.ts ORDER BY B.ts LIMIT 100; ... 1358759467 11 buffalo 10:68:3f:40:20:2d 1358760760 1293 buffalo 10:68:3f:40:20:2d 1358760892 132 buffalo 10:68:3f:40:20:2d 1358761202 326626 fonera 10:68:3f:40:20:2d 1358761459 257 fonera 10:68:3f:40:20:2d 1358761492 33 fonera 10:68:3f:40:20:2d 1358761552 60 fonera 10:68:3f:40:20:2d 1358761596 704 buffalo 10:68:3f:40:20:2d ... |
The calculated average duration between visits over both stores for this particular user is 15009.77778 seconds which is around 04:10:09h.
There is a lot of useful information that can be derived from this plot (figure 5). Firstly, there is a repeating pattern of step-ins and step-outs within a short period of time. Perhaps this user was having a mobile phone conversation somewhere around the door of this store. Secondly, there was a step-out of store number one and a step-in into store number two within just 28 seconds. Imagine for a while both routers were in the same store on different floors. This pattern is a clear indicator that this particular user went from one floor to another, which should be interpreted as one visit.
Conclusion
- Analysing WiFi router log files could be done with a traditional RDBMS database approach as well. But one of the main benefits of this architecture is the ability to store a variety of semi structured files and to apply a schema afterwards. As the raw data contains a lot of information beyond our questions, it’s easy to answer different questions ad hoc. This effect could be leveraged whenever new log data from other sources can be processed and joined together.
- Answering such questions based on WiFi router log files can be done without programming software by using graphical designers from existing BI/analysis and reporting tools with a BigData platform integration.
- Given the fact that one can quickly ramp up a test cluster with a few nodes, similar problems can be solved within one day with a handful of engineers. The Cloudera Manager makes it very easy to install, maintain and monitor a Hadoop cluster and it can be used without profound understanding of the whole ecosystem.
- Impala as a query engine is still in beta phase but querying massive amounts of data in real time is definitely the future. Hive does not support implicit JOINs that we used here. Furthermore we used JOIN with the “=” condition, where the left and right side comes from the same table which is not supported in Hive.
- It’s possible to track paths from people based on WiFi router signals using triangulation. There a few projects following this idea. You can find some links below.
- Assuming that a retail store has several floors, each of which equipped with a WiFi router, each visit interpreted as “login”/”logoff” on a particular router is not correct anymore. Additional data processing is required to identify visitors that just walk through the store and visit different floors as they “login” and “logout” within a short period of time between the levels, e.g. within 30 seconds.
Similar projects/solutions
- (German) Handysignale verraten Wege der HB-Passanten
Wo wird es für die Bevölkerung eng am Hauptbahnhof, woher strömen die Menschen in den Bahnhof, wohin gehen sie und wo staut es? Diese Fragen soll eine neue Technologie beantworten, die frei verfügbare Signale von Mobiltelefonen ortet. Zürcher Oberländer – 09.10.12 - (German) MagicMap - Ein System zur kooperativen Positionsbestimmung über WLAN
MagicMap ist eine reine Softwarelösung, die bei den mobilen Systemen außer einer konventionellen WLAN-Ausstattung keine weitere Hardware erfordert. Die WLAN Access Points können beliebig verteilt sein und es sind weder Eingriffe an der AP-Hardware noch an der Software nötig. magicmap.de - (German) Euclid Zero: US-Startup überträgt Monitoring für Onlineshops auf lokalen Handel
Das US-Startup Euclid Analytics nutzt die WLAN-Funktion von Smartphones, um das Einkaufsverhalten von Kunden im lokalen Handel zu verfolgen und auszuwerten. So entstehen umfangreiche Statistiken, vergleichbar mit den Besucherstatistiken eines Onlineshops aus Google Analytics. t3n – 22.01.2013 - PlaceLab Geopositioning system
The Place Lab software listens for wireless network base-stations, it can then look-up the coordinates of whatever networks it finds and will use triangulation to calculate its position. 4BA2 Technology Survey - OsmocomBB Projct
OsmocomBB is an Free Software / Open Source GSM Baseband software implementation. It intends to completely replace the need for a proprietary GSM baseband software, such as a) drivers for the GSM analog and digital baseband (integrated and external) peripherals b) the GSM phone-side protocol stack, from layer 1 up to layer 3. osmocom.org - Indoor positioning system (IPS)
An indoor positioning system (IPS) is a network of devices used to wirelessly locate objects or people inside a building. Wikipedia
Our Moment
And this is how a hackathon at YMC looks like:
This is the final post of this series. If you have questions or feedback: jean(minus)pierre(dot)koenig(at)ymc(dot)ch.
Continue reading:
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 1
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 2
Case Study: Retail WiFi Log-file Analysis with Hadoop and Impala, Part 3







Hi there, it would be interesting developing the performance aspects since Impala is intendes for real time/interactive querying.
How many rows were ingested through flume, how many lines in the facts table, how long did all these queries took to complete?
Looking forward reading your figures guys ! Regards
Hi Coulibaly. Unfortunately i have to mention that performance analysis was not part of this case study.
Since our hackathons are time-boxed we produced just a few thousand log lines on our office wifi routes during one day. The fact table is obviously quite small and query duration around milliseconds. Said that, we are not really talking about ‘big’ data here. It’s more about showing how easy and fast one could adapt those technologies and generate actionable results from ‘useless’ log-file data. Compared to a traditional data warehouse approach its a massive difference and that’s what we wanted to focus on. What would you expect to have in one day? That’s the question here.
Regards, Jean-Pierre