Monday, August 25, 2008

Analysing GPS Logs with Awk

This post describes the first two "chop" functions that fit into the partitioning framework outlined last post.

def chopToSpeedHistogram(dest, p):
    # create histogram of speeds from nmea written to stdout
    os.system("cat "+sh_escape(dest)+".log"
              + " | awk -F , '{if($1==\"$GPVTG\" && int($8)!=0){count[int($8+0.5)]++}}"
              + " END {for(w in count) printf(\"[%d,%d],\\n\", w, count[w]);}'"
              # sort it
              + " | sort -g -k 1.2"
              # output json of histogram
              + " > "+sh_escape(dest)+".hist")

def chopToHeadingHistogram(dest, p):
    # create histogram of headings from nmea written to stdout (ignore heading when stopped)
    os.system("cat "+sh_escape(dest)+".log"
              + " | awk -F , '{if($1==\"$GPVTG\" && int($8)!=0){count[5.0*int($2/5.0+0.5)]++;}}"
              + " END {for(w in count) printf(\"[%d,%d],\\n\", w, count[w]);}'"
              # sort it
              + " | sort -g -k 1.2"
              # output json of histogram
              + " > "+sh_escape(dest)+".head")

Both functions use awk to create a histogram from the speed (in km/h) and heading (or bearing, in degrees) from the NMEA VTG sentences. The speed is rounded to an integer, and the bearing to the nearest 5 degrees. The data logger records on reading per second, so this gives a measure of how much time was spent at each speed/bearing.

The histogram is output in a "json" array format that can be inserted straight into a webpage where the flot library is used to generate some graphs.

Speed Histogram

The average and standard deviation (shaded at ±0.5σ) are indicated on the graph for two bike rides along the same route, and match pretty closely with that recorded by my bike computer:

GPS logBike computer
Ride 1 (brown) 2hrs 59 min, minus 41 min stopped63.4km27.7km/h 2 hrs 16 min64.01km28.00km/h
Ride 2 (dark green) 2 hrs 25 min, minus 13 min stopped63.4km29.0km/h 2 hrs 10 min63.85km29.30km/h

The two rides went in different directions, the first in the "uphill" direction and the second with a bit of a tail wind. I got a flat tire on the first ride too, hence the extra time spent stopped.

Heading Histogram

Up is north and the radius represents the time spent heading in that direction (normalized during the plotting process and "expanded" by taking the square root to show a little more detail.)

Thursday, August 21, 2008

Automatically Partitioning GPS Logs with gpsbabel

My GPS logger is capturing lots of useful information but it's difficult to efficiently capture data for regular activities. Geotagging photos is easy, and manually working with the logs for a special event is possible, but it's not feasible to put in that much work to analyze commutes for example.

The logger creates a separate log file each time it's switched on and off, and while these logs could be sorted into categories for analysis, it's easy to forget to turn it on and off at the start and end of a section of interest and activities are then merged in the logs. In addition, there is often "junk" data at start and end of logs while leaving or arriving at a destination.

I wanted to be able to automatically capture the information about my daily activities by simply switching on the logger and carrying it around with me. I then simply want to plug the logger into the computer and have the logs automatically chopped into segments of interest that can be compared to each other over time.

The rest of this post roughly outlines the Python script I created to perform this task, minus some of the hopefully irrelevant details.

Firstly, I collect the lat/long coordinates of places that I am interested in collecting data while I'm there and traveling between them. These include my home, work, the climbing gym and so on. Each point has a radius within which any readings will be considered to be in that place.

#         id:  name lat         long        radius
places = { 1: ("A", -37.123456, 145.123456, 0.050),
           2: ("B", -37.234567, 145.234567, 0.050),
           3: ("C", -37.345678, 145.345678, 0.050) }
otherid = 4

For each of these places of interest, I then use gpsbabel's radius filter to find all the times where I was within that zone:

# create a list of all raw log files to be processed
from path import path
month = path("/gpslogs/200808")
logs = " ".join(["-i nmea -f %s"%log 
                 for log in sorted((month/"raw").files("GPS_*.log"))])

for (id,(place,lat,lon,radius)) in places.items():
   os.system("gpsbabel "
             # input files
             + logs
             # convert to waypoints
             + " -x transform,wpt=trk,del"
             # remove anything outside place of interest
             + (" -x radius,distance=%.3fK,lat=%.6f,lon=%.6f,nosort"%(radius,lat,lon))
             # convert back to tracks
             + " -x transform,trk=wpt,del"
             # output nmea to stdout
             + " -o nmea -F -"
             # filter to just GPRMC sentences
             + " | grep GPRMC"
             # output to log file
             + (" > %s/processed/place%d.log"%(month,id)))

And all points outside any of the specific places of interest are sent into an "other" file:

os.system("gpsbabel "
          # input files
          + logs
          # convert to waypoints
          + " -x transform,wpt=trk,del"
          # remove anything in a place of interest
          + "".join([" -x radius,distance=%.3fK,lat=%.6f,lon=%.6f,nosort,exclude"%(radius,lat,lon)
                     for (id,(place,lat,lon,radius)) in places.items()])
          # convert back to tracks
          + " -x transform,trk=wpt,del"
          # output nmea to stdout
          + " -o nmea -F -"
          # filter to just GPRMC sentences
          + " | grep GPRMC"
          # output to log file
          + (" > %s/processed/place%d.log" % (month, otherid)))

These files are filtered with grep to contain only minimal data as we only require the timestamps for this part of the process. Specifically only the NMEA GPRMC sentences are kept.

To provide a brief illustration, the following picture shows two log files of data, a blue and a green, between three points of interest:

The above process would create four files, one for each point A, B and C and one for "Other" points that would contain something like the following information, where the horizontal axis represents time:

I then read all those log files back in to create a "time line" that for each timestamp stores my "location" in the sense that it knows whether I was "home", at "work" or somewhere between the two.

# dict of timestamp (seconds since epoch, UTC) to placeid
where = {}
for placeid in places.keys()+[otherid,]:
   for line in (month/"processed"/("place%d.log"%placeid)).lines():
      fields = line.split(",")
      # convert date/time to seconds since epoch (UTC)
      t, d = fields[1], fields[-3]
      ts = calendar.timegm( (2000+int(d[4:6]), int(d[2:4]), int(d[0:2]),
                             int(t[0:2]), int(t[2:4]), int(t[4:6])) )
      where[ts] = placeid

This is then summarised from one value per second to a list of "segments" with a start and end time and a location. Unlogged time segments are also inserted at this point whenever there are no logged readings for 5 minutes or more.

# array of tuples (placeid, start, end, logged)
# placeid = 0 indicates "unknown location", i.e. unlogged
summary = []
current, start, stop, last_ts = 0, 0, 0, None
for ts in sorted(where.keys()):
   # detect and insert "gaps" if space between logged timestamps is greater than 5 minutes
   if last_ts and ts-last_ts > 5*60:
      if current:
         summary.append( [current, start, stop, True] )
      current, start, stop = where[ts], ts, ts
      summary.append( [0, last_ts, ts, False] )
 
   last_ts = ts

   if where[ts] != current:
      if current:
         summary.append( [current, start, stop, True] )
      current, start, stop = where[ts], ts, ts
   else:
      stop = ts
summary.append( [current, start, stop, True] )

(If there's a more "Pythonic" way of writing that kind of code, I'd be interested in knowing it.)

"Spurious" segments are then removed. These show up because when the logger is inside buildings the location jumps around and often out of the 50m radius meaning that, for example, there will be a sequence of Home-Other-Home-Other-Home logs. The "Other" segments that are between two known points of interest and less than 5 minutes long are deleted, as are "Other" segments that sit between a known place of interest and an unlogged segment.

Based on the above graphic, the summary might look something like the following:

startendlocation
10.00am10.05amA
10.05am10.30amOther
10.30am10.35amB
10.35am11.00amOther
...

The "Other" segments are then labelled if possible to indicate they were "commutes" between known locations:

startendlocation
10.00am10.05amA
10.05am10.30amA-B
10.30am10.35amB
10.35am11.00amB-C
...

Some segments cannot be labeled automatically and are left as "Other". This may be a trip out to a "one-off" location and back again, which can be left as "Other". However, sometimes it is because the logger didn't lock onto the satellites within the 50m radius on the way out of a place of interest and these can be manually fixed up later.

Once a list of "activities" has been obtained, with start and end times, it is easy to use gpsbabel again to split logs based on start and end of time segments:

for (place, start, stop, place_from, place_to, logged) in summary:
    dest = month / "processed" / ("%s-%s"%(time.strftime("%Y%m%d%H%M%S", time.localtime(start)),
                                           time.strftime("%Y%m%d%H%M%S", time.localtime(stop))))

   for (ext, chopFn) in [(".log", chopToLog),
                         (".kml", chopToKml), 
                         (".speed", chopToSpeedVsDistance), 
                         (".alt", chopToAltitudeVsDistance), 
                         (".hist", chopToSpeedHistogram),
                         (".head", chopToHeadingHistogram),
                         (".stops", chopToStopsVsDistance)]:
      if not (dest+ext).exists():
         chopFn(dest, locals())
         # make the file in case it was empty and not created
         (dest+ext).touch()

This generates a bunch of files for each segment, named with the start and end timestamps of the segment and an extension depending on the content. The first "chop" function generates an NMEA format log file that is then processed further by the remaining "chop" functions. The other chop functions will probably be explained in a later post, the first two are:

def chopToLog(dest, p):
    # filter input file entries within times of interest to temp file
    os.system("gpsbabel " + p["logs"]
              + (" -x track,merge,start=%s,stop=%s"
                 % (time.strftime("%Y%m%d%H%M%S", time.gmtime(p["start"])),
                    time.strftime("%Y%m%d%H%M%S", time.gmtime(p["stop"]))))
              + " -o nmea -F "+sh_escape(dest)+".log")

def chopToKml(dest, p):
    # create kml file with reduced resolution
    os.system("gpsbabel -i nmea -f "+sh_escape(dest)+".log"
              + " -x simplify,error=0.01k"
              + " -o kml -F "+sh_escape(dest)+".kml")

def sh_escape(p):
    return p.replace("(","\\(").replace(")","\\)").replace(" ","\\ ")

(Again, if there's a better way to handle escaping special characters in shell commands, I would like to know it.)

Using this, I can simply plug in the logger, which launches an autorun script, and the end result are nicely segmented log files that I can map and graph. More about that in another post.

Monday, August 18, 2008

GMail tip: "To be filtered" label

I use GMail's filters a lot, and in particular have a "Newsletters" label that keeps "bacn" from getting into my Inbox (and instead just lights up my ambient email notifier green.)

I've added a "to be filtered" label that I can apply to anything that slips into the Inbox and then later on, when I have time, I can look at all these messages, select them in groups and use the "Filter messages like these" command to make sure they don't bother me again.

This solves two problems: being taken out of the "flow" to create a filter, and perpetually handling the items manually and never getting around to creating a filter.

AMOD AGL3080 GPS data logger

I bought the AMOD AGL3080 GPS data logger about a month ago now and am very happy with it. Thanks Richard for a very helpful review.

I've left it on the default settings, logging everything in NMEA format at 1 second intervals. It seems quite easy to change the logging interval and detail but at this stage, I'm hard pressed to imagine a situation where I'll need to as I've logged about 62 hours so far and only used about 50% of the 128mb capacity.

It's a little larger than I'd like because it takes three AAA batteries. I'm using eneloops and getting between 10 and 12 hours per charge (changing them as soon as I notice the battery warning light flashing, I don't know how long it would continue running if left to run flat completely.)

The logging seems quite accurate while driving and cycling, but has a lot of "jitter" while walking for some reason. It generally locks onto the signal quickly enough once I get outside, but again while walking it seems to take an unusually long time to lock on which is a little frustrating.

It's best feature is that it doesn't require any special software to access the logs. It simply mounts as a USB drive allowing the NMEA format log files to be copied off. This is great and makes it particularly attractive for Linux and Mac users.