Introduction

This is a learning project to practise XML and SQL data wrangling using actual OpenStreetMap data. Some map data such as street names, postal codes or phone numbers shall be audited automatically and possibly corrected. Furthermore, all corrections might be contributed to the gorgeous OpenStreetMap-Project. After storing the map data to a SQLite database, some SQL requests are demonstrated which cycling enthusiasts might find interesting.

The OpenStreetmap XML data was downloaded using Overpass-API https://overpass-api.de/api/map?bbox=10.6622,51.5293,12.0465,52.2715 for a portion of Saxony-Anhalt.

The city of Magdeburg and Harz mountains, Saxony-Anhalt

This is the area where I was born. The size of the dataset is 799 MB (838,025,562 bytes).

In [1]:
"""
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix 
    the unexpected street types to the appropriate ones in the expected list.
    You have to add mappings only for the actual problems you find in this OSMFILE,
    not a generalized solution, since that may and will depend on the particular area you are auditing.
- write the update_name function, to actually fix the street name.
    The function takes a string with street name as an argument and should return the fixed name
    We have provided a simple test so that you see what exactly is expected
"""
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import pandas as pd

#OSMFILE = "example4.osm"
#OSMFILE_audited = "example4_audited.osm"
OSMFILE = "magdeburg.osm" # https://overpass-api.de/api/map?bbox=11.5257,52.0597,11.8436,52.2518
OSMFILE = "magdeburg_harz.osm" #https://overpass-api.de/api/map?bbox=10.6622,51.5293,12.0465,52.2715
OSMFILE_audited = "magdeburg_audited.osm"


street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
has_numbers_re = re.compile(r'[0-9]+')

            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def is_postcode(elem):
    return (elem.attrib['k'] == "addr:postcode")


def audit(osmfile):
    osm_file = open(osmfile, "r", encoding='utf-8')
    street_types = defaultdict(set)
    postcode_types = defaultdict(set)
    '''
    Usage of defaultdict(set)
    >>> s = [('red', 1), ('blue', 2), ('red', 3), ('blue', 4), ('red', 1), ('blue', 4)]
    >>> d = defaultdict(set)
    >>> for k, v in s:
    ...     d[k].add(v)
    ...
    >>> d.items()
    [('blue', set([2, 4])), ('red', set([1, 3]))]
    '''
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type_func(street_types, tag.attrib['v'])
                    #if tag.attrib['v'] == 'Wiesenstr':
                    #    print(tag.attrib)
                    #    print("id:{0}; lat={1};lon={2}".format(elem.attrib['id'], elem.attrib['lat'], elem.attrib['lon']))
                if is_postcode(tag):
                    audit_postcode_type_func(postcode_types, tag.attrib['v'], elem=elem)
    osm_file.close()
    return {'street_types': street_types, 'postcode_types': postcode_types}

def update_name(name, mapping):
    street_types = defaultdict(set)
    #street_types = {}
    audit_street_type(street_types, name)
    print("update_name")
    pprint.pprint(dict(street_types) )
    bad_str = list(dict(street_types).keys())[0]
    name = name.replace(bad_str, mapping[bad_str])
    # YOUR CODE HERE
    return name

# Supporting function to get element from osm file
# Takes as input osm file and tuple of nodes and yield nodes of types from tuple. 
def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

# This function replace abbreviations in osm file
# Takes 2 osm file names as input and replace all abbreviations of streets in 
# first file. Saves new file under new name (new_file)
# Inspired from: https://github.com/neviadomski/OpenStreetMap-Python-SQL/blob/master/OsmData.py
def improve_streetnames(old_file, new_file):
    with open(new_file, 'wb') as output:
        output.write('<?xml version="1.0" encoding="UTF-8"?>\n'.encode(encoding='utf-8', errors='strict') )
        output.write('<osm>\n  '.encode(encoding='utf-8', errors='strict') )
        for i, element in enumerate(get_element(old_file)):
            for tag in element.iter("tag"):
                if is_street_name(tag):
                    tag.set('v',update_name(tag.attrib['v'], mapping))
            output.write(ET.tostring(element, encoding='utf-8'))
        output.write('</osm>'.encode(encoding='utf-8', errors='strict'))

Audit street names

Auditing german street names can certainly be a matter of science, because many exceptions and regional specialities are possible. Most commonly, street names may end with words (or sub-strings) like "Straße", "Weg", "Chaussee", "Ring", "Platz", "Tor", "Steg", "Stieg", "Garten", "Park", "Brücke". A few less common names where added while auditing the dataset iteratively and are listed in the code below.

In [2]:
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]
expected = ["Straße", "Weg", "Chaussee", "Ring", "Platz", "Tor", "Steg", "Stieg", "Garten", "Park", "Brücke",
           "Aue", "Teich", "ldchen", "Breite", "Damm", "Allee", "Hof", "Berg", "Grund", "Weiler", "Weiher", "Gang",
           "Gasse", "Feld", "Felde", "Markt", "Werder", "Hafen", "Wuhne", "Bogen", "Blick", "Dorf", "Felsen",
           "Winkel", "Wall", "Busch", "Denkmal", "Dom", "Einheit", "Graben", "Bruch", "Kanal", "Weide", "Pfad",
           "Leben", "Ehle", "Elbe", "Fahrt", "Fort", "Steig", "Rain", "Ufer", "Wache", "Sprung", "Grube", "Tal",
           "Ecken", "Friedens", "Brunnen", "Anger", "Schlag", "Scheid", "Krug", "Höhen", "Hügel", "Keller", "Schule",
           "Frieden", "Kuhle", "Börse", "Gesang", "Walde", "Ruh", "Schänke", "See", "Mühle", "Werk", "Revier", "Berge",
           "Schleife", "Plan", "Wiese", "Beck", "Fähre", "Wacken", "Gärten", "Gärtnerei", "Bad", "Horn", "Theater",
           "Bahn", "Eck", "Haus", "Horst", "Lake", "Wellen", "Pfuhl", "Hufe", "Kamp", "Förder", "Form", "Wiesen",
           "Kaserne", "Leuchte", "Kopf", "Pappel", "Deich", "Kopf", "lerei", "Siedlung", "Stadion", "Turm", "Spitze",
           "Rondell", "Neuber", "Born", "Thie", "Marsch", "Promenade", "Gelände", "Insel", "Wände", "Schleuse",
           "Domäne", "Sülze", "Wacke", "Scholle", "Zitadelle", "Stein", "Heide", "Busche", "Bode", "Burg", "Schloss",
           "Schloß", "Wabe", "Wege", "Kreuz", "Reihe", "Treppe", "Höhe", "Kruge", "Dorfe", "Eiche", "Dorn", "Passage",
           "Mauer", "Glück", "Jugend", "Lage", "Kirche", "Hain", "Zaun", "Achte", "Klippe", "Loch", "Hang", "Schacht",
           "Holz", "Teichen", "Stall", "Halde", "Fabrik", "Wand", "Linden", "Renne", "Halden"]


# UPDATE THIS VARIABLE
mapping = { "St": "Street",
            "St.": "Street",
            "Rd.": "Road",
            "Ave": "Avenue"
            }


def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def audit_street_type_DE(street_types, street_name):
    expected_lower = [el.lower() for el in expected]
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group() # Returns the last element
        if not street_type.lower().endswith(tuple(expected_lower)) or has_numbers_re.search(street_name): #  not in expected:
            street_types[street_type].add(street_name)
            #print(street_name)

# Set the audit function for the actual language+area
audit_street_type_func = audit_street_type_DE

test = defaultdict(set)
test_names = ['Teststraße', 'T1estweg', 'Meine Wiese', 'A Str.' , 'Falsche Str.', 'Laaanger Weeg']
for strasse in test_names:
    audit_street_type_DE(test, strasse)

print("Test data is:")
print(test_names)
print("Candidates for auditing:")
print(test)
Test data is:
['Teststraße', 'T1estweg', 'Meine Wiese', 'A Str.', 'Falsche Str.', 'Laaanger Weeg']
Candidates for auditing:
defaultdict(<class 'set'>, {'T1estweg': {'T1estweg'}, 'Str.': {'A Str.', 'Falsche Str.'}, 'Weeg': {'Laaanger Weeg'}})

Obviously, the OSM street name data is of very good quality and is not featuring easy mistakes such abreviations. However, there are a few findings that have been corrected in the OSM database. Apparently an automatic correction of the findings is not applicable since every finding was (almost) unique.

A very common error in street names might be that the string ends with a number, which should be the house number. However, in some cases this might be correct as illustrated for "Am Hammelberg Weg 1&2". Not a problem

There are no further similar findings in the whole dataset.

Examples of corrections of street names contributed to the OSM-Project

However, some errors where identified and corrected in the OSM dataset using the editor iD.

Changeset #59861609: Typo in "Straße":

Typo in "Straße"

Changeset #59861810: Street name is the name of the place.

Typo in "Street name references the name of the village"

In [5]:
def mainfunc():
    pprint.pprint("Need to improve these names:")
    st_types = audit(OSMFILE)
    pprint.pprint(dict(st_types))
    '''
    pprint.pprint("Would update these names")
    for st_type, ways in st_types.items(): #.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "=>", better_name)
    #improve_streetnames(OSMFILE, OSMFILE_audited)
    '''

if __name__ == '__main__':
    print("Not shown in final doc")
    #mainfunc()
Not shown in final doc

Audit Postal Codes

This is a map of the postal codes of the area. It is used to create the regular expression for auditing of the postal codes.

Postal codes in the audited area

So, the postal code shall eiter start with a "3" or a "0". If it starts with a "3", the following character can be "8" or "9". In case of a "0", the next character must be a "4" or a "6". Three {3} more numerical characters are required to have a valid postal code. A small section of the map has postal codes beginning with "997". The audit_street_type function has been modified for auditing postal codes here. It is tested with the list postalcodes and returns all postal codes that do not match the criteria (in the same way as the street names).

In [3]:
MDPOSTALCODE = re.compile(r'^(3[8,9]{1}[0-9]{3})|(0[4,6]{1}[0-9]{3})|(997[3,5,6]{1}[0-9]{1})')
#(VW (Golf|Polo)|Fiat (Punto|Panda))

def check_postcode(codestr, check_re):
    if check_re.match(codestr):
        return codestr
    else:
        return None #""

def audit_postcode_type_DE(postcode_types, postcode_str, elem=None):
    postcode_type = postcode_str[0:2]
    if not check_postcode(postcode_str, MDPOSTALCODE):
        if elem:
            auditdetails = {
                "addr:postcode":postcode_str,
                'id':elem.attrib['id']#,
                #'lat':elem.attrib['lat'], 'lon':elem.attrib['lon']
                }
            postcode_types[postcode_type].add(str(auditdetails))
        else:
            postcode_types[postcode_type].add(postcode_str)

# Set the audit function for the actual language+area
audit_postcode_type_func = audit_postcode_type_DE

test = defaultdict(set)
postalcodes=['39118','3822','06456','22123','37123','03456','38521','99734']
for postalcode in postalcodes:
    audit_postcode_type_DE(test, postalcode)
print(test)
defaultdict(<class 'set'>, {'38': {'3822'}, '22': {'22123'}, '37': {'37123'}, '03': {'03456'}})
In [9]:
# Works! But TLDR. So not executed in the final document
audit_results = audit(OSMFILE)
In [10]:
audit_results['postcode_types']
Out[10]:
defaultdict(set, {'96': {"{'addr:postcode': '96449', 'id': '5638689921'}"}})

Regarding the postal codes, the data quality seems to be very good as well. There is only one finding in the observed area: "96449" is not from this area - it does not even exist. The node can be investigated in the browser using this link: https://www.openstreetmap.org/node/5638689921 (not the same anymore) The node references a hospital (german: e.g. "Klinikum") and the postal code should be "06449" instead.

Node:5638689921 - correct postal code of the hospital

But wait... let's have a closer look at the pictures above. Is this really the same building on the map and the aerial image? Maybe not... The white markers in the image below are pointing to the correct (right) and the wrong (left) position of the hospital. The image is a screenshot is showing the OSM-editor iD in operation used to delete the wrong node. The changeset is #59886196

Delete misplaced Node:5638689921 - see the correct place of the hospital

Obviously the link above is now telling the news that the node was deleted by shuita (me). So, humans might still be useful for auditing more complex things...

Creating tables of nodes, nodes_tags, ways, ways_tags and storing all into a SQLite DB

The XML-data shall now be committed into a SQLite database enabling convenient an quick queries on the map data.

In theory, it would be a good opportunity, to perform data cleaning within the transformation process from XML to SQL. However, as explained in the previous chapter, this huge dataset was checked and is of excellent quality. Thus, compiling (simple) automatic data correction functions for street names or postal codes is not very interesting and has obviously no added value to the OSM-Project.

In order to transform the XML to the SQL, the instructions and code snippets from Udacity advised to perform the transformation in two steps:

  1. From XML to tables in CSV format
  2. Commit tables in CSV format to SQLite But why doing something in two steps what can be done in one step easier and faster? I think it is time to come up with an update of the old process. As a big advantage, this approach seems to be adequate to prevent character encoding for other languages.

The class UnicodeDictWriter is replaced by Sqlite3TableWriter. Its implementation and usage beautyfies the original function process_map as shown below. The full code is convert_osm2sql.py.

class Sqlite3dbFile(object):
    """Base class just to open an SQLite3-DB file"""
    def __init__(self, fname="magdeburg.sqlite3"):
        self.connlite3 = sqlite3.connect(fname)
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_value, traceback):
        self.connlite3.close()

class Sqlite3TableWriter(object):
    """Same thing as UnicodeDictWriter but using SQLite3 db"""
    def __init__(self, sqlite3db, table_name, table_columns):
        self.db = sqlite3db.connlite3
        self.tn = table_name
        self.tc = table_columns
        self.cache_pdf = pd.DataFrame(columns=table_columns)
        self.cache_list = [] # Faster than pd.DataFrame.append!
    def writerow(self, row):
        self.cache_list.append(row) # Pre-allocation would be faster, but this works...
    def writerows(self, rows):
        for row in rows: # TODO: store rows in dataframe
            self.writerow(row)
    def commit_db(self): # Some caching might be reasonable...
        self.cache_pdf = pd.DataFrame(self.cache_list, columns=self.tc)
        self.cache_pdf.to_sql(self.tn, self.db, if_exists="append")



# ================================================== #
#               Main Function                        #
# ================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""
    with Sqlite3dbFile("magdeburg_harz.sqlite3") as dbfile:        
        nodes_writer = Sqlite3TableWriter(dbfile, "nodes", NODE_FIELDS)
        node_tags_writer = Sqlite3TableWriter(dbfile, "nodes_tags", NODE_TAGS_FIELDS)
        ways_writer = Sqlite3TableWriter(dbfile, "ways", WAY_FIELDS)
        way_nodes_writer = Sqlite3TableWriter(dbfile, "ways_nodes", WAY_NODES_FIELDS)
        way_tags_writer = Sqlite3TableWriter(dbfile, "ways_tags", WAY_TAGS_FIELDS)

        validator = cerberus.Validator()
        print("Processing the Map...")

        counter = 0
        counter_disp = 0
        for element in get_element(file_in, tags=('node', 'way')):
            counter = counter + 1
            if counter > 100000: # Alive-ticker...
                counter = 0
                counter_disp = counter_disp + 1
                print("Parsing elements (in 100k):{0}".format(counter_disp))
            el = shape_element(element)

            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

        print("Done!")
        print("Committing to SQLite-DB...")
        nodes_writer.commit_db()
        print("Committing to SQLite-DB...")
        node_tags_writer.commit_db()
        print("Committing to SQLite-DB...")
        ways_writer.commit_db()
        print("Committing to SQLite-DB...")
        way_nodes_writer.commit_db()
        print("Committing to SQLite-DB...")
        way_tags_writer.commit_db()

This creates the SQLite database and produces the following output:

In [1]:
import convert_osm2sql
convert_osm2sql.run_process_map()
Processing the Map...
Parsing elements (in 100k):1
Parsing elements (in 100k):2
Parsing elements (in 100k):3
Parsing elements (in 100k):4
Parsing elements (in 100k):5
Parsing elements (in 100k):6
Parsing elements (in 100k):7
Parsing elements (in 100k):8
Parsing elements (in 100k):9
Parsing elements (in 100k):10
Parsing elements (in 100k):11
Parsing elements (in 100k):12
Parsing elements (in 100k):13
Parsing elements (in 100k):14
Parsing elements (in 100k):15
Parsing elements (in 100k):16
Parsing elements (in 100k):17
Parsing elements (in 100k):18
Parsing elements (in 100k):19
Parsing elements (in 100k):20
Parsing elements (in 100k):21
Parsing elements (in 100k):22
Parsing elements (in 100k):23
Parsing elements (in 100k):24
Parsing elements (in 100k):25
Parsing elements (in 100k):26
Parsing elements (in 100k):27
Parsing elements (in 100k):28
Parsing elements (in 100k):29
Parsing elements (in 100k):30
Parsing elements (in 100k):31
Parsing elements (in 100k):32
Parsing elements (in 100k):33
Parsing elements (in 100k):34
Parsing elements (in 100k):35
Parsing elements (in 100k):36
Parsing elements (in 100k):37
Parsing elements (in 100k):38
Parsing elements (in 100k):39
Parsing elements (in 100k):40
Done!
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...

Database Overview

To provide an illustrative overview of the data and table architecture of OSM data stored in an SQL database, some sample data is shown in the following:

In [1]:
import pandas as pd
import sqlite3
from IPython.display import display, HTML, Image
connlite3 = sqlite3.connect("magdeburg_harz.sqlite3")
c = connlite3.cursor()

Nodes

In [2]:
querystr = "SELECT DISTINCT nodes.* FROM nodes INNER JOIN nodes_tags ON nodes_tags.id = nodes.id ORDER BY id LIMIT 3"
c.execute(querystr)
rows = c.fetchall()
columns = []
for column_pos in range(len(c.description)):
    columns.append(c.description[column_pos][0])
pdf = pd.DataFrame(rows, columns=columns)
In [3]:
HTML(pdf.to_html())
Out[3]:
index id lat lon user uid version changeset timestamp
0 457834 1000109025 51.8766181 10.6933798 robgeb 336460 1 6435761 2010-11-22T22:22:48Z
1 458191 1001338921 51.8036745 11.6682431 FvGordon 161619 2 16950138 2013-07-14T14:49:06Z
2 458192 1001338932 51.8030183 11.6659263 robgeb 336460 1 6442484 2010-11-23T20:53:54Z

Nodes_tags

In [4]:
#querystr = "SELECT * FROM nodes_tags ORDER BY id LIMIT 3"
querystr = "SELECT nodes_tags.* FROM nodes INNER JOIN nodes_tags ON nodes_tags.id = nodes.id ORDER BY id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())
Out[4]:
index id key value type
0 68601 1000109025 barrier gate
1 68602 1001338921 amenity recycling
2 68603 1001338932 amenity bank
3 68604 1001338932 name Sparkasse
4 68605 1001338947 name Cölbigk
5 68606 1001338947 place hamlet
6 68607 1001453122 amenity parking
7 68608 1001453122 name Waldparkplatz
8 68609 1001453138 landuse quarry
9 68610 1001453138 name historischer Tagebau

Ways_nodes

In [5]:
#querystr = "SELECT * FROM ways LIMIT 3"
querystr = "SELECT ways_nodes.* FROM ways_nodes INNER JOIN nodes ON ways_nodes.node_id = nodes.id ORDER BY ways_nodes.node_id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())
Out[5]:
index id node_id position
0 675116 86165181 1000108485 8
1 4707622 541563439 1000108539 0
2 4868839 583362995 1000108539 3
3 4868842 583362995 1000108546 6
4 53880 23768221 1000108549 0
5 4868733 583362975 1000108549 1
6 4868818 583362986 1000108549 3
7 39664 22778241 1000108643 3
8 675143 86165187 1000108684 0
9 675181 86165194 1000108684 6

Ways

In [6]:
querystr = "SELECT ways.* FROM ways INNER JOIN ways_nodes ON ways.id = ways_nodes.node_id ORDER BY ways_nodes.node_id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())
Out[6]:
index id user uid version changeset timestamp
0 98736 114394341 st1974 44793 1 8214864 2011-05-22T11:07:28Z
1 98736 114394341 st1974 44793 1 8214864 2011-05-22T11:07:28Z
2 98736 114394341 st1974 44793 1 8214864 2011-05-22T11:07:28Z
3 124977 146469110 poschon 1719211 5 37007616 2016-02-04T20:41:45Z
4 124977 146469110 poschon 1719211 5 37007616 2016-02-04T20:41:45Z
5 124977 146469110 poschon 1719211 5 37007616 2016-02-04T20:41:45Z
6 124981 146469115 poschon 1719211 6 37007616 2016-02-04T20:38:14Z
7 184536 190646979 richtkreiser 390227 3 49112180 2017-05-30T18:56:33Z
8 184536 190646979 richtkreiser 390227 3 49112180 2017-05-30T18:56:33Z
9 205570 203868355 HWST 697239 1 14882451 2013-02-02T12:10:36Z

Ways_tags

In [7]:
querystr = "SELECT ways_tags.* FROM ways_tags INNER JOIN ways ON ways_tags.id = ways.id ORDER BY ways.id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())
Out[7]:
index id key value type
0 310515 100033647 building yes
1 310516 100033648 building yes
2 310517 100033649 building yes
3 310518 100033650 building yes
4 310519 100033651 building yes
5 310520 100033653 building yes
6 310521 100033654 building yes
7 310522 100033655 building yes
8 310523 100033656 building yes
9 310524 100033657 building yes

Some numbers

These are some basic statistics on the map data of the area:

In [22]:
querystr = "SELECT count() FROM ways"
num_ways = pd.read_sql_query(querystr, connlite3).iloc[0,0]
querystr = "SELECT count() FROM ways_tags"
num_ways_tags = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of ways(*1000) / Average number of tags per way: {0} / {1}".format( num_ways/1000.0, num_ways_tags/num_ways ))
querystr = "SELECT count() FROM nodes"
num_nodes = pd.read_sql_query(querystr, connlite3).iloc[0,0]
querystr = "SELECT count() FROM nodes_tags"
num_nodes_tags = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of nodes(*1000) / Average number of tags per node: {0} / {1}".format( num_nodes/1000.0, num_nodes_tags/num_nodes ))
querystr = "SELECT count() FROM (\
                SELECT DISTINCT usr.user FROM (\
                    SELECT DISTINCT user FROM nodes UNION ALL SELECT DISTINCT user FROM ways\
                ) \
                usr)"
print("Number of (distinct) contributors: {0}".format( pd.read_sql_query(querystr, connlite3).iloc[0,0] ))

querystr = "SELECT count(*) FROM (\
                SELECT unc.user, \
                        unc.nodes_count, \
                        uwc.ways_count, \
                        (unc.nodes_count + uwc.ways_count) AS total_count \
                    FROM (\
                        SELECT user, count(*) AS nodes_count \
                        FROM nodes \
                        GROUP BY user\
                    ) unc \
                    INNER JOIN (\
                        SELECT user, count(*) AS ways_count \
                        FROM ways \
                        GROUP BY user\
                    ) uwc \
                    ON unc.user = uwc.user \
                    WHERE unc.nodes_count > 30 AND uwc.ways_count > 10 \
                ORDER BY total_count DESC \
            )\
            "
num_active_distinct_users = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of active contributors: {0} (more than 30 nodes and 10 ways)".format( pd.read_sql_query(querystr, connlite3).iloc[0,0] ))
Number of ways(*1000) / Average number of tags per way: 554.156 / 2.5943867791740955
Number of nodes(*1000) / Average number of tags per node: 3537.404 / 0.11618378901589979
Number of (distinct) contributors: 2952
Number of active contributors: 757 (more than 30 nodes and 10 ways)

Top contributers by number

not considering quality...

In [9]:
querystr = "SELECT unc.user, \
                    unc.nodes_count, \
                    uwc.ways_count, \
                    (unc.nodes_count + uwc.ways_count) AS total_count \
                FROM (\
                    SELECT user, count(*) AS nodes_count \
                    FROM nodes \
                    GROUP BY user\
                ) unc \
                INNER JOIN (\
                    SELECT user, count(*) AS ways_count \
                    FROM ways \
                    GROUP BY user\
                ) uwc \
                ON unc.user = uwc.user \
            ORDER BY total_count DESC \
            LIMIT 20"
HTML( pd.read_sql_query(querystr, connlite3).to_html() )
Out[9]:
user nodes_count ways_count total_count
0 matchman 405360 33796 439156
1 HWST 291936 60938 352874
2 easyX 235799 32686 268485
3 spaceman02 184791 12239 197030
4 richtkreiser 154929 26324 181253
5 HPHenschel 117051 17881 134932
6 hadhuey 103906 22011 125917
7 Arnaldur 109920 15297 125217
8 MichaOSM 87878 15760 103638
9 Pomfuttge 74229 9613 83842
10 Natura 72279 10525 82804
11 rbuch703 41116 10653 51769
12 tastenmensch 43504 7860 51364
13 nonny 45722 4170 49892
14 rexunil 43026 6363 49389
15 Andre68 43067 5521 48588
16 st1974 37632 10020 47652
17 bonzulu 39662 4652 44314
18 ClickKlack 38651 5541 44192
19 Kurt Krampmeier 37165 5174 42339

Map usage - POI's for cyclists

As a passionate cyclist I am always interested in bicycle related places in my area. Let's start with a query of nodes that have been somehow tagged with "bicycle", excluding cycleways, traffic signs and parkings.

In [23]:
# Which POI's (grouped by key) exist?
# First query all data from id's that have something to do with bicycle --> mit self join
querystr = "SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                INNER JOIN (\
                    SELECT DISTINCT id \
                    FROM nodes_tags \
                    WHERE (\
                        key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                        AND NOT key='bicycle'\
                        ) OR (\
                        key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                        )\
                ) id_sel ON nt.id = id_sel.id"

HTML( pd.read_sql_query(querystr, connlite3).head(20).to_html() )
Out[23]:
id type key value
0 246162984 addr housenumber 2a
1 246162984 addr street Steingrube
2 246162984 amenity bicycle_rental
3 246162984 name Radwerk
4 246162984 phone +4939436268444
5 246162984 sells Fahrräder, Reparaturen
6 246162984 shop bicycle
7 246162984 website http://radwerk-wernigerode.de/
8 262694771 name Wust
9 262694771 shop bicycle
10 268145211 name Zweiradmarkt Rudloff
11 268145211 shop bicycle
12 281575115 addr city Schönebeck
13 281575115 addr housenumber 151
14 281575115 addr postcode 39218
15 281575115 addr street Magdeburger Straße
16 281575115 name Firma Fredrich
17 281575115 operator K. H. Fredrich
18 281575115 shop bicycle
19 290174287 addr city Quedlinburg

This form of representation is not very well suited for humans. A pivoted representation would be better human readable and preferred. It is straigthforward to do the job with a Pandas-DataFrame. Of course, this is also possible with pure SQL:

In [24]:
querystr = "SELECT id\
                , MAX(CASE WHEN key='name' THEN value END) name\
                , MAX(CASE WHEN key='street' THEN value END) strasze\
                , MAX(CASE WHEN key='housenumber' THEN value END) nr\
                , MAX(CASE WHEN key='city' THEN value END) city\
            FROM (\
                    SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                        INNER JOIN (\
                            SELECT DISTINCT id \
                            FROM nodes_tags \
                            WHERE (\
                                key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                                AND NOT key='bicycle'\
                                ) OR (\
                                key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                                )\
                        ) id_sel ON nt.id = id_sel.id\
                )\
            GROUP BY id"

HTML( pd.read_sql_query(querystr, connlite3).head(10).to_html() )
Out[24]:
id name strasze nr city
0 1116530908 Vetter None None None
1 1140989444 Radschmiede Sudenburger Fahrradladen None None None
2 1158066486 Fahrrad Peter None None None
3 1194501064 Fahrradverleih und Reparatur am Westfriedhof None None None
4 1279629575 Rad der Stadt Magdeburg Schönebecker Straße 103 Magdeburg
5 1368379633 MattBIKE RAD-CENTER None None None
6 1396124480 Harzbikehaus Wasserstraße 22 Oberharz am Brocken
7 1491890626 Schliesser Bike None None None
8 1561895254 Fahrradcenter Grohmann Karlstraße 29 Bernburg
9 1585913294 None None None None

What a bad surprise! Many tags, particularly the adresses, are missing! In order to find the place anyway, we can join and information from the nodes table. Works for most people in times of GPS & Co. However, there is certainly room for improvement on the map data...

In [12]:
querystr = "SELECT n_tags.*, n.lat, n.lon FROM (\
                SELECT id\
                    , MAX(CASE WHEN key='name' THEN value END) name\
                    , MAX(CASE WHEN key='street' THEN value END) strasze\
                    , MAX(CASE WHEN key='housenumber' THEN value END) nr\
                    , MAX(CASE WHEN key='city' THEN value END) city\
                FROM (\
                        SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                            INNER JOIN (\
                                SELECT DISTINCT id \
                                FROM nodes_tags \
                                WHERE (\
                                    key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                                    AND NOT key='bicycle'\
                                    ) OR (\
                                    key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                                    )\
                            ) id_sel ON nt.id = id_sel.id\
                    )\
                GROUP BY id\
            ) n_tags\
            INNER JOIN (\
                SELECT id, lat, lon FROM nodes\
            ) n\
            ON n_tags.id = n.id \
            ORDER by n_tags.city, n_tags.name, n.lon"

HTML( pd.read_sql_query(querystr, connlite3).to_html() )
Out[12]:
id name strasze nr city lat lon
0 4273174600 None None None None 51.8335062 10.7833424
1 1585913294 None None None None 52.2279611 11.0113925
2 5145960134 None None None None 51.7458012 11.0252872
3 1943938437 None None None None 51.7247609 11.2488583
4 1878045831 None None None None 52.1204151 11.6280791
5 1880719764 None None None None 52.1069861 11.6440794
6 685773764 None None None None 51.7677979 11.9326290
7 4833537022 1,2,3 Radverleih None None None 51.8381848 10.7868399
8 1714120909 Alles rund ums Rad – Fahrräder und Service Gom... Martin-Schwantes-Straße 34 None 52.0730608 11.8243944
9 3825377658 Bike Schmiede None None None 51.6581950 11.3534160
10 4089924037 Brockenbike Siegl & Siegl GbR None None None 51.9025860 10.6663869
11 631163565 DB Call a Bike Magdeburg None None None 52.1306458 11.6253357
12 4375716262 Der Fahrrad Doktor None None None 52.2482075 10.8206914
13 843230377 Fahrrad Baron None None None 51.8377596 10.7892531
14 5303435131 Fahrrad Karl Beyer None None None 51.6458696 11.5127795
15 3582592158 Fahrrad Magdeburg Büchner Hasselbachstraße 8 None 52.1292532 11.6286108
16 1158066486 Fahrrad Peter None None None 52.1062870 11.6372482
17 290559960 Fahrrad Scherf None None None 52.0060906 11.7255161
18 635228117 Fahrrad Zöffzig None None None 52.1295807 11.6047938
19 436368538 Fahrrad- und Modelleisenbahn Ingo Klettke None None None 52.0187567 11.7204955
20 1796420587 Fahrrad-Fachhandel J. Drese Braugartenstraße 9 None 51.6139684 11.4944537
21 3700699524 Fahrradmanufaktur Albatros None None None 51.7526045 11.9723890
22 4876558016 Fahrradverleih "Neustadt/Harz" None None None 51.5594056 10.8334287
23 1194501064 Fahrradverleih und Reparatur am Westfriedhof None None None 52.1297757 11.5859291
24 1368379633 MattBIKE RAD-CENTER None None None 52.0874701 11.6595682
25 2808179531 RadBär None None None 51.7917986 11.7445941
26 858426345 Radmaxx None None None 52.1472053 11.5831995
27 1140989444 Radschmiede Sudenburger Fahrradladen None None None 52.1152608 11.6007149
28 4375836937 Radservice Der Radmitte Service-Point None None None 52.1294021 11.6357805
29 246162984 Radwerk Steingrube 2a None 51.8344290 10.7893295
30 1491890626 Schliesser Bike None None None 52.1409088 10.9717172
31 2451387337 TOUREN HARZ - BIKES & PARTS Marienhöferstraße 2 None 51.8642890 10.6793092
32 428075092 Täves Radladen & Manufaktur None None None 52.1234531 11.6129286
33 1116530908 Vetter None None None 52.2277682 11.0275548
34 1597532744 Waldgasthaus Plessenburg None None None 51.8320143 10.6679773
35 1914255431 Weltrad None None None 52.0210239 11.7447310
36 262694771 Wust None None None 51.9708135 10.7137120
37 5221821421 Zweirad Weld None None None 52.1784848 11.4822542
38 3607760369 Zweirad-Franke None None None 51.5938103 11.4542209
39 3573239495 Zweirad-Groth None None None 51.7481234 11.9764348
40 268145211 Zweiradmarkt Rudloff None None None 52.0020272 11.7176194
41 988984699 eldoRADo None None None 52.1597659 11.6393392
42 357497780 fahrradohlsen Karlstraße 16 None 51.7933800 11.7459043
43 1808686389 nextbike None None None 52.1330228 11.6104660
44 1810394587 nextbike None None None 52.1309134 11.6377057
45 1812228573 Zweirad Mente GmbH Holzmarkt 4 Aschersleben 51.7553729 11.4574536
46 1561895254 Fahrradcenter Grohmann Karlstraße 29 Bernburg 51.7918947 11.7475311
47 795642288 Rad & Tat Karl-Zerbst-Straße 22 Blankenburg (Harz) 51.8042893 10.9477148
48 1732631501 Heiko Persch Schartauer Straße 24 Burg 52.2714767 11.8512896
49 3664004601 Zweirad Haase Martin-Luther-Straße 25 Burg 52.2714214 11.8399208
50 2384178784 Zweirad Myrrhe Hakeborner Weg 3 Egeln 51.9383853 11.4183275
51 1908447413 Niemann & Blankmeister Marktstraße 4 Königslutter am Elm 52.2506470 10.8182412
52 3698647427 Sattelfest Am Beek None Königslutter am Elm 52.2692675 10.7844148
53 1783650216 Fahrrad Koch Simonstraße 2 Magdeburg 52.1174399 11.6602042
54 629464410 Feine Räder Liebknechtstraße 39 Magdeburg 52.1234650 11.6102744
55 629553229 Little John Bikes Kastanienstraße 92 Magdeburg 52.1613835 11.6355207
56 323474099 Little John Bikes Alter Markt 13-14 Magdeburg 52.1322141 11.6381003
57 1279629575 Rad der Stadt Magdeburg Schönebecker Straße 103 Magdeburg 52.1064613 11.6408054
58 610564832 RadMitte Goldschmiedebrücke 7, 9, 11 Magdeburg 52.1292617 11.6374959
59 320881325 Schaper Halberstädter Straße 89 Magdeburg 52.1118188 11.6051428
60 401234988 Zweirad Schulz Breiter Weg 38 Magdeburg 52.1364881 11.6394686
61 1396124480 Harzbikehaus Wasserstraße 22 Oberharz am Brocken 51.7693895 10.8026747
62 290174287 2Rad Pavillon Adelheidstraße 1b Quedlinburg 51.7875045 11.1472362
63 300447446 Mente Pölkenstraße 8 Quedlinburg 51.7883428 11.1477341
64 281575115 Firma Fredrich Magdeburger Straße 151 Schönebeck 52.0106776 11.7143204
65 4128056889 Fahrzeugmuseum Staßfurt Berlepschstraße 14 Staßfurt 51.8592631 11.6015367
66 322079653 Bremer Fahrradservice Zur Grube 2 Wolmirstedt 52.2517271 11.6329177
67 629322254 Fahrrad Pfalz Fabrikstraße 2 Wolmirstedt 52.2470992 11.6210667

68 places... but 7 are not even tagged (yet) with a name.

Conclusions and more ideas

Semi-automatic map data auditing and corrections have been performed in this project. However, some more advanced checks would be desireable - since these are just simple plausibility checks for common human mistakes. They do not ensure, that the data is correct at all. This would require the usage of more reference datasets or more advanced heuristics (e.g. the usage of lat/lon to verify the postal codes).

Some corrections were contributed to the OpenStreetMap project. These are the corresponding changesets:

Änderungssätze von shuita
- Straßennahmen korrigiert (Schreibfehler) #59862237
- Straßennahmen korrigiert (Schreibfehler) #59862205
- Straßennahmen korrigiert (Schreibfehler) #59861978
- Straßennahme war Ortsnahme #59861926
- Straßennahme war Ortsnahme #59861810
- Straßennahmen korrigiert (Schreibfehler) #59861609

The audited data has been transformed to SQLite tables. The transformation process was (in my point of view) simplified and improved by directly writing the dictionaries to SQL tables rather than writing them to CSV tables in an intermediate step. Furthermore, this approach prevents character encoding problems with foreign languages.

Some queries were demonstrated, e.g. statistics on map data contributors and points of interest for cycling enthusiasts. Although the overall quality of the OSM map data is very good, many incompletely tagged nodes were found for this example*. So there is likely in gerneral certain room for improvement regarding the completeness of tags.

In [26]:
import sys
print(sys.version)
3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
In [27]:
connlite3.close()

*I am probably going to fix this when I am at home the next time with my bicycle B~)