Project: OpenStreetMap: Audit & SQL-Playground (Germany)¶

Table of Contents¶

Introduction
Audit street names
Audit postal codes
Creating tables of nodes, nodes_tags, ways, ways_tags and storing all into a SQLite DB
Database Overview
Map usage
Conclusions and more ideas

Introduction¶

This is a learning project to practise XML and SQL data wrangling using actual OpenStreetMap data. Some map data such as street names, postal codes or phone numbers shall be audited automatically and possibly corrected. Furthermore, all corrections might be contributed to the gorgeous OpenStreetMap-Project. After storing the map data to a SQLite database, some SQL requests are demonstrated which cycling enthusiasts might find interesting.

The OpenStreetmap XML data was downloaded using Overpass-API https://overpass-api.de/api/map?bbox=10.6622,51.5293,12.0465,52.2715 for a portion of Saxony-Anhalt.

The city of Magdeburg and Harz mountains, Saxony-Anhalt

This is the area where I was born. The size of the dataset is 799 MB (838,025,562 bytes).

"""
Your task in this exercise has two steps:

- audit the OSMFILE and change the variable 'mapping' to reflect the changes needed to fix 
    the unexpected street types to the appropriate ones in the expected list.
    You have to add mappings only for the actual problems you find in this OSMFILE,
    not a generalized solution, since that may and will depend on the particular area you are auditing.
- write the update_name function, to actually fix the street name.
    The function takes a string with street name as an argument and should return the fixed name
    We have provided a simple test so that you see what exactly is expected
"""
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
import pandas as pd

#OSMFILE = "example4.osm"
#OSMFILE_audited = "example4_audited.osm"
OSMFILE = "magdeburg.osm" # https://overpass-api.de/api/map?bbox=11.5257,52.0597,11.8436,52.2518
OSMFILE = "magdeburg_harz.osm" #https://overpass-api.de/api/map?bbox=10.6622,51.5293,12.0465,52.2715
OSMFILE_audited = "magdeburg_audited.osm"


street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
has_numbers_re = re.compile(r'[0-9]+')

            
def is_street_name(elem):
    return (elem.attrib['k'] == "addr:street")

def is_postcode(elem):
    return (elem.attrib['k'] == "addr:postcode")


def audit(osmfile):
    osm_file = open(osmfile, "r", encoding='utf-8')
    street_types = defaultdict(set)
    postcode_types = defaultdict(set)
    '''
    Usage of defaultdict(set)
    >>> s = [('red', 1), ('blue', 2), ('red', 3), ('blue', 4), ('red', 1), ('blue', 4)]
    >>> d = defaultdict(set)
    >>> for k, v in s:
    ...     d[k].add(v)
    ...
    >>> d.items()
    [('blue', set([2, 4])), ('red', set([1, 3]))]
    '''
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type_func(street_types, tag.attrib['v'])
                    #if tag.attrib['v'] == 'Wiesenstr':
                    #    print(tag.attrib)
                    #    print("id:{0}; lat={1};lon={2}".format(elem.attrib['id'], elem.attrib['lat'], elem.attrib['lon']))
                if is_postcode(tag):
                    audit_postcode_type_func(postcode_types, tag.attrib['v'], elem=elem)
    osm_file.close()
    return {'street_types': street_types, 'postcode_types': postcode_types}

def update_name(name, mapping):
    street_types = defaultdict(set)
    #street_types = {}
    audit_street_type(street_types, name)
    print("update_name")
    pprint.pprint(dict(street_types) )
    bad_str = list(dict(street_types).keys())[0]
    name = name.replace(bad_str, mapping[bad_str])
    # YOUR CODE HERE
    return name

# Supporting function to get element from osm file
# Takes as input osm file and tuple of nodes and yield nodes of types from tuple. 
def get_element(osm_file, tags=('node', 'way', 'relation')):
    context = iter(ET.iterparse(osm_file, events=('start', 'end')))
    _, root = next(context)
    for event, elem in context:
        if event == 'end' and elem.tag in tags:
            yield elem
            root.clear()

# This function replace abbreviations in osm file
# Takes 2 osm file names as input and replace all abbreviations of streets in 
# first file. Saves new file under new name (new_file)
# Inspired from: https://github.com/neviadomski/OpenStreetMap-Python-SQL/blob/master/OsmData.py
def improve_streetnames(old_file, new_file):
    with open(new_file, 'wb') as output:
        output.write('<?xml version="1.0" encoding="UTF-8"?>\n'.encode(encoding='utf-8', errors='strict') )
        output.write('<osm>\n  '.encode(encoding='utf-8', errors='strict') )
        for i, element in enumerate(get_element(old_file)):
            for tag in element.iter("tag"):
                if is_street_name(tag):
                    tag.set('v',update_name(tag.attrib['v'], mapping))
            output.write(ET.tostring(element, encoding='utf-8'))
        output.write('</osm>'.encode(encoding='utf-8', errors='strict'))

Audit street names¶

Auditing german street names can certainly be a matter of science, because many exceptions and regional specialities are possible. Most commonly, street names may end with words (or sub-strings) like "Straße", "Weg", "Chaussee", "Ring", "Platz", "Tor", "Steg", "Stieg", "Garten", "Park", "Brücke". A few less common names where added while auditing the dataset iteratively and are listed in the code below.

expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]
expected = ["Straße", "Weg", "Chaussee", "Ring", "Platz", "Tor", "Steg", "Stieg", "Garten", "Park", "Brücke",
           "Aue", "Teich", "ldchen", "Breite", "Damm", "Allee", "Hof", "Berg", "Grund", "Weiler", "Weiher", "Gang",
           "Gasse", "Feld", "Felde", "Markt", "Werder", "Hafen", "Wuhne", "Bogen", "Blick", "Dorf", "Felsen",
           "Winkel", "Wall", "Busch", "Denkmal", "Dom", "Einheit", "Graben", "Bruch", "Kanal", "Weide", "Pfad",
           "Leben", "Ehle", "Elbe", "Fahrt", "Fort", "Steig", "Rain", "Ufer", "Wache", "Sprung", "Grube", "Tal",
           "Ecken", "Friedens", "Brunnen", "Anger", "Schlag", "Scheid", "Krug", "Höhen", "Hügel", "Keller", "Schule",
           "Frieden", "Kuhle", "Börse", "Gesang", "Walde", "Ruh", "Schänke", "See", "Mühle", "Werk", "Revier", "Berge",
           "Schleife", "Plan", "Wiese", "Beck", "Fähre", "Wacken", "Gärten", "Gärtnerei", "Bad", "Horn", "Theater",
           "Bahn", "Eck", "Haus", "Horst", "Lake", "Wellen", "Pfuhl", "Hufe", "Kamp", "Förder", "Form", "Wiesen",
           "Kaserne", "Leuchte", "Kopf", "Pappel", "Deich", "Kopf", "lerei", "Siedlung", "Stadion", "Turm", "Spitze",
           "Rondell", "Neuber", "Born", "Thie", "Marsch", "Promenade", "Gelände", "Insel", "Wände", "Schleuse",
           "Domäne", "Sülze", "Wacke", "Scholle", "Zitadelle", "Stein", "Heide", "Busche", "Bode", "Burg", "Schloss",
           "Schloß", "Wabe", "Wege", "Kreuz", "Reihe", "Treppe", "Höhe", "Kruge", "Dorfe", "Eiche", "Dorn", "Passage",
           "Mauer", "Glück", "Jugend", "Lage", "Kirche", "Hain", "Zaun", "Achte", "Klippe", "Loch", "Hang", "Schacht",
           "Holz", "Teichen", "Stall", "Halde", "Fabrik", "Wand", "Linden", "Renne", "Halden"]


# UPDATE THIS VARIABLE
mapping = { "St": "Street",
            "St.": "Street",
            "Rd.": "Road",
            "Ave": "Avenue"
            }


def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)

def audit_street_type_DE(street_types, street_name):
    expected_lower = [el.lower() for el in expected]
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group() # Returns the last element
        if not street_type.lower().endswith(tuple(expected_lower)) or has_numbers_re.search(street_name): #  not in expected:
            street_types[street_type].add(street_name)
            #print(street_name)

# Set the audit function for the actual language+area
audit_street_type_func = audit_street_type_DE

test = defaultdict(set)
test_names = ['Teststraße', 'T1estweg', 'Meine Wiese', 'A Str.' , 'Falsche Str.', 'Laaanger Weeg']
for strasse in test_names:
    audit_street_type_DE(test, strasse)

print("Test data is:")
print(test_names)
print("Candidates for auditing:")
print(test)

Test data is:
['Teststraße', 'T1estweg', 'Meine Wiese', 'A Str.', 'Falsche Str.', 'Laaanger Weeg']
Candidates for auditing:
defaultdict(<class 'set'>, {'T1estweg': {'T1estweg'}, 'Str.': {'A Str.', 'Falsche Str.'}, 'Weeg': {'Laaanger Weeg'}})

Obviously, the OSM street name data is of very good quality and is not featuring easy mistakes such abreviations. However, there are a few findings that have been corrected in the OSM database. Apparently an automatic correction of the findings is not applicable since every finding was (almost) unique.

A very common error in street names might be that the string ends with a number, which should be the house number. However, in some cases this might be correct as illustrated for "Am Hammelberg Weg 1&2". Not a problem

There are no further similar findings in the whole dataset.

Examples of corrections of street names contributed to the OSM-Project¶

However, some errors where identified and corrected in the OSM dataset using the editor iD.

Changeset #59861609: Typo in "Straße":¶

Typo in "Straße"

Changeset #59861810: Street name is the name of the place.¶

Typo in "Street name references the name of the village"

def mainfunc():
    pprint.pprint("Need to improve these names:")
    st_types = audit(OSMFILE)
    pprint.pprint(dict(st_types))
    '''
    pprint.pprint("Would update these names")
    for st_type, ways in st_types.items(): #.iteritems():
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "=>", better_name)
    #improve_streetnames(OSMFILE, OSMFILE_audited)
    '''

if __name__ == '__main__':
    print("Not shown in final doc")
    #mainfunc()

Not shown in final doc

Audit Postal Codes¶

This is a map of the postal codes of the area. It is used to create the regular expression for auditing of the postal codes.

Postal codes in the audited area

So, the postal code shall eiter start with a "3" or a "0". If it starts with a "3", the following character can be "8" or "9". In case of a "0", the next character must be a "4" or a "6". Three {3} more numerical characters are required to have a valid postal code. A small section of the map has postal codes beginning with "997". The audit_street_type function has been modified for auditing postal codes here. It is tested with the list postalcodes and returns all postal codes that do not match the criteria (in the same way as the street names).

MDPOSTALCODE = re.compile(r'^(3[8,9]{1}[0-9]{3})|(0[4,6]{1}[0-9]{3})|(997[3,5,6]{1}[0-9]{1})')
#(VW (Golf|Polo)|Fiat (Punto|Panda))

def check_postcode(codestr, check_re):
    if check_re.match(codestr):
        return codestr
    else:
        return None #""

def audit_postcode_type_DE(postcode_types, postcode_str, elem=None):
    postcode_type = postcode_str[0:2]
    if not check_postcode(postcode_str, MDPOSTALCODE):
        if elem:
            auditdetails = {
                "addr:postcode":postcode_str,
                'id':elem.attrib['id']#,
                #'lat':elem.attrib['lat'], 'lon':elem.attrib['lon']
                }
            postcode_types[postcode_type].add(str(auditdetails))
        else:
            postcode_types[postcode_type].add(postcode_str)

# Set the audit function for the actual language+area
audit_postcode_type_func = audit_postcode_type_DE

test = defaultdict(set)
postalcodes=['39118','3822','06456','22123','37123','03456','38521','99734']
for postalcode in postalcodes:
    audit_postcode_type_DE(test, postalcode)
print(test)

defaultdict(<class 'set'>, {'38': {'3822'}, '22': {'22123'}, '37': {'37123'}, '03': {'03456'}})

# Works! But TLDR. So not executed in the final document
audit_results = audit(OSMFILE)

audit_results['postcode_types']

defaultdict(set, {'96': {"{'addr:postcode': '96449', 'id': '5638689921'}"}})

Regarding the postal codes, the data quality seems to be very good as well. There is only one finding in the observed area: "96449" is not from this area - it does not even exist. The node can be investigated in the browser using this link: https://www.openstreetmap.org/node/5638689921 (not the same anymore) The node references a hospital (german: e.g. "Klinikum") and the postal code should be "06449" instead.

Node:5638689921 - correct postal code of the hospital

But wait... let's have a closer look at the pictures above. Is this really the same building on the map and the aerial image? Maybe not... The white markers in the image below are pointing to the correct (right) and the wrong (left) position of the hospital. The image is a screenshot is showing the OSM-editor iD in operation used to delete the wrong node. The changeset is #59886196

Delete misplaced Node:5638689921 - see the correct place of the hospital

Obviously the link above is now telling the news that the node was deleted by shuita (me). So, humans might still be useful for auditing more complex things...

Creating tables of nodes, nodes_tags, ways, ways_tags and storing all into a SQLite DB¶

The XML-data shall now be committed into a SQLite database enabling convenient an quick queries on the map data.

In theory, it would be a good opportunity, to perform data cleaning within the transformation process from XML to SQL. However, as explained in the previous chapter, this huge dataset was checked and is of excellent quality. Thus, compiling (simple) automatic data correction functions for street names or postal codes is not very interesting and has obviously no added value to the OSM-Project.

In order to transform the XML to the SQL, the instructions and code snippets from Udacity advised to perform the transformation in two steps:

From XML to tables in CSV format
Commit tables in CSV format to SQLite But why doing something in two steps what can be done in one step easier and faster? I think it is time to come up with an update of the old process. As a big advantage, this approach seems to be adequate to prevent character encoding for other languages.

The class UnicodeDictWriter is replaced by Sqlite3TableWriter. Its implementation and usage beautyfies the original function process_map as shown below. The full code is convert_osm2sql.py.

class Sqlite3dbFile(object):
    """Base class just to open an SQLite3-DB file"""
    def __init__(self, fname="magdeburg.sqlite3"):
        self.connlite3 = sqlite3.connect(fname)
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_value, traceback):
        self.connlite3.close()

class Sqlite3TableWriter(object):
    """Same thing as UnicodeDictWriter but using SQLite3 db"""
    def __init__(self, sqlite3db, table_name, table_columns):
        self.db = sqlite3db.connlite3
        self.tn = table_name
        self.tc = table_columns
        self.cache_pdf = pd.DataFrame(columns=table_columns)
        self.cache_list = [] # Faster than pd.DataFrame.append!
    def writerow(self, row):
        self.cache_list.append(row) # Pre-allocation would be faster, but this works...
    def writerows(self, rows):
        for row in rows: # TODO: store rows in dataframe
            self.writerow(row)
    def commit_db(self): # Some caching might be reasonable...
        self.cache_pdf = pd.DataFrame(self.cache_list, columns=self.tc)
        self.cache_pdf.to_sql(self.tn, self.db, if_exists="append")



# ================================================== #
#               Main Function                        #
# ================================================== #
def process_map(file_in, validate):
    """Iteratively process each XML element and write to csv(s)"""
    with Sqlite3dbFile("magdeburg_harz.sqlite3") as dbfile:        
        nodes_writer = Sqlite3TableWriter(dbfile, "nodes", NODE_FIELDS)
        node_tags_writer = Sqlite3TableWriter(dbfile, "nodes_tags", NODE_TAGS_FIELDS)
        ways_writer = Sqlite3TableWriter(dbfile, "ways", WAY_FIELDS)
        way_nodes_writer = Sqlite3TableWriter(dbfile, "ways_nodes", WAY_NODES_FIELDS)
        way_tags_writer = Sqlite3TableWriter(dbfile, "ways_tags", WAY_TAGS_FIELDS)

        validator = cerberus.Validator()
        print("Processing the Map...")

        counter = 0
        counter_disp = 0
        for element in get_element(file_in, tags=('node', 'way')):
            counter = counter + 1
            if counter > 100000: # Alive-ticker...
                counter = 0
                counter_disp = counter_disp + 1
                print("Parsing elements (in 100k):{0}".format(counter_disp))
            el = shape_element(element)

            if el:
                if validate is True:
                    validate_element(el, validator)

                if element.tag == 'node':
                    nodes_writer.writerow(el['node'])
                    node_tags_writer.writerows(el['node_tags'])
                elif element.tag == 'way':
                    ways_writer.writerow(el['way'])
                    way_nodes_writer.writerows(el['way_nodes'])
                    way_tags_writer.writerows(el['way_tags'])

        print("Done!")
        print("Committing to SQLite-DB...")
        nodes_writer.commit_db()
        print("Committing to SQLite-DB...")
        node_tags_writer.commit_db()
        print("Committing to SQLite-DB...")
        ways_writer.commit_db()
        print("Committing to SQLite-DB...")
        way_nodes_writer.commit_db()
        print("Committing to SQLite-DB...")
        way_tags_writer.commit_db()

This creates the SQLite database and produces the following output:

import convert_osm2sql
convert_osm2sql.run_process_map()

Processing the Map...
Parsing elements (in 100k):1
Parsing elements (in 100k):2
Parsing elements (in 100k):3
Parsing elements (in 100k):4
Parsing elements (in 100k):5
Parsing elements (in 100k):6
Parsing elements (in 100k):7
Parsing elements (in 100k):8
Parsing elements (in 100k):9
Parsing elements (in 100k):10
Parsing elements (in 100k):11
Parsing elements (in 100k):12
Parsing elements (in 100k):13
Parsing elements (in 100k):14
Parsing elements (in 100k):15
Parsing elements (in 100k):16
Parsing elements (in 100k):17
Parsing elements (in 100k):18
Parsing elements (in 100k):19
Parsing elements (in 100k):20
Parsing elements (in 100k):21
Parsing elements (in 100k):22
Parsing elements (in 100k):23
Parsing elements (in 100k):24
Parsing elements (in 100k):25
Parsing elements (in 100k):26
Parsing elements (in 100k):27
Parsing elements (in 100k):28
Parsing elements (in 100k):29
Parsing elements (in 100k):30
Parsing elements (in 100k):31
Parsing elements (in 100k):32
Parsing elements (in 100k):33
Parsing elements (in 100k):34
Parsing elements (in 100k):35
Parsing elements (in 100k):36
Parsing elements (in 100k):37
Parsing elements (in 100k):38
Parsing elements (in 100k):39
Parsing elements (in 100k):40
Done!
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...
Committing to SQLite-DB...

Database Overview¶

To provide an illustrative overview of the data and table architecture of OSM data stored in an SQL database, some sample data is shown in the following:

import pandas as pd
import sqlite3
from IPython.display import display, HTML, Image
connlite3 = sqlite3.connect("magdeburg_harz.sqlite3")
c = connlite3.cursor()

Nodes¶

querystr = "SELECT DISTINCT nodes.* FROM nodes INNER JOIN nodes_tags ON nodes_tags.id = nodes.id ORDER BY id LIMIT 3"
c.execute(querystr)
rows = c.fetchall()
columns = []
for column_pos in range(len(c.description)):
    columns.append(c.description[column_pos][0])
pdf = pd.DataFrame(rows, columns=columns)

HTML(pdf.to_html())

Nodes_tags¶

#querystr = "SELECT * FROM nodes_tags ORDER BY id LIMIT 3"
querystr = "SELECT nodes_tags.* FROM nodes INNER JOIN nodes_tags ON nodes_tags.id = nodes.id ORDER BY id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())

Ways_nodes¶

#querystr = "SELECT * FROM ways LIMIT 3"
querystr = "SELECT ways_nodes.* FROM ways_nodes INNER JOIN nodes ON ways_nodes.node_id = nodes.id ORDER BY ways_nodes.node_id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())

Ways¶

querystr = "SELECT ways.* FROM ways INNER JOIN ways_nodes ON ways.id = ways_nodes.node_id ORDER BY ways_nodes.node_id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())

Ways_tags¶

querystr = "SELECT ways_tags.* FROM ways_tags INNER JOIN ways ON ways_tags.id = ways.id ORDER BY ways.id LIMIT 10"
HTML(pd.read_sql_query(querystr, connlite3).to_html())

Some numbers¶

These are some basic statistics on the map data of the area:

querystr = "SELECT count() FROM ways"
num_ways = pd.read_sql_query(querystr, connlite3).iloc[0,0]
querystr = "SELECT count() FROM ways_tags"
num_ways_tags = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of ways(*1000) / Average number of tags per way: {0} / {1}".format( num_ways/1000.0, num_ways_tags/num_ways ))
querystr = "SELECT count() FROM nodes"
num_nodes = pd.read_sql_query(querystr, connlite3).iloc[0,0]
querystr = "SELECT count() FROM nodes_tags"
num_nodes_tags = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of nodes(*1000) / Average number of tags per node: {0} / {1}".format( num_nodes/1000.0, num_nodes_tags/num_nodes ))
querystr = "SELECT count() FROM (\
                SELECT DISTINCT usr.user FROM (\
                    SELECT DISTINCT user FROM nodes UNION ALL SELECT DISTINCT user FROM ways\
                ) \
                usr)"
print("Number of (distinct) contributors: {0}".format( pd.read_sql_query(querystr, connlite3).iloc[0,0] ))

querystr = "SELECT count(*) FROM (\
                SELECT unc.user, \
                        unc.nodes_count, \
                        uwc.ways_count, \
                        (unc.nodes_count + uwc.ways_count) AS total_count \
                    FROM (\
                        SELECT user, count(*) AS nodes_count \
                        FROM nodes \
                        GROUP BY user\
                    ) unc \
                    INNER JOIN (\
                        SELECT user, count(*) AS ways_count \
                        FROM ways \
                        GROUP BY user\
                    ) uwc \
                    ON unc.user = uwc.user \
                    WHERE unc.nodes_count > 30 AND uwc.ways_count > 10 \
                ORDER BY total_count DESC \
            )\
            "
num_active_distinct_users = pd.read_sql_query(querystr, connlite3).iloc[0,0]
print("Number of active contributors: {0} (more than 30 nodes and 10 ways)".format( pd.read_sql_query(querystr, connlite3).iloc[0,0] ))

Number of ways(*1000) / Average number of tags per way: 554.156 / 2.5943867791740955
Number of nodes(*1000) / Average number of tags per node: 3537.404 / 0.11618378901589979
Number of (distinct) contributors: 2952
Number of active contributors: 757 (more than 30 nodes and 10 ways)

Top contributers by number¶

not considering quality...

querystr = "SELECT unc.user, \
                    unc.nodes_count, \
                    uwc.ways_count, \
                    (unc.nodes_count + uwc.ways_count) AS total_count \
                FROM (\
                    SELECT user, count(*) AS nodes_count \
                    FROM nodes \
                    GROUP BY user\
                ) unc \
                INNER JOIN (\
                    SELECT user, count(*) AS ways_count \
                    FROM ways \
                    GROUP BY user\
                ) uwc \
                ON unc.user = uwc.user \
            ORDER BY total_count DESC \
            LIMIT 20"
HTML( pd.read_sql_query(querystr, connlite3).to_html() )

Map usage - POI's for cyclists¶

As a passionate cyclist I am always interested in bicycle related places in my area. Let's start with a query of nodes that have been somehow tagged with "bicycle", excluding cycleways, traffic signs and parkings.

# Which POI's (grouped by key) exist?
# First query all data from id's that have something to do with bicycle --> mit self join
querystr = "SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                INNER JOIN (\
                    SELECT DISTINCT id \
                    FROM nodes_tags \
                    WHERE (\
                        key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                        AND NOT key='bicycle'\
                        ) OR (\
                        key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                        )\
                ) id_sel ON nt.id = id_sel.id"

HTML( pd.read_sql_query(querystr, connlite3).head(20).to_html() )

This form of representation is not very well suited for humans. A pivoted representation would be better human readable and preferred. It is straigthforward to do the job with a Pandas-DataFrame. Of course, this is also possible with pure SQL:

querystr = "SELECT id\
                , MAX(CASE WHEN key='name' THEN value END) name\
                , MAX(CASE WHEN key='street' THEN value END) strasze\
                , MAX(CASE WHEN key='housenumber' THEN value END) nr\
                , MAX(CASE WHEN key='city' THEN value END) city\
            FROM (\
                    SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                        INNER JOIN (\
                            SELECT DISTINCT id \
                            FROM nodes_tags \
                            WHERE (\
                                key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                                AND NOT key='bicycle'\
                                ) OR (\
                                key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                                )\
                        ) id_sel ON nt.id = id_sel.id\
                )\
            GROUP BY id"

HTML( pd.read_sql_query(querystr, connlite3).head(10).to_html() )

What a bad surprise! Many tags, particularly the adresses, are missing! In order to find the place anyway, we can join and information from the nodes table. Works for most people in times of GPS & Co. However, there is certainly room for improvement on the map data...

querystr = "SELECT n_tags.*, n.lat, n.lon FROM (\
                SELECT id\
                    , MAX(CASE WHEN key='name' THEN value END) name\
                    , MAX(CASE WHEN key='street' THEN value END) strasze\
                    , MAX(CASE WHEN key='housenumber' THEN value END) nr\
                    , MAX(CASE WHEN key='city' THEN value END) city\
                FROM (\
                        SELECT nt.id, nt.type, nt.key, nt.value FROM nodes_tags nt \
                            INNER JOIN (\
                                SELECT DISTINCT id \
                                FROM nodes_tags \
                                WHERE (\
                                    key LIKE '%bicycle%' AND key NOT LIKE '%left%' AND key NOT LIKE '%right%' \
                                    AND NOT key='bicycle'\
                                    ) OR (\
                                    key IN ('shop', 'amenity') AND value LIKE '%bicycle%' AND NOT value='bicycle_parking'\
                                    )\
                            ) id_sel ON nt.id = id_sel.id\
                    )\
                GROUP BY id\
            ) n_tags\
            INNER JOIN (\
                SELECT id, lat, lon FROM nodes\
            ) n\
            ON n_tags.id = n.id \
            ORDER by n_tags.city, n_tags.name, n.lon"

HTML( pd.read_sql_query(querystr, connlite3).to_html() )

68 places... but 7 are not even tagged (yet) with a name.

Conclusions and more ideas¶

Semi-automatic map data auditing and corrections have been performed in this project. However, some more advanced checks would be desireable - since these are just simple plausibility checks for common human mistakes. They do not ensure, that the data is correct at all. This would require the usage of more reference datasets or more advanced heuristics (e.g. the usage of lat/lon to verify the postal codes).

Some corrections were contributed to the OpenStreetMap project. These are the corresponding changesets:

Änderungssätze von shuita
- Straßennahmen korrigiert (Schreibfehler) #59862237
- Straßennahmen korrigiert (Schreibfehler) #59862205
- Straßennahmen korrigiert (Schreibfehler) #59861978
- Straßennahme war Ortsnahme #59861926
- Straßennahme war Ortsnahme #59861810
- Straßennahmen korrigiert (Schreibfehler) #59861609

The audited data has been transformed to SQLite tables. The transformation process was (in my point of view) simplified and improved by directly writing the dictionaries to SQL tables rather than writing them to CSV tables in an intermediate step. Furthermore, this approach prevents character encoding problems with foreign languages.

Some queries were demonstrated, e.g. statistics on map data contributors and points of interest for cycling enthusiasts. Although the overall quality of the OSM map data is very good, many incompletely tagged nodes were found for this example*. So there is likely in gerneral certain room for improvement regarding the completeness of tags.

import sys
print(sys.version)

3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]

connlite3.close()

*I am probably going to fix this when I am at home the next time with my bicycle B~)

	index	id	lat	lon	user	uid	version	changeset	timestamp
0	457834	1000109025	51.8766181	10.6933798	robgeb	336460	1	6435761	2010-11-22T22:22:48Z
1	458191	1001338921	51.8036745	11.6682431	FvGordon	161619	2	16950138	2013-07-14T14:49:06Z
2	458192	1001338932	51.8030183	11.6659263	robgeb	336460	1	6442484	2010-11-23T20:53:54Z

	index	id	key	value
0	68601	1000109025	barrier	gate
1	68602	1001338921	amenity	recycling
2	68603	1001338932	amenity	bank
3	68604	1001338932	name	Sparkasse
4	68605	1001338947	name	Cölbigk
5	68606	1001338947	place	hamlet
6	68607	1001453122	amenity	parking
7	68608	1001453122	name	Waldparkplatz
8	68609	1001453138	landuse	quarry
9	68610	1001453138	name	historischer Tagebau

	index	id	node_id	position
0	675116	86165181	1000108485	8
1	4707622	541563439	1000108539	0
2	4868839	583362995	1000108539	3
3	4868842	583362995	1000108546	6
4	53880	23768221	1000108549	0
5	4868733	583362975	1000108549	1
6	4868818	583362986	1000108549	3
7	39664	22778241	1000108643	3
8	675143	86165187	1000108684	0
9	675181	86165194	1000108684	6

	index	id	user	uid	version	changeset	timestamp
0	98736	114394341	st1974	44793	1	8214864	2011-05-22T11:07:28Z
1	98736	114394341	st1974	44793	1	8214864	2011-05-22T11:07:28Z
2	98736	114394341	st1974	44793	1	8214864	2011-05-22T11:07:28Z
3	124977	146469110	poschon	1719211	5	37007616	2016-02-04T20:41:45Z
4	124977	146469110	poschon	1719211	5	37007616	2016-02-04T20:41:45Z
5	124977	146469110	poschon	1719211	5	37007616	2016-02-04T20:41:45Z
6	124981	146469115	poschon	1719211	6	37007616	2016-02-04T20:38:14Z
7	184536	190646979	richtkreiser	390227	3	49112180	2017-05-30T18:56:33Z
8	184536	190646979	richtkreiser	390227	3	49112180	2017-05-30T18:56:33Z
9	205570	203868355	HWST	697239	1	14882451	2013-02-02T12:10:36Z

	index	id	key	value
0	310515	100033647	building	yes
1	310516	100033648	building	yes
2	310517	100033649	building	yes
3	310518	100033650	building	yes
4	310519	100033651	building	yes
5	310520	100033653	building	yes
6	310521	100033654	building	yes
7	310522	100033655	building	yes
8	310523	100033656	building	yes
9	310524	100033657	building	yes

	user	nodes_count	ways_count	total_count
0	matchman	405360	33796	439156
1	HWST	291936	60938	352874
2	easyX	235799	32686	268485
3	spaceman02	184791	12239	197030
4	richtkreiser	154929	26324	181253
5	HPHenschel	117051	17881	134932
6	hadhuey	103906	22011	125917
7	Arnaldur	109920	15297	125217
8	MichaOSM	87878	15760	103638
9	Pomfuttge	74229	9613	83842
10	Natura	72279	10525	82804
11	rbuch703	41116	10653	51769
12	tastenmensch	43504	7860	51364
13	nonny	45722	4170	49892
14	rexunil	43026	6363	49389
15	Andre68	43067	5521	48588
16	st1974	37632	10020	47652
17	bonzulu	39662	4652	44314
18	ClickKlack	38651	5541	44192
19	Kurt Krampmeier	37165	5174	42339

	id	type	key	value
0	246162984	addr	housenumber	2a
1	246162984	addr	street	Steingrube
2	246162984		amenity	bicycle_rental
3	246162984		name	Radwerk
4	246162984		phone	+4939436268444
5	246162984		sells	Fahrräder, Reparaturen
6	246162984		shop	bicycle
7	246162984		website	http://radwerk-wernigerode.de/
8	262694771		name	Wust
9	262694771		shop	bicycle
10	268145211		name	Zweiradmarkt Rudloff
11	268145211		shop	bicycle
12	281575115	addr	city	Schönebeck
13	281575115	addr	housenumber	151
14	281575115	addr	postcode	39218
15	281575115	addr	street	Magdeburger Straße
16	281575115		name	Firma Fredrich
17	281575115		operator	K. H. Fredrich
18	281575115		shop	bicycle
19	290174287	addr	city	Quedlinburg

	id	name	strasze	nr	city
0	1116530908	Vetter	None	None	None
1	1140989444	Radschmiede Sudenburger Fahrradladen	None	None	None
2	1158066486	Fahrrad Peter	None	None	None
3	1194501064	Fahrradverleih und Reparatur am Westfriedhof	None	None	None
4	1279629575	Rad der Stadt Magdeburg	Schönebecker Straße	103	Magdeburg
5	1368379633	MattBIKE RAD-CENTER	None	None	None
6	1396124480	Harzbikehaus	Wasserstraße	22	Oberharz am Brocken
7	1491890626	Schliesser Bike	None	None	None
8	1561895254	Fahrradcenter Grohmann	Karlstraße	29	Bernburg
9	1585913294	None	None	None	None

	id	name	strasze	nr	city	lat	lon
0	4273174600	None	None	None	None	51.8335062	10.7833424
1	1585913294	None	None	None	None	52.2279611	11.0113925
2	5145960134	None	None	None	None	51.7458012	11.0252872
3	1943938437	None	None	None	None	51.7247609	11.2488583
4	1878045831	None	None	None	None	52.1204151	11.6280791
5	1880719764	None	None	None	None	52.1069861	11.6440794
6	685773764	None	None	None	None	51.7677979	11.9326290
7	4833537022	1,2,3 Radverleih	None	None	None	51.8381848	10.7868399
8	1714120909	Alles rund ums Rad – Fahrräder und Service Gom...	Martin-Schwantes-Straße	34	None	52.0730608	11.8243944
9	3825377658	Bike Schmiede	None	None	None	51.6581950	11.3534160
10	4089924037	Brockenbike Siegl & Siegl GbR	None	None	None	51.9025860	10.6663869
11	631163565	DB Call a Bike Magdeburg	None	None	None	52.1306458	11.6253357
12	4375716262	Der Fahrrad Doktor	None	None	None	52.2482075	10.8206914
13	843230377	Fahrrad Baron	None	None	None	51.8377596	10.7892531
14	5303435131	Fahrrad Karl Beyer	None	None	None	51.6458696	11.5127795
15	3582592158	Fahrrad Magdeburg Büchner	Hasselbachstraße	8	None	52.1292532	11.6286108
16	1158066486	Fahrrad Peter	None	None	None	52.1062870	11.6372482
17	290559960	Fahrrad Scherf	None	None	None	52.0060906	11.7255161
18	635228117	Fahrrad Zöffzig	None	None	None	52.1295807	11.6047938
19	436368538	Fahrrad- und Modelleisenbahn Ingo Klettke	None	None	None	52.0187567	11.7204955
20	1796420587	Fahrrad-Fachhandel J. Drese	Braugartenstraße	9	None	51.6139684	11.4944537
21	3700699524	Fahrradmanufaktur Albatros	None	None	None	51.7526045	11.9723890
22	4876558016	Fahrradverleih "Neustadt/Harz"	None	None	None	51.5594056	10.8334287
23	1194501064	Fahrradverleih und Reparatur am Westfriedhof	None	None	None	52.1297757	11.5859291
24	1368379633	MattBIKE RAD-CENTER	None	None	None	52.0874701	11.6595682
25	2808179531	RadBär	None	None	None	51.7917986	11.7445941
26	858426345	Radmaxx	None	None	None	52.1472053	11.5831995
27	1140989444	Radschmiede Sudenburger Fahrradladen	None	None	None	52.1152608	11.6007149
28	4375836937	Radservice Der Radmitte Service-Point	None	None	None	52.1294021	11.6357805
29	246162984	Radwerk	Steingrube	2a	None	51.8344290	10.7893295
30	1491890626	Schliesser Bike	None	None	None	52.1409088	10.9717172
31	2451387337	TOUREN HARZ - BIKES & PARTS	Marienhöferstraße	2	None	51.8642890	10.6793092
32	428075092	Täves Radladen & Manufaktur	None	None	None	52.1234531	11.6129286
33	1116530908	Vetter	None	None	None	52.2277682	11.0275548
34	1597532744	Waldgasthaus Plessenburg	None	None	None	51.8320143	10.6679773
35	1914255431	Weltrad	None	None	None	52.0210239	11.7447310
36	262694771	Wust	None	None	None	51.9708135	10.7137120
37	5221821421	Zweirad Weld	None	None	None	52.1784848	11.4822542
38	3607760369	Zweirad-Franke	None	None	None	51.5938103	11.4542209
39	3573239495	Zweirad-Groth	None	None	None	51.7481234	11.9764348
40	268145211	Zweiradmarkt Rudloff	None	None	None	52.0020272	11.7176194
41	988984699	eldoRADo	None	None	None	52.1597659	11.6393392
42	357497780	fahrradohlsen	Karlstraße	16	None	51.7933800	11.7459043
43	1808686389	nextbike	None	None	None	52.1330228	11.6104660
44	1810394587	nextbike	None	None	None	52.1309134	11.6377057
45	1812228573	Zweirad Mente GmbH	Holzmarkt	4	Aschersleben	51.7553729	11.4574536
46	1561895254	Fahrradcenter Grohmann	Karlstraße	29	Bernburg	51.7918947	11.7475311
47	795642288	Rad & Tat	Karl-Zerbst-Straße	22	Blankenburg (Harz)	51.8042893	10.9477148
48	1732631501	Heiko Persch	Schartauer Straße	24	Burg	52.2714767	11.8512896
49	3664004601	Zweirad Haase	Martin-Luther-Straße	25	Burg	52.2714214	11.8399208
50	2384178784	Zweirad Myrrhe	Hakeborner Weg	3	Egeln	51.9383853	11.4183275
51	1908447413	Niemann & Blankmeister	Marktstraße	4	Königslutter am Elm	52.2506470	10.8182412
52	3698647427	Sattelfest	Am Beek	None	Königslutter am Elm	52.2692675	10.7844148
53	1783650216	Fahrrad Koch	Simonstraße	2	Magdeburg	52.1174399	11.6602042
54	629464410	Feine Räder	Liebknechtstraße	39	Magdeburg	52.1234650	11.6102744
55	629553229	Little John Bikes	Kastanienstraße	92	Magdeburg	52.1613835	11.6355207
56	323474099	Little John Bikes	Alter Markt	13-14	Magdeburg	52.1322141	11.6381003
57	1279629575	Rad der Stadt Magdeburg	Schönebecker Straße	103	Magdeburg	52.1064613	11.6408054
58	610564832	RadMitte	Goldschmiedebrücke	7, 9, 11	Magdeburg	52.1292617	11.6374959
59	320881325	Schaper	Halberstädter Straße	89	Magdeburg	52.1118188	11.6051428
60	401234988	Zweirad Schulz	Breiter Weg	38	Magdeburg	52.1364881	11.6394686
61	1396124480	Harzbikehaus	Wasserstraße	22	Oberharz am Brocken	51.7693895	10.8026747
62	290174287	2Rad Pavillon	Adelheidstraße	1b	Quedlinburg	51.7875045	11.1472362
63	300447446	Mente	Pölkenstraße	8	Quedlinburg	51.7883428	11.1477341
64	281575115	Firma Fredrich	Magdeburger Straße	151	Schönebeck	52.0106776	11.7143204
65	4128056889	Fahrzeugmuseum Staßfurt	Berlepschstraße	14	Staßfurt	51.8592631	11.6015367
66	322079653	Bremer Fahrradservice	Zur Grube	2	Wolmirstedt	52.2517271	11.6329177
67	629322254	Fahrrad Pfalz	Fabrikstraße	2	Wolmirstedt	52.2470992	11.6210667