HTML Parsing Guide

Parse VuSitu or HydroVu HTML files using the groups, properties and other XML attributes listed below. Use the example parser written in Python at the bottom of this page as a model for your own script, or customize the code to suit your needs. We have also provided several VuSitu HTML files you may use for testing.

Groups

LocationProperties
ReportProperties 
InstrumentProperties 
LogProperties 
TestProperties 
WellProperties 
PumpProperties 
TubingProperties

Location Properties

Name
GUID 
Latitude 
Longitude

Report Properties

StartTime
Created 
Duration 
Readings 
TimeOffset

Instrument Properties

Model 
SerialNumber 
FirmwareVersion

LogProperties

LogType
Name
GUID 
FileNumber 
LogWrapping

LogLinearProperties

Interval

LogLogarithmicProperties

Interval

LogLinearAverageProperties

Interval 
AveragingInterval 
SampleSize

LogStepProperties

Interval
Count 
Duration

LogEventProperties

SamplingInterval 
DefaultInterval 
HighThreshold 
LowThreshold 
ChangeThreshold 
ChangeSinceLastLoggedThreshold

LowFlowTestProperties

TestType 
StartTime 
TimeOffset 
ProjectName 
OperatorName 
FlowCellVolume 
InitialDepthToWater 
FinalDrawDown 
TotalSystemVolume 
TotalPumpedVolume

WellProperties

CasingType 
Diameter 
Length 
TotalDepth 
DepthToScreen 
ScreenLength

PumpProperties

Model 
FlowRate 
Volume 
IntakeFromTopOfCasing 
FinalPumpingRate

TubingProperties

TubingType 
Diameter 
Length

XML Attributes:

isi-group:

Defines identifier to be used by isi-group-members to logically group together.

Example: isi-group=“LocationProperties”

isi-group-member:

Identifies item as a group member.

Example: isi-group-member=“LocationProperties"

isi-property:

Identifies item as "property", in the form of <Key> <Value>

<tr isi-property="Name"><td>Name = My Location</td></tr>

isi-label:

Identifies element as a localized label, to isolate from the value side

Example:<tr><td>Name = My Location</td></tr>

isi-value:

Instructs parser the current element contains a value Example:

<tr><td>Name = My Location</td></tr>

isi-text-node:

Instructs parser the value does not exist as an attribute of the current element Example:

<tr><td isi-text-node="">Name = My Location</td></tr>

isi-datetime:

Provides an ISO standard date time formatted string Example:

<tr><td isi-datetime="2017-10-08T13:05:30-06:00">Start Time = 10/08/2017 1:05 PM</td></tr>

isi-timespan-milliseconds:

Provides an integer millisecond duration Example:

<tr><td isi-timespan-milliseconds="3600000">Duration = 01:00:00</td></tr>

isi-enabled:

<tr><td isi-enabled="True">Log Wrapping Enabled = True</td></tr>

isi-device-type:

<tr><td isi-device-type="7">Model = Aqua TROLL 600</td></tr>

isi-log-type:
isi-data-column-header:

<tr><td isi-data-column-header="">Temperature (F)</td></tr>

isi-device-serial-number:

<tr><td isi-device-serial-number="1234">Temperature (F)</td></tr>

isi-sensor-serial-number:

<tr><td isi-sensor-serial-number="4567">Temperature (F)</td></tr>

isi-device-sensor-type:

<tr><td isi-sensor-type="1">Temperature (F)</td></tr>

isi-external-parameter:

<tr><td isi-external-parameter="">Temperature (F)</td></tr>

isi-parameter-type:

<tr><td isi-parameter-type="1">Temperature (F)</td></tr>

isi-unit-type:

<tr><td isi-unit-type="2">Temperature (F)</td></tr>

isi-data-table:

<tr isi-data-table=""><td>Temperature (F)</td></tr>

isi-data-row:

<tr isi-data-row=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>

isi-timestamp:

<tr isi-timestamp="238900"><td>07/16/1969 20:18:00</td><td>100.0</td></tr>

isi-data-quality:

<tr><td>98.6</td><td>32.0</td><td>100.0</td><td isi-data-quality="3">0</td></tr>

isi-marked:

<tr isi-marked=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>

isi-log-note:

<tr><td isi-log-note="">10/08/2012 15:30:00 Sensor Changed</td></tr>

isi-log-note-type:

<tr><td isi-log-note-type="12">10/08/2012 15:30:00 Sensor Changed</td></tr>

isi-lowflow-sample:

<tr><td isi-lowflow-sample="">Sample #931: Pre test sample</td></tr>

isi-lowflow-note:

Indicates that the current element contains a Low-Flow note Example:

<tr><td isi-lowflow-note="">Weather Conditions: 38.5 F, 78% humidity</td></tr>

Notes on Parsing the file

The data to be parsed begins at the tag: <table id="isi-report">
The data is organized by table row <tr> and then <td> elements in that row:
Do NOT parse based on the class attribute. The class attribute is only used for formatting inside of Excel and should be treated as optional
Parse based on the XML Attributes listed above (for example: isi-group).

Example Parser (Python 2x)


from HTMLParser import HTMLParser

# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
    def FeedLine(self, line, reset):
        if reset:
            self.startStack = [];
            self.endStack = []
            self.elements = []
        self.feed(line)

    def handle_starttag(self, tag, attrs):
        self.startStack.append(tag)
        self.elements.append({})
        if len(attrs) > 0:
            self.elements[-1]["attrs"] = attrs

    def handle_endtag(self, tag):
        self.endStack.append(tag)

    def handle_data(self, data):
        if data.strip() == "" or data.strip().rstrip() == "=":
             return
        if len(self.elements) < 1:
            return

        self.elements[-1]["data"] = data

# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
    for attr in attrs:
        if attr[0] == ofType:
            return attr
    return None

# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
    return GetAttr(attrs, ofType) is not None

# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
    for element in elements:
        if 'attrs' not in element:
            continue
        for attr in element['attrs']:
            if attr[0] == ofType:
                return attr
    return None

def GetClass(elements):
    attr = GetAttrFromElements(elements, 'class')
    if attr is None:
        return None
    else:
        return attr[1]

parser = MyHTMLParser()

# data structures to hold the file data
metadataGroups = {}
dataTables = []

# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')

# skip past the display html to the data we are interested in
for line in fptr:
    parser.FeedLine(line, True)
    if 'body' in parser.startStack:
        break

# read the file in line by line to save memory
resetLine = True
for line in fptr:
    parser.FeedLine(line, resetLine)
    resetLine = True

    # only interested in table rows so if it's not a tr go to next line
    if "tr" not in parser.startStack:
        continue

    # make sure we have an entire table row loaded up not just a line
    if "tr" not in parser.endStack:
        resetLine = False
        continue

    # skip blank lines
    if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
        continue

    startIndex = parser.startStack.index('tr')
    rootAttr = parser.elements[startIndex]['attrs']

    for element in parser.elements[startIndex:]:
        # if the element doesn't have attributes we can ignore it
        if 'attrs' not in element:
            continue

        if ContainsAttr(element['attrs'], 'isi-group'):
            attr = GetAttr(element['attrs'], 'isi-group')
            metadataGroups[attr[1]] = {}
            metadataGroups[attr[1]]["Name"] = attr[1]
        elif ContainsAttr(element['attrs'], 'isi-group-member'):
            attr = GetAttr(element['attrs'], 'isi-group-member')
            metaData = parser.elements[1]['attrs']
            groupName = GetAttr(metaData, 'isi-group-member')[1]

            isiProperty = GetAttr(element['attrs'], 'isi-property')
            if isiProperty is not None:
                label = parser.elements[2]['data']
                value = parser.elements[3]['data']
                metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}

            logNotes = GetAttr(element['attrs'], 'isi-log-note')
            if logNotes is not None:
                if "Notes" not in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"] = []
                for attr in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})

        elif ContainsAttr(element['attrs'], 'isi-data-table'):
            dataTables.append({'Headers': [], 'Values': []})
        elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
            hold = {'Name': element['data']}
            for attr in element:
                hold[attr[0]] = attr[1]
            dataTables[-1]['Headers'].append(hold)
        elif ContainsAttr(element['attrs'], 'isi-data-row'):
            hold = []
            # all the elements in this row are data so process them all then break
            for element in parser.elements:
                if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
                    continue
                elif 'data' in element:
                    hold.append(element['data'])
                else:
                    hold.append(' ')
            dataTables[-1]['Values'].append(hold)

# printing out the metadata for the file.
for key in metadataGroups.iterkeys():
    print (key)
    print ("\t", metadataGroups[key])
print ("\n")

# printing out the data
for table in dataTables:
    # printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
    for header in table['Headers']:
        print (header['Name'] + "\t",)
    print ("")
    # printing the data row - same order as the headers so you can associate them properly
    for row in table['Values']:
        
        for datum in row:
            print ("|" + datum + "|",)
        print ("")

Example Parser (Python 3x)


from html.parser import HTMLParser

# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
    def FeedLine(self, line, reset):
        if reset:
            self.startStack = []
            self.endStack = []
            self.elements = []
        self.feed(line)

    def handle_starttag(self, tag, attrs):
        self.startStack.append(tag)
        self.elements.append({})
        if len(attrs) > 0:
            self.elements[-1]["attrs"] = attrs

    def handle_endtag(self, tag):
        self.endStack.append(tag)

    def handle_data(self, data):
        if data.strip() == "" or data.strip().rstrip() == "=":
             return
        if len(self.elements) < 1:
            return

        self.elements[-1]["data"] = data

# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
    for attr in attrs:
        if attr[0] == ofType:
            return attr
    return None

# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
    return GetAttr(attrs, ofType) is not None

# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
    for element in elements:
        if 'attrs' not in element:
            continue
        for attr in element['attrs']:
            if attr[0] == ofType:
                return attr
    return None

def GetClass(elements):
    attr = GetAttrFromElements(elements, 'class')
    if attr is None:
        return None
    else:
        return attr[1]

parser = MyHTMLParser()

# data structures to hold the file data
metadataGroups = {}
dataTables = []

# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')

# skip past the display html to the data we are interested in
for line in fptr:
    parser.FeedLine(line, True)
    if 'body' in parser.startStack:
        break

# read the file in line by line to save memory
resetLine = True
for line in fptr:
    parser.FeedLine(line, resetLine)
    resetLine = True

    # only interested in table rows so if it's not a tr go to next line
    if "tr" not in parser.startStack:
        continue

    # make sure we have an entire table row loaded up not just a line
    if "tr" not in parser.endStack:
        resetLine = False
        continue

    # skip blank lines
    if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
        continue

    startIndex = parser.startStack.index('tr')
    rootAttr = parser.elements[startIndex]['attrs']

    for element in parser.elements[startIndex:]:
        # if the element doesn't have attributes we can ignore it
        if 'attrs' not in element:
            continue

        if ContainsAttr(element['attrs'], 'isi-group'):
            attr = GetAttr(element['attrs'], 'isi-group')
            metadataGroups[attr[1]] = {}
            metadataGroups[attr[1]]["Name"] = attr[1]
        elif ContainsAttr(element['attrs'], 'isi-group-member'):
            attr = GetAttr(element['attrs'], 'isi-group-member')
            metaData = parser.elements[1]['attrs']
            groupName = GetAttr(metaData, 'isi-group-member')[1]

            isiProperty = GetAttr(element['attrs'], 'isi-property')
            if isiProperty is not None:
                label = parser.elements[2]['data']
                value = parser.elements[3]['data']
                metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}

            logNotes = GetAttr(element['attrs'], 'isi-log-note')
            if logNotes is not None:
                if "Notes" not in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"] = []
                for attr in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})

        elif ContainsAttr(element['attrs'], 'isi-data-table'):
            dataTables.append({'Headers': [], 'Values': []})
        elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
            hold = {'Name': element['data']}
            for attr in element:
                hold[attr[0]] = attr[1]
            dataTables[-1]['Headers'].append(hold)
        elif ContainsAttr(element['attrs'], 'isi-data-row'):
            hold = []
            # all the elements in this row are data so process them all then break
            for element in parser.elements:
                if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
                    continue
                elif 'data' in element:
                    hold.append(element['data'])
                else:
                    hold.append(' ')
            dataTables[-1]['Values'].append(hold)

# printing out the metadata for the file.
for key in metadataGroups.keys():
    print (key)
    print ("\t", metadataGroups[key])

print ("\n")

# printing out the data
for table in dataTables:
    # printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
    for header in table['Headers']:
        print (header['Name'] + "\t",)
    print ("")
    # printing the data row - same order as the headers so you can associate them properly
    for row in table['Values']:
        for datum in row:
            print ("|" + datum + "|",)
        print ("")

Sample VuSitu HTML files