HTML Parsing Guide

Parse VuSitu or HydroVu HTML files using the groups, properties and other XML attributes listed below. Use the example parser written in Python at the bottom of this page as a model for your own script, or customize the code to suit your needs. We have also provided several VuSitu HTML files you may use for testing.

Groups

LocationProperties
ReportProperties 
InstrumentProperties 
LogProperties 
TestProperties 
WellProperties 
PumpProperties 
TubingProperties 

Location Properties

Name
GUID 
Latitude 
Longitude 

Report Properties

StartTime
Created 
Duration 
Readings 
TimeOffset 

Instrument Properties

Model 
SerialNumber 
FirmwareVersion 

LogProperties

LogType
Name
GUID 
FileNumber 
LogWrapping 

LogLinearProperties

Interval 

LogLogarithmicProperties

Interval 

LogLinearAverageProperties

Interval 
AveragingInterval 
SampleSize 

LogStepProperties

Interval
Count 
Duration 

LogEventProperties

SamplingInterval 
DefaultInterval 
HighThreshold 
LowThreshold 
ChangeThreshold 
ChangeSinceLastLoggedThreshold 

LowFlowTestProperties

TestType 
StartTime 
TimeOffset 
ProjectName 
OperatorName 
FlowCellVolume 
InitialDepthToWater 
FinalDrawDown 
TotalSystemVolume 
TotalPumpedVolume 

WellProperties

CasingType 
Diameter 
Length 
TotalDepth 
DepthToScreen 
ScreenLength 

PumpProperties

Model 
FlowRate 
Volume 
IntakeFromTopOfCasing 
FinalPumpingRate 

TubingProperties

TubingType 
Diameter 
Length 

XML Attributes:

  • isi-group: 

Defines identifier to be used by isi-group-members to logically group together.

Example: isi-group=“LocationProperties”

  • isi-group-member:

Identifies item as a group member.

Example: isi-group-member=“LocationProperties"

  • isi-property:

Identifies item as "property", in the form of <Key> <Value>

<tr isi-property="Name"><td>Name = My Location</td></tr>

  • isi-label:

Identifies element as a localized label, to isolate from the value side

Example:<tr><td><span isi-label="">Name</span> = My Location</td></tr>

  • isi-value:

Instructs parser the current element contains a value Example:

<tr><td>Name = <span isi-value="">My Location</span></td></tr>

  • isi-text-node:

Instructs parser the value does not exist as an attribute of the current element Example:

<tr><td isi-text-node="">Name = My Location</td></tr>

  • isi-datetime:

Provides an ISO standard date time formatted string Example:

<tr><td isi-datetime="2017-10-08T13:05:30-06:00">Start Time = 10/08/2017 1:05 PM</td></tr>

  • isi-timespan-milliseconds:

Provides an integer millisecond duration Example:

<tr><td isi-timespan-milliseconds="3600000">Duration = 01:00:00</td></tr>

  • isi-enabled:
  • Provides a boolean value as a string Example:

<tr><td isi-enabled="True">Log Wrapping Enabled = True</td></tr>

  • isi-device-type:
  • Provides an integer value representing the device model (System Spec Device Type) Example:

<tr><td isi-device-type="7">Model = Aqua TROLL 600</td></tr>

  • isi-log-type:
  • Provides an integer value representing the log type
  • isi-data-column-header:
  • Indicates the current element is a data column header Example:

<tr><td isi-data-column-header="">Temperature (F)</td></tr>

  • isi-device-serial-number:
  • Provides a device serial number Example:

<tr><td isi-device-serial-number="1234">Temperature (F)</td></tr>

  • isi-sensor-serial-number:
  • Provides a sensor serial number Example:

<tr><td isi-sensor-serial-number="4567">Temperature (F)</td></tr>

  • isi-device-sensor-type:
  • Provides a sensor type (system spec sensor type) Example:

<tr><td isi-sensor-type="1">Temperature (F)</td></tr>

  • isi-external-parameter:
  • Indicates a sensor is from an external source (not part of the instrument) Example:

<tr><td isi-external-parameter="">Temperature (F)</td></tr>

  • isi-parameter-type:
  • Provides a sensor parameter type (system spec parameter type) Example:

<tr><td isi-parameter-type="1">Temperature (F)</td></tr>

  • isi-unit-type:
  • Provides a sensor unit type (system spec unit type) Example:

<tr><td isi-unit-type="2">Temperature (F)</td></tr>

  • isi-data-table:
  • Indicates the start of a time series data table Example:

<tr isi-data-table=""><td>Temperature (F)</td></tr>

  • isi-data-row:
  • Indicates a data row of a time series data table Example:

<tr isi-data-row=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>

  • isi-timestamp:
  •  Provides a integer form of system spec time (internal definition) Example:

<tr isi-timestamp="238900"><td>07/16/1969 20:18:00</td><td>100.0</td></tr>

  • isi-data-quality:
  • Indicates data of a time series data table has a non-normal quality value (system spec Data Quality Type) Example:

<tr><td>98.6</td><td>32.0</td><td>100.0</td><td isi-data-quality="3">0</td></tr>

  • isi-marked:
  • Indicates data of a time series data table was marked (entire row) Example:

<tr isi-marked=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>

  • isi-log-note:
  • Indicates that the current element is part of a log note Example:

<tr><td isi-log-note="">10/08/2012 15:30:00 Sensor Changed</td></tr>

  • isi-log-note-type:
  • Provides the log note type as an integer (System Spec Log Note Type) Example:

<tr><td isi-log-note-type="12">10/08/2012 15:30:00 Sensor Changed</td></tr>

  • isi-lowflow-sample:
  • Indicates that the current element is part of a Low-Flow sample Example:

<tr><td isi-lowflow-sample=""><span isi-label="">Sample #931</span>: <span isi-value="">Pre test sample</span></td></tr>

  • isi-lowflow-note:
 Indicates that the current element contains a Low-Flow note Example:

<tr><td isi-lowflow-note=""><span>Weather Conditions</span>: <span>38.5 F, 78% humidity</span></td></tr>

Notes on Parsing the file

  • The data to be parsed begins at the tag: <table id="isi-report">
  • The data is organized by table row <tr> and then <td> elements in that row:
  • Do NOT parse based on the class attribute. The class attribute is only used for formatting inside of Excel and should be treated as optional
  • Parse based on the XML Attributes listed above (for example: isi-group).

 

Example Parser (Python 2x)


from HTMLParser import HTMLParser

# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
    def FeedLine(self, line, reset):
        if reset:
            self.startStack = [];
            self.endStack = []
            self.elements = []
        self.feed(line)

    def handle_starttag(self, tag, attrs):
        self.startStack.append(tag)
        self.elements.append({})
        if len(attrs) > 0:
            self.elements[-1]["attrs"] = attrs

    def handle_endtag(self, tag):
        self.endStack.append(tag)

    def handle_data(self, data):
        if data.strip() == "" or data.strip().rstrip() == "=":
             return
        if len(self.elements) < 1:
            return

        self.elements[-1]["data"] = data

# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
    for attr in attrs:
        if attr[0] == ofType:
            return attr
    return None

# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
    return GetAttr(attrs, ofType) is not None

# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
    for element in elements:
        if 'attrs' not in element:
            continue
        for attr in element['attrs']:
            if attr[0] == ofType:
                return attr
    return None

def GetClass(elements):
    attr = GetAttrFromElements(elements, 'class')
    if attr is None:
        return None
    else:
        return attr[1]

parser = MyHTMLParser()

# data structures to hold the file data
metadataGroups = {}
dataTables = []

# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')

# skip past the display html to the data we are interested in
for line in fptr:
    parser.FeedLine(line, True)
    if 'body' in parser.startStack:
        break

# read the file in line by line to save memory
resetLine = True
for line in fptr:
    parser.FeedLine(line, resetLine)
    resetLine = True

    # only interested in table rows so if it's not a tr go to next line
    if "tr" not in parser.startStack:
        continue

    # make sure we have an entire table row loaded up not just a line
    if "tr" not in parser.endStack:
        resetLine = False
        continue

    # skip blank lines
    if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
        continue

    startIndex = parser.startStack.index('tr')
    rootAttr = parser.elements[startIndex]['attrs']

    for element in parser.elements[startIndex:]:
        # if the element doesn't have attributes we can ignore it
        if 'attrs' not in element:
            continue

        if ContainsAttr(element['attrs'], 'isi-group'):
            attr = GetAttr(element['attrs'], 'isi-group')
            metadataGroups[attr[1]] = {}
            metadataGroups[attr[1]]["Name"] = attr[1]
        elif ContainsAttr(element['attrs'], 'isi-group-member'):
            attr = GetAttr(element['attrs'], 'isi-group-member')
            metaData = parser.elements[1]['attrs']
            groupName = GetAttr(metaData, 'isi-group-member')[1]

            isiProperty = GetAttr(element['attrs'], 'isi-property')
            if isiProperty is not None:
                label = parser.elements[2]['data']
                value = parser.elements[3]['data']
                metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}

            logNotes = GetAttr(element['attrs'], 'isi-log-note')
            if logNotes is not None:
                if "Notes" not in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"] = []
                for attr in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})

        elif ContainsAttr(element['attrs'], 'isi-data-table'):
            dataTables.append({'Headers': [], 'Values': []})
        elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
            hold = {'Name': element['data']}
            for attr in element:
                hold[attr[0]] = attr[1]
            dataTables[-1]['Headers'].append(hold)
        elif ContainsAttr(element['attrs'], 'isi-data-row'):
            hold = []
            # all the elements in this row are data so process them all then break
            for element in parser.elements:
                if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
                    continue
                elif 'data' in element:
                    hold.append(element['data'])
                else:
                    hold.append(' ')
            dataTables[-1]['Values'].append(hold)

# printing out the metadata for the file.
for key in metadataGroups.iterkeys():
    print (key)
    print ("\t", metadataGroups[key])
print ("\n")

# printing out the data
for table in dataTables:
    # printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
    for header in table['Headers']:
        print (header['Name'] + "\t",)
    print ("")
    # printing the data row - same order as the headers so you can associate them properly
    for row in table['Values']:
        
        for datum in row:
            print ("|" + datum + "|",)
        print ("")

Example Parser (Python 3x)


from html.parser import HTMLParser

# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
    def FeedLine(self, line, reset):
        if reset:
            self.startStack = []
            self.endStack = []
            self.elements = []
        self.feed(line)

    def handle_starttag(self, tag, attrs):
        self.startStack.append(tag)
        self.elements.append({})
        if len(attrs) > 0:
            self.elements[-1]["attrs"] = attrs

    def handle_endtag(self, tag):
        self.endStack.append(tag)

    def handle_data(self, data):
        if data.strip() == "" or data.strip().rstrip() == "=":
             return
        if len(self.elements) < 1:
            return

        self.elements[-1]["data"] = data

# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
    for attr in attrs:
        if attr[0] == ofType:
            return attr
    return None

# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
    return GetAttr(attrs, ofType) is not None

# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
    for element in elements:
        if 'attrs' not in element:
            continue
        for attr in element['attrs']:
            if attr[0] == ofType:
                return attr
    return None

def GetClass(elements):
    attr = GetAttrFromElements(elements, 'class')
    if attr is None:
        return None
    else:
        return attr[1]

parser = MyHTMLParser()

# data structures to hold the file data
metadataGroups = {}
dataTables = []

# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')

# skip past the display html to the data we are interested in
for line in fptr:
    parser.FeedLine(line, True)
    if 'body' in parser.startStack:
        break

# read the file in line by line to save memory
resetLine = True
for line in fptr:
    parser.FeedLine(line, resetLine)
    resetLine = True

    # only interested in table rows so if it's not a tr go to next line
    if "tr" not in parser.startStack:
        continue

    # make sure we have an entire table row loaded up not just a line
    if "tr" not in parser.endStack:
        resetLine = False
        continue

    # skip blank lines
    if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
        continue

    startIndex = parser.startStack.index('tr')
    rootAttr = parser.elements[startIndex]['attrs']

    for element in parser.elements[startIndex:]:
        # if the element doesn't have attributes we can ignore it
        if 'attrs' not in element:
            continue

        if ContainsAttr(element['attrs'], 'isi-group'):
            attr = GetAttr(element['attrs'], 'isi-group')
            metadataGroups[attr[1]] = {}
            metadataGroups[attr[1]]["Name"] = attr[1]
        elif ContainsAttr(element['attrs'], 'isi-group-member'):
            attr = GetAttr(element['attrs'], 'isi-group-member')
            metaData = parser.elements[1]['attrs']
            groupName = GetAttr(metaData, 'isi-group-member')[1]

            isiProperty = GetAttr(element['attrs'], 'isi-property')
            if isiProperty is not None:
                label = parser.elements[2]['data']
                value = parser.elements[3]['data']
                metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}

            logNotes = GetAttr(element['attrs'], 'isi-log-note')
            if logNotes is not None:
                if "Notes" not in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"] = []
                for attr in parser.elements[1]['attrs']:
                    metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})

        elif ContainsAttr(element['attrs'], 'isi-data-table'):
            dataTables.append({'Headers': [], 'Values': []})
        elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
            hold = {'Name': element['data']}
            for attr in element:
                hold[attr[0]] = attr[1]
            dataTables[-1]['Headers'].append(hold)
        elif ContainsAttr(element['attrs'], 'isi-data-row'):
            hold = []
            # all the elements in this row are data so process them all then break
            for element in parser.elements:
                if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
                    continue
                elif 'data' in element:
                    hold.append(element['data'])
                else:
                    hold.append(' ')
            dataTables[-1]['Values'].append(hold)

# printing out the metadata for the file.
for key in metadataGroups.keys():
    print (key)
    print ("\t", metadataGroups[key])

print ("\n")

# printing out the data
for table in dataTables:
    # printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
    for header in table['Headers']:
        print (header['Name'] + "\t",)
    print ("")
    # printing the data row - same order as the headers so you can associate them properly
    for row in table['Values']:
        for datum in row:
            print ("|" + datum + "|",)
        print ("")

Sample VuSitu HTML files

Sample Log 1

Sample Log 2

Sample Log 3

Sample Log 4

Sample Log 5