HTML Parsing Guide
Parse VuSitu or HydroVu HTML files using the groups, properties and other XML attributes listed below. Use the example parser written in Python at the bottom of this page as a model for your own script, or customize the code to suit your needs. We have also provided several VuSitu HTML files you may use for testing.
Groups
LocationProperties
ReportProperties
InstrumentProperties
LogProperties
TestProperties
WellProperties
PumpProperties
TubingProperties
Location Properties
Name
GUID
Latitude
Longitude
Report Properties
StartTime
Created
Duration
Readings
TimeOffset
Instrument Properties
Model
SerialNumber
FirmwareVersion
LogProperties
LogType
Name
GUID
FileNumber
LogWrapping
LogLinearProperties
Interval
LogLogarithmicProperties
Interval
LogLinearAverageProperties
Interval
AveragingInterval
SampleSize
LogStepProperties
Interval
Count
Duration
LogEventProperties
SamplingInterval
DefaultInterval
HighThreshold
LowThreshold
ChangeThreshold
ChangeSinceLastLoggedThreshold
LowFlowTestProperties
TestType
StartTime
TimeOffset
ProjectName
OperatorName
FlowCellVolume
InitialDepthToWater
FinalDrawDown
TotalSystemVolume
TotalPumpedVolume
WellProperties
CasingType
Diameter
Length
TotalDepth
DepthToScreen
ScreenLength
PumpProperties
Model
FlowRate
Volume
IntakeFromTopOfCasing
FinalPumpingRate
TubingProperties
TubingType
Diameter
Length
XML Attributes:
- isi-group:
Defines identifier to be used by isi-group-members to logically group together.
Example: isi-group=“LocationProperties”
- isi-group-member:
Identifies item as a group member.
Example: isi-group-member=“LocationProperties"
- isi-property:
Identifies item as "property", in the form of <Key> <Value>
<tr isi-property="Name"><td>Name = My Location</td></tr>
- isi-label:
Identifies element as a localized label, to isolate from the value side
Example:<tr><td><span isi-label="">Name</span> = My Location</td></tr>
- isi-value:
Instructs parser the current element contains a value Example:
<tr><td>Name = <span isi-value="">My Location</span></td></tr>
- isi-text-node:
Instructs parser the value does not exist as an attribute of the current element Example:
<tr><td isi-text-node="">Name = My Location</td></tr>
- isi-datetime:
Provides an ISO standard date time formatted string Example:
<tr><td isi-datetime="2017-10-08T13:05:30-06:00">Start Time = 10/08/2017 1:05 PM</td></tr>
- isi-timespan-milliseconds:
Provides an integer millisecond duration Example:
<tr><td isi-timespan-milliseconds="3600000">Duration = 01:00:00</td></tr>
- isi-enabled: Provides a boolean value as a string Example:
<tr><td isi-enabled="True">Log Wrapping Enabled = True</td></tr>
- isi-device-type: Provides an integer value representing the device model (System Spec Device Type) Example:
<tr><td isi-device-type="7">Model = Aqua TROLL 600</td></tr>
- isi-log-type: Provides an integer value representing the log type
- isi-data-column-header: Indicates the current element is a data column header Example:
<tr><td isi-data-column-header="">Temperature (F)</td></tr>
- isi-device-serial-number: Provides a device serial number Example:
<tr><td isi-device-serial-number="1234">Temperature (F)</td></tr>
- isi-sensor-serial-number: Provides a sensor serial number Example:
<tr><td isi-sensor-serial-number="4567">Temperature (F)</td></tr>
- isi-device-sensor-type: Provides a sensor type (system spec sensor type) Example:
<tr><td isi-sensor-type="1">Temperature (F)</td></tr>
- isi-external-parameter: Indicates a sensor is from an external source (not part of the instrument) Example:
<tr><td isi-external-parameter="">Temperature (F)</td></tr>
- isi-parameter-type: Provides a sensor parameter type (system spec parameter type) Example:
<tr><td isi-parameter-type="1">Temperature (F)</td></tr>
- isi-unit-type: Provides a sensor unit type (system spec unit type) Example:
<tr><td isi-unit-type="2">Temperature (F)</td></tr>
- isi-data-table: Indicates the start of a time series data table Example:
<tr isi-data-table=""><td>Temperature (F)</td></tr>
- isi-data-row: Indicates a data row of a time series data table Example:
<tr isi-data-row=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>
- isi-timestamp: Provides a integer form of system spec time (internal definition) Example:
<tr isi-timestamp="238900"><td>07/16/1969 20:18:00</td><td>100.0</td></tr>
- isi-data-quality: Indicates data of a time series data table has a non-normal quality value (system spec Data Quality Type) Example:
<tr><td>98.6</td><td>32.0</td><td>100.0</td><td isi-data-quality="3">0</td></tr>
- isi-marked: Indicates data of a time series data table was marked (entire row) Example:
<tr isi-marked=""><td>98.6</td><td>32.0</td><td>100.0</td></tr>
- isi-log-note: Indicates that the current element is part of a log note Example:
<tr><td isi-log-note="">10/08/2012 15:30:00 Sensor Changed</td></tr>
- isi-log-note-type: Provides the log note type as an integer (System Spec Log Note Type) Example:
<tr><td isi-log-note-type="12">10/08/2012 15:30:00 Sensor Changed</td></tr>
- isi-lowflow-sample: Indicates that the current element is part of a Low-Flow sample Example:
<tr><td isi-lowflow-sample=""><span isi-label="">Sample #931</span>: <span isi-value="">Pre test sample</span></td></tr>
- isi-lowflow-note:
<tr><td isi-lowflow-note=""><span>Weather Conditions</span>: <span>38.5 F, 78% humidity</span></td></tr>
Notes on Parsing the file
- The data to be parsed begins at the tag:
<table id="isi-report">
- The data is organized by table row
<tr>
and then<td>
elements in that row: - Do NOT parse based on the class attribute. The class attribute is only used for formatting inside of Excel and should be treated as optional
- Parse based on the XML Attributes listed above (for example: isi-group).
Example Parser (Python 2x)
from HTMLParser import HTMLParser
# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
def FeedLine(self, line, reset):
if reset:
self.startStack = [];
self.endStack = []
self.elements = []
self.feed(line)
def handle_starttag(self, tag, attrs):
self.startStack.append(tag)
self.elements.append({})
if len(attrs) > 0:
self.elements[-1]["attrs"] = attrs
def handle_endtag(self, tag):
self.endStack.append(tag)
def handle_data(self, data):
if data.strip() == "" or data.strip().rstrip() == "=":
return
if len(self.elements) < 1:
return
self.elements[-1]["data"] = data
# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
for attr in attrs:
if attr[0] == ofType:
return attr
return None
# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
return GetAttr(attrs, ofType) is not None
# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
for element in elements:
if 'attrs' not in element:
continue
for attr in element['attrs']:
if attr[0] == ofType:
return attr
return None
def GetClass(elements):
attr = GetAttrFromElements(elements, 'class')
if attr is None:
return None
else:
return attr[1]
parser = MyHTMLParser()
# data structures to hold the file data
metadataGroups = {}
dataTables = []
# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')
# skip past the display html to the data we are interested in
for line in fptr:
parser.FeedLine(line, True)
if 'body' in parser.startStack:
break
# read the file in line by line to save memory
resetLine = True
for line in fptr:
parser.FeedLine(line, resetLine)
resetLine = True
# only interested in table rows so if it's not a tr go to next line
if "tr" not in parser.startStack:
continue
# make sure we have an entire table row loaded up not just a line
if "tr" not in parser.endStack:
resetLine = False
continue
# skip blank lines
if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
continue
startIndex = parser.startStack.index('tr')
rootAttr = parser.elements[startIndex]['attrs']
for element in parser.elements[startIndex:]:
# if the element doesn't have attributes we can ignore it
if 'attrs' not in element:
continue
if ContainsAttr(element['attrs'], 'isi-group'):
attr = GetAttr(element['attrs'], 'isi-group')
metadataGroups[attr[1]] = {}
metadataGroups[attr[1]]["Name"] = attr[1]
elif ContainsAttr(element['attrs'], 'isi-group-member'):
attr = GetAttr(element['attrs'], 'isi-group-member')
metaData = parser.elements[1]['attrs']
groupName = GetAttr(metaData, 'isi-group-member')[1]
isiProperty = GetAttr(element['attrs'], 'isi-property')
if isiProperty is not None:
label = parser.elements[2]['data']
value = parser.elements[3]['data']
metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}
logNotes = GetAttr(element['attrs'], 'isi-log-note')
if logNotes is not None:
if "Notes" not in parser.elements[1]['attrs']:
metadataGroups[groupName]["Notes"] = []
for attr in parser.elements[1]['attrs']:
metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})
elif ContainsAttr(element['attrs'], 'isi-data-table'):
dataTables.append({'Headers': [], 'Values': []})
elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
hold = {'Name': element['data']}
for attr in element:
hold[attr[0]] = attr[1]
dataTables[-1]['Headers'].append(hold)
elif ContainsAttr(element['attrs'], 'isi-data-row'):
hold = []
# all the elements in this row are data so process them all then break
for element in parser.elements:
if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
continue
elif 'data' in element:
hold.append(element['data'])
else:
hold.append(' ')
dataTables[-1]['Values'].append(hold)
# printing out the metadata for the file.
for key in metadataGroups.iterkeys():
print (key)
print ("\t", metadataGroups[key])
print ("\n")
# printing out the data
for table in dataTables:
# printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
for header in table['Headers']:
print (header['Name'] + "\t",)
print ("")
# printing the data row - same order as the headers so you can associate them properly
for row in table['Values']:
for datum in row:
print ("|" + datum + "|",)
print ("")
Example Parser (Python 3x)
from html.parser import HTMLParser
# making a custom version of the HTML Parser with overrides to get the data elements handled
class MyHTMLParser(HTMLParser):
def FeedLine(self, line, reset):
if reset:
self.startStack = []
self.endStack = []
self.elements = []
self.feed(line)
def handle_starttag(self, tag, attrs):
self.startStack.append(tag)
self.elements.append({})
if len(attrs) > 0:
self.elements[-1]["attrs"] = attrs
def handle_endtag(self, tag):
self.endStack.append(tag)
def handle_data(self, data):
if data.strip() == "" or data.strip().rstrip() == "=":
return
if len(self.elements) < 1:
return
self.elements[-1]["data"] = data
# get the attribute of the type provided from a list of attributes
def GetAttr(attrs, ofType):
for attr in attrs:
if attr[0] == ofType:
return attr
return None
# determine if the list of attributes contains the type provided
def ContainsAttr(attrs, ofType):
return GetAttr(attrs, ofType) is not None
# scan through all the elements in a list of elements looking for the attribute of the provided type
def GetAttrFromElements(elements, ofType):
for element in elements:
if 'attrs' not in element:
continue
for attr in element['attrs']:
if attr[0] == ofType:
return attr
return None
def GetClass(elements):
attr = GetAttrFromElements(elements, 'class')
if attr is None:
return None
else:
return attr[1]
parser = MyHTMLParser()
# data structures to hold the file data
metadataGroups = {}
dataTables = []
# open the in-situ data file
fptr = open('YOUR FILENAME HERE', 'r')
# skip past the display html to the data we are interested in
for line in fptr:
parser.FeedLine(line, True)
if 'body' in parser.startStack:
break
# read the file in line by line to save memory
resetLine = True
for line in fptr:
parser.FeedLine(line, resetLine)
resetLine = True
# only interested in table rows so if it's not a tr go to next line
if "tr" not in parser.startStack:
continue
# make sure we have an entire table row loaded up not just a line
if "tr" not in parser.endStack:
resetLine = False
continue
# skip blank lines
if len(parser.elements) < 1 or 'attrs' not in parser.elements[0]:
continue
startIndex = parser.startStack.index('tr')
rootAttr = parser.elements[startIndex]['attrs']
for element in parser.elements[startIndex:]:
# if the element doesn't have attributes we can ignore it
if 'attrs' not in element:
continue
if ContainsAttr(element['attrs'], 'isi-group'):
attr = GetAttr(element['attrs'], 'isi-group')
metadataGroups[attr[1]] = {}
metadataGroups[attr[1]]["Name"] = attr[1]
elif ContainsAttr(element['attrs'], 'isi-group-member'):
attr = GetAttr(element['attrs'], 'isi-group-member')
metaData = parser.elements[1]['attrs']
groupName = GetAttr(metaData, 'isi-group-member')[1]
isiProperty = GetAttr(element['attrs'], 'isi-property')
if isiProperty is not None:
label = parser.elements[2]['data']
value = parser.elements[3]['data']
metadataGroups[groupName][isiProperty[1]] = {'Label': label, 'Value': value}
logNotes = GetAttr(element['attrs'], 'isi-log-note')
if logNotes is not None:
if "Notes" not in parser.elements[1]['attrs']:
metadataGroups[groupName]["Notes"] = []
for attr in parser.elements[1]['attrs']:
metadataGroups[groupName]["Notes"].append({attr[0]:attr[1]})
elif ContainsAttr(element['attrs'], 'isi-data-table'):
dataTables.append({'Headers': [], 'Values': []})
elif ContainsAttr(element['attrs'], 'isi-data-column-header'):
hold = {'Name': element['data']}
for attr in element:
hold[attr[0]] = attr[1]
dataTables[-1]['Headers'].append(hold)
elif ContainsAttr(element['attrs'], 'isi-data-row'):
hold = []
# all the elements in this row are data so process them all then break
for element in parser.elements:
if 'attrs' in element and ContainsAttr(element['attrs'], 'isi-data-row'): # no data in the row setup
continue
elif 'data' in element:
hold.append(element['data'])
else:
hold.append(' ')
dataTables[-1]['Values'].append(hold)
# printing out the metadata for the file.
for key in metadataGroups.keys():
print (key)
print ("\t", metadataGroups[key])
print ("\n")
# printing out the data
for table in dataTables:
# printing header row for all the data - NOTE: there is metadata like sensor type not being printed here
for header in table['Headers']:
print (header['Name'] + "\t",)
print ("")
# printing the data row - same order as the headers so you can associate them properly
for row in table['Values']:
for datum in row:
print ("|" + datum + "|",)
print ("")