Thursday, 22 August 2013

Can HTMLAgilityPack or something similar help me obtain data from these tables?

Can HTMLAgilityPack or something similar help me obtain data from these
tables?

We are facing an issue where we download thousands of XML files, and
embedded in those XML files are HTML(ish) tables that contain information
we need to extract and put into a database.
We made a first pass with LINQ to XML and while it worked for a large
portion of the tables, it became apparent that there is some variety in
how the tables are structured. The thing is, they are largely the same in
that they end up rendering as two columns, with the left side defining
criteria, and the right side defining sub criteria.
They differ in how they split up the sub criteria and how these relate too
the criteria. For the most part they follow the example of the second
table below, so that's relatively straight forward (each <p> in the second
column is a sub criteria), but some, such as the example table 1, use
rowspan/colspan and individual rows in the second column to define
individual sub criteria and this is much harder to parse using something
like LINQ to XML.
I've not used HTMLAgilityPack before, but is it possible to do something
like pass it the HTML of the table, and then extract rows of the first
column, then rows of the second column ? I'd then have to work out a way
of matching sub criteria to criteria. Alternatively, is there some other
tool other than HTMLAgilityPack that could be good for such a scenario?
<!-- Example table 1 -->
<html>
<table border="1">
<tr header="true" rowheight="0">
<td colspan="2"><p><cs id="24">CRITERIA</cs></p></td>
<td colspan="2"><p>SUB CRITERIA</p></td>
</tr>
<tr>
<td rowspan="2"><p>1</p></td>
<td rowspan="2"><p>CRITERIA 1.</p></td>
<td><p>1.1</p></td>
<td><p>SUB CRITERIA 1.1</p></td>
</tr>
<tr>
<td><p>1.2</p></td>
<td><p>SUB CRITERIA 1.2</p></td>
</tr>
</table>
<br /><br />
<!-- Example table 2 -->
<table border="1">
<tr>
<td><p>CRITERIA</p></td>
<td><p>SUB CRITERIA</p></td>
</tr>
<tr>
<td><p>1 CRITERIA 1</p></td>
<td>
<p>1.1 SUB CRITERIA 1.1</p>
<p>1.2 SUB CRITERIA 1.2</p>
</td>
</tr>
</table>
</html>

No comments:

Post a Comment