1SmartSolution Blog: Parsing Large XML files

Posted At : Jan 28, 2009 19:54 PM | Posted By : Ed Tabara
Related Categories: ColdFusion, SQL

Did i say large? Sorry guys, i wanted to say HUGE!
Few days ago i've got an interesting problem. An XML had to be parsed and the data inserted into the database. I know it may be done with sql + some visual basic code, but because i had no clue on how really to do it and after spending some time googling and not finding what i REALLY needed i went on doing it with ColdFusion. But i run ahead of the train, so let me start over...

The XML was of the following structure:

<?xml version="1.0" encoding="ISO-8859-1" ?>

<catalog version="1.0">

   <Header>

      <Company>some company name</Company>

      <SenderName>the name</SenderName>

      <SenderPhone>some phone</SenderPhone>

      <SenderPhoneExt>phone extension</SenderPhoneExt>

      <TransferDate>date of the transfer</TransferDate>

      .... some other nodes here ...

   </Header>

   <App action="A" id="1" validate="yes">

      <tag1 id="270" /> 

      <tag2 id="75" /> 

      <Note>

         <![CDATA[ some possible text here]]> 

      </Note>

      <Qty>0</Qty>

      <tag3 id="6864" />

      <tag4>1232434</tag4>

      <Product>

         <tag7>ergreg</tag7>

         <tag5>rgreg r</tag5>

         <tag6>

            <![CDATA[ rgrgrg ]]> 

         </tag6>

         ... some more nodes ...

      </Product>

      <Product>

         <tag7>iuoyuo</tag7>

         <tag5>uiyi</tag5>

         <tag6>

            <![CDATA[ uyj r wtyy et rtret ]]> 

         </tag6>

         ... some more nodes ...

      </Product>

   </App>

   <App action="A" id="2" validate="yes">

      <tag1 id="23" /> 

      <tag2 id="435" /> 

      <tag4>12324</tag4>

      <Product>

         <tag7>ereg</tag7>

         <tag5>rg reg r</tag5>

         <tag6>

            <![CDATA[ rgrg wefe rg ]]> 

         </tag6>

         ... some more nodes ...

      </Product>

   </App>

   <Footer>

      <someOtherTag>srfaefewf</someOtherTag>

   </Footer>

</catalog>

Note that the number of APP nodes as well as the number of PRODUCT nodes inside the APP can be different. The number of nodes as inside APP, so inside PRODUCT may differ as well, but are known all possible variants that may exist.
So far, there is nothing really complicated and ColdFusion can "eat" all that without even to notice it. But what would you tell if the XML file will have 40GB-50GB? The sample i've got to play with was only 3GB and this is still not something that can be parsed easily, huh? How many possible resources would the server need to handle that? I don't think CFFILE will even read/load it, not talking about using XMLParse on it and then doing the usual parsing work. Here is where i started to think about something sql server native. But because as explained earlier the structure of each APP and PRODUCT may be different, plus the fact that i never did it with sql and i didn't find anything easy enough to setup that would do the work for me, i had to start thinking on ways to do it with ColdFusion itself.

First of all i needed the file to be read. because obviously it can't be done for the entire file, i went from the idea to do it line by line. Something like:

<cfset fileReader = CreateObject("java", "java.io.FileReader")>

<cfset fileReader.init("thePathToMYXMLFile\theFile.xml")>

<cfset br = CreateObject("java", "java.io.BufferedReader")>

<cfset br.init(fileReader)>

<cfset line = br.readLine()>

used correctly would do it for you.
Reading it line by line is good, but how many are needed to be read and how to work them out after that. Here is where the second "solution" come into play. What i came with was the following. Because i knew there may exist 3 different "blocks" of tags HEADER, APP and FOOTER, i created the script that would "look" for the start of one of that blocks. When the match happen, i start saving the lines to a variable till the end of the block is met. As soon as a block is found and read into that variable, i use the XMLParse over it (because each "block" ia a valid XML, no?) and then do the usual read of the data from XML as normally would do.

After doing it, but still without to realy save the data into the database, i have run the code just to see how well it will work, same time looking at Server Monitor to see what impact it will have over the memory. I did not write down the exact results, but it worked weirdly well. I mean FAST and without any noticeable impact on the memory.

Being so, i created quickly the database tables, wrote the insert queries and..... here is where problems started again...
After 35,000-39,000 APP blocks inserted into the database (note that it wasn't only that amount of insert queryes, but that number plus the number of PRODUCT blocks that had to be inserted to a separate table), Java was going Out Of Memory. I could try to play a bit with java settings, but i realised it wouldn't really help because that 35K blocks was only a little part of the 3GB file and it would be even less of the 40GB-50GB one, so i had to think on something else. The next idea was to do Garbage Collection. I thought it would be ok to do it after wach 20,000 blocks. Did it.. no help. Did it twice in a row.. nothing. Same with 1 second sleep after each of them.. same result. Google for other solutions, found somewhere other ways to do the GC, but just same result.... The memory just didn't want to go down. Maybe it's because the script was still running, i dunno..

I was almost lost when i thought.. HEY! why not split the job into pieces... Say do the 20,000 blocks parse, do the GC with some short sleep, then create a scheduled task that will start the process from where the previous stopped and finish the current one. The only problem here was the fact that on each step the script had still to read the XML from the start. So i added a piece of code that would read first N lines in a loop without any action on them. It helped. After the current instance of the script was doing the GC, sleeping a little bit, creating the "call" of itself with a 1 minute delay and finishing, the memory was going down to where it was. And watching the process running for few hours i got confirmation that even on picks, the memory wasn't getting even over 50% of the limit it had. It was about 4am when i went to sleep leaving the process to continue, so i am not sure how long it took from the start till the end, but i think it was something between 4 and 6 hours. Yes.. long indeed.. but it worked and a total number of 6,811,681 records has been inserted into the database.

I am sure there should exist a much better and faster way to do it, but i don't know it and wasn't able to find any info that would help on this. The biggest files i've read about as being parsed with coldfusion was of 30MB-50MB and that's nothing compared to even the sample 3GB one i had.

If anyone know better ways of doing it, please share. It's something that is rarely used, but when you have to do it is better to have the solution from the start.

If not... at least i hope someone will find my way useful for their needs when/if the moment will appear.

Have a nice day everyone!

| 42343 Views | 49% / 2% Popularity

Comments

Not with XML parsing, but I've run into that problem before - trying to do so much in a single page run. Even if you are not using variables, it seems that ColdFusion doesn't really do any garbage collection during the page load itself. As such, memory just seemingly keeps getting eaten.
My solution, like yours, is just to break it up into blocks and run N records each page load.

Posted By Ben Nadel | Jan 28, 2009 20:36 PM

yeah.. the only real problem i still see in my approach is the fact that at each new "iteration" i need to do empty loop just to pass the lines that has been worked already. So with each step this loop is getting bigger and bigger and i believe that is one of the points where the process is getting slow. Running the 3GB in 4-6 hours make me fear about how long would run the 40GB-50GB one. Yes, on the real server with it's particular resources, probably will be possible to make more than 20,000 blocks per step but still it is gonna take a lot to run the entire process. And i don't seem to know any way of "cutting" that XML for the unneeded rows. probably it could be done with some DOS command, but it has been SOOO long since i've used them last time, that i don't even remember if that is possible or not :)

Posted By Ed Tabara | Jan 28, 2009 20:43 PM

Have you heard of SAX-based XML parser? It will not run out of memory like the outdated DOM-based XML parser that CF uses, and you don't have to write your own file reader.
There are many java based XML parser that supports the SAX model of processing, like: http://xerces.apache.org/xerces2-j/

Posted By Henry Ho | Jan 28, 2009 22:11 PM

i did hear about SAX but i supposed it also would have to read/load the file to the memory before to process it. And if so with such huge files it still would not work. As result i did not even consider looking at it for my purpose. Was i wrong and with SAX the file would not be loaded to the memory from the start?

Posted By Ed Tabara | Jan 28, 2009 22:18 PM

As Henry Ho wrote, a SAX parser is a solution for this particular problem, because it doesn't need to read the whole document into memory to work with it. The DOM based approach that ships with ColdFusion requires having the whole document in memory to work with it. That noted, the SAX parser API is very different from a DOM based approach.
So to simplify the problem I recommend taking a look at a Java project called XOM (http://www.xom.nu/). If you are running atop a 1.5 JVM or higher this will work without any other dependencies. It provides an API to work with XML documents, not unlike the CF APIs. It is well documented and has a very good tutorial, FAQ, and mailing list to help solve various XML processing problems you may encounter.

Posted By Matthew Lesko | Jan 29, 2009 16:04 PM

One followup to Ed's question about reading whole document into memory. Need to use a BufferedReader class to wrap a FileReader class, so:
oFileReader = createObject("java", "java.io.FileReader").init("/path/to/file.ext")
oBufferedReader = createObject("java", "java.io.BufferedReader").init(oFileReader)
oDocument = createObject("java", "nu.xom.Builder
").build(oBufferedReader)

Posted By Matthew Lesko | Jan 29, 2009 16:12 PM

thx Matthew!
i surely will give it a try and see how well/fast it will work.

Posted By Ed Tabara | Jan 29, 2009 17:28 PM

XMLMax will split these huge files using an xpath expression. //App will output each App fragment to a separate, auto-incrementing filename. That would be a large number of files, but you could then read in these files in a loop and go from there. It is based on Microsoft's XmlReader, a pull parser, and should not have any memory issues.

Posted By Bill Conniff | Oct 13, 2009 16:40 PM

You should consider vtd-xml for processing large or huge XmL files (lots of gigabytes)

Posted By anon_anon | Nov 10, 2009 0:42 AM

Search

Calendar

Sun	Mon	Tue	Wed	Thu	Fri	Sat
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags Cloud

1ssblog about access adobe ajax amazon caching cfhssf cfsqlmaster cfwatcher coldfusion data deals fun jquery moldova mssql my projects optimization other performance profit ria server sites sql this transerfing using your

ColdFusion	166	[RSS]
Other	101	[RSS]
My Projects	61	[RSS]
Fun	54	[RSS]
SQL	52	[RSS]
Deals	19	[RSS]
RIA	18	[RSS]
1ssBlog	16	[RSS]
cfSQLMaster	12	[RSS]
Profit	9	[RSS]
Caching	8	[RSS]
Transerfing	6	[RSS]
AJAX	5	[RSS]
cfHSSF	5	[RSS]
cfWatcher	5	[RSS]
JavaScript	5	[RSS]
Amazon	4	[RSS]
cfFirewall	4	[RSS]
Security	4	[RSS]
SEO	4	[RSS]
jQuery	3	[RSS]
1ssChat	2	[RSS]
Adobe Air	1	[RSS]
jQuery Mobile	1	[RSS]
MMA	1	[RSS]

Categories

Recent Entries

Recent Comments

Search

Calendar

Tags Cloud