Did i say large? Sorry guys, i wanted to say HUGE!
Few days ago i've got an interesting problem. An XML had to be parsed and the data inserted into the database. I know it may be done with sql + some visual basic code, but because i had no clue on how really to do it and after spending some time googling and not finding what i REALLY needed i went on doing it with ColdFusion. But i run ahead of the train, so let me start over...

The XML was of the following structure:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<catalog version="1.0">
      <Company>some company name</Company>
      <SenderName>the name</SenderName>
      <SenderPhone>some phone</SenderPhone>
      <SenderPhoneExt>phone extension</SenderPhoneExt>
      <TransferDate>date of the transfer</TransferDate>
      .... some other nodes here ...
   <App action="A" id="1" validate="yes">
      <tag1 id="270" />
      <tag2 id="75" />
         <![CDATA[ some possible text here]]>
      <tag3 id="6864" />
         <tag5>rgreg r</tag5>
            <![CDATA[ rgrgrg ]]>
         ... some more nodes ...
            <![CDATA[ uyj r wtyy et rtret ]]>
         ... some more nodes ...
   <App action="A" id="2" validate="yes">
      <tag1 id="23" />
      <tag2 id="435" />
         <tag5>rg reg r</tag5>
            <![CDATA[ rgrg wefe rg ]]>
         ... some more nodes ...

Note that the number of APP nodes as well as the number of PRODUCT nodes inside the APP can be different. The number of nodes as inside APP, so inside PRODUCT may differ as well, but are known all possible variants that may exist.
So far, there is nothing really complicated and ColdFusion can "eat" all that without even to notice it. But what would you tell if the XML file will have 40GB-50GB? The sample i've got to play with was only 3GB and this is still not something that can be parsed easily, huh? How many possible resources would the server need to handle that? I don't think CFFILE will even read/load it, not talking about using XMLParse on it and then doing the usual parsing work. Here is where i started to think about something sql server native. But because as explained earlier the structure of each APP and PRODUCT may be different, plus the fact that i never did it with sql and i didn't find anything easy enough to setup that would do the work for me, i had to start thinking on ways to do it with ColdFusion itself.

First of all i needed the file to be read. because obviously it can't be done for the entire file, i went from the idea to do it line by line. Something like:
<cfset fileReader = CreateObject("java", "java.io.FileReader")>
<cfset fileReader.init("thePathToMYXMLFile\theFile.xml")>
<cfset br = CreateObject("java", "java.io.BufferedReader")>
<cfset br.init(fileReader)>
<cfset line = br.readLine()>
used correctly would do it for you.
Reading it line by line is good, but how many are needed to be read and how to work them out after that. Here is where the second "solution" come into play. What i came with was the following. Because i knew there may exist 3 different "blocks" of tags HEADER, APP and FOOTER, i created the script that would "look" for the start of one of that blocks. When the match happen, i start saving the lines to a variable till the end of the block is met. As soon as a block is found and read into that variable, i use the XMLParse over it (because each "block" ia a valid XML, no?) and then do the usual read of the data from XML as normally would do.

After doing it, but still without to realy save the data into the database, i have run the code just to see how well it will work, same time looking at Server Monitor to see what impact it will have over the memory. I did not write down the exact results, but it worked weirdly well. I mean FAST and without any noticeable impact on the memory.

Being so, i created quickly the database tables, wrote the insert queries and..... here is where problems started again...
After 35,000-39,000 APP blocks inserted into the database (note that it wasn't only that amount of insert queryes, but that number plus the number of PRODUCT blocks that had to be inserted to a separate table), Java was going Out Of Memory. I could try to play a bit with java settings, but i realised it wouldn't really help because that 35K blocks was only a little part of the 3GB file and it would be even less of the 40GB-50GB one, so i had to think on something else. The next idea was to do Garbage Collection. I thought it would be ok to do it after wach 20,000 blocks. Did it.. no help. Did it twice in a row.. nothing. Same with 1 second sleep after each of them.. same result. Google for other solutions, found somewhere other ways to do the GC, but just same result.... The memory just didn't want to go down. Maybe it's because the script was still running, i dunno..

I was almost lost when i thought.. HEY! why not split the job into pieces... Say do the 20,000 blocks parse, do the GC with some short sleep, then create a scheduled task that will start the process from where the previous stopped and finish the current one. The only problem here was the fact that on each step the script had still to read the XML from the start. So i added a piece of code that would read first N lines in a loop without any action on them. It helped. After the current instance of the script was doing the GC, sleeping a little bit, creating the "call" of itself with a 1 minute delay and finishing, the memory was going down to where it was. And watching the process running for few hours i got confirmation that even on picks, the memory wasn't getting even over 50% of the limit it had. It was about 4am when i went to sleep leaving the process to continue, so i am not sure how long it took from the start till the end, but i think it was something between 4 and 6 hours. Yes.. long indeed.. but it worked and a total number of 6,811,681 records has been inserted into the database.

I am sure there should exist a much better and faster way to do it, but i don't know it and wasn't able to find any info that would help on this. The biggest files i've read about as being parsed with coldfusion was of 30MB-50MB and that's nothing compared to even the sample 3GB one i had.

If anyone know better ways of doing it, please share. It's something that is rarely used, but when you have to do it is better to have the solution from the start. If not... at least i hope someone will find my way useful for their needs when/if the moment will appear.

Have a nice day everyone!