public/XMLdigest/Discussion

[GJ 2014-04-28 10:40:13] Carl, I'm following this page with much interest. I especially liked the Fortran example. Here are a few comments for your consideration:

Both the Fortran and Python examples read the complete animal data (eg. pedigree file) and associated data (eg. performance file) completely into memory. This approach might not be practical when the files contain many millions of records. A warning about memory requirements for large datasets might be in order.
The Fortran program performs a linear search in the animal list for each record in the associated data. Again, this might get very slow if both files contain many, many records. In the old-fashioned approach, the files would both be sorted by animal identifier first, then the merge and write step would require trivial memory and be very fast.
The Python program builds the complete element tree in memory before writing it to disk. I'm curious how memory intensive this is for very large XML files.
It would be of considerable interest to do some practical testing with very large data files. You have suggested in the past that compression should work well for XML files because repeated tags get stored efficiently, but it would be nice to see some real examples with very large files.

[CW 2014-05-02] Thank you, Gerald, for the good points (and also for hunting down the nasty bug in the fortran program)

Since the programs here are just examples of the logic of dealing with XML files and not meant to be used I have not optimized them in any way. The performance might very well drop drastically with bigger files. A real program would have to take measures for efficient memory handling and so on, of course. But I have added a note of warning of this in the page.
Also, since I would argue for the use of XML when it comes to exchanging data, not for storing data (use a database for that!), then, I guess, the number of animals per file can be limited in most cases. Ie. send more files rather than one big file, like for IDEA we limit the amount of animals in one file to 1 million.
Yes, that would be an improvement, but outside the scope of these examples.
We (me + GJ) have experimented with big files, and the result is OK performance for XML files. In many ways the JSON format performs better, but I would argue that the tools and support for XML are way more wide-spread, and the XML-related standards (xpaths, etc.) are more coherent. So it's a give and take.
After compression there is neglible difference between XML and JSON in terms of file size. GJ, correct me if I am wrong!
See 3.