public/XMLdigest/Discussion
- [GJ 2014-04-28 10:40:13] Carl, I'm following this page with much interest. I especially liked the Fortran example. Here are a few comments for your consideration:
- Both the Fortran and Python examples read the complete animal data (eg. pedigree file) and associated data (eg. performance file) completely into memory. This approach might not be practical when the files contain many millions of records. A warning about memory requirements for large datasets might be in order.
- The Fortran program performs a linear search in the adata list for each record in the associated data. Again, this might get very slow if both files contain many, many records. In the old-fashioned approach, the files would both be sorted by animal identifier first, then the merge and write step would require trivial memory and be very fast.
- The Python program builds the complete element tree in memory before writing it to disk. I'm curious how memory intensive this is for very large XML files.
- It would be of considerable interest to do some practical testing with very large data files. You have suggested in the past that compression should work well for XML files because repeated tags get stored efficiently, but it would be nice to see some real examples with very large files.