Very large files (GByte, millions of molecules) with mview

User 677b9c22ff

10-11-2008 21:32:00

Peter,


I have some general comments for (very) large files, just for discussion.


Most mol viewers and text editors try to read in the whole file and then suffer


from memory errors.





Assuming you have a 50 GByte file with molecules (as in PubChem or generated


with the Markush generator) and you look at a specific set of 10 molecules in


the spreadsheet view or 10x10 in the matrix view at any given time.





That means at any given time only 100 molecules plus


the SDF fields need to be cached. That would be a maximum of


lets say 100 Mbyte memory (one meg for each mol with text).





If you assume that in most cases the files are homogeneous


as in case of SMILES or with constant SDF fields, there is no reason


to read in or count the whole file in the first place and there is also


no reason to start caching all the molecule views.


This also assumes a fast harddisk RAID array or SSDs or RamDisk


for real-time scrolling.





1) The program reads the file length of the SDF file.


2) The program determines the size of the right scrollbar.


3) If I have a 100 Gbyte molecule file and I move the scrollbar


to 50% it will move the filepointer to 50% of the filesize (50 Gbyte).


4) If I move to the bottom it will move to the end of the file.


5) If I use page-up and page-down it will exactly read in the number of molecules


in the matrix (positive or negative) by determining a overlap molecule and start from there.


6) Possible importerrors (filepointer is at half of the molecule) are cought


by an exception handler. For SMILES this would be EOF and for SDF $$$$ or M END.





In this way no real ID numbers or molecule numbers are allowed


but the viewer could be very fast. Currently it reads in all molecules


and then increases and counts the molecules which makes it very slow


for large molecule sets. I am currently looking into the API examples,


but there was no direct way given how the molecules are read into the Viewer


(with molimporter I guess). The examples (SimpleViewer.java took the molecules from SMILES).


Furthermore the implementation here would be very static and not as flexible as


Mview but faster.





The reason here not to load it into a DB is that it would save


the time of fingerprint generation and would really just serve


as a molecule viewer.





Cheers


Tobias

User 677b9c22ff

11-11-2008 01:13:41

Hi,


I just performed a new comparison (with the old Marvin View 5.1.2 version)


and it seems as long as the input file is pure SMILES,


mview has no problem at all and I can scroll in real-time


through several million compounds and substances can be as


diverse as from PubChem (Minimum tested here 10 Mio


compounds in a single file).





However if there are any additional field in the SMILES or


if the file is an SDF file one can not scroll in real-time through


the file (Minimum tested here 20 million compounds in single file).


Somehow either the heap space is not sufficient (even if the file is 10x smaller)


or there are some issues with the chemaxon.marvin.view.MDocStorage which either slows down


the whole scrolling process or the scrolling process can not even start.


In case of SMILES the memory use is also smaller than SD (as expected).





Even if the additional view of the SDF fields is turned off,


somehow the fields are processed or the sheer filesize is


a problem. A 10 million SMILES file can be in the size of


1.5 GByte. A 10 million SD file can be in the size of 15 GByte.





So besides the possible MDocStorage issue, reading and processing


all of 15 GBytes can also take some time (in this case it would take around


70 seconds to just read through the whole file - my RAID 6 array


has a saturated read speed of 200 MByte/second). So maybe its a


combined issue of storing all the additional information


and reading through the whole file itself. So as discussed in the above


post, ignoring the number of molecules and just reading through the file


according to the filepointer (set via the scrollbar) would be much faster and probably also


faster for processing the SD fields. Maybe I can figure out something using the API.





Cheers


Tobias

User ef5e605ae6

11-11-2008 07:32:16

Hi Tobias,
TobiasKind wrote:



3) If I have a 100 Gbyte molecule file and I move the scrollbar


to 50% it will move the filepointer to 50% of the filesize (50 Gbyte).


Unfortunately, if you move the file pointer to 50% of the file size, then you have about 99.9% chance that what you find there is NOT the beginning of a molecule record. Molecule importer modules fail to read anything if you try to start reading at the middle of a record. Some tricky solutions may be found in case of single-line formats (like SMILES) or SDF, to look for the nearest end of line or for "$$$$", but there is a huge number of other formats.





cheers,


Peter

User ef5e605ae6

11-11-2008 08:29:26

Quote:
I just performed a new comparison (with the old Marvin View 5.1.2 version)


...


So besides the possible MDocStorage issue, reading and processing


all of 15 GBytes can also take some time (in this case it would take around


70 seconds to just read through the whole file
"Pre-reading" is more than 3 times faster in 5.1.3 than in 5.1.2. An additional improvement was also implemented, but currently only in the development branch (not sure yet whether it will appear in 5.1.4): "pre-reading" starts automatically and immediately when you open the file.


Peter

User 677b9c22ff

15-11-2008 03:16:37

Hello Peter,


thank you for looking into that. I try to see if I can figure


something out with my (limited) JAVA programming knowledges,


using the mentioned buffer concept from above.





It must work with such a concept, because with a HEX based viewer I can scroll in real-time through files of any size (in GByte range). The scroll bar only sets the file pointer so it is real time.


But as said before no molecule counting is performed, its a naked


and ID numberless viewer. The only thing is that a constant small buffer of some MByte needs to be filled after the mouse cursor is released and those structures are rendered from SMILES or SDF


and their GZipped formats (no other format needed here).





Cheers


Tobias

User ef5e605ae6

18-11-2008 11:34:24

Hi Tobias,





You are right, every hex viewer should work this way. Unfortunately, the data structures MarvinView must handle are slightly more complex than the bytes displayed by your hex viewer.


I will consider your idea but I cannot promise anything yet.


At the moment, only one thing is sure: seeking in a gzipped file is absolutely impossible, you cannot avoid reading it through. When you view a non-seekable file (gzipped or standard input), MarvinView creates a temporary uncompressed file containing the binary serialized molecules. Creating the temporary file is a slow process, but when it is completed, you can scroll in it quickly. Temporary file creation could only be avoided if you do not ever want to scroll backwards.


Just let us restrict your idea to uncompressed data. If you want to store and use gigabyte-size text files efficiently, then you should use a compressed filesystem anyway.





cheers,


Peter