User 677b9c22ff
11-02-2005 23:35:11
Hi,
how can i extract *all* fields from an SDF file, without telling molconvert which fields are inside a SD file (like an automated parser)?
I know that I can do that with multiple fields like extracting the MW field:
molconvert smiles billion.sdf -T MW -o test.smi
Thank you
Tobias
ChemAxon 7c2d26e5cf
14-02-2005 14:59:38
Dear Tobias,
It is not clear what you would like to do exactly.
Do you want to collect all field names in an SD file to add all fields to each structure?
Molconvert can not do this. It can export only those fields that have got value.
ChemAxon 7c2d26e5cf
14-02-2005 15:27:24
Ok, I think I understand what you mean. The problem is that some fields may occur only at the end of the SD file, so molvonvert should read the whole file before generating the header. Since the file can be very big, MolConverter could quickly run out of memory. (Reading the file twice is not an option, because the file can come from the standard input)
User 677b9c22ff
14-02-2005 21:09:26
Hi Tamas,
thanks for your help. Ok I should write my own parser using the SDF Toolkit from Bruno Bienfait,
http://cactus.nci.nih.gov/SDF_toolkit/ width="90%" cellspacing="0" cellpadding="3" border="0" align="center"> Quote: |
Since the file can be very big, MolConverter could quickly run out of memory. |
But if the file does *not* come from the standard input, it could be read in a sequential way, filling only small portions of memory (kind of buffered read) so it would never run out of memory. But I know its impossible to write a software swiss army knife :-)
With kind regards
Tobias Kind
ChemAxon 43e6884a7a
15-02-2005 11:43:07
How about a UNIX shell script solution?
Code: |
$ molconvert smiles -T `grep "<.*>" input.sdf | sed "s/^.*<//" | sed "s/>.*$//" | sort | uniq | tr "\n" ":"` input.sdf |
User 677b9c22ff
03-03-2006 06:15:07
Hi Bruno,
cool, however which version? The v1.11 does not know this command.
Tobias
User 677b9c22ff
04-03-2006 00:52:03
Hi Peter,
thanks alot for the update. I really like the responsiveness of ChemAxon. I mean there are always several solutions to a certain problem but in such cases the user has not the feeling of beeing lost.
For playing around, the PubChem database has now 8 Mio compounds and the flatfile is around 30 GByte.
ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
Now this is only the small part. Running a molecular isomer generator like MOLGEN
http://www.mathe2.uni-bayreuth.de/molgen4/ you can start generating all small molecules. But here we talk about billions of isomer structures (which is OK as long it is in SMILES) but trillions of molecules generate problems (not only space problems). But this is soon to come and this will separate the wheat from the chaff software. Can I handle 100,000 molecules in a quick way, or can I handle 100,000,000,000 in a quick way? (Maybe not with WINXP but the Datacenter Edition :-)
BTW MarvinView has a buffer size of only 2048 bytes so it reads larger files very slow, an adjustment to 64 KBytes would be much quicker.
Do you know if the "memory problem" is solved, so that I can scroll through a large SD file, without reading it all into memory?
Kind regards
Tobias