How to extract all fields from a SDF file?

User 677b9c22ff

11-02-2005 23:35:11

Hi,





how can i extract *all* fields from an SDF file, without telling molconvert which fields are inside a SD file (like an automated parser)?





I know that I can do that with multiple fields like extracting the MW field:


molconvert smiles billion.sdf -T MW -o test.smi





Thank you


Tobias

ChemAxon 7c2d26e5cf

14-02-2005 14:59:38

Dear Tobias,


It is not clear what you would like to do exactly.


Do you want to collect all field names in an SD file to add all fields to each structure?


Molconvert can not do this. It can export only those fields that have got value.

ChemAxon 7c2d26e5cf

14-02-2005 15:27:24

Ok, I think I understand what you mean. The problem is that some fields may occur only at the end of the SD file, so molvonvert should read the whole file before generating the header. Since the file can be very big, MolConverter could quickly run out of memory. (Reading the file twice is not an option, because the file can come from the standard input)

User 677b9c22ff

14-02-2005 21:09:26

Hi Tamas,





thanks for your help. Ok I should write my own parser using the SDF Toolkit from Bruno Bienfait, http://cactus.nci.nih.gov/SDF_toolkit/ width="90%" cellspacing="0" cellpadding="3" border="0" align="center"> Quote: Since the file can be very big, MolConverter could quickly run out of memory. But if the file does *not* come from the standard input, it could be read in a sequential way, filling only small portions of memory (kind of buffered read) so it would never run out of memory. But I know its impossible to write a software swiss army knife :-)





With kind regards


Tobias Kind

ChemAxon 43e6884a7a

15-02-2005 11:43:07

How about a UNIX shell script solution?





Code:
$ molconvert smiles -T `grep "<.*>" input.sdf | sed "s/^.*<//" | sed "s/>.*$//" | sort | uniq | tr "\n" ":"` input.sdf

User e81aa85d78

02-03-2006 12:56:37

TobiasKind wrote:
Hi Tamas,





thanks for your help. Ok I should write my own parser using the SDF Toolkit from Bruno Bienfait, http://cactus.nci.nih.gov/SDF_toolkit/ width="90%" cellspacing="0" cellpadding="3" border="0" align="center">
Quote:
Since the file can be very big, MolConverter could quickly run out of memory.
But if the file does *not* come from the standard input, it could be read in a sequential way, filling only small portions of memory (kind of buffered read) so it would never run out of memory. But I know its impossible to write a software swiss army knife :-)





With kind regards


Tobias Kind Hello all,





the new version of SDF toolkit has a new tool, "extract_all_prop_sdf".





Here is the help from the tool:





---------------------------------------------------------------------------


extract_all_prop_sdf -h


Example usage : extract_all_prop_sdf [-help] [-regexp '/Salt/i'] < input.sdf > props.csv


extract_all_prop_sdf reads an MDL SDF file and extract all properties. If the -regexp option is used, only the properties matching the given regular expression will be written





The output is a CSV file.


-------------------------------------------------------------------------





Note that this script does not solve the memory problem that could happen with large amount of data in the SDF.





Bruno Bienfait

User 677b9c22ff

03-03-2006 06:15:07

Hi Bruno,





cool, however which version? The v1.11 does not know this command.





Tobias

User e81aa85d78

03-03-2006 08:49:22

TobiasKind wrote:
Hi Bruno,





cool, however which version? The v1.11 does not know this command.





Tobias
Version 2 is not on the NIH web page. I have sent you an Email with the new package.








Bruno

User ef5e605ae6

03-03-2006 11:51:04

TobiasKind wrote:
how can i extract *all* fields from an SDF file, without telling molconvert which fields are inside a SD file (like an automated parser)?





I know that I can do that with multiple fields like extracting the MW field:


molconvert smiles billion.sdf -T MW -o test.smi
I implemented the * wildcard, this will work in the Marvin 4.1 release:


molconvert smiles billion.sdf -T "*" -o test.smi





(In the meantime, you can do the same using the long command line proposed by Ferenc.)

User 677b9c22ff

04-03-2006 00:52:03

Hi Peter,


thanks alot for the update. I really like the responsiveness of ChemAxon. I mean there are always several solutions to a certain problem but in such cases the user has not the feeling of beeing lost.





For playing around, the PubChem database has now 8 Mio compounds and the flatfile is around 30 GByte.


ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/





Now this is only the small part. Running a molecular isomer generator like MOLGEN http://www.mathe2.uni-bayreuth.de/molgen4/ you can start generating all small molecules. But here we talk about billions of isomer structures (which is OK as long it is in SMILES) but trillions of molecules generate problems (not only space problems). But this is soon to come and this will separate the wheat from the chaff software. Can I handle 100,000 molecules in a quick way, or can I handle 100,000,000,000 in a quick way? (Maybe not with WINXP but the Datacenter Edition :-)





BTW MarvinView has a buffer size of only 2048 bytes so it reads larger files very slow, an adjustment to 64 KBytes would be much quicker.





Do you know if the "memory problem" is solved, so that I can scroll through a large SD file, without reading it all into memory?





Kind regards


Tobias

User ef5e605ae6

06-03-2006 11:41:50

Hi Tobias,
TobiasKind wrote:
For playing around, the PubChem database has now 8 Mio compounds and the flatfile is around 30 GByte.


ftp://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/SDF/
Thanks. These files are relatively small. I tried Compound_00000001_00010000.sdf.gz, but it contains only 9962 structures and no data fields.
Quote:
Can I handle 100,000 molecules in a quick way, or can I handle 100,000,000,000 in a quick way?
Do you mean converting them to a table containing smiles and all data fields? It is already possible with files of any size but I have not performed speed tests. If smiles conversion is not important, then the previously proposed unix command line (or its perl equivalent) is evidently faster than molconvert. After all, SDF field collection is not a molecule conversion task anyway. If you and/or other users need it, then we can consider implementing


grep "<.*>" input.sdf | sed "s/^.*<//;s/>.*$//" | sort | uniq


in java as an additonal command line tool "sdf_fields".
Quote:
BTW MarvinView has a buffer size of only 2048 bytes so it reads larger files very slow, an adjustment to 64 KBytes would be much quicker.
I do not think so, system time seems to be negligible in both cases. (I tested molconvert null Compound_00000001_00010000.sdf.gz and a similarly large smiles file.) However, the default buffer size of BufferedInputStream was increased to 8192 bytes in java 1.5, so I also increased marvin's input buffer size to this value.
Quote:
Do you know if the "memory problem" is solved, so that I can scroll through a large SD file, without reading it all into memory?
Still unsolved but it is on our task list, among a lot of other requests. (The complication is that basic changes are needed in the viewer for this improvement, a lot of code has to be rewritten by maintaining backward compatibility.)





Peter

User 677b9c22ff

07-03-2006 06:06:13

Hi Peter,
Quote:
Thanks. These files are relatively small. I tried Compound_00000001_00010000.sdf.gz, but it contains only 9962 structures and no data fields.



well, they have ~30 properties or fields and if you combine all the files, its a file containing 5 Mio structures and properties and has a size of 30 GB.
Quote:
Do you mean converting them to a table containing smiles and all data fields?



No, I think the SDF extraction problem is solved, but this was more a free thinking about software. Sure I can put everything in an Oracle database and work there, but still - all the software around, is it fast enough? (see http://www.chemaxon.com/jchem/FAQ.html#benchmark3


What happens if you expand your database from 3 Mio to 3 Billion? Is the search time for "O=Cc1ccccc1" instead of 85 sec then 85.000 sec or 1 day?
Quote:
Still unsolved but it is on our task list,



Thats good, because its really pain to work with larger files.





Tobias

ChemAxon a3d59b832c

07-03-2006 16:20:05

TobiasKind wrote:
No, I think the SDF extraction problem is solved, but this was more a free thinking about software. Sure I can put everything in an Oracle database and work there, but still - all the software around, is it fast enough? (see http://www.chemaxon.com/jchem/FAQ.html#benchmark3


What happens if you expand your database from 3 Mio to 3 Billion? Is the search time for "O=Cc1ccccc1" instead of 85 sec then 85.000 sec or 1 day?
I think above a certain number of molecules the only option is to store them in a database. Otherwise, they get unmanageable. (And it may be advisable to use a compact format, e.g. smiles.)





About search speed: searching in database is always faster than in flat files, because database search uses a fast pre-screening phase. In general the search speed is more dependent on the number of screened (~number of hits) than on the database size. So the exact time increase may vary depending your query and database.





There are also tricks when the data becomes too large. Some ideas:
  • In JChem you can limit the number of hits and the maximum time to allow for the search. There are also tricks to make the fingerprints more selective, if the screening becomes less efficient due to sizes.


  • Please also note that searching in JChem Base instead of the Cartridge is about two times faster: http://www.chemaxon.com/jchem/FAQ.html#benchmark


  • Currently we are working on a project which allows so called Markush structures to be imported and handled in the database. This feature will allow the handling of large combinatorial libraries without enumerating them. The searching in such a database will be much faster than in an equivalent enumerated one.
Finally, I do not know what you are using MolGen for, but you may find Reactor (our reaction enumeration program) interesting. You can find more information about it here:


http://www.chemaxon.com/products.html#Reactor





Best regards,


Szabolcs

User 677b9c22ff

25-03-2006 02:41:25

Szabolcs wrote:



I think above a certain number of molecules the only option is to store them in a database. Otherwise, they get unmanageable. (And it may be advisable to use a compact format, e.g. smiles.)





About search speed: searching in database is always faster than in flat files, because database search uses a fast pre-screening phase. In general the search speed is more dependent on the number of screened (~number of hits) than on the database size. So the exact time increase may vary depending your query and database.





Yes, you are right.
Quote:
Finally, I do not know what you are using MolGen for, but you may find Reactor (our reaction enumeration program) interesting. You can find more information about it here:


http://www.chemaxon.com/products.html#Reactor
I will try that. The JCHEM tools are so diverse and there are so many interesting approaches, sometimes its really fun.


Thanks for your reply.





Best regards,


Tobias

User ef5e605ae6

09-06-2006 08:23:44

TobiasKind wrote:
Do you know if the "memory problem" is solved, so that I can scroll through a large SD file, without reading it all into memory?
Solved. MarvinView 4.1 will be able to display large files.





Peter

User 677b9c22ff

15-06-2006 04:12:39

Thanks alot :-)


Tobias