How to extract all fields from a SDF file?

14-02-2005 21:09:26

But if the file does *not* come from the standard input, it could be read in a sequential way, filling only small portions of memory (kind of buffered read) so it would never run out of memory. But I know its impossible to write a software swiss army knife :-)

With kind regards

Tobias Kind

15-02-2005 11:43:07

02-03-2006 12:56:37

But if the file does *not* come from the standard input, it could be read in a sequential way, filling only small portions of memory (kind of buffered read) so it would never run out of memory. But I know its impossible to write a software swiss army knife :-)

With kind regards

Tobias Kind Hello all,

the new version of SDF toolkit has a new tool, "extract_all_prop_sdf".

Here is the help from the tool:

---------------------------------------------------------------------------

extract_all_prop_sdf -h

Example usage : extract_all_prop_sdf [-help] [-regexp '/Salt/i'] < input.sdf > props.csv

extract_all_prop_sdf reads an MDL SDF file and extract all properties. If the -regexp option is used, only the properties matching the given regular expression will be written

The output is a CSV file.

-------------------------------------------------------------------------

Note that this script does not solve the memory problem that could happen with large amount of data in the SDF.

Bruno Bienfait

03-03-2006 08:49:22

Version 2 is not on the NIH web page. I have sent you an Email with the new package.

Bruno

03-03-2006 11:51:04

I implemented the * wildcard, this will work in the Marvin 4.1 release:

molconvert smiles billion.sdf -T "*" -o test.smi

(In the meantime, you can do the same using the long command line proposed by Ferenc.)

06-03-2006 11:41:50

Thanks. These files are relatively small. I tried Compound_00000001_00010000.sdf.gz, but it contains only 9962 structures and no data fields.

07-03-2006 06:06:13

well, they have ~30 properties or fields and if you combine all the files, its a file containing 5 Mio structures and properties and has a size of 30 GB.

07-03-2006 16:20:05

I think above a certain number of molecules the only option is to store them in a database. Otherwise, they get unmanageable. (And it may be advisable to use a compact format, e.g. smiles.)

About search speed: searching in database is always faster than in flat files, because database search uses a fast pre-screening phase. In general the search speed is more dependent on the number of screened (~number of hits) than on the database size. So the exact time increase may vary depending your query and database.

There are also tricks when the data becomes too large. Some ideas:

In JChem you can limit the number of hits and the maximum time to allow for the search. There are also tricks to make the fingerprints more selective, if the screening becomes less efficient due to sizes.
Please also note that searching in JChem Base instead of the Cartridge is about two times faster: http://www.chemaxon.com/jchem/FAQ.html#benchmark
Currently we are working on a project which allows so called Markush structures to be imported and handled in the database. This feature will allow the handling of large combinatorial libraries without enumerating them. The searching in such a database will be much faster than in an equivalent enumerated one.

Finally, I do not know what you are using MolGen for, but you may find Reactor (our reaction enumeration program) interesting. You can find more information about it here:

http://www.chemaxon.com/products.html#Reactor

Best regards,

Szabolcs

25-03-2006 02:41:25

Yes, you are right.

09-06-2006 08:23:44

Solved. MarvinView 4.1 will be able to display large files.

Peter