fingerprinting a library using SMARTS - ChemAxon Forum Archive

User 3af65074b3

19-01-2011 16:44:33

I was wondering if there is a quick way of fingerprinting a large library of compounds using predefined query language like SMARTS. If there is a tool for this? Thanks.

ChemAxon 9c0afc9aaf

20-01-2011 21:56:11

Hi,

The answer may depend on your needs.

Could you tell us what do you want to use these fingerprints for / how you plan to use them ?

Also approximately how many SMARTS queries you would expect ?

Would you also consider some Java programming or only ready-solutions ?

Best,

Szilard

User 3af65074b3

21-01-2011 14:48:43

1/These fingerprints will be used for machine learning basically understanding the database. We used pipeline protocols w chemaxon comp but it takes lots of time for each compound.

2/SMARTS query to be used: 420+

3/Ready solutions is highly approciated but if needed Java Programming we need to work with other collegues.

Thanks.

ChemAxon 9c0afc9aaf

21-01-2011 17:48:54

Hi,

One possible "ready" solution is to

- create a JChem table with the SMARTS as "structural keys"

- import the structures into this JChem table

- the extra columns (above the FP length at table creation) will contain the fingerprints as 32 bit integers

- The groups of 32 keys (columns) should be in the order of the input (left to right)

- Within the integers the lowest bits come first (right to left)

- I suggest to test the above two cases with a simpler test first

- these columns can be accessed directly from SQL or exported with the structures with JChem Manager

Otherwise you can use our Java API to perform the desired calculations in different ways

A. You may import your data into a JChem table, then:

Run all 420+ queries and store the results in a related table or extra, user created fields in the jchem table, or straight into file output in the format you require.

B. You may read the 420 queries and start reading the targets. Standardize all targets and do 420 graph search operation on them.

Though (A) is more involved with the database, this should be fastest (recommended), as the searches utilize hashed fingerprints for accelerating the search.

Let us know if you need more information (e.g. API classes) on any of the particular suggestions.

Best regards,

Szilard

User 3af65074b3

26-01-2011 00:47:35

Thnks for the information. I did according to the suggestion. Now i have fps but

1/Are these based on the compounds that i have uploaded. One thing that i didnot understand is: Are these fingerprints the information of the substructures of the compounds present in the db's as my queries have represented. if not how can i get the fp's based on that? Basically, how from these output i determine the presence or absence of all the substructures defined from the querys.

2/Also, When i import the querys in JChemManager, do my table type will be Query structures or just molecules?

3/"the extra columns (above the FP length at table creation) will contain the fingerprints as 32 bit integers" = what do you mean extra columns? Are these fps that u r mentioning?

4/The fp could be any format interger, string etc. Please explain. Thanks.

ChemAxon a3d59b832c

26-01-2011 08:09:34

Hi Barun,

You will find the answers to most of your questions starting from this section of the manual:

http://www.chemaxon.com/jchem/doc/dev/dbconcepts/index.html#fingerprints

Let us know if you have any further querstions.

Best regards,

Szabolcs