Search with R-groups

User 4e4b708dbd

28-01-2010 14:10:24

I wrote a query that would find all protected pyrrolid​ines. Yet, the query returns no hits. I am a bit curious why. I know we have most of these structures in our database. Individual queries have hits (and some combinations):


Query 1: N-Acetylpyrrolidine


Query 2: N-(Triflu​oroacetyl​)pyrrolid​ine


Query 3: N-t-Boc-pyrrolidone


Query 4: phenyl pyrrolidine-1-carboxylate


Query 5: (no hits)


Query 6: 1-benzylpyrrolidine


Query 7: 1-(4-Methylphenyl)sulfonyl-pyrrolidine


Query 8: (no hits)



Done with JChem Base 5.2.2


[Query links removed to avoid remote server load.]

ChemAxon 42004978e8

28-01-2010 20:55:41

Hello,


Thanks for reporting the bug.


We could reproduce the problem. We are now searching for the underlying bug.


Regards,


Robert

User 4e4b708dbd

29-01-2010 13:02:55

This does not work: query 1


This works: query 2


[Query links
removed to avoid remote server load.]

ChemAxon 42004978e8

02-02-2010 14:40:45

Hello,


Ok. We are checking it as well.


Thanks,


Robert

User 4e4b708dbd

05-03-2010 13:19:58

Dear Robert,


Do you have an update for the original issue?


Imants

ChemAxon 42004978e8

09-03-2010 15:33:09

Hi,


Sorry for the late response. My colleague who is dealing with this issue will answer you soon.


Thanks,


Robert

ChemAxon 42004978e8

24-03-2010 09:28:59

Hello,


 


Sorry for responding so late my colleagues were busy with correcting some critical bugs.


There is a bug in screening when searching with Rg groups. This will be fixed with the next release, version 5.3.9.


Bye,


Robert

ChemAxon 42004978e8

24-03-2010 14:55:26

Hi,


I misstyped the version number it will be 5.3.3. in about a month.


Robert

User 4e4b708dbd

16-06-2010 09:36:01

The query with multiple R groups appears to be working properly now.


It is extreamly slow, however. It takes over 10 minutes to get results and server load is very high (please, don't try it on our website). In the time it takes to get result for the example I provided, I can draw modify the query and find molecules individually. If R groups are used even with one group query takes minutes while a single exact query executes in milliseconds.


And isotopes still can't be used, as it appears. Correct?

ChemAxon a3d59b832c

16-06-2010 13:53:32

Hi Imants,


Yes, searching with R-group queries may take a long time. The reason is that only the scaffold is used for the fast pre-screening process, and if a database has many records that contain the scaffold, then all those records will remain for the slower atom-by-atom searching. This theory can be confirmed if you check the JChem search logs, it will write out the number screened and hit records for a particular query.


However, 10 minutes seems high to me. In our NCI 236K database the above queries run within a few seconds, and the non-R-group searches are much faster here as well. I can see these numbers in my log using your last R-query:


R-group search:


Wed Jun 16 15:41:53 CEST 2010
Search mode: FULL
Structure table: demo_nci
Query: [$(O=CCC1C2=CC=CC=C2C2=CC=CC=C12)]N1CCCC1
Screened: 6828
Hits: 0
Total time: 1282 ms  Screening: 15 ms
Processing threads: 2
Current / peak / maximum searches per minute: 1 / 1 / Unlimited


 


Non-Rgroup search:


Wed Jun 16 15:47:37 CEST 2010
Search mode: FULL
Structure table: demo_nci
Query: O=C(CC1C2=CC=CC=C2C2=CC=CC=C12)N1CCCC1
Screened: 0
Hits: 0
Total time: 171 ms  Screening: 110 ms
Processing threads: 2
Current / peak / maximum searches per minute: 1 / 1 / Unlimited


 


Regarding the other problem (isotope and link node):


 


I could not reproduce the problem. I could find records with your query on our example site:


http://www.chemaxon.com/jchem/examples/db_search


(Check table editexample.)


Can you attach molecules in your database that are being missed?

User 4e4b708dbd

16-06-2010 14:38:02

250,000 structures is nothing. I am somewhat surprised you keep testing bugs with such small database. Well, you probably have your reasons. Have you tried it with 20 million structures from PubChem? And it appears you tried it for only one R-group. How about all 8 of them? How much longer it takes?


And is your query really identical to the one I listed? Can you run the attached query?


What are your guidelines for R-group search? When can it be used in 20 million structure database? Is the search length increase with R-group count linear? With molecule count in the database?

User 4e4b708dbd

16-06-2010 14:49:05

Just checked in your NCI example database - indeed it take only a few seconds. Does that run on the cartridge or JChem Base?

ChemAxon a3d59b832c

16-06-2010 20:54:37

 




Answering last question first:


Just checked in your NCI example database -
indeed it take only a few seconds. Does that run on the cartridge or
JChem Base?


It runs JChem Base, but there should be not much difference, because there are not many results. In this case most of the time is spent by the computation, which is exactly the same code.


And is your query really identical to the one I
listed? Can you run
the attached query?



Yes, I used copy-paste. This latest attached query runs in ~2s in substructure, and ~3s in full structure search mode on the NCI db.


250,000 structures is nothing. I am somewhat
surprised you keep testing bugs with such small database. Well, you
probably have your reasons. Have you tried it with 20 million structures
from PubChem?


I will check if we have it on one of our test machines imported. We do use it for benchmarks, but very often a smaller database is enough to reproduce problems.



Is the search
length increase with
R-group count linear? With molecule count in the database?


Since
NCI is quite diverse itself, I would expect so. Previously we used
multiplied NCI for our benchmarks, and we now use Pubchem for that. The
searching times were proportional as we moved to Pubchem.



What are your guidelines for R-group search? When
can it be used in
20 million structure database?


I will check Pubchem, but I think searching on a larger DB should be about linearly proportional.


It is then ~4 min - counting my dual core notebook that I used for those runs. If pubchem in total happens to contain more records with the core structure, and if your server is heavily loaded, then your figure (~10 minutes) might not be very far off.


We will check if we can improve searching times for R-group queries.


Maybe full searching times can be improved
easily for this kind of R-queries that you used.


 


User 4e4b708dbd

17-06-2010 14:27:17

Hare you tried r-group search in a database with empty structures? What if you add about 5% empty structures to the database - should that increase the search time?


We are running these queries on JChem Base on Linux.

ChemAxon 9c0afc9aaf

17-06-2010 17:41:31

 


Hare you tried r-group search in a database with empty structures? What if you add about 5% empty structures to the database - should that increase the search time?

Empty structures have empty (all zero) fingerprints as well, so they would drop out quickly ding the fingerprint screening phase.


In short: they cannot increase search time.


Szilard

User 4e4b708dbd

18-06-2010 09:51:04

We are considering setting up a table for direct comparison with your online examples. The NCI database looks the most suitable for this. Or is there a larger database? Which exact version of the NCI data are you using? Can you share the standardization that you are using?


This way we will be able to directly compare results on our servers and your benchmarks. If some query will work on your online example and not on our servers - the problem is our.

ChemAxon a3d59b832c

18-06-2010 21:06:12

Hi,


 


I think it is the "August 2000 SMILES Strings" collection from this download page:


http://cactus.nci.nih.gov/download/nci/


Or you can also export it directly from our JSP example.


 


There is no special standardization on that table, it just uses the default one.


 


No, currently there is no larger publicly accessible example database. I have found one version of Pubchem on one test machine from last year for some testing, it is being regenerated now.


 


Best regards,


Szabolcs

ChemAxon a3d59b832c

22-06-2010 08:27:05

Hi Imants,


I tried pubchem. In JChem Base, full searching with your latest R-query took 660s on our test server, that is 11 min. On the other hand, substructure search took 479s = ~8 min, without the time needed for returning the results. (346000 hits)


Search mode: SUBSTRUCTURE
Structure table: PKOVACSUSER_520.PUBCHEM
Query: [$([#6]C=O),$(FC(F)(F)C=O),$([#6]C([#6])([#6])OC=O),$(O=COC1=CC=CC=C1),$(
C1=CC=C(C=C1)C(C1=CC=CC=C1)C1=CC=CC=C1),$([#6]C1=CC=CC=C1),$([#6]C1=CC=C(C=C1)S(
=O)=O),$(O=CCC1C2=CC=CC=C2C2=CC=CC=C12)]N1CCCC1
Screened: 1625241
Hits: 346290
Cache loading: 757741 ms
Cache size (this table / total): 1966.84 / 1966.84 MBytes
Total time: 479212 ms  Screening: 1299 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited

Found 346290 hits in table PKOVACSUSER_520.PUBCHEM.



Search mode: FULL
Structure table: PKOVACSUSER_520.PUBCHEM
Query: [$([#6]C=O),$(FC(F)(F)C=O),$([#6]C([#6])([#6])OC=O),$(O=COC1=CC=CC=C1),$(
C1=CC=C(C=C1)C(C1=CC=CC=C1)C1=CC=CC=C1),$([#6]C1=CC=CC=C1),$([#6]C1=CC=C(C=C1)S(
=O)=O),$(O=CCC1C2=CC=CC=C2C2=CC=CC=C12)]N1CCCC1
Screened: 1625241
Hits: 9
Cache loading: 756777 ms
Cache size (this table / total): 1966.84 / 1966.84 MBytes
Total time: 659617 ms  Screening: 1205 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited

Found 9 hits in table PKOVACSUSER_520.PUBCHEM.

In both cases, the initial screening left 1.6M structures to search, which is a lot...


As I said, we will improve screening for these type of queries. That will take searching time down significantly for full search time, and also hopefully somewhat for substructure search.


 


However, I am not sure if we will be able to squeeze in this development for the next major release (5.4). So I would say that it is only realistic to expect it for 5.5 in the first half of next year.


 


Best regards,


Szabolcs


 


Search mode: SUBSTRUCTURE
Structure table: PKOVACSUSER_520.PUBCHEM
Query: [$([#6]C=O),$(FC(F)(F)C=O),$([#6]C([#6])([#6])OC=O),$(O=COC1=CC=CC=C1),$(
C1=CC=C(C=C1)C(C1=CC=CC=C1)C1=CC=CC=C1),$([#6]C1=CC=CC=C1),$([#6]C1=CC=C(C=C1)S(
=O)=O),$(O=CCC1C2=CC=CC=C2C2=CC=CC=C12)]N1CCCC1
Screened: 1625241
Hits: 346290
Cache loading: 757741 ms
Cache size (this table / total): 1966.84 / 1966.84 MBytes
Total time: 479212 ms  Screening: 1299 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited

Found 346290 hits in table PKOVACSUSER_520.PUBCHEM.
[scsepregi@prefect bin]$ ./jcsearch -Xmx4000M -vv -t:f -q ../../../workspace/db
Bug0003/R-query.mrv DB:PKOVACSUSER_520.PUBCHEM -f :TCD_ID >jcs_log_full.txt

Mon Jun 21 13:34:31 CEST 2010
Search mode: FULL
Structure table: PKOVACSUSER_520.PUBCHEM
Query: [$([#6]C=O),$(FC(F)(F)C=O),$([#6]C([#6])([#6])OC=O),$(O=COC1=CC=CC=C1),$(
C1=CC=C(C=C1)C(C1=CC=CC=C1)C1=CC=CC=C1),$([#6]C1=CC=CC=C1),$([#6]C1=CC=C(C=C1)S(
=O)=O),$(O=CCC1C2=CC=CC=C2C2=CC=CC=C12)]N1CCCC1
Screened: 1625241
Hits: 9
Cache loading: 756777 ms
Cache size (this table / total): 1966.84 / 1966.84 MBytes
Total time: 659617 ms  Screening: 1205 ms
Processing threads: 4
Current / peak / maximum searches per minute: 1 / 1 / Unlimited

Found 9 hits in table PKOVACSUSER_520.PUBCHEM.
[scsepregi@prefect bin]$

ChemAxon a3d59b832c

11-05-2011 16:16:19

JChem 5.5 is out now, with improvements in screening of simple R-group queries. (Those not containing R-logic.)


 


Furthermore, we sped up the internals of R-group searching as well.


 


So your above queries are significantly faster now.


I did not check the big database, but NCI is under 0.7s now with the large R-group query attached above for SSS and 0.13 s for FSS. (Compared to the previous 2 and 3 s.)


 


Depending on the query and the database, even 100-1000-fold speedups may also be possible.


 


Best regards,


Szabolcs