jcsearch command line question

User c1ce6b3d19

03-12-2010 16:45:17

A User wrote:


We are using jcsearch to cross references between our
database and pubchem.  So far we are at early stage and we are
running some subset of data to test the performance



jcsearch -q chemdb_sub500.sdf -f sdf -t:e --or -o hits.sdf Pubchem1.sdf


where 

chemdb_sub500 is a subset of 500 structures from our
database

hits.sdf  is an output file showing the matches

Pubchem1.sdf is a subset of pubchem data


- This seems to work ok, but we will need to output the id
fields from the CHEMDB_SUB500.SDF file. Is there a way to do
so?

- In terms of performance is quite slow. Is there a way to
turn the PUBCHEM1.SDF into a fast search file format using
some tools like( generatemd or similar) ?

Thanks

User c1ce6b3d19

03-12-2010 16:47:09

If I understand correctly, this part of the documentation might help:



-f format output format (default: smiles). Run jcsearch -H for details.


    -f :T<SDF field> write the value of the SDF field in matching targets


    -f :Tname write the molecule names of matching targets


    -f :M<SDF field1:...:SDF fieldn> write the specified field values to


the molecule as SDF fields


User c1ce6b3d19

03-12-2010 16:49:06

A User responded:

I've already check the documentation and, the reason I'm contacting
is because it's not very clear to me how to do that. I've tried
different permutations in the parameters but probably not the right one.
I'm unable to find a solution for this in the forums. The field I'm
trying to retrieve is called ETC_ID and is present in the
chemdb_sub500.sdf  


jcsearch -q chemdb_sub500.sdf -f :T<ETC_ID> sdf -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.


jcsearch -q chemdb_sub500.sdf -f :M< SDF ETC_ID> -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.


jcsearch -q chemdb_sub500.sdf -f :<SDF ETC_ID> -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.


jcsearch -q chemdb_sub500.sdf -f :T<SDF ETC_ID> -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.


jcsearch -q chemdb_sub500.sdf -f :M<ETC_ID> -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.


jcsearch -q chemdb_sub500.sdf -f :M<ETC_ID>  sdf -t:e --or -o hits.sdf Pubchem1.sdf

The system cannot find the file specified.

 

I
would also appreciate if you can explain a little bit more about the
following thing that Peter commments. Basically I'm trying to cross 2
SDF  one contains 1.7 MM records and the other one  100 MM records (we
will probably split this one into small subsets) 

Regarding this comment from Peter:


I am aware of one fundamental characteristic of jcsearch, though:
it loads the structure table being searched in each time it is
executed. This makes it unpractical for most real-life purposes.
You'll most probably have to use (or implement) a tool which keeps
the structure tables cached across individual searches. (Such a
tool is JChem
Webservices
, which Jon also happens to be in charge of.)

User c1ce6b3d19

03-12-2010 17:41:40

It sounds like the SDF field ETC_ID is in the query molecules, but not the target. 


Do you only want the fields that are in the chemdb_sub500.sdf or in the Pubchem1.sdf, or some of both?


 


 

ChemAxon 9c0afc9aaf

03-12-2010 17:59:13

 


If I understand correctly, this part of the documentation might help:




Unfortunately it deals with the data properties of the target (Pubchem) not the queries (in-house).


jcsearch does not seem to offer this kind of funcionality.


Regarding Peter's comment two observations are implied:


1. Searching in files is much slower than searching in a database, No fingerprints are used for pre-filterin before graph search, and all structures are standardized on the fly - again and again for multiple runs.


2. jcsearch can work with a database as well. This can be faster, but there is still a big overhead for multiple runs:


first the structure cache is filled up before the first search (loading a lot of data to memory from the DB). Normally this only happens once in a continually running application, but jcsearch exits after the search and the cache is lost - so has to be re-loaded for each run.


For the reasons mentioned above jcsearch is mostly used for small datasets and testing.


Apart from the option of writing a few lines of code either in Java our using one of our language agnostic interfaces (we can give pointers to the relevant API), there is also an other end-user solution.


The Instant JChem GUI has an "overlap analysis" tool (amongst other handy features) which may be useful for you:


http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html


Product page:


http://www.chemaxon.com/products/instant-jchem/


Instant JChem section on the support forum:


https://www.chemaxon.com/forum/forum62.html


 


Best regards,


Szilard