odd matching behavior

User 870ab5b546

22-09-2008 17:03:30

With the parameters below, query CN=C matches to target [H]N([H])(C)=C, but query CN=C(C)C doesn't match to target [H]N([H])(C)=C(C)C.





??? Seems to me both matches should be true, or both should be false.





Code:
        ourSearchOpts.setSearchType(SearchConstants.EXACT);


        ourSearchOpts.setOrderSensitiveSearch(false);


        ourSearchOpts.setStereoModel(SearchConstants.STEREO_MODEL_GLOBAL);


        ourSearchOpts.setStereoSearch(true);


        ourSearchOpts.setExactStereoMatching(false);


        ourSearchOpts.setDoubleBondStereoMatchingMode(true);


        ourSearchOpts.setChargeMatching(SearchConstants.CHARGE_MATCHING_EXACT);


        ourSearchOpts.setIsotopeMatching(SearchConstants.ISOTOPE_MATCHING_EXACT);


        ourSearchOpts.setRadicalMatching(SearchConstants.RADICAL_MATCHING_EXACT);


        ourSearchOpts.setVagueBondLevel(SearchConstants.VAGUE_BOND_OFF);


        ourSearch.setSearchOptions(ourSearchOpts);

ChemAxon 9c0afc9aaf

24-09-2008 02:21:55

Hi Bob,





We are examining this case and answer soon.





A quick question:


You are talking about MolSearch and MolSearchOptions here, right ?


(not JChemSearch and JChemSearchOptions)





Best regards,





Szilard

User 870ab5b546

24-09-2008 02:30:45

Szilard wrote:
You are talking about MolSearch and MolSearchOptions here, right ?


(not JChemSearch and JChemSearchOptions)
Correct.





Shouldn't you be asleep right now?

ChemAxon 42004978e8

24-09-2008 10:32:11

Hi Bob,





Query "CN=C(C)C" is not matching on [H]N([H])(C)=C(C)C, because of double bond stereo information. The query is double bond stereo specific, while in the target, the nitrogen has 4 ligands, so it is not.





In the other case, molecule CN=C doesn't have enough atoms the double bond stereo to be specified. Hence it matches on [H]N([H])(C)=C, which isn't stereo specific either.





Hope this explains the behaviour,


Bye,


Robert

User 870ab5b546

24-09-2008 14:13:52

Unfortunately, it doesn't explain the behavior. CN=C(C)C has no double-bond stereochemistry, so, according to your explanation, it *should* match to [H]N([H])(C)=C(C)C.





If you are going to allow pentavalent N atoms, you ought to treat them as different from the trivalent N atoms in an exact search, even if the only difference is two H atoms.

ChemAxon 42004978e8

26-09-2008 09:19:39

Hi Bob,





Yes you are right, the query molecule is simmetric. In our code in this case it will match on both CIS, or TRANS or DB simmetric cases.


The problem with this search is, that for pentavalent nitrogens (and atoms with >3 ligands) double bond stereo isn't handled in any way.


Now we are considering, when we should allow matching of DB queries to such targets.





Bye,


Robert

User 870ab5b546

26-09-2008 12:07:45

Again: There is a stereochemistry problem here, but the real problem is that in the exact search, N with three ligands is matching to N with five ligands. No chemist would ever draw pentavalent N with implicit H atoms. Your matching code needs to treat H atoms attached to hypervalent atoms as if they were heavier atoms. You can convert them to pseudoatoms before doing the exact match.

ChemAxon a3d59b832c

29-09-2008 08:18:18

Hi Bob,





Please note that exact search does not check atom valencies. (It basically works the same way as substructure search in all aspects except that it requires that the heavy atom network is the same.)





On the other hand, perfect search checks if the number of hydrogens are the same. In finer detail, this functionality can be controlled by option exactQueryAtomMatching:


http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/search/SearchOptions.html#isExactQueryAtomMatching()





Or, alternatively by a custom MolComparator:


http://www.chemaxon.com/jchem/doc/api/chemaxon/sss/search/SearchOptions.html#addUserComparator(chemaxon.sss.search.MolComparator)





Atom valencies are usually checked only when explicitly set in the query by the "v" property:


http://www.chemaxon.com/jchem/doc/user/query_features.html#atprop


http://www.chemaxon.com/jchem/doc/api/chemaxon/struc/MolAtom.html#getValenceProp()





This property can also be used to draw atoms of unusual valence with implicit Hydrogens.





Best regards,


Szabolcs

User 870ab5b546

29-09-2008 15:17:00

I suggest that when an atom is hypervalent and bears H atoms, that the H atoms be treated as heavy atoms. You can convert them to pseudatoms before the search.





Generally, ignoring explicit H atoms when doing exact structure searching is good policy, because it makes no chemical difference whether H atoms are explicit or implicit. However, a pentavalent N atom with two explicit H atoms is in no way, shape, or form equivalent to a trivalent N atom. The two should not be allowed to match.





Yes, there are workarounds, and we may implement a custom MolComparator for the purpose, but you have to be aware of the problem to know to implement the workaround, and I think most chemists would not think to make their programmers aware of the problem. It would be better if you incorporated my suggestion into your standard treatment, and let users develop custom MolComparators if they would rather have them match (which they won't).

ChemAxon d76e6e95eb

02-10-2008 07:19:22

In my opinion, in case of molecules, it does not matter how a hydrogen is drawn, explicitly or implicitly. No difference between them. It is a hydrogen atom, an atom to be considered as any other atoms. In case of exact search, the query is a molecule (with all hydrogens), the target is a molecule as well (with all hydrogens). So I think, that exact matching should always consider all atoms including hydrogens.





A trivalent nitrogen should not match a pentavalent nitrogen in exact matching.

ChemAxon a3d59b832c

02-10-2008 08:53:45

I think there is a misunderstanding here. As I stated earlier, our exact search is a special type of substructure search where the heavy atom network of the query and target molecules must be equal for a match. All other features are treated the same way as for substructure search. (E.g. query atoms, query properties, stereochemistry, formal charges, radicals, etc.) As substructure search ignores implicit H-s and implicit valence of an atom, exact structure search will also.





On the other hand, exact search in other cheminformatics systems usually mean duplication (equality) search of molecules. This latter term is called perfect search at ChemAxon. As we agreed, perfect search (duplication search) behaves exactly that way as you propose.





However, I am sure all chemists would be highly surprised if substructure search required hydrogen number or valency check. This latter would mean that one-atom substructure queries without valence specification would only match to the default valency occurrences. (E.g. S would only retrieve C1CCSCC1 , but not CS(N)(=O)=O )





To reduce such confusions in the future, we plan to rename exact search to a more descriptive name, and eventually perfect search will be renamed to exact search.





Actually, I am creating a separate forum topic to discuss the name of the now-exact-search. I am sure you all will contribute good ideas.





Regards,


Szabolcs

ChemAxon a3d59b832c

02-10-2008 09:47:11

Please see this forum topic about exact search renaming:


http://www.chemaxon.com/forum/ftopic4164.html

User 870ab5b546

02-10-2008 12:46:31

Gyuri wrote:
In my opinion, in case of molecules, it does not matter how a hydrogen is drawn, explicitly or implicitly. No difference between them. It is a hydrogen atom, an atom to be considered as any other atoms. In case of exact search, the query is a molecule (with all hydrogens), the target is a molecule as well (with all hydrogens). So I think, that exact matching should always consider all atoms including hydrogens.





A trivalent nitrogen should not match a pentavalent nitrogen in exact matching.
Szabolcs wrote:
I think there is a misunderstanding here. As I stated earlier, our exact search is a special type of substructure search where the heavy atom network of the query and target molecules must be equal for a match. All other features are treated the same way as for substructure search. (E.g. query atoms, query properties, stereochemistry, formal charges, radicals, etc.) As substructure search ignores implicit H-s and implicit valence of an atom, exact structure search will also.





On the other hand, exact search in other cheminformatics systems usually mean duplication (equality) search of molecules. This latter term is called perfect search at ChemAxon. As we agreed, perfect search (duplication search) behaves exactly that way as you propose.





However, I am sure all chemists would be highly surprised if substructure search required hydrogen number or valency check. This latter would mean that one-atom substructure queries without valence specification would only match to the default valency occurrences. (E.g. S would only retrieve C1CCSCC1 , but not CS(N)(=O)=O )


Regards,


Szabolcs
I think both of you are misunderstanding me.





First, I agree with Szabolcs that in substructure searches, valence should not matter. But I was not talking about substructure searching, I was only talking about what JChem now calls exact matching.





Second, I agree with Gyuri that in what JChem now calls perfect matching, all H atoms should correspond exactly, whether they are implicit or explicit. But I was not talking about perfect matching, I was talking about exact matching.





So we all agree that query C=NC should match to target C=N([H])([H])C in substructure searching, and should not in perfect searching. The question is, should they match in exact matching? I would say they should not, but the current behavior of JChem is that they do. (The reason I say not is that C=N([H])([H])C is much better described as [C-][N+]([H])([H])C.) So my suggestion is that in exact matching (not substructure searching), H atoms attached to hypervalent atoms should be replaced with pseudoatoms.

ChemAxon d76e6e95eb

02-10-2008 13:04:53

I am talking about exact matching too, and I agree with you, Bob. In case of exact matching, we should consider all hydrogens.





(The implicit hydrogens can be ignored on query atoms only (like hetero or atom list), because they cannot be determined.)

ChemAxon a3d59b832c

03-10-2008 09:55:01

We will discuss it internally and I will return later.

ChemAxon a3d59b832c

07-10-2008 14:45:10

There are arguments pro and contra whether valence or hydrogen number should be checked or not.





Unfortunately, there are some technical prerequisites of the extra checks that we cannot undertake in version 5.2 (the next major version) due to capacity reasons. The technical issues are mostly related to achieve consistent behavior in all file formats as well as handling full molecules and/or query features.





I suggest to return to this question when we are discussing planned 5.3 features.





Best regards,


Szabolcs