molconvert smiles to peptide bug and custom amino acid file

User 2a3dba2aa7

17-09-2014 11:20:08

Hi-

Two things : I have a SMILES of a molecule that contains a cyclopropylalanine. When I try to convert to a single-letter sequence via molconvert (version 14.9.15.0) thus:

molconvert -g peptide:1 myfile.smi

then in theory it should output the correct sequence with an X(some-string) at the position of the unnatural amino acid (if defined in your standard dictionary or any custom dictionary), or it should refuse to convert and output a blank line. In fact, it outputs a single-letter sequence which is correct for all the other aminoacids, but puts an L at the position of the unnatural amino acid (cPr-Ala). I guess it fails to recognize that there is an extra bond in the skeleton compared to Leu, but I can't find the definitions you use for standard natural and unnatural amino acids (looks as though they're hardcoded rather than in a text file). Is this information available anywhere?

Secondly, I have tried to add custom amino acids by creating a file in $HOME/.chemaxon/custom_aminoacids.dict as per your web documentation. However, lines such as

X(Tfa) X(Tfa) [SX2]1~[CH1]~[CH1]~[CH1]~[C]~1[CH2][CH1]([NX3])C=O 8 9

and numerous variants on that theme in which I've tried every SMARTS representation I can think of for 2-thiophenylalanine, just fail to get picked up. The documentation doesn't go into any detail and it's not easy to see why it's not working. Could you please provide some guidance?

Best,

Howard.

ChemAxon 2c555f5717

18-09-2014 12:40:12

Dear Howard,

I attach here two custom_aminoacid.dict files that has worked for me (with altered names). First I have copied your definition, and just created a file exactly how you have wrote it down, and it has not worked for me. It contained spaces instead of tab characters, so I have replaced them, and after that it has worked. (But tabulators might be changed by this forum engine, so for a better understanding could you send me your custom_aminoacid.dict file?)

If you write this definition, as you have made, than your custom aminoacid won't have any coordinates when you extract it (until you clean the structure). If you want to convert any structures to a cxsmarts reprezentation with coordinates you can use this molconvert format:

"c:\Program Files\ChemAxon\MarvinBeans\bin\molconvert" "cxsmarts:c" -2 inputFile.mrv -o output.cxsmi

Here c parameter stands for coordinate export and -2 stands for 2D clean.

We have the following definition for Leu: [CH3][CH1]([CH3])[CH2][C@HH1]([NX3])C=O |wD:4.4,(-.36,1.33,;.98,2.1,;.98,3.64,;2.31,1.33,;3.64,2.1,;3.64,3.64,;4.98,1.33,;6.31,2.1,)|

We do not support the changing of our defaults in aminoacid dictionarry but I can send it to you via email if you want to further examine it.

I hope I could help.

Regards:
Balázs

User 2a3dba2aa7

18-09-2014 15:35:42

Dear Balazs,

Many thanks for getting back to me. I downloaded both your custom_aminoacids.dict files and tried them one after another, installed inthe standard location (just to check that I'm putting this in the right place, I used this :

$HOME/.chemaxon/custom_aminoacids.dict

)

in each case (under Linux; Red Had EL6). molconvert still refuses to output a peptide sequence for a peptide that contains the amino acid in question, using either the "with coordinates" or the "without coordinates" versions. I even tried it with just the amino acid itself (attached as qry.smi) in the input to the conversion, but that doesn't work either. I get this

$ molconvert peptide:1 qry.smi

qry.smi: error: Unmatched atoms in peptide chain. Likely amino acid template mismatch.

I'm sending you my original custom_aminoacids.dict file as an attachment, but I think it's identical to yours except that there's a linefeed at the end of the line. I did use tabs, so they must have been translated to multiple spaces by the forum engine. Either way, it doesn't work with my file or with the ones I downloaded from your reply.

Maybe this is a linux-only problem ?

On the other issue, the SMARTS processing of the Leu definition is also not working as it should - you clearly have the two terminal methyl groups defined as CH3 in the definition you included in your message, but it's picking up cyclopropylalanine anyway. It's as though it just ignored the hydrogen counts. Maybe there is some problem with the library that handles this under Linux, but not under Windows? So I get this :

$ molconvert peptide:1 cPrAla.smi
L

I hope these example files help!

Best,

Howard.

ChemAxon 2c555f5717

18-09-2014 16:14:04

Dear Howard,

Sorry for the misunderstanding, I was only testing whether your definition can be recognized with our Sketcher. I need to discuss more about your problem with my colleagues. Please be patient, we try to answer as soon as possible.

Regards:
Balázs

ChemAxon 2c555f5717

19-09-2014 14:35:14

Dear Howard,

The experts in ChemAxon told me, that your structure has query features for aminoacids, that can not be substituted because this method requires exact match in atoms and bonds, and any bonds do not stand this criteria.

You can do similar thing through JChem Search, but for this you have to have a JChem License. If you are interested in the solution with the JChem API I can connect you with the experts.

Regrads:
Balázs

User 2a3dba2aa7

22-09-2014 09:11:33

Hi Balazs,

Thanks for getting back to me again. But I think there may be a misunderstanding here - and I'm not quite sure which of my two issues you're referring to.

Issue number one is that if I have a structure (in a SMILES file) that contains a 2-thiophenylalanine, and nothing that is not an amino acid, then I think it ought to be detected by molconvert if there's a suitable entry defining 2-thiophenylalanine in the custom_aminoacids.dict file, and the relevant amino acid name (short or long) output. Or does the custom_aminoacids.dict file only work if you have a jchem license or explicitly hook something up via the API?

Issue number two is that if I have a SMILES string that contains an amino acid sequence (of one or more amino acids) that contains cyclopropylalanine, then that cyclopropylalanine is incorrectly recognized by molconvert as a leucine. That is, the hydrogen count in the built-in amino acid dictionary is ignored, and/or additional bonds in the structure supplied for conversion are ignored. That's a bug, I think, since molconvert should not report "Leu" for anything other than an exact match to the skeleton of Leu, with no additional bonds and with all hydrogens exactly as in leucine (or its C-terminal amide; that makes sense).

Can you help me clarify this?

Many thanks,

Best,

Howard.

ChemAxon d26931946c

23-09-2014 09:54:08

Hi Howard,

Thank you for clarifying your issues.

1) Custom amino acid support should work without any license. You might be able to define custom amino acids that contain special query features that requires a license, but this is not the case right.

2) You're right we incorrectly recognize cyclopropylalanine as leucine. We'll investigate and fix this issue soon.

Thank you for your report, we'll get back to you when it's fixed.

BRs,

Peter

User 2a3dba2aa7

23-09-2014 13:48:47

Thanks for getting back to me, Peter.

You're right, I'm not (I think) using any special features in my custom amino acid definition - the file's attached, as is the SMILES for an amino acid I think it should recognize.

Great to hear you're on to the cyclopropylalanine issue - not a huge deal for me, but it will be good to have that aspect tidied up (and maybe there's something more general there that will be fixed at the same time).

Best,

Howard.

ChemAxon d26931946c

24-09-2014 12:51:37

Hi Howard,

I think the problem is with the definition. The dictionary should contain exact structures while your structure contains several any bonds. I this this is due the facwant to reogniye this amino acid in aromatic and in Kekule form too. To achieve this, you have to define the aminoacid twice, once in Kekule and once in aromatic form:

X(Tfa) X(Tfa) [SX2]1[CH1]=[CH1][CH1]=[C]1[CH2][CH1]([NX3])C=O 8 9

X(Tfa) X(Tfa) [sX2]1[cH1]=[cH1][cH1]=[c]1[CH2][CH1]([NX3])C=O 8 9

And just for clarification, the name of the file should be custom_aminoacid.dict, without the ".txt" ending.

BRs,

Peter

User 2a3dba2aa7

24-09-2014 14:05:31

Peter,

Thanks for getting back to me again. I hadn't realized that the apparent "SMARTS pattern" in the custom_aminoacids.dict file needed to be 100% specific to each representation of the molecule that could be used in the input - almost like a molecule match rather than a substructure match - nor that you could have multiple lines with the same aminoacid names. That combination now works just fine (in fact, I've added a couple more definitions for good measure including an all-aromatic one with : bonds specified) and outputs the expected strings.

On the filename, I just stuck a .txt on the end of the filename to have the file recognized by the forum engine - some of them won't eat arbitrary file extensions, and I don't know about this one!

Best,

Howard.

ChemAxon d26931946c

24-09-2014 14:09:14

Thanks for the feedback, I'm glad that it's working now. I'll check our documentation to make this behaviour more clear.

Best,

Peter

User 2a3dba2aa7

24-09-2014 15:40:33

One final point. I put a definition for the straight-chain aminoacid norleucine (Nle) into my custom aminoacids dictionary file. If I use a simple definition like

[CH3][CH2][CH2][CH2][CH1]([NX3])C=O

this seems to find a match to cyclopropylalanine (that is, the current behaviour of the default dictionary that matches cyclopropylalanine to Leucine no longer happens, and an error is generated; it doesn't recognize cPrAla as Nle). I get the same behaviour - an error instead of the correct string - if I put in a correct definition of cyclopropylalanine into the custom aminoacid dictionary, I guess because the program finds more than one hit to the structure, one in the default dictionary as Leu and one in custom_aminoacids.dict as cPrAla. However, if I use a definition for Nle like

[CH3R0][CH2R0][CH2R0][CH2R0][CH1]([NX3])C=O

then that works fine. Picks up Nle and doesn't seem to modify the behaviour of the default dictionaries or match to cPrAla. I thought this info might be useful to you - looks as though the SMARTS processing is handing ring counts better than hydrogen counts.

Best,

Howard.