document to structure

User 421219f4d4

29-12-2011 12:41:33

The attachment has IUPAC names, bit complex ones, I want them to be converted to structures. 


I tried to using document to structure through MarvinView but none of these IUPAC names were converted. 


Please let me know what is the issue here.

ChemAxon b124dd5f17

29-12-2011 17:58:24

HI, the attachment pdf is a scanned image, there is no suport yet for extracting text from scanned images. Did you try or can you get the text version of the pdf?

User 421219f4d4

30-12-2011 06:21:19

Hi Alexa,


Thanks for your reply.


I used Marvin 5.8 test version which says that "Automatic text OCR (optical character recognition) has been added to support document to structure conversion of scanned (non searchable) PDF documents. " as mentioned here. I could extract a lot of srtuctures from the full pdf file but these particular pages were not converted.


Lets take the following four names as a case study :


 


(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-1-{3-[(phenylcarbonyl)sulfanyl]propanoyl}-L-proline ditrifluoroacetate salt


N-[(1S)-1-carboxy-3-phenylpropyl]-L-alanyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-proline ditrifluoroacetate salt


Methyl N-[(2S)-1-ethoxy-1-oxo-4-phenylbutan-2-yl]-L-alanyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-prolinate ditrifluoroacetate salt


N-[(1S)-1-carboxy-3-phenylpropyl]-L-valyl-(4S)-4-{[(3R)-3-amino-4-(2,4,5-trifluorophenyl)butanoyl]amino}-L-proline dilithium salt.


 if I type the first name in MarvinSketch->edit->Import name without the "ditrifluoroacetate salt" the structure is retrieved. 


but all the rest give error with/without the salt component. I have attached the error log.

ChemAxon e7b9408ca1

30-12-2011 22:43:12

Hi Surojit,


Thanks for the testing and the detailed report. I'm glad you found extraction working in many cases.


For the attached pages, it seems the main problem is that the patent has line numbers at the begining of each line. For names that span over two lines, the line number ends up in the middle of the name, which prevents the conversion. We will work on a solution, probably for 5.9.


Of the four names you extracted, what I found is that 5.7 indeed converts only the first one (but also when including the ditrifluoroacetate salt part). However 5.8 does convert all four of them. Can you confirm that?


Best regards,


Daniel

User 421219f4d4

31-12-2011 08:28:19

Hi Daniel,


 


Thaks for the information. 


I have successfully converted the IUPAC names to structure using Marvin 5.8. :)


One more query:


If I have have an IUPAC name as an Image, and I convert that image into .pdf, why doesn't it ectract the structure?

ChemAxon e7b9408ca1

31-12-2011 08:38:47










surojit.sadhu wrote:

If I have have an IUPAC name as an Image, and I convert that image into .pdf, why doesn't it ectract the structure?



Could you attach the pdf?

User 421219f4d4

31-12-2011 08:55:14

Please find attached the pdf.


If I try to extract the structure from this file the I only get Acetate salt and the rest is not there.

ChemAxon e7b9408ca1

31-12-2011 11:24:31

There are OCR errors on this image. I improved the situation in the 5.9 branch.