Whole document parsing

User 5adfeb8d26

09-06-2009 10:54:54

Hi,

I'd like to parse a whole document (text) and extract the molecules as either an array or collection.

Looking at the documentation it appears as though MolImporter.importDoc might work - am I on the right track ?

Cheers

Luke

ChemAxon e7b9408ca1

10-06-2009 07:40:15

Hi Luke,

No, the name-to-structure extraction from documents is not released yet. We plan to include it in next version 5.2.3 in about a month. If that schedule is a problem we can discuss an early evaluation version.

Cheers,

Daniel

User 5adfeb8d26

10-06-2009 08:22:05

Hi Daniel,

Ok, I'd really like to try the development version on evaluation, or alternately could you provide some guidance on how to approach parsing a body of text. i.e. is it rational to attempt every token etc..

Thanks

Luke

ChemAxon e7b9408ca1

10-06-2009 14:11:23

It's of course possible to attempt every token, but you would miss names spanning over multiple tokens. You can of course try concatenating several tokens, but that can lead to poor performance, and that will not handle OCR errors, names broken with extra spaces or dashes, ... That's why a document parser is non-trivial.

I'll contact you in a couple of days about an evaluation.

Cheers,

Daniel