Finding offsets with DocumentToStructure

User 5bdc9884eb

13-07-2015 09:59:24

Hello,

I am trying to locate some chemicals in text, i.e. given a string of text (e.g. a sentence), return the start and end character offset of each chemical inside the text. Is this possible? I tried DocumentToStructure (JChem version 14.11.10.0), but it is not obvious to me how to do this. My understanding is that the CHARACTER and END_CHARACTER fields are supposed to contain offsets, but to process a string of text with this class, I need the static method process(String text), which does not seem to update these fields. Could you please explain to me how do this with DocumentToStructure (or any other ChemAxon tool)? An example would be appreciated.

Thanks in advance!

ChemAxon e7b9408ca1

13-07-2015 14:37:35

Here's a code example that does that:

String text = "This is a text mentioning aspirin and benzene";
for (Molecule m : DocumentToStructure.process(text)) {
    String name = m.getName();
    int start = (Integer) m.properties().getObject(DocumentToStructure.CHARACTER);
    int end   = (Integer) m.properties().getObject(DocumentToStructure.END_CHARACTER);
    System.out.println(name + ": " + start + "-" + end);
}

Output:

aspirin: 27-34
benzene: 39-46

Does this cover your needs?

Theodosia

User 5bdc9884eb

13-07-2015 15:58:46

Thanks! That's exactly what I wanted!

ChemAxon e7b9408ca1

14-07-2015 07:54:13

Great, you're welcome!