DocumentExtractor does't work properly for some text files

User a8852677c2

15-11-2010 14:04:33

Hi,


I have text file containing this  text.


A dicationic bis-hydrazone compound according to claim 1  wherein the compound is chosen from the following compounds: 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[3-(4-{(E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)propyl]pyridinium dibromide  4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[4-(4-((E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)butyl]pyridinium dibromide 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[5-(4-{(E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)pentyl]pyridinium dibromide 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[6-(4-{(E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)hexyl]pyridinium dibromide 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-(6-{[4-({[6-(4-{(E)-[methyl(phenyl)hydrazono]methyl}pyridin ium-1-yl)hexyl]amino}carbonyl)benzoyl]amino}hexyl)pyridinium dibromide 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[6-(4-{(E)-[methyl(phenyl)hydrazono]methyl}quinolinium-1-yl)hexyl]quinolinium dibromide 4-{(E)-[methyl(phenyl)hydrazono]methyl}-1-[6-(4-{(E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)hexyl]quinolinium dibromide 1-methyl-3-[5-(1-methyl-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-imidazol-3-ium-3-yl)pentyl]-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-imidazol-3-ium dibromide 1-methyl-3-[4-(1-methyl-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-benzimidazol-3-ium-3-yl)butyl]-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-benzimidazol-3-ium dibromide 1-[6-(1-methyl-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-imidazol-3-ium-3-yl)hexyl]-4-{(E)-[methyl(phenyl)hydrazono]methyl}quinolinium dibromide 1-[6-(1-methyl-2-{(E)-[methyl(phenyl)hydrazono]methyl}-1H-benzimidazol-3-ium-3-yl)hexyl]-4-{(E)-[methyl(phenyl)hydrazono]methyl}quinolinium dibromide  1 1-pentane-1 5-diylbis(4-{(E)-[methyl(phenyl)hydrazono]methyl}quinolinium) dibromide  1 1-butane-1 4-diylbis(4-{(E)[methyl(phenyl)hydrazono]methyl}quinolinium) dibromide  1 1-propane-1 3-diylbis(4-{(E)-[methyl(phenyl)hydrazono]methyl)quinolinium) dibromide  2-{(E)-[[4-[(4-methoxyphenyl) (methyl)amino]phenyl}(methyl)hydrazono]methyl}-3 3-dimethyl-1-[6-(4{(E)-[methyl(phenyl)hydrazono]methyl}pyridinium-1-yl)hexyl]-3H-indolium dichloride  1 1-hexane-1 6-diylbis(2-{(E)-[{4-[(4-methoxyphenyl)(methyl)amino]phenyl}(methyl)hydrazono]methyl]-3 3-dimethyl-3H-indolium) dichloride.


 


 


DocumentExtractor x = new DocumentExtractor(srcFile);
 x.processHTML();


System.out.println("ok");


DocumentExtractor class get hang .does not print ok .


Can u tell me why this happen?. For other text files It works properly.


 JChem version 5.3.4.


Java : jdk1.6.0_16


OS : Windows XP.


 


Thanks & Regards


Yogesh

ChemAxon a3d59b832c

15-11-2010 14:19:57

Hi Yogesh,


 


I have moved your question to the naming section of our forum. My colleagues will check it and answer soon.


Best regards,


 


Szabolcs

ChemAxon e7b9408ca1

16-11-2010 09:44:21

Dear Yogesh,



Thank you very much for reporting this issue. After an initial assessment, it looks like soon coming version 5.4 will improve the situation, as it does finish on this text, unlike 5.3. I will let you know more once I finished investigating the situation.


Best regards,


Daniel

ChemAxon e7b9408ca1

19-11-2010 17:06:50

So, 5.4 does finish on this text, though it does have problems with the structures, partly because of OCR/formatting errors. I made further improvements, they will probably be released later, in 5.4.1. All of the names are now recognized. This includes automatically fixing the OCR errors in the text, like the missing commas in "1 1-hexane-1 6-diyl" instead of "1,1-hexane-1,6-diyl".


The only remaining issue is that without any formatting between the names, it can be hard to be sure where one compound ends and the next one starts, so several names are understood together as a larger compound.


Thanks again for your report. I hope the improvements in 5.4 and 5.4.1 will be useful to you. Let me know if you find other issues or need specific support.