User 677b9c22ff
17-11-2008 23:16:54
Hello Daniel,
chemicalize.org is a useful service and the non-invasive output looks good.
The system as is works good as a chemical reading enhancer.
Similar projects (with different scopes) are:
* Project Prospect
* OSCAR3/OPSIN which also power Project Prospect
* IBM Patent Search
* SureChem patent search powered by ACDName
* ChemMantis not online yet
* LEXICHEM which was used for PubChem
* CAS and Beilstein systems in use
Common problems are
1) how to filter out non chemcial noise using stop lists,
2) how to detect complex names as
2,15-dimethyl-14-(6-methylheptan-2- yl)tetracyclo[8.7.0.0^{2,7}.0^{11,15}]heptadec-7- en-5-ol
(which is Cholesterol from Marvin Name)
3) how to deal with different formats from chemistry publications and patents and websites
4) how to apply semantic filters and use curated vocabulary or ontology sets (IUPAC/CHEBI)
Applications are limitless
a) Build a web crawler which crawls the web and allows substrucure search
This actually should be built within Google Scholar, the only website which has
access to most of the digitized chemical literature (except CAS and Beilstein)
b) Built chemistry enhanced websites (as in Project Prospect)
c) Prepare documents prior submission to journals
d) Analyze journals after submission to find chemicals
e) Chemical Text Mining on full texts (not only Medline abstracts)
I am quite sure using the ChemAxon API and JChem cartrige on could
built such a stand-alone service or program or as you showcased
on can use it as webservice. Still having it as standalone program
would be nice.
The question is where to go with chemicalize.org?
(I)
I would use chemicalize.org for existing documents
to read through them. (The problem here is that the proxy can
not access the subscription literature (which is the majority)
The solution to that problem would be to download the whole
document locally and run the service again. That does not work.
Second problem most publications are in PDF, so PDF-->HTML is
needed which is even worse to perform (even with full Acrobat
it is a mess and usually fails) The system should also perform
OCR to convert chemical pictures to structures as done whith
systems like Kekule, Clide, OSRA and ChemOCR.
(II)
I would use chemicalize.org for existing web documents,
to obtain a list of Names, canonical SMILES, INCHIs, InChIKeys
That could be exported as TAB separated TXT or XLS using
a small button on top of the document. That would be a really helpful extension.
(III)
Attach PubChem Names which are free to download,
as a Lexicon for this service, because many common names
are not covered in this implementation.
BTW. there is also a nice PPT from David Wild from an ACS meeting:
Integrating text and literature sources
with traditional chemoinformatics tools
Cheers
Tobias
chemicalize.org is a useful service and the non-invasive output looks good.
The system as is works good as a chemical reading enhancer.
Similar projects (with different scopes) are:
* Project Prospect
* OSCAR3/OPSIN which also power Project Prospect
* IBM Patent Search
* SureChem patent search powered by ACDName
* ChemMantis not online yet
* LEXICHEM which was used for PubChem
* CAS and Beilstein systems in use
Common problems are
1) how to filter out non chemcial noise using stop lists,
2) how to detect complex names as
2,15-dimethyl-14-(6-methylheptan-2- yl)tetracyclo[8.7.0.0^{2,7}.0^{11,15}]heptadec-7- en-5-ol
(which is Cholesterol from Marvin Name)
3) how to deal with different formats from chemistry publications and patents and websites
4) how to apply semantic filters and use curated vocabulary or ontology sets (IUPAC/CHEBI)
Applications are limitless
a) Build a web crawler which crawls the web and allows substrucure search
This actually should be built within Google Scholar, the only website which has
access to most of the digitized chemical literature (except CAS and Beilstein)
b) Built chemistry enhanced websites (as in Project Prospect)
c) Prepare documents prior submission to journals
d) Analyze journals after submission to find chemicals
e) Chemical Text Mining on full texts (not only Medline abstracts)
I am quite sure using the ChemAxon API and JChem cartrige on could
built such a stand-alone service or program or as you showcased
on can use it as webservice. Still having it as standalone program
would be nice.
The question is where to go with chemicalize.org?
(I)
I would use chemicalize.org for existing documents
to read through them. (The problem here is that the proxy can
not access the subscription literature (which is the majority)
The solution to that problem would be to download the whole
document locally and run the service again. That does not work.
Second problem most publications are in PDF, so PDF-->HTML is
needed which is even worse to perform (even with full Acrobat
it is a mess and usually fails) The system should also perform
OCR to convert chemical pictures to structures as done whith
systems like Kekule, Clide, OSRA and ChemOCR.
(II)
I would use chemicalize.org for existing web documents,
to obtain a list of Names, canonical SMILES, INCHIs, InChIKeys
That could be exported as TAB separated TXT or XLS using
a small button on top of the document. That would be a really helpful extension.
(III)
Attach PubChem Names which are free to download,
as a Lexicon for this service, because many common names
are not covered in this implementation.
BTW. there is also a nice PPT from David Wild from an ACS meeting:
Integrating text and literature sources
with traditional chemoinformatics tools
Cheers
Tobias