Document to structure batch conversion with KNIME

User 380f434589

14-03-2012 10:06:33

Dear ladies and gentlemen,


I want  to try document to structure as a batch converter together with KNIME (JChemExtensions). I was able to do the conversion for a single PDF document and save the data into a file. My problem is, that the conversion has to be done for about 100 documents and I would like to automate this process. Unluckily the document extractor node does not seem to have an input.


Does anyone know a solution for this problem?


Kind regards,


Daniel

User 5458277630

14-03-2012 10:52:56

Hi,

If you use the looping node with the DocumentExtractor, you can process multiple documents in the KNIME workflow.

I attached example worlflow.
Could you please confirm it?
(Please use import wolkflow feature.)

NOTE:
You have to specify a input file path for the Document Extractor node as dummy file.
However, when you execute the workflow, the path will be overwritten by the list of file from List Files node.

Also, please confirm the following step?

1. The output of List Files node shows the list of what you want to import?
2. Select the "Location" variables for the "file_name" in the Document Extractor.

Best,
Taka

User 380f434589

14-03-2012 12:44:46

Hello Tohshima,


thanks for your quick answer. Unfortunately the loop does not work properly or maybe I am using it in the wrong way. I attached my workflow to this message. What I did was:


 


1. Set the Input for List Files to the directory I tried (I used 3 Example files)


2. I put one of these 3 files as a dummy into the document extractor.


3. I changed the end of the worklow so that the results (without duplicates) would be written into an excel file.


What I recognized is, that I only seem to get the results out of the dummy file. When I change it the resulting xls-file changes. When I use a completely different dummy file without any chemical information he just recognized two words out of it and writes them into the excel file.


Do you have any idea where the problem is laying?


Best regards,


Daniel

User 5458277630

15-03-2012 01:25:38

Hi Daniel,

You have to select the "Location" variables for the "file_name" in the Document Extractor.
Could you please confirm sample.zip (sample.doc)?

Best,
Taka

User 380f434589

15-03-2012 07:57:56


Hello Tohshima,



I have missed to specify this variable, now it is working! Thanks a lot for your help.


Maybe you can help me right away with another issue I am confronted with? I have an excel list with about 10000 substance names and CAS numbers. Not all of the substances have a CAS number. I would like to get a clean file including structures for this list, standardized and checked. Is there a predefined workflow for this issue?


Best regards,


Daniel

User 5458277630

15-03-2012 09:42:04

Hi Daniel,

I think that you might be able to use the XLS Reader and Text Extractor node.

I attached example worlflow and sample file.
Could you please confirm them?

The sample file has two columns that are "name" and "cas".
Therefore, you can use Column List Loop nodes.
You have to select the "currentColumnName" variables for the "text_column" in the Text Extractor.

Best,
Taka

User 380f434589

15-03-2012 13:12:31

Hi Toshima,


thanks again for your quick answer! I will check this as soon as I finished my other task. Meanwhile another problem occured with the document to structure process. It seems that I have to use MrvCell to receive all the necessary results out of my documents. They contain normal structures, text and markush structures, but I don't know if that makes any difference in the process.


My problem is, that I can not filter duplicates as before when I used SmilesCell with the GroupBy Node. I simply fill in the MrvCell in the Group Option as before but whatever I try all the duplicate structures stay in the final Mrv-file. Why is this problem occuring and do you know a possibility to solve it?


Best regards,


Daniel

User 5458277630

16-03-2012 01:45:00

Hi Daniel,

The MrvCell can not be used for group column in the current versions...
I found some problems for that and will investigate it.

In the meantime, you can use MolConverter node to change the type.
In this case, you can convert them to Unique Smiles.

Or, could you please confirm attached workflow?
I use MolSearch node to filter duplicates molecules.
However, it might take long time...

Best,
Taka

User 380f434589

19-03-2012 09:59:22

Hi Tohshima,


both workflows are working just fine. Thanks a lot for your help. But you are right, the removing of duplicates from a mrv file takes very long.


Best regards,


Daniel

User 421219f4d4

21-05-2012 05:54:09

Hi Tohshima,


Thanks for the workflow for document to structure-batch conversion.


The output file gives us the position of the test/structure in the file. But I was unable to understand which file it came from?


Is there a way wherein we can put one column for the file_name/or for complete file_location which includes the file name?


Thanks


Surojit

User 5458277630

21-05-2012 12:24:22

Hi Surojit,


You can use the Variable To TableColumn node to add the file location as column.
I attached example worlflow.
Could you please confirm it?


Best,
Taka

User 421219f4d4

21-05-2012 13:40:25

Thanks for the prompt response Taka.


It works fine for me.