cd_ids from importer

User acc0517f25

15-08-2008 17:26:10

I am attempting to load an sdf into my jchem database using chemaxon.jchem.db.importer. I don't want duplicates stored in the database. I want back the cd_ids for all imported chemicals as well as for all duplicates so I am using the following functions.





importer.setStoreDuplicates(true);


importer.setStoreImportedIDs(true);


importer.setDuplicateImportAllowed(false);


importer.init();


importer.importMols();


ArrayList<Integer> duplicateCdIds = importer.getDuplicateIDs();


IntArray CdIds = importer.getImportedIDs();





Now I have a list of the duplicate cd_ids and a list of the imported ids, but I dont know which structures from the original load were the duplicates unless I parse the infostream. Is there a way to do this more simply? I could cut my multistructure sdfs into single structure sdfs to eliminate the problem. Is this the recommended solution? Also, is there a reason why different types are returned from getDuplicateIDs() and getImportedIDs(). It seems like they should be of the same type since they contain the same sort of information.

ChemAxon e274e1bada

18-08-2008 13:14:52

Hi Christopher,





Importer has an option: setDuplicateImportAllowed() with import procedure can skip the duplicates.


See the API doc: http://www.chemaxon.com/jchem/doc/api/chemaxon/jchem/db/Importer.html#setDuplicateImportAllowed(boolean)





Regards, Edvard

User acc0517f25

18-08-2008 14:24:30

I don't think I adequately described my question. I am setting setDuplicateImportAllowed() so that duplicates are not entered in the database. However, when someone tries to enter a duplicate compound, I need back the cd_id for that compound. I also need back the cd_id for any compound that is not a duplicate and gets loaded into the database.





Take for example a sdf with 15 compounds. If compound 1 and 5 are duplicates, using the code above I will get a ArrayList<Int> of length 2 containing cd_ids for the duplicates and an IntArray of length 13 containing the cd_ids for the newly entered compounds. However, I am unable to determine from the code which 13 compounds were stored into the database and which 2 were the duplicates unless I parse the infostream. It is vital to know this since I am linking structure to activities stored in a different table.





I have already found a way around this by using MolImporter to get an array of Molecules and then looping through them and storing each one individually using UpdateHandler. But at first it did appear that using Importer would be simple and straightforward for my task when actually it is quite difficult to get back the cd_ids for all 15 compounds you try to load in the same order as they are in the sdf if you don't want to store duplicate structures.

ChemAxon 42004978e8

18-08-2008 15:16:15

Hello,





If you have the IDs, with these you can get the structures from the database.


http://www.chemaxon.com/jchem/doc/guide/search/index.html#sss_retrieve


You need probably the cd_structures or the cd_smiles fields.





The duplicate IDs contain the indexes of the structures in the database which were hit by the new import.


So if you have a file with 15 structures to import where the 12th and the 13 th are the same, than the


duplicate Id is: 12


the imported ids are: 1,2,3,4, .... 14.





Hope this helps,


Robert

User acc0517f25

18-08-2008 15:46:34

Nope, I still haven't made clear my problem. I have 10 structures in the database with cd_ids 1-10. I am loading 15 new structures from an sdfile with position in the sdf ranging from 1-15. Structure 1 in the sdf matches structure 6 in the databse. Structure 8 in the sdf matches structure 4 in the database. When I load, I will get the result of an arraylist<Int> of size 2 containing the following cd_ids { 6, 4 } for duplicates. I will get an IntArray of size 13 containing the following cd_ids { 11, 12, 13, ..., 22, 23 } for the loaded compounds. How do I know that is was structure 1 and structure 8 in the sdf that were the duplicates of structure 6 and structure 4 in the database? If a single array of cd_ids {6, 11, 12, 13, 14, 15, 16, 4, 17, 18, 19, 20, 21, 22, 23} were in some way available, this would be nice. Turns out I have to use MolImporter and UpdateHandler anyway to access the chemical names stored in the file, so its a moot point.

ChemAxon e274e1bada

18-08-2008 17:20:15

What like output do you want? If a string representation of the molecule is good for you then you can use this method: setOutputOptions





Regards, Edvard

ChemAxon 9c0afc9aaf

19-08-2008 10:39:15

Hi,





Probably the best solution currently is using MolImporter and UpdateHandler indeed for both the accessing the duplicates and importing the name.





The Importer class has not been upgraded for quite some time due to lack of user requests.


We agree that it should be more straightforward to access the duplicate structures.


This is not trivial however, as the structures cannot fit into the memory, they have to be stored somewhere (there can be potentially millions of them).


Probably storing the file ID, the cd_id and the structure source in a temporary table could be a proper solution.


You would still have to fetch the structures from that temporary table though.


Would you like a solution like that in the future ?





By the way we are also planning to allow non-SDF field (regular) structure names to be inserted into one of the user-defined data fields of the structure table.





Best regards,





Szilard

User acc0517f25

19-08-2008 13:46:19

All I really want is a way to get out cd_ids for all the chemicals into a single array whether they are duplicates or imported similar to the way UpdateHandler.execute(true) will return -cd_id for duplicate strucutres and cd_id for imported strucutures.

ChemAxon 9c0afc9aaf

22-08-2008 17:40:42

Hi,





Thank you for the clarification and sorry for the misunderstanding.





This is certainly easier to achieve.





We will consider to implement this for version 5.2.





Best regards,





Szilard