I'd like to create a huge database with lots of information. For that I already ceated a table with many substances. Now I want to add more and more substances from other databeses(files). So I want to upload other files into that one table.
( My table has columns for the smiles/structure code, for the cas number, and for a lot of other properties that can vary with the different files.)
It's important to just upload substances that aren't so far existing in the database. For already existing substances I just want to upload properties wich are not added to a substance yet.
I tried mapping and merging and also the overlap anlysis to identify the already existing substances in a new file but couldn't reach what I want to achieve.
Do you know how I can manage to get my stuff done? Is there a way to do something like the overlap check with the cas numbers? For me it would be much better to compare the cas- numbers with each other rather than to check the structures to identify already existing substances.
Thanks a lot,
this isn't supported as a single step process, but there would be some ways of doing so. I can think of two approaches that should be possible.
1. write a script with the specific behaviour you wanted. Scripts have full access to the IJC functionality, so can be tailored to do exactly what you want. Obviously to do this you need to write the script and would need some programming skills to do this and need to have some understanding of the IJC and Marvin/Jchem APIs. The CDXImporter script that is in the demo data and can be found in the user docs is a good starting point. The best starting point in the user guide is this:
2. Import each new file into its own table and perform an Overlap analysis (http://www.chemaxon.com/instantjchem/ijc_latest/docs/user/help/htmlfiles/chemistry_functions/performing_overlap_analysis.html) on it to identify the duplicates. Then export the duplicate and non-duplicate structures separately and the import the non-duplicates as new structures and merge the duplicates as existing structures.
A final note, I'm not sure using CAS numbers is the best approach. Strucutres can have multiple CAS numbers so this will not guarantee uniqueness. Doing a duplicate structure search would be more reliable, and would allow use of a standardizer to accommodate differences in representation (nitro groups, aromatisation etc.), and duplicate search is very fast.
thanks for the answer.
I've decided to try the second possibility. However there is one step I don't know how to do it.
I'm supposed to merge the duplicates with my existing structures. As I'm not able to merge structures, I have to merge the CAS-numbers, haven't I?( In the user guide they merge the id-field but I think that just makes sense for two lists with the same order of substances) But doing the duplicate check with structures and then merging the CAS-numbers wouldn't be a brilliant performance; as your are right that there might be several CAS-numbers for one structure.
Hi, yes that is the key bit.You can't merge by structure, and CAS is not reliable.
But Overlap analysis generates the IDs of the matching rows for you (the Overlap hits column if you use the default names) and you can use these to merge the data.
I tried this approach, and it works:
1. run the overlap analysis using duplicate search
2. run a search on the results to find rows with more than zero hits (we assume here that your target tabel does not have duplicates so that there are 0 or 1 hits only)
3. export those hits, including the structure, the Overlap hits column and any other columns you want. Choose SDF or MRV format for the file.
4. import that exported file into your target table, and merge by the overlap hits field form the file and the CdID field from the target table. Map the additional fields (don't import the structure).