Fastest and most reliable method for data migration?

User c91b2283c1

19-02-2010 16:18:32

Hello,


I'm trying to figure out the best method to transfer data from one JChem database to another. I am currently using mysql and mysql command-line tools and JChem 5.3.0.2


1. MySQL


For my first attempt, I used mysqldump to convert my JChem database to an SQL dump. I then used this dump to create a copy of my JChem database on a remote server. Everything seemed to go fine. However, to be sure, I ran "jcman r" to make sure the data was consistent. A few records caused jcman to throw Java Exceptions (see below) indicating that data was corrupted. I did try to make sure that all the character sets used during exporting and importing of mysql data and creation of mysql schema were consistent. Maybe I missed something and this is the source of the error?


2. jcman export.


Then, I tried to use "jcman x" to export SDFs. The SDFs looked fine. However, when I attempted to import them using ""jcman a", the import captured only molecule data and none of the custom columns that had been added to the data. Is there some way to use jcman to import both molecule data and custom columns? Instant JChem can "auto-discover" these custom columns and modify the schema during import.


3. Instant JChem import.


I'm currently using JChem to import the data. However, importing SDF files to a remote database using a via Instant JChem is very very slow and would be unsuitable for, say, an emergency data backup and restore operation.


Can I use any of these methods for safe and consistent data migration? If so, how would I improve them to get them to work reliably? If not, what method do you recommend I use?


Thanks,


-John David


## Java Exceptions for Attempt 2 # #


Regenerating TABLE_NAME ... chemaxon.formats.MolFormatException: Line too short in bond block.
        at chemaxon.marvin.io.formats.mdl.MolImport.readBondBlockV2(Unknown Source)
        at chemaxon.marvin.io.formats.mdl.MolImport.readCtab(Unknown Source)
        at chemaxon.marvin.io.formats.mdl.MolImport.readMol0(Unknown Source)
        at chemaxon.marvin.io.formats.mdl.MolImport.readMol(Unknown Source)
        at chemaxon.marvin.io.MRecordImporter.readStructure(Unknown Source)
        at chemaxon.marvin.io.MRecordImporter.readMol(Unknown Source)
        at chemaxon.marvin.io.MRecordImporter.readMol(Unknown Source)
        at chemaxon.marvin.io.MRecordImporter.readMol0(Unknown Source)
        at chemaxon.marvin.io.MRecordImporter.readMol(Unknown Source)
        at chemaxon.formats.MolImporter.readMol(Unknown Source)
        at chemaxon.formats.MolImporter.read(Unknown Source)
        at chemaxon.formats.MolImporter.read(Unknown Source)
        at chemaxon.util.MolHandler.importMol(MolHandler.java:657)
        at chemaxon.util.MolHandler.setMolecule(MolHandler.java:178)
        at chemaxon.util.MolHandler.<init>(MolHandler.java:127)
        at chemaxon.jchem.db.UpdateHandler.readMolecule(UpdateHandler.java:1280)
        at chemaxon.jchem.db.UpdateHandler.init(UpdateHandler.java:1069)
        at chemaxon.jchem.db.UpdateHandler.execute(UpdateHandler.java:2155)
        at chemaxon.jchem.db.UpdateHandler.execute(UpdateHandler.java:2124)
        at chemaxon.jchem.db.Regenerator$RegenThread.run(Regenerator.java:103)
Error at molecule with cd_id: 16526

User c1ce6b3d19

19-02-2010 16:39:58

John,


Pertaining to point #2, you can use the --connect parameter to assign custom table fields to SDFile tags.  This should import the custom columns in addition to the molecule data.


For example,


jcman a Table1 t1.sdf --connect "customDBField=customSDField"



User c91b2283c1

19-02-2010 20:13:15

Hi jlee,


I considered this as a possible solution, but I've noticed that not every record in my data has identical columns. Sometimes some columns are unique to a single record out of ten of thousands of records. It is therefore difficult to know at import time what columns have to be created, and "jcman a" does not emit any warnings data from a column is not imported.


I guess I could write a script to parse the SDF file and then generate the appropriate --connect parameters, but I was hoping there was a one-step solution that I could use to quickly, efficiently, and safely backup and restore data.


If you can think of such a method, I'd be very happy :]


-John David

User c91b2283c1

19-02-2010 22:29:54

Hi jlee,


I wrote a script to parse an SDF file, extract non-molecule column names, and build a call to "jcman a" along with the necessary --connect parameter.


It seems that --connect does not seem to be able to create new columns.


Here is an example of the type of "jcman a" command my script can generate:


jcman a table table.sdf --connect "field_1=field 1;field_2=field2"


This command will still import the molecule data without creating new columns named "field_1" and "field 2".


What do you suggest I do? Should I proceed to issuing ALTER TABLE X ADD COLUMN Y SQL commands or use the --coldefs parameter to "jcman c"? Is there some other one-step solution to importing SDFs with custom columns?


Thanks,


-John David

ChemAxon ebbce65bcf

22-02-2010 14:13:36

Hi,


Yes, you should use the --coldefs parameter of "jcman c". There is no one step solution, first you have to create the table and then connect SDF fields to DB fields. Anyway, I still think the easiest solution would be to dump all your jchem tables (with _UL tables, _PR tables and molecular descriptor tables, if they exist) and the JChem property table. Import the dump to your new database and modify your JChem connection parameters. This should work, I have done it several times. If you get an error during the regeneration, I suppose you should get the same in your original database, as well. Could you check it, please? In case you got a similar error related to the molecule format, I would ask some molecule samples of yours to investigate the problem, if possible. Thank you.


Regards,


Roland

User c91b2283c1

22-02-2010 14:39:13

Hi Roland,


I also thought that a dump would work right away. This would be the best solution since it would not require that I support special backup routines involving SDF files.


Regeneration generates no errors in the original data set, but does generate errors in the exported data set. I'll get you the broken molecule files to see if you can figure out why data seems to be getting corrupted.


Thanks for your help,


-John David

ChemAxon e274e1bada

22-02-2010 16:49:58

hi John,


please check the cd_structure fields in the original and the exported table in the case of the problematic records. Please compare the molecule sources, unfortunately, an extra line ending charater or wrong indentation can corrupt the SD source of the molecule. I experienced similar in the similar way.


Edvard

User c91b2283c1

22-02-2010 17:22:34

Hi Edvard,


I found the problem. As you said, it was indeed corruption introduced during the exporting/importing process.


When exporting data using mysqldump, failing to pass the --hex-blob flag to that command will cause the data to exported in plain text rather than raw hex code which doesn't seem to be reliably importable.


The complete export command would look like this:


mysqldump --databases DATABASE -u USERNAME --password=PASSWORD --hex-blob | gzip > FILENAME.sql.gz


The complete import command:


mysql -u USERNAME --password=PASSWORD < FILENAME.sql.gz


Thank you all for your help.


-John David