Jchem API Importer became slower and slower.

User 6a59cb749d

12-10-2013 09:19:32

I want to establish a structure database and import pubchem sdf into it.  There are more than 2500 sdf files . It is not efficient  enough to  use JChem Manager to import one by one. So I use JChem API to build a java program.


 


At the beggining , the program worked quite well. It imported one sdf within less than 1 minute. But after several hours, the performance became worse and worse. Eventually, it costs 10 minutes to import one sdf.


 


I think my program should be optimized to improve the performance.  Maybe the java program produced too much redundant objects and  the java virsual machine has to suspend the program and do some clean up.


 


below is my java code,  I hope some one can help me optimize my code. Thanks.


 


 


 


 


 


import java.io.IOException;


 


import java.sql.SQLException;


 


import java.util.Iterator;


 


import java.util.List;


 


 


 


import org.apache.log4j.Logger;


 


 


 


import chemaxon.jchem.db.Importer;


 


import chemaxon.jchem.db.TransferException;


 


import chemaxon.util.ConnectionHandler;


 


 


 


/**


 


 * Provide methods that will be used to import sdf files into database.


 


 * @author Administrator


 


 * 


 


 */


 


public class InImporter {


 



 


private Log log=Log.getLog();


 



 


public void importSDF() {


 


Logger logger=log.logger;


 


ConnectionHandler connHandler = new ConnectionHandler(true);


 


//Setting database connection


 


connHandler.setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver");


 


connHandler.setUrl("jdbc:sqlserver://localhost:1433;databaseName=pubchem_db");


 


connHandler.setLoginName("sa");


 


connHandler.setPassword("111111");


 



 


try {


 


connHandler.connectToDatabase();


 


} catch (SQLException e2) {


 


e2.printStackTrace();


 


logger.fatal("", e2);


 



 


} catch (ClassNotFoundException e2) {


 


e2.printStackTrace();


 


logger.fatal("", e2);


 



 


}


 


 


 


Importer importer = new Importer();


 


importer.setConnectionHandler(connHandler);


 


//dizhi.txt contains all the file paths of pubchem sdfs. every line is a file path, such as E:/pubchemSDF/Compound_040050001_040075000.sdf


 


String filePath="E:/pubchemSDF/dizhi.txt";


 


ReadTextLine readTextLine=new ReadTextLine(filePath);


 


List <String> sdfNameList = null;


 


try {


 


sdfNameList=readTextLine.read();


 


} catch (IOException e2) {


 


e2.printStackTrace();


 


logger.error("", e2);


 


}


 


 


 


String fileName;


 



 


importer.setTableName("pubchem_structure");


 


importer.setLinesToCheck(100);


 


importer.setHaltOnError(true);


 


importer.setFieldConnections("PUBCHEM_COMPOUND_CID=PUBCHEM_COMPOUND_CID");


 



 


Iterator<String> iterator=sdfNameList.iterator();


 



 


//StringBuffer currentInfo=new StringBuffer("Importing structures from ");


 


//StringBuffer finishInfo=new StringBuffer("Imported ");


 



 


int imported = 0;


 


while(iterator.hasNext()){


 


fileName = (String)iterator.next();//String object 


 


importer.setInput(fileName);


 



 


//System.out.println("Collecting file information ...");


 


try {


 


importer.init();


 


} catch (TransferException e1) {


 


e1.printStackTrace();


 


logger.error("", e1);


 


}


 


 


 


logger.info("Importing structures from " + fileName + " ...");


 


//logger.info(currentInfo.append(fileName));


 


imported=0;


 


try {


 


imported = importer.importMols();


 


} catch (TransferException e) {


 


e.printStackTrace();


 


logger.fatal("", e);


 


}


 


//logger.info(finishInfo.append(imported).append(" structures successfully"));


 


logger.info("Imported "+imported+" structures successfully");


 


//currentInfo;


 


//finishInfo;


 



 


}


 


 


 


}


 


 


 


/**


 


* @param args


 


*/


 


public static void main(String[] args) {


 


InImporter integleImporter=new InImporter();


 


integleImporter.importSDF();


 


 


 


}


 


 


 


}

User 6a59cb749d

12-10-2013 09:26:01

2013-10-11 18:57:47,909 - INFO   Importing structures from E:\pubchemSDF\Compound_000000001_000025000.sdf ...


2013-10-11 18:58:39,982 - INFO   Imported 23143 structures successfully


2013-10-11 18:58:40,013 - INFO   Importing structures from E:\pubchemSDF\Compound_000025001_000050000.sdf ...


2013-10-11 18:59:06,829 - INFO   Imported 22054 structures successfully


2013-10-11 18:59:06,860 - INFO   Importing structures from E:\pubchemSDF\Compound_000050001_000075000.sdf ...


2013-10-11 18:59:39,121 - INFO   Imported 22329 structures successfully


2013-10-11 18:59:39,152 - INFO   Importing structures from E:\pubchemSDF\Compound_000075001_000100000.sdf ...


2013-10-11 19:00:01,429 - INFO   Imported 22832 structures successfully


2013-10-11 19:00:01,461 - INFO   Importing structures from E:\pubchemSDF\Compound_000100001_000125000.sdf ...


2013-10-11 19:00:27,856 - INFO   Imported 21330 structures successfully


2013-10-11 19:00:27,903 - INFO   Importing structures from E:\pubchemSDF\Compound_000125001_000150000.sdf ...


 


 


__________________You see,it became much slower___________________________________________


 


 


2013-10-12 15:51:31,303 - INFO   Importing structures from E:\pubchemSDF\Compound_011600001_011625000.sdf ...


2013-10-12 16:01:21,015 - INFO   Imported 24738 structures successfully


2013-10-12 16:01:21,062 - INFO   Importing structures from E:\pubchemSDF\Compound_011625001_011650000.sdf ...


2013-10-12 16:10:02,696 - INFO   Imported 24749 structures successfully


2013-10-12 16:10:02,743 - INFO   Importing structures from E:\pubchemSDF\Compound_011650001_011675000.sdf ...


2013-10-12 16:20:00,629 - INFO   Imported 24747 structures successfully


2013-10-12 16:20:00,707 - INFO   Importing structures from E:\pubchemSDF\Compound_011675001_011700000.sdf ...


2013-10-12 16:29:29,126 - INFO   Imported 24718 structures successfully


2013-10-12 16:29:29,219 - INFO   Importing structures from E:\pubchemSDF\Compound_011700001_011725000.sdf ...


2013-10-12 16:38:06,641 - INFO   Imported 24764 structures successfully


2013-10-12 16:38:06,719 - INFO   Importing structures from E:\pubchemSDF\Compound_011725001_011750000.sdf ...


2013-10-12 16:46:34,016 - INFO   Imported 24840 structures successfully


2013-10-12 16:46:34,063 - INFO   Importing structures from E:\pubchemSDF\Compound_011750001_011775000.sdf ...


2013-10-12 16:54:13,593 - INFO   Imported 24829 structures successfully


2013-10-12 16:54:13,640 - INFO   Importing structures from E:\pubchemSDF\Compound_011775001_011800000.sdf ...


2013-10-12 17:02:51,639 - INFO   Imported 24856 structures successfully


2013-10-12 17:02:51,670 - INFO   Importing structures from E:\pubchemSDF\Compound_011800001_011825000.sdf ...


2013-10-12 17:11:12,384 - INFO   Imported 24843 structures successfully


2013-10-12 17:11:12,446 - INFO   Importing structures from E:\pubchemSDF\Compound_011825001_011850000.sdf ...


2013-10-12 17:18:29,871 - INFO   Imported 24771 structures successfully


2013-10-12 17:18:29,902 - INFO   Importing structures from E:\pubchemSDF\Compound_011850001_011875000.sdf ...

User 6a59cb749d

13-10-2013 08:10:48

Now, I realize the real problem. Maybe there is nothing wrong with my program and chem api. I put too many record, more than 70,00 0,000 records, into one stucture table in Sql server. The design of database should be optimized, I think.

ChemAxon 4a2fc68cd1

14-10-2013 07:17:23

Hi,


Yes, 70 million compounds require a quite large table and different RDBMSs may handle it with rather different performance. Could you give a try to another RDBMS, e.g. mysql? Or could you separte your data to multiple tables?


Another important aspect is whether you use duplicate filtering on the table or not. If you use it, then it definitely results in an overhead, which gets larger and larger as the number of structures increases.


Anyway, we also experienced performance problems with Importer in case of such large tables, but it is not clear at the moment what conditions do these problems depend on.


Best regards,
Peter

User 6a59cb749d

13-11-2013 08:34:53










pkovacs84 wrote:

Hi,


Yes, 70 million compounds require a quite large table and different RDBMSs may handle it with rather different performance. Could you give a try to another RDBMS, e.g. mysql? Or could you separte your data to multiple tables?


Another important aspect is whether you use duplicate filtering on the table or not. If you use it, then it definitely results in an overhead, which gets larger and larger as the number of structures increases.


Anyway, we also experienced performance problems with Importer in case of such large tables, but it is not clear at the moment what conditions do these problems depend on.


Best regards,
Peter



Thank you for your reply. I have imported all of the puchem sdf files into Sql server 2005 and it cost me more than ten days. I will try mysql and take your advice in the next time. Thank you very much!


Best regards,


Panfeng Liang