Jchem API Importer became slower and slower.

User 6a59cb749d

12-10-2013 09:19:32

I want to establish a structure database and import pubchem sdf into it. There are more than 2500 sdf files . It is not efficient enough to use JChem Manager to import one by one. So I use JChem API to build a java program.

At the beggining , the program worked quite well. It imported one sdf within less than 1 minute. But after several hours, the performance became worse and worse. Eventually, it costs 10 minutes to import one sdf.

I think my program should be optimized to improve the performance. Maybe the java program produced too much redundant objects and the java virsual machine has to suspend the program and do some clean up.

below is my java code, I hope some one can help me optimize my code. Thanks.

import java.io.IOException;

import java.sql.SQLException;

import java.util.Iterator;

import java.util.List;

import org.apache.log4j.Logger;

import chemaxon.jchem.db.Importer;

import chemaxon.jchem.db.TransferException;

import chemaxon.util.ConnectionHandler;

/**

* Provide methods that will be used to import sdf files into database.

* @author Administrator

*

*/

public class InImporter {

private Log log=Log.getLog();

public void importSDF() {

Logger logger=log.logger;

ConnectionHandler connHandler = new ConnectionHandler(true);

//Setting database connection

connHandler.setDriver("com.microsoft.sqlserver.jdbc.SQLServerDriver");

connHandler.setUrl("jdbc:sqlserver://localhost:1433;databaseName=pubchem_db");

connHandler.setLoginName("sa");

connHandler.setPassword("111111");

try {

connHandler.connectToDatabase();

} catch (SQLException e2) {

e2.printStackTrace();

logger.fatal("", e2);

} catch (ClassNotFoundException e2) {

e2.printStackTrace();

logger.fatal("", e2);

}

Importer importer = new Importer();

importer.setConnectionHandler(connHandler);

//dizhi.txt contains all the file paths of pubchem sdfs. every line is a file path, such as E:/pubchemSDF/Compound_040050001_040075000.sdf

String filePath="E:/pubchemSDF/dizhi.txt";

ReadTextLine readTextLine=new ReadTextLine(filePath);

List <String> sdfNameList = null;

try {

sdfNameList=readTextLine.read();

} catch (IOException e2) {

e2.printStackTrace();

logger.error("", e2);

}

String fileName;

importer.setTableName("pubchem_structure");

importer.setLinesToCheck(100);

importer.setHaltOnError(true);

importer.setFieldConnections("PUBCHEM_COMPOUND_CID=PUBCHEM_COMPOUND_CID");

Iterator<String> iterator=sdfNameList.iterator();

//StringBuffer currentInfo=new StringBuffer("Importing structures from ");

//StringBuffer finishInfo=new StringBuffer("Imported ");

int imported = 0;

while(iterator.hasNext()){

fileName = (String)iterator.next();//String object

importer.setInput(fileName);

//System.out.println("Collecting file information ...");

try {

importer.init();

} catch (TransferException e1) {

e1.printStackTrace();

logger.error("", e1);

}

logger.info("Importing structures from " + fileName + " ...");

//logger.info(currentInfo.append(fileName));

imported=0;

try {

imported = importer.importMols();

} catch (TransferException e) {

e.printStackTrace();

logger.fatal("", e);

}

//logger.info(finishInfo.append(imported).append(" structures successfully"));

logger.info("Imported "+imported+" structures successfully");

//currentInfo;

//finishInfo;

}

/**

* @param args

*/

public static void main(String[] args) {

InImporter integleImporter=new InImporter();

integleImporter.importSDF();

}

User 6a59cb749d

12-10-2013 09:26:01

2013-10-11 18:57:47,909 - INFO Importing structures from E:\pubchemSDF\Compound_000000001_000025000.sdf ...

2013-10-11 18:58:39,982 - INFO Imported 23143 structures successfully

2013-10-11 18:58:40,013 - INFO Importing structures from E:\pubchemSDF\Compound_000025001_000050000.sdf ...

2013-10-11 18:59:06,829 - INFO Imported 22054 structures successfully

2013-10-11 18:59:06,860 - INFO Importing structures from E:\pubchemSDF\Compound_000050001_000075000.sdf ...

2013-10-11 18:59:39,121 - INFO Imported 22329 structures successfully

2013-10-11 18:59:39,152 - INFO Importing structures from E:\pubchemSDF\Compound_000075001_000100000.sdf ...

2013-10-11 19:00:01,429 - INFO Imported 22832 structures successfully

2013-10-11 19:00:01,461 - INFO Importing structures from E:\pubchemSDF\Compound_000100001_000125000.sdf ...

2013-10-11 19:00:27,856 - INFO Imported 21330 structures successfully

2013-10-11 19:00:27,903 - INFO Importing structures from E:\pubchemSDF\Compound_000125001_000150000.sdf ...

__________________You see,it became much slower___________________________________________

2013-10-12 15:51:31,303 - INFO Importing structures from E:\pubchemSDF\Compound_011600001_011625000.sdf ...

2013-10-12 16:01:21,015 - INFO Imported 24738 structures successfully

2013-10-12 16:01:21,062 - INFO Importing structures from E:\pubchemSDF\Compound_011625001_011650000.sdf ...

2013-10-12 16:10:02,696 - INFO Imported 24749 structures successfully

2013-10-12 16:10:02,743 - INFO Importing structures from E:\pubchemSDF\Compound_011650001_011675000.sdf ...

2013-10-12 16:20:00,629 - INFO Imported 24747 structures successfully

2013-10-12 16:20:00,707 - INFO Importing structures from E:\pubchemSDF\Compound_011675001_011700000.sdf ...

2013-10-12 16:29:29,126 - INFO Imported 24718 structures successfully

2013-10-12 16:29:29,219 - INFO Importing structures from E:\pubchemSDF\Compound_011700001_011725000.sdf ...

2013-10-12 16:38:06,641 - INFO Imported 24764 structures successfully

2013-10-12 16:38:06,719 - INFO Importing structures from E:\pubchemSDF\Compound_011725001_011750000.sdf ...

2013-10-12 16:46:34,016 - INFO Imported 24840 structures successfully

2013-10-12 16:46:34,063 - INFO Importing structures from E:\pubchemSDF\Compound_011750001_011775000.sdf ...

2013-10-12 16:54:13,593 - INFO Imported 24829 structures successfully

2013-10-12 16:54:13,640 - INFO Importing structures from E:\pubchemSDF\Compound_011775001_011800000.sdf ...

2013-10-12 17:02:51,639 - INFO Imported 24856 structures successfully

2013-10-12 17:02:51,670 - INFO Importing structures from E:\pubchemSDF\Compound_011800001_011825000.sdf ...

2013-10-12 17:11:12,384 - INFO Imported 24843 structures successfully

2013-10-12 17:11:12,446 - INFO Importing structures from E:\pubchemSDF\Compound_011825001_011850000.sdf ...

2013-10-12 17:18:29,871 - INFO Imported 24771 structures successfully

2013-10-12 17:18:29,902 - INFO Importing structures from E:\pubchemSDF\Compound_011850001_011875000.sdf ...

User 6a59cb749d

13-10-2013 08:10:48

Now， I realize the real problem. Maybe there is nothing wrong with my program and chem api. I put too many record, more than 70,00 0,000 records, into one stucture table in Sql server. The design of database should be optimized, I think.

ChemAxon 4a2fc68cd1

14-10-2013 07:17:23

Hi,

Yes, 70 million compounds require a quite large table and different RDBMSs may handle it with rather different performance. Could you give a try to another RDBMS, e.g. mysql? Or could you separte your data to multiple tables?

Another important aspect is whether you use duplicate filtering on the table or not. If you use it, then it definitely results in an overhead, which gets larger and larger as the number of structures increases.

Anyway, we also experienced performance problems with Importer in case of such large tables, but it is not clear at the moment what conditions do these problems depend on.

Best regards,
Peter

User 6a59cb749d

13-11-2013 08:34:53

pkovacs84 wrote:

Hi,

Yes, 70 million compounds require a quite large table and different RDBMSs may handle it with rather different performance. Could you give a try to another RDBMS, e.g. mysql? Or could you separte your data to multiple tables?

Another important aspect is whether you use duplicate filtering on the table or not. If you use it, then it definitely results in an overhead, which gets larger and larger as the number of structures increases.

Anyway, we also experienced performance problems with Importer in case of such large tables, but it is not clear at the moment what conditions do these problems depend on.

Best regards,
Peter

Thank you for your reply. I have imported all of the puchem sdf files into Sql server 2005 and it cost me more than ten days. I will try mysql and take your advice in the next time. Thank you very much!

Best regards,

Panfeng Liang