User 6baabcf48d
13-09-2016 08:45:10
Hello,
I have met following error while indexing a large table (about 100 million records )
ERROR: ChemIndex.cpp:61 OperationAborted:
*** class com.chemaxon.zetor.api.exceptions.OperationAbortedException
Wall time limit reached
And I noticed that at first the index calculation was spread on different CPUs (each 20% - 30%), but after some time it became only 1 CPU with high occupation (near 100%) while others were all idle, then above error came out.
Does anyone has any clue how this error comes from and which configuration can be done to avoid that ?
Also, since indexing such big table is very time consuming, I would like to know if there's any tips that can fasten the indexing ?
Thanks,
William
ChemAxon abe887c64e
13-09-2016 13:56:43
Hi William,
Unfortunately, the 100 million records are above the current scope of our developments and testings. We are aware of the increasing time request of the indexing process in case of such tables sizes and are working on finding a good solution for the performance issues.
At the moment, we can recommend the following workaround:
- Drop chemical indexes.
- sudo service jchem-psql stop
- Modify the file jchem-psql.conf in folder /etc/chemaxon .
Change the setting 'mapdb' to 'rocksdb' in the row:
com.chemaxon.jchem.psql.env.scheme=mapdb
- sudo service jchem-psql init
- sudo service jchem-psql start
- Create chemical indexes.
Please let us know your findings.
Best regards,
Krisztina
User 6baabcf48d
13-09-2016 23:29:15
Hello,
Thanks for your reply, but unfortunately the same error still came out when I set "com.chemaxon.jchem.psql.env.scheme=rocksdb" and re-index the table
Regards,
William
ChemAxon 9e528cc8c2
14-09-2016 08:45:48
Hello William,
The JVM may run out of memory while indexing this large table.
You can increase the JVM maximal memory size in file: /etc/default/jchem-psql .
The example in the file (-Xmx14g) sets maximal memory to 14 GB.
It is advisable to use rocksdb as backend in case of such a large table.
After setting the backend you should initialize the jchem-psql service. (sudo service jchem-psql init)
After setting the maximal memory size you should restart the jchem-psql service.
Best regards,
Please let us know your findings.
Roland
User 6baabcf48d
15-09-2016 00:06:06
Hello,
Yes, I did set Xmx to 20g and restarted the jchem-psql service.
And in /etc/chemaxon/jchem-psql.conf, I set "com.chemaxon.jchem.psql.env.scheme=rocksdb", should i also change "com.chemaxon.jchem.psql.main.scheme" and "com.chemaxon.jchem.psql.idx.scheme" ?
Are there any specific setting for rocksdb like other backends(mapdb, mvstore, hashed, cassandra)? Cause I didn't find in the configuration file.
Also, I noticed that usually the indexing is multithreaded, but at some point jchem will get stuck in a single thread task for a long time (only 1 processor with nearly 100% occupation and others are all idle), is that normal ?
Thanks,
William
ChemAxon 9e528cc8c2
15-09-2016 11:21:33
Hi William,
Setting 'com.chemaxon.jchem.psql.env.scheme' to 'rocksdb' in /etc/chemaxon/jchem-psql.conf is sufficient as long as the another two options mentioned are commented out (with '#' sign).
Unfortunately there are no specific settings for rocksdb in the configuration file at the moment.
Regarding indexing at some points executes in a single thread for a long time. This is a known behavior we also observed. It does not indicate a bug or failure.
Best regards,
Please let us know your findings.
Roland
User 6baabcf48d
15-09-2016 12:43:07
Hello, Roland
Thanks for your reply, and here's another question, is there any significant performance difference between JChem PostgreSQL Cartridge and JChem Oracle Cartridge ?
Thanks,
William
ChemAxon 25dcd765a3
15-09-2016 14:38:19
xwang_01 wrote: |
Thanks for your reply, and here's another question, is there any significant performance difference between JChem PostgreSQL Cartridge and JChem Oracle Cartridge ?
|
Yes definitely, JChem PostgreSQL Cartridge (JPC) is faster for queries returning only small number of hits (like under few thousand). But there is an other major difference. JPC has higher memory footage then JChem Oracle Cartridge (JOC), however if the memory needed to cache all the structures is not available then JOC just cannot work, while JPC can still work. So as you can see there are multiple factors to consider.
best
User 6baabcf48d
15-09-2016 15:01:32
Thanks, so here's what I need basically:
- Approximately 100 million smiles (within a single table or multiple tables with sharding, depends on the performance)
- Will do exact search, substructure search and tanimoto similarity search (number of hits can be quite different depends on the search query smile)
Since my previous test were all focus on JPC, I would like to know if it's possible that JOC might have great advantage in my use case ?
Thanks a lot,
William
User 6baabcf48d
18-09-2016 03:41:19
Hello,
I reproduced the previous "wall time limit reached" error, this time was copy 10 millions smiles into an indexed table, I tried many times, the error always comes out at certain point (for me is copy the batch starts from 5535000), so could it be caused by some invalid smiles ? But I didn't get any further error information.
ERROR: ChemIndex.cpp:61 OperationAborted:
*** class com.chemaxon.zetor.api.exceptions.OperationAbortedException
Wall time limit reached
CONTEXT: COPY jchem_10m_mol, line 5535000
Thanks,
William
ChemAxon abe887c64e
19-09-2016 09:35:24
Hi William,
Yes, unfortunately one erroneous / invalid smiles can produce this error. Could you identify and send us this molecule in smiles ? If the molecule is confidential, you can send it to jpc-support _at_ chemaxon.com.
Best regards,
Krisztina
User 6baabcf48d
20-09-2016 00:25:37
Hello, Krisztina
My whole dataset have 90,878,834 smiles, and the COPY execute in batch so I'm not sure exactly which smiles are invalid.
I did following operations like suggested in manual
- CREATE TABLE jchem_mol(inchi_key text, smiles text);
- COPY jchem_mol FROM 'xxx.csv' (FORMAT CSV);
- CREATE TABLE invalid_mol AS SELECT * FROM jchem_mol WHERE NOT is_valid_molecule(smiles);
However, 0 smiles are found as invalid, so could you please share if there's any other methods to locate the invalid smiles?
Thanks,
William
ChemAxon abe887c64e
20-09-2016 13:54:55
Hi William,
We think that the erroneous molecule is between lines 5530000 and 5535000 because indexing runs in batches of 5000 molecules, by default.
Would you copy these 5000 lines (5000 smiles) in a new text file and try to import and index them separately, but before starting the create index process, please run
set chemaxon.index_creation_batch_size to 1;
This way, the batch size will be changed to 1 in session level.
An other independent idea is to increase the wall_time_limit by
set chemaxon.search_wall_time_limit to 1200000;
The default is 600000 (= 10 min).
See the documentation.
All of these setting can be modified in the /etc/postgresql/9.5/main/postgresql.conf
file as well, but in that case after the modification postgresql service must be restarted.
Best regards,
Krisztina
User 6baabcf48d
21-09-2016 08:16:24
Hello, Krisztina
Thanks for your help, I successfully located the smiles that caused the problem
InchiKey : "TYAGLVAIEVGVDE-UHFFFAOYSA-N"
Smiles : "CC(=O)CC1C2C13C24C35C46C57C68C79C81C92C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C11C22C1C2"
Hope this can help, and is there any pattern for these "invalid" smiles ?
Regards,
William
ChemAxon abe887c64e
22-09-2016 13:30:17
Hi William,
Thank you for the molecule. Unfortunately, it really freezes the indexing in PostgreSQL Cartridge. Additionally, this molecule freezes the indexing in JChem Oracle Cartridge as well. Now we start to investigate what causes this behavior and will let you know when the issue is fixed.
Until then, as a workaround, we can only recommend to delete this molecules from the dataset.
Best regards,
Krisztina