duplicate filtering

User 870ab5b546

11-09-2014 19:46:36

The documentation for setDuplicateImportAllowed() says,


If set to DUPLICATE_FILTERING_ON does not import molecules that already exist in the table with the same topology.

We have a JChem table with two additional fields, reactionId and products. If I set duplicate filtering to on, will UpdateHandler consider these additional fields when it considers whether an existing record is a duplicate of what it is about to insert?

ChemAxon abe887c64e

12-09-2014 08:02:12

Hi Bob,


The additional fields are not considered during the duplicate filtering, only the chemical structure.


Krisztina

User 870ab5b546

12-09-2014 13:41:05

Hmm. Maybe you can help me with this problem, then. When we do a Reactor calculation, we store the results of the calculation in the database, one line per calculated product:


starter1 reactionId1 product1


starter1 reactionId1 product2


starter1 reactionId2 product1


starter2 reactionId1 product1


etc.


where reactionId and product are fields that we have added. Once we have added all the records for a particular combination of starter and reactionId, we don't want to add those records again. 


When a student submits a response, we first see whether we have already done the Reactor calculation for each particular combination of starter and reactionId in that student's response by seeing if there are any records in the database. If not, we do the calculations, then store the results in the database.


The problem I am wrestling with is if two students submit identical responses at nearly the same time.



  1. Student 1 submits a response; program looks in database to see if starter/reactionId combination is already there, finds it isn't, starts calculation

  2. Student 2 submits a response; program looks in database to see if starter/reactionId combination is already there, finds it isn't, starts calculation

  3. Calculation of student 1 is finished; program stores results in database.

  4. Calculation of student 2 is finished; program stores results in database because it doesn't realize that student 1's identical results have already been stored.


I thought I might be able to solve this problem by duplicate filtering based on starter, reactionId, and product, but you say that won't work. 


Suppose I open a connection to the JChem table and first query whether the same starter/reactionId combination is already there, and, if not, use the same connection to store the records. Will using the same connection for both the query and writing prevent another thread from querying and writing before the first writing is complete? If not, can I add some SQL to the UpdateHandler query to keep it from writing if there is another record already there? If not, do you have any other suggestions?


Here is our code:


    /** Stores the products, calculated by Reactor, of a particular set of 
* reaction conditions operating on a particular molecule or group of
* molecules. Called from SynthSolver.
* <p>According to
* <a href="https://www.chemaxon.com/forum/ftopic7311.html">this</a>
* and
* <a href="https://forums.oracle.com/forums/thread.jspa?threadID=279238">this</a>
* discussion, we need to get a ConnectionHandler that uses the inner
* OracleConnection of the pooled connection, but later we need to close
* the outer pooled connection without closing the inner one.
* @param molStr starting material(s) of the reaction
* @param rxnId the ID number of the reaction conditions operating on
* the molecule or group of molecules
* @param calcdProdStrs the products calculated by Reactor, as strings
*/
public static void addCalcdProducts(String molStr, int rxnId,
String[] calcdProdStrs) throws DBException {
final String SELF = "ReactorResultsRW.addCalcdProducts: ";
Connection con = null;
UpdateHandler uh = null;
try {
con = getPoolConnection();
final ConnectionHandler conHandler = getConHandler(con);
if (conHandler == null) {
alwaysPrint(SELF + "can't get innermost Oracle connection; "
+ "need to add data to table in two steps.");
addCalcdProductsTwoSteps(molStr, rxnId, calcdProdStrs);
} else {
final String rxnName = getName(con, rxnId);
final String addlCols =
joinAll(REACT_RXN_ID, REACT_CALCD_PRODS).toString();
debugPrint(SELF + "storing ", calcdProdStrs.length,
" calculated product(s) for starting materials ",
molStr, " and reaction ", rxnName, " (ID ", rxnId,
", addlCols = ", addlCols, "):\n", calcdProdStrs);
uh = new UpdateHandler(conHandler, UpdateHandler.INSERT,
REACTOR_RESULTS, addlCols);
uh.setDuplicateFiltering(UpdateHandler.DUPLICATE_FILTERING_OFF);
for (final String calcdProd : calcdProdStrs) {
uh.setStructure(molStr);
uh.setValueForAdditionalColumn(1, rxnId);
uh.setValueForAdditionalColumn(2, calcdProd);
uh.execute();
} // for each calculated product
} // if we got a good connection handler
} catch (SQLException e) {
alwaysPrint(SELF + "caught SQLException trying to store "
+ "reaction of ", molStr, " under rxn conditions ",
rxnId, " giving products ", calcdProdStrs);
e.printStackTrace();
throw new DBException(e.getMessage());
} catch (Exception e) {
alwaysPrint(SELF + "caught Exception trying to store "
+ "reaction of ", molStr, " under rxn conditions ",
rxnId, " giving products ", calcdProdStrs);
e.printStackTrace();
throw new DBException(e.getMessage());
} finally {
try {
if (uh != null) uh.close();
} catch (SQLException e) {
// do nothing
}
closeConnection(con);
} // try
} // addCalcdProducts(String, int, String[])

ChemAxon a9ded07333

15-09-2014 10:01:47

Hi Bob,


I think the only thing you need to do is make your method synchronized:




 public synchronized static void addCalcdProducts

Synchronization guarantees that no parallel calls can be made to this method, so the second student's request (steps 2. and 4. in your list) will automatically wait until the first student's process (1. and 3.) finishes. Since in step 3. the molecule is inserted, in step 2. the second student's structure will be indentified as a duplicate and won't be stroed again.


Regards,
Tamás


 

User 870ab5b546

15-09-2014 13:43:30

That's easy enough. Thanks!