Tuesday, November 25, 2008

INSERT if new and UPDATE otherwise

Recently I faced the following problem. I had to insert a big number of records into a PostgreSQL database. I have created a string which contains the various insert statements like below. The primary key is defined on the columns a,b.

INSERT INTO tablename (a,b,c) VALUES (1,2,'abc');
INSERT INTO tablename (a,b,c) VALUES (2,2,'def');
INSERT INTO tablename (a,b,c) VALUES (2,2,'ghi'); --generates an error!!
INSERT INTO tablename (a,b,c) VALUES (3,2,'ghi');

This worked fine until I encountered a duplicate primary key violation. The main problem with this was that after the error, the transaction was aborted and rolled back. So no record was inserted. I wanted instead that if such an error occurs, the existing record gets simply updated with the new values and if no record exists, it should be inserted. I searched the internet and there where different proposals but none of them really convinced me. I then came up with the following function:

CREATE OR REPLACE FUNCTION insert_update_record(a_in int, b_in int, c_in character varying)
RETURNS void AS
$BODY$
BEGIN
-- try to insert the record
BEGIN
INSERT INTO tablename (a,b,c) VALUES (a_in, b_in, c_in);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- if there is a duplicate key exception
-- update the record
UPDATE tablename SET c = c_in WHERE a = a_in AND b = b_in;
END;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;

This function simply tries to insert the new record into the table and if there is a duplicate key exception it will update the record. The previous insert statements would then be converted to:

SELECT insert_update_record(1,2,'abc');
SELECT insert_update_record(2,2,'def');
SELECT insert_update_record(2,2,'ghi'); --generates an error but does not abort the transaction
SELECT insert_update_record(3,2,'ghi');

I did not make any performance tests, but I believe that this function is not much slower than the INSERT statements if there are not many duplicates; because it simply makes an insert and returns and only if there is an exception it does some additional work. Since this function is precompiled it may be that it is even faster than the INSERT statement. Anyway the function can help you solve the insert/update problem.
Of course you have to adjust the function according to your needs, but if you have some experience with programming it shouldn't be too difficult ;-)

UPDATE:
Based on the suggestion of Peter I made up a new function which is pratically an implementation for PostgreSQL of his pseudo SQL statements:
UPDATE sample
SET testno = 1;
WHERE test = 'PL/SQL';
IF SQL%ROWCOUNT ==0
THEN
/* Insert Statement */

END IF;

I have created then a table 'test' with the columns 'id' (pk), 'val1' and 'val2'. Then I converted the pseudo SQL into:

CREATE OR REPLACE FUNCTION update_insert(id_in integer, val1_in integer, val2_in integer)
RETURNS void AS
$BODY$
DECLARE
count int;
BEGIN
update test set val1 = val1_in, val2 = val2_in where id = id_in;
GET DIAGNOSTICS count = ROW_COUNT;

IF count = 0 THEN
INSERT INTO test (id, val1, val2) VALUES (id_in, val1_in, val2_in);
END IF;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;

and provided also the function for this table with the exception handling:

CREATE OR REPLACE FUNCTION insert_update(id_in integer, val1_in integer, val2_in integer)
RETURNS void AS
$BODY$
BEGIN
-- try to insert the record
BEGIN
INSERT INTO test (id,val1,val2) VALUES (id_in, val1_in, val2_in);
RETURN;
EXCEPTION WHEN unique_violation THEN
-- if there is a duplicate key exception
-- update the record
UPDATE test SET val1 = val1_in, val2 = val2_in WHERE id = id_in;
END;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;

Then I created with a small application the SQL statements that call these functions and I created them with differenct percentage of duplicates:

SELECT insert_update(1,101,-99);
and
SELECT update_insert(1,101,-99);

Here are the results of my performance measures:
1 000 statements
tried to insert every id twice -> 500 records in DB
update_insert: 250ms
insert_update: 250ms

100 000 statements
tried to insert every id twice -> 50 000 records in DB
update_insert: 37281ms
insert_update: 45125ms

100 000
every 10th record was already present -> 90 000 records in DB
update_insert: 63875ms
insert_update: 61500ms

10 000
all ids were the same -> 1 record in DB
update_insert 10016ms
insert_update 18703ms

I did the tests on my laptop; I know that they may not be precise because of running services in the background, but at least they give an intuition.
The result of my test:
- High number of duplicates -> the update/insert function is much faster
- Low number of duplicates -> the two functions are almost equally fast

I was surprised that there is not more difference between these two functions. I expected that the function with the exception would be much slower than the other one. I would suggest to use the update/insert function as proposed by Peter because exceptions should only be used in exceptional situations and avoided when possible.
Thanks Peter for the proposal...

5 comments:

Peter Gfader said...

Hi Manfred,

I would suggest you, do an update.
If that Update return 0 affected rows, do an insert.

Pseudo SQL code

UPDATE sample
SET testno = 1;
WHERE test = 'PL/SQL';
IF SQL%ROWCOUNT ==0
THEN
/* Insert Statement */

END IF;

Manfred said...

Hello Peter,
thanks for your suggestion. I will try it out as soon as I have some time. I will also do some perfomance measures and post them here...

Do you think that your version is also faster if the most cases are inserts rather than updates?
Do you have some knowledge if the exception creates much overhead?

Juri Strumpflohner said...

Hi,
nice to see you back blogging, after some break :)

@Peter
See you found your way here to Manfred's blog :)

Peter Gfader said...

I have done this on Oracle, ~80% of the rows where "Updates", ~20% were new "Inserts".

With the Exception Handling it was painful slow...

Manfred said...

Hello,
I did some performance tests and updated my post...