If you want to dummy up a bunch of data in PostgreSQL you can use this neat little trick
create table test (a int); insert into test (select generate_series(0,999,1)); INSERT 0 1000
Knowledge Breeds Fear
If you want to dummy up a bunch of data in PostgreSQL you can use this neat little trick
create table test (a int); insert into test (select generate_series(0,999,1)); INSERT 0 1000
I frequently find the need to have single quoted output when i generate dynamic SQL.
It's always a pain to remember the exact number of ticks needed to get the quoted output.
Here is a reminder on how to do it:
select ''''||schemaname||'.'||tablename||'''' from pg_tables
4 quotes when there's no text, or 3 quotes on the outside and one on the inside when there is text.
select '''public.'||tablename||''';' from pg_tables
PostgreSQL, like many RDBMSs uses a cost based optimizer. CBOs rely on database "statistics" (number of rows, data distribution, etc.) to generate a good execution plan for any query that is sent to the engine. If you have bad statistics, then you potentially have bad query plans. So a query that should take 1 second ends up taking 3 hours.
I recently ran into a problem like that in PostgreSQL, the query ran fine *most* of the time, but every once in a while it would go off into the weeds.
After a little digging I realized that the table's statistics hadn't been updated in quite a while. This struck me as odd since I knew that PostgreSQL automatically updates statistics on a regular basis. So I did some more digging to figure out exactly what triggers statistics to be updated in PostgreSQL.
First off, it's important to note that in postgresql the vacuum process is what handles statistics. "vacuum analyze" is the command although you can just use "analyze" too. Vacuuming and analyzing are different operations, and have different, but very similar parameters that control them.
The key postgresql.conf parameters that affect whether or not a table gets auto-analyzed are:
#autovacuum_analyze_threshold = 50 #autovacuum_analyze_scale_factor = 0.1
(You can see the defaults in the postgresql.conf above)
The formula that Postgres uses to determine if a table needs to be auto-analyzed is
[ # of rows in table ] * [ scale factor ]) + [ threshold ]
So, for example, if you have a table with 10,000 rows in it, then it would look like this:
( 10000 * 0.1 ) + 50 = 1050
So if you were to add 1049 rows, the table would not get auto-analyzed.
So how do you change it? Well, there are 2 ways. you could change the 2 parameters in the postgresql.conf, and this is a good idea if you think you need to globally lower or raise them.
However, I think in most cases you'll want to make the change just for one table, which means you need to utilize the pg_autovacuum table.
postgres=# \d pg_autovacuum
Table "pg_catalog.pg_autovacuum"
Column | Type | Modifiers
------------------+---------+-----------
vacrelid | oid | not null
enabled | boolean | not null
vac_base_thresh | integer | not null
vac_scale_factor | real | not null
anl_base_thresh | integer | not null
anl_scale_factor | real | not null
vac_cost_delay | integer | not null
vac_cost_limit | integer | not null
freeze_min_age | integer | not null
freeze_max_age | integer | not null
Indexes:
"pg_autovacuum_vacrelid_index" UNIQUE, btree (vacrelid)
Remember when I mentioned above that vacuuming and analyzing were handled via similar mechanisms? you can see that here, the table has fields related to vacuuming and analyzing.
The fields we care about for this particular problem are: vacrelid, enabled, anl_base_thresh and anl_scale_factor.
select oid from pg_class where relname = <tablename>
To make the parameter change for just one table we then need to insert into the pg_autovacuum table.
I'm changing the scale factor from 0.1 to 0.9, the rest are the default values for those parameters
What that means is in our example:
( 10000 * 0.1 ) + 50 = 1050
Changes to:
( 10000 * 0.09 ) + 50 = 950
In other words, the autovacuum would kick off 100 records sooner, not a huge difference, I know however to the planner it could make all of the difference in the world.
A little while ago I posted about PostgreSQL clustering and I said that I was going to evaluate HA-JDBC as an option.
The reason I wanted to use HA-JDBC is that I was looking for a no-coding required solution for seamless fail over. (Similar to Oracle RAC) for PostgreSQL.
I'll be using a Shared Disk / Heartbeat cluster on the server side, however when the node fails the application will register an error, which is undesirable.
After doing more research it's been determined that HA-JDBC won't work.
It seems that HA-JDBC is, at best, a SQL replicator, where you have 2 active nodes and HA-JDBC will perform inserts and updates across both databases to keep them in sync. This is fine if you're not using the "serial" data type in PostgreSQL, Triggers, Functions, time based default values, etc. Using any sort of trigger, the code fires independently on insert and you end up getting out of sync databases.
Another reason why HA-JDBC won't work is that if a node is un-reachable HA-JDBC removes it from consideration. So your 2nd "standby" when it becomes active won't be considered by HA-JDBC without some manual intervention.
And finally, I'd advise steering away from HA-JDBC at this point even if the above works for you. I posted a few clarifying questions regarding the above to both the HA-JDBC forums and their mailing list and received no response. If your business is looking into true high availability for their servers you need to choose all of your components with care. A non-responsive community either means the project is dead, or un-caring, both of which are unacceptable when you're looking into HA solutions.
Finally got a response from Paul Ferraro, Nice guy, very helpful posted below:
Sorry for the slow response... HA-JDBC is not the right tool for this job. HA-JDBC is an *alternative* to shared disk failover - it was not designed to be used in concert with it. Instead, you want something like JBoss HA DataSource or Weblogic multi-pools. These are DataSource proxies whose getConnection() method returns a raw connection from the first available data source. DataSource-level. Connections returned by HA-JDBC's DataSource, on the other hand, are proxies to connections to each active database in your cluster. I can go into more detail if you'd like, such as the advantages/ disadvantages of HA-JDBC over shared-disk failover, if you're interested. Paul
pgfouine is a nice logfile analyzer for PostgreSQL written in php.
I'm doing a trace on a very long running ETL process and the logfile generated is ~11GB.
I'm running up against a 2GB barrier in php for fopen(). If you've got a 64bit machine and can recompile php with -D_FILE_OFFSET_BITS=64 then you're good to go. But in my case, I can't do either.
The error i'd get is:
So for Plan B I had to remember back to the days when 64 bit wasn't even an option (back in my day, we had 8 bits and we liked 'em!)
I used a named pipe since pgfouine expects a file and doesn't seem to be able to read from stdin.
Once that kicked off I stopped getting that error and pgfouine was able to process the file.
Random Database, OS or otherwise interesting tips and tricks.