penk: (notagain)
[personal profile] penk
Okay LJ folks. Riddle me this one:

select * from queue_encoder
where
status='ready' and
machine_name IS NULL
order by
urgent_flag desc ,
timestamp_enqueued DESC
limit 10

yacht-2:~ dbs$ time mysql -uroot cms < urgentfirst.sql
[...]
real 0m0.037s

explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE queue_encoder range index_1,index_3,compound_1,compound_2 compound_2 154 NULL 3377962 Using where


---------------

Now if I add a constraint:
select * from queue_encoder
where
status='ready' and
machine_name IS NULL AND
codec_id IN (4,9,10,11)
order by
urgent_flag desc ,
timestamp_enqueued DESC
limit 10

yacht-2:~ dbs$ time mysql -uroot cms < urgentfirst.sql
[...]
real 0m24.982s

explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE queue_encoder ref index_1,index_3,compound_1,compound_2 compound_2 154 const,const 1576779 Using where


CREATE TABLE `queue_encoder` (
`queue_encoder_id` int(11) NOT NULL AUTO_INCREMENT,
`media_type_id` int(11) NOT NULL DEFAULT '0',
`item_id` int(11) NOT NULL DEFAULT '0',
`label_id` int(11) NOT NULL DEFAULT '0',
`codec_id` int(11) DEFAULT NULL,
`urgent_flag` int(1) NOT NULL DEFAULT '0',
`machine_name` varchar(50) DEFAULT NULL,
`process_id` int(2) DEFAULT NULL,
`timestamp_enqueued` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`timestamp_started` datetime DEFAULT NULL,
`timestamp_finished` datetime DEFAULT NULL,
`status` enum('ready','success-complete','success-incomplete','error','in-process','delay') NOT NULL DEFAULT 'ready',
`status_message` varchar(255) DEFAULT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`queue_encoder_id`),
UNIQUE KEY `unique_1` (`item_id`,`media_type_id`,`codec_id`),
KEY `index_1` (`machine_name`,`process_id`,`status`),
KEY `index_2` (`timestamp_enqueued`,`label_id`,`codec_id`,`media_type_id`),
KEY `index_3` (`status`,`machine_name`,`codec_id`,`media_type_id`,`label_id`),
KEY `compound_1` (`status`,`machine_name`,`codec_id`,`media_type_id`,`label_id`,`urgent_flag`,`timestamp_enqueued`),
KEY `compound_2` (`status`,`machine_name`,`urgent_flag`,`timestamp_enqueued`)
) ENGINE=MyISAM AUTO_INCREMENT=74076324 DEFAULT CHARSET=utf8 |

(Background - this table has 3.4 million rows-ish in it. We have a process that asks the following select question about every 30 seconds:

SELECT * from queue_encoder
WHERE
status = 'ready' AND
machine_name IS NULL AND
codec_id IN (4,9,10,11) AND
media_type_id IN (1,2) AND
label_id IN (2,3,4,11,13,15,17,18,19,20,21,22,29,30,46,47)
ORDER BY
urgent_flag DESC,
timestamp_enqueued DESC
limit 20

That query, due to ISAM table locking, can take 2-4 minutes to complete (it's actually an UPDATE...WHERE clause). If I can make that be fully indexed, we will significantly increase our processing throughput.

Help?

Date: 2010-07-03 08:00 pm (UTC)
From: [identity profile] http://users.livejournal.com/cgull_/
that's a lot of songs you have there.

i'm no mysql expert, and a little rusty, but...

compound_2 is a subset of compound_1, which has everything you need. (ISTR that keys that are a subset of another key aren't a good idea in general.) yet the engine is searching compound_2 (either because the ORDER BY clause is forcing it or because of poor optimization decisions), and needing to refer to other things to do the query. try removing compound_2, or hinting the query to use compound_1 (I forget how that works).

also remember that ORDER BY...LIMIT effectively requires a sort in there somewhere, in order to be able to get the highest-valued candidates to the top for LIMIT to be applied to. if you can change this to twenty UPDATE...LIMIT 1 queries, then the engine may just say, "oh, I just need to look at the rows that have the highest known value in urgent_flag, and of those, the rows that have the highest known value in timestamp_enqueued" (and if your inserts were slow enough, there's only one of the latter).

how big are your MySQL caches/buffers?

--jh

Date: 2010-07-03 09:50 pm (UTC)
From: [identity profile] penk.livejournal.com
that's a lot of songs you have there.

Well, this IS for work - and my company does distribute music for a living :)

how big are your MySQL caches/buffers?

We actually pursued this a bunch. Using the key_buffer_read variables in the server, we could see that indeed the entire query was using in-memory indexes. so THAT wasn't it.

A good check though :)

Date: 2010-07-06 01:30 pm (UTC)
From: [identity profile] awfief.livejournal.com
Both queries are using the compound_2 index:

KEY `compound_2` (`status`,`machine_name`,`urgent_flag`,`timestamp_enqueued`)

The first query has filters (where clause) on
status, machine name

and sorts by:
urgent_flag
timestamp_enqueued


The second query has filters on
status, machine name, codec_id

and sorts by:
urgent_flag
timestamp_enqueued

So if you just want an index for that, you'd want an index on:

(status,machine_name,codec_id,urgent_flag,timestamp_enqueued)

-------------
HOWEVER:

MySQL uses B-trees for its underlying structure for MyISAM tables and indexes (by default, unless you specify otherwise). Therefore, range and equality matches are pretty good. Also, the prefix of indexes can be used. (I liken it to scanning through a dictionary, it's easy to find the words starting with s, sh, or she, but not really possible without a full scan of each word to find words ending in e.)

Also, because you're doing a SELECT *, the MySQL optimizer has to figure out if it's worth it to actually go to the index to find the rows you need, and THEN go to the table data to find the rest of the information. If the
*optimizer thinks* you will need more than about 20-30% of the rows, it will just do a full table scan instead. This is NOT your problem, however I am bringing this point up because both this and the previous lead to a corollary:

MySQL isn't particularly good at finding "flags", even when indexed, especially when the "flag" you are searching for is shared in common with about 20% or more of the data. For example, say you have a flag on gender with male, female, other/don't know, where 45% are male, 45% are female and 10% are other/don't know. Searching for "other" will be fast and use the index, searching for male or female probably won't bother.

Likewise, NULL adds extra time in the search. In this particular case I'd advise a machine name of '', with the field marked as NOT NULL DEFAULT '', instead of NULL. You will never have a question of "do they mean 'blank' or do they mean NULL" in this case, because all machines have names.

-----------------
In addition, if you're updating on a MyISAM table, you are locking the WHOLE table until the update completes. I would advise using InnoDB for this table. If you need a certain feature of MyISAM, it's likely a read feature (ie, exact optimizer statistics, FULLTEXT search) and you could have the table InnoDB on the master and MyISAM on the slave (which would mean your slave lag could be 2-4 minutes when such a query goes through, but at least the master isn't locking).

October 2022

S M T W T F S
      1
2345678
9101112131415
16171819202122
2324 2526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Apr. 7th, 2026 09:05 pm
Powered by Dreamwidth Studios