尚硅谷大数据技术之Hive(新)第10章 Hive实战之谷粒影音
10.4.4 统计视频观看数Top50所关联视频的所属类别Rank
思路:
- 查询出观看数最多的前50个视频的所有信息(当然包含了每个视频对应的关联视频),记为临时表t1
t1:观看数前50的视频
select * from gulivideo_orc order by views desc limit 50; |
- 将找到的50条视频信息的相关视频relatedId列转行,记为临时表t2
t2:将相关视频的id进行列转行操作
select explode(relatedId) as videoId from t1; |
- 将相关视频的id和gulivideo_orc表进行inner join操作
t5:得到两列数据,一列是category,一列是之前查询出来的相关视频id
(select distinct(t2.videoId), t3.category from t2 inner join gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name; |
4) 按照视频类别进行分组,统计每组视频个数,然后排行
最终代码:
select category_name as category, count(t5.videoId) as hot from ( select videoId, category_name from ( select distinct(t2.videoId), t3.category from ( select explode(relatedId) as videoId from ( select * from gulivideo_orc order by views desc limit 50) t1) t2 inner join gulivideo_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category) t_catetory as category_name) t5 group by category_name order by hot desc; |
10.4.5 统计每个类别中的视频热度Top10,以Music为例
思路:
1) 要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。
2) 向category展开的表中插入数据。
3) 统计对应类别(Music)中的视频热度。
最终代码:
创建表类别表:
create table gulivideo_category( videoId string, uploader string, age int, categoryId string, length int, views int, rate float, ratings int, comments int, relatedId array<string>) row format delimited fields terminated by "\t" collection items terminated by "&" stored as orc; |
向类别表中插入数据:
insert into table gulivideo_category select videoId, uploader, age, categoryId, length, views, rate, ratings, comments, relatedId from gulivideo_orc lateral view explode(category) catetory as categoryId; |
统计Music类别的Top10(也可以统计其他)
select videoId, views from gulivideo_category where categoryId = "Music" order by views desc limit 10; |