[Spike 2h] Which browsers are downloading PDFs (including OS percentages)?
Closed, ResolvedPublic

Description

This task captures the work to re-run the queries in T176467#3658144 including OS percentages.

Event Timeline

Using the query I got a days worth of data. Not clear if we need more data than that.

select user_agent_map["os_family"] as os_family, count(*) as c from webrequest where year=2017 AND month=9 AND day=26 AND uri_query LIKE "%action=redirect-to-electron%" group by user_agent_map["os_family"] sort by c desc;
os_familytotal% of total
Other262633.56%
Windows 7169121.61%
Windows 10160220.47%
Mac OS X5476.99%
Windows XP4385.60%
Windows 8.14055.18%
Windows Vista1602.04%
Windows700.89%
Android690.88%
Linux550.70%
Windows 98540.69%
Windows 8300.38%
Chrome OS270.35%
iOS190.24%
Ubuntu110.14%
Windows CE90.12%
Windows 200040.05%
Linux Mint30.04%
Windows 9520.03%
Bada10.01%
BlackBerry OS10.01%
Symbian OS10.01%

I'm not 100% sure what kind of traffic classes as "other" - but it looks like a large degree of it is bots from a quick browse.

I'm not 100% sure what kind of traffic classes as "other" - but it looks like a large degree of it is bots from a quick browse.

The time frame I chose isn't exhaustive but here's a couple of examples of what "other" OS and/or browser families are. Note well that I'm excluding known bots and spiders from the result set.

+---------------------------------+-------+
|           user_agent            |   n   |
+---------------------------------+-------+
| MyApp/0.1                       | 5412  |
| Mozilla/4.0 (compatible;)       | 1     |
| 12345                           | 8     |
| abc                             | 1     |
| HTTrack/3.0x                    | 2     |
| Mozilla/5.0                     | 1     |
| Yandex DocViewer                | 2     |
| Dorado WAP-Browser/1.0.0        | 4     |
| UNTRUSTED/1.0/MAUI WAP Browser  | 4     |
+---------------------------------+-------+

select
  user_agent,
  count(*) as n
from
  wmf.webrequest
where
  year = 2017
  and month = 10
  and day = 11
  and hour >= 0
  and hour < 12
  and webrequest_source = 'text'
  and uri_path like "%rest_v1/page/pdf%"

  # Exclude known bots and spiders.
  and agent_type = "user"

  and user_agent_map["os_family"] = "Other"
  and user_agent_map["browser_family"] = "Other"
group by
  user_agent;

@ovasileva: MyApp. The User-Agent header sent by the client can be anything you want it to be if you're so inclined.

Edit: I've certainly done the same in my older hobby projects. My guess would be that that's what this is, a hobby project.

Thanks for running that query, @Jdlrobson.

How did you calculate the percentages on a per-row basis? Per your comment in the parent task, T176467#3640968, the uri_path we're searching for is like '%rest_v1/page/pdf%'.

How did you calculate the percentages on a per-row basis?

I ask because I hadn't done this in one query in HiveQL before and I wanted to see if I could. Fortunately, @Milimetric helped me out and I arrived at the following:

1!set outputformat table
2
3# Disable Hive strict mode.
4#
5# Below we're going to compute the Cartesian product (the cross join) of two
6# small result sets. In strict mode, Hive will not allow a Cartesian product to
7# be computed at all.
8set hive.mapred.mode=nonstrict;
9
10# Create two Common Table Expressions:
11#
12# 1. The requests to /api/rest_v1/page/pdf/<title> in the time range that we're
13# interested in, grouped by OS and browser vendor, and counted.
14# 2. The total number of requests in #1, represented as a 1x1 result set.
15#
16# The cross join of these two CTEs will be a result set that looks like:
17#
18# | OS | browser_family | browser_major | n | total |
19# | --- | --- | ------------- | ---- | ----- |
20# | Windows 7 | Chrome | 61 | N | T |
21# | Windows 10 | Chrome | 61 | M | T |
22# | ... | ... | ... | ... | ... |
23#
24# Which will allow us to compute the percentage of the total number of requests
25# that the OS-browser pair made (simply, n / total * 100).
26#
27# Using CTEs allows us to keep:
28#
29# * The time range defined in one place (CTE #1)
30# * The percentage calculation and result set size decoupled
31#
32# * See https://cwiki.apache.org/confluence/display/Hive/Common+Table+Expression.
33with pdf_requests as (
34 select
35 user_agent_map["os_family"] as os_family,
36 user_agent_map["browser_family"] as browser_family,
37 user_agent_map["browser_major"] as browser_major,
38 count(*) as n
39 from
40
41 # Analytics Engineering provide a detailed description of wmf.webrequest and
42 # wmf_raw.webrequest here: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest
43 wmf.webrequest
44 where
45
46 # Year, month, day, and hour are all Hive partitions that map to hourly
47 # imports of Varnish logs in HDFS. By constraining the time range for your
48 # query, you minimise the number of partitions that the cluster will load
49 # into memory to compute results.
50 #
51 # NOTE: You must ask in #wikimedia-analytics before analysing a month's
52 # worth of data as the size of the initial dataset is ~35 TB!
53 year = M
54 and month = N
55 and day = O
56 and hour >= P
57 and hour < Q
58
59 # The Varnish cluster that handled the request
60 # (see https://wikitech.wikimedia.org/wiki/Varnish#Cache_Clusters).
61 #
62 # Select a source whenever possible as it's a Hive partition. As with year,
63 # month, day, and hour above, constrain the source to minimise the number of
64 # partitions that the cluster will load into memory to compute results.
65 and webrequest_source = 'text'
66
67 and uri_path like "%rest_v1/page/pdf%"
68
69 # Exclude known bots and spiders.
70 and agent_type = "user"
71 group by
72 user_agent_map["os_family"],
73 user_agent_map["browser_family"],
74 user_agent_map["browser_major"]
75),
76total_pdf_requests as (
77 select
78 sum(n) as total
79 from
80 pdf_requests
81)
82
83select
84 os_family,
85 browser_family,
86 browser_major,
87 round(n / total * 100, 2) as pct
88from
89 total_pdf_requests
90cross join (
91 select
92 *
93 from
94 pdf_requests
95
96 # We select the size of the result set here so as to limit the size of the
97 # cross join. We know the maximum size of this result set whereas the maximum
98 # size of the pdf_requests CTE (CTE #1) is unknown.
99 order by
100 n desc
101 limit 50
102) as pdf_requests_top_50
103order by
104 pct desc
105
106;

https://pythonhosted.org/feedparser/http-useragent.html may be responsible for the user agent. People love to copy and paste examples :)

As for calculating percentages I did those manually in a spreadsheet.

@Tbayer - here's the results looking by OS

@phuedx Thanks for documenting the query used (T177969#3687130 )! Can you also specify the timespan for which it was ran? (I.e. the concrete values of M, N, O, P and Q.) I re-ran it for a different timespan - October 31 - and got quite different results. E.g. 38.15% for Chrome 61 on Windows 10 instead of 8.83%, but only 6.55% for "Other" instead of 14.38%, etc. This may be because the total number of downloads in the timespan used was too low, and hence the statistical error (random variation) too large.

--  adapted from https://phabricator.wikimedia.org/P6147 :

!set outputformat table

set hive.mapred.mode=nonstrict;

with pdf_requests as (
  select
    user_agent_map["os_family"] as os_family,
    user_agent_map["browser_family"] as browser_family,
    user_agent_map["browser_major"] as browser_major,
    count(*) as n
  from

    wmf.webrequest
  where

    year = 2017
    and month = 10
    and day = 31

    and webrequest_source = 'text'

    and uri_path like "%rest_v1/page/pdf%"

    and agent_type = "user"
  group by
    user_agent_map["os_family"],
    user_agent_map["browser_family"],
    user_agent_map["browser_major"]
),
total_pdf_requests as (
  select
    sum(n) as total
  from
    pdf_requests
)

select
  os_family,
  browser_family,
  browser_major,
  round(n / total * 100, 2) as pct
from
  total_pdf_requests
cross join (
  select
    *
  from
    pdf_requests
    
  order by
    n desc
  limit 50
) as pdf_requests_top_50
order by
  pct desc

;


+----------------+-----------------+----------------+--------+--+
|   os_family    | browser_family  | browser_major  |  pct   |
+----------------+-----------------+----------------+--------+--+
| Windows 10     | Chrome          | 61             | 38.15  |
| Windows 7      | Chrome          | 61             | 8.05   |
| Other          | Other           | -              | 6.55   |
| Windows 7      | Firefox         | 56             | 4.73   |
| Windows 10     | Firefox         | 56             | 4.28   |
| Windows 7      | IE              | 11             | 3.66   |
| Windows XP     | Chrome          | 49             | 1.62   |
| Mac OS X       | Safari          | 11             | 1.58   |
| Windows 8.1    | Chrome          | 61             | 1.45   |
| Windows 10     | Edge            | 15             | 1.45   |
| Windows XP     | Firefox         | 52             | 1.35   |
| Windows 10     | Chrome          | 62             | 1.26   |
| Windows 7      | Chrome          | 62             | 1.26   |
| Windows 10     | IE              | 11             | 1.18   |
| Windows 10     | Chrome          | 59             | 1.03   |
| Windows 8.1    | Firefox         | 56             | 0.92   |
| Windows 8.1    | IE              | 11             | 0.8    |
| Windows 10     | Firefox         | 53             | 0.77   |
| Windows 10     | Edge            | 14             | 0.63   |
| Windows 7      | Yandex Browser  | 17             | 0.58   |
| Mac OS X       | Chrome          | 61             | 0.48   |
| Windows 7      | Opera           | 48             | 0.48   |
| Ubuntu         | Firefox         | 56             | 0.47   |
| Android        | UC Browser      | 11             | 0.42   |
| Android        | Chrome Mobile   | 61             | 0.35   |
| Windows 8.1    | Chrome          | 62             | 0.34   |
| Windows 10     | Opera           | 48             | 0.33   |
| Windows Vista  | Firefox         | 52             | 0.33   |
| Mac OS X       | Firefox         | 56             | 0.3    |
| Windows 8      | Chrome          | 61             | 0.29   |
| Windows 7      | Chrome          | 43             | 0.28   |
| Mac OS X       | Safari          | 10             | 0.27   |
| Windows 7      | Firefox         | 52             | 0.27   |
| Windows 7      | Chrome          | 60             | 0.26   |
| Windows 10     | Edge            | 16             | 0.25   |
| Windows XP     | IE              | 6              | 0.24   |
| Android        | Android         | 4              | 0.21   |
| Ubuntu         | Chromium        | 57             | 0.2    |
| Windows 7      | Chrome          | 55             | 0.2    |
| Windows 8.1    | Chrome          | 55             | 0.19   |
| Windows Vista  | Firefox         | 43             | 0.18   |
| Windows 8      | Firefox         | 56             | 0.18   |
| Linux          | Firefox         | 52             | 0.17   |
| Linux          | Chrome          | 61             | 0.17   |
| Windows XP     | Chrome          | 46             | 0.16   |
| Windows 10     | Edge            | 13             | 0.16   |
| Windows 8.1    | Chrome          | 58             | 0.16   |
| Windows 7      | Chrome          | 59             | 0.16   |
| Mac OS X       | Firefox         | 48             | 0.16   |
| Android        | Opera Mini      | 30             | 0.16   |
+----------------+-----------------+----------------+--------+--+
50 rows selected (341.405 seconds)

Thanks, @Tbayer. The timespan I used was 0:00 - 12:00 on 2017/10/11, i.e.

year = 2017
and month = 10
and day = 11
and hour >= 0
and hour < 12

I've since removed the result set from that paste so that it's generic, so I'll reproduce it below for posterity.

[0]
+----------------+-----------------+----------------+--------+
|   os_family    | browser_family  | browser_major  |  pct   |
+----------------+-----------------+----------------+--------+
| Other          | Other           | -              | 14.38  |
| Windows 7      | Chrome          | 61             | 12.42  |
| Windows 10     | Chrome          | 61             | 8.83   |
| Windows 7      | IE              | 11             | 7.33   |
| Windows 7      | Firefox         | 56             | 6.59   |
| Windows 10     | Firefox         | 56             | 3.82   |
| Windows 10     | Edge            | 15             | 3.24   |
| Windows 8.1    | Chrome          | 61             | 3.07   |
| Windows XP     | Chrome          | 49             | 2.2    |
| Windows 10     | Chrome          | 59             | 1.53   |
| Windows 10     | IE              | 11             | 1.51   |
| Windows 8.1    | Firefox         | 56             | 1.31   |
| Windows XP     | Firefox         | 52             | 1.22   |
| Windows 8      | Chrome          | 61             | 1.15   |
| Windows 8.1    | IE              | 11             | 1.15   |
| Mac OS X       | Safari          | 11             | 0.9    |
| Windows 7      | Firefox         | 53             | 0.89   |
| Windows 7      | Firefox         | 52             | 0.78   |
| Ubuntu         | Firefox         | 56             | 0.78   |
| Windows XP     | IE              | 6              | 0.7    |
| Windows 7      | Chrome          | 55             | 0.68   |
| Windows 7      | Firefox         | 55             | 0.62   |
| Mac OS X       | Chrome          | 61             | 0.62   |
| Android        | UC Browser      | 11             | 0.6    |
| Windows 10     | Edge            | 14             | 0.59   |
| Windows 7      | Opera           | 48             | 0.53   |
| Android        | Chrome Mobile   | 61             | 0.49   |
| Windows 10     | Opera           | 48             | 0.44   |
| Windows 7      | Chrome          | 60             | 0.4    |
| Windows Vista  | Chrome          | 49             | 0.39   |
| Windows 7      | Yandex Browser  | 17             | 0.37   |
| Windows 10     | Firefox         | 55             | 0.37   |
| Mac OS X       | Safari          | 10             | 0.36   |
| Windows 10     | Chrome          | 50             | 0.34   |
| Android        | Android         | 4              | 0.33   |
| Mac OS X       | Firefox         | 56             | 0.33   |
| Windows 10     | Chrome          | 60             | 0.32   |
| Windows 8.1    | Chrome          | 43             | 0.3    |
| Android        | Amazon Silk     | 60             | 0.29   |
| Windows 7      | Sogou Explorer  | 1              | 0.27   |
| Windows 8      | IE              | 10             | 0.26   |
| Windows 7      | IE              | 8              | 0.26   |
| Windows 8      | Opera           | 12             | 0.25   |
| Windows 7      | IE              | 9              | 0.25   |
| Linux          | Firefox         | 52             | 0.25   |
| Mac OS X       | Firefox         | 53             | 0.24   |
| Windows 7      | Firefox         | 45             | 0.24   |
| Windows 10     | Firefox         | 57             | 0.24   |
| Windows 7      | Firefox         | 38             | 0.22   |
| Windows 10     | Firefox         | 47             | 0.21   |
+----------------+-----------------+----------------+--------+

Thanks @phuedx! So I think this discrepancy shows it's better to rely on a longer timespan. Below, I have extended it to four weeks (Oct 1-28), which is actually OK to do provided one is content to relegate one's query to new "nice" queue on Hive (and prepared to wait a bit longer in case there are more timely queries running in the normal, non-nice queue - but even so, this query only took less than two hours for these four weeks' worth of data).

T176467#3658144, which gave rise to this task, also included the suggestion to compare the PDF download frequencies to the analogous frequencies from our general traffic, which seems a better test of the hypothesis that older browser are more likely to download PDFs (see also the original task description at T176467 ). I have done this below: pdf_pct is the same percentage as above, pv_pct is the percentage among pageviews in general, and pdf_propensity is the quotient of these two, measuring how much more likely users with that browser/OS combination are to download PDFs, compared with a general reader.

E.g. readers using the then-current version (61) of Chrome on Android seem to be remarkably unlikely to download PDFs, even when compared to other Android browsers like UC Browser.

os_familybrowser_familybrowser_majorpdf_downloadspdf_pctpv_pctpdf_propensity
Windows 10Chrome6141277914.928.291.8
OtherOther-32923011.91.0811.05
Windows 7Chrome6132045011.586.471.79
Windows 7Firefox561472625.321.613.31
Windows 7IE111361504.922.811.75
Windows 10Firefox561185434.281.842.33
Windows 8.1Chrome61682862.471.511.63
Windows 10Edge15621902.251.371.64
Windows XPChrome49419951.520.433.51
Windows XPFirefox52361401.310.235.65
Windows 10IE11360841.31.091.19
Windows 10Chrome59348591.260.0429.63
Windows 8.1IE11305821.110.61.85
Windows 7Firefox55306021.110.343.27
Windows 8.1Firefox56283051.020.352.93
Mac OS XSafari11280021.011.210.84
Windows 10Edge14256560.930.392.38
Windows 10Firefox55213010.770.362.15
Mac OS XChrome61209580.761.90.4
Windows 7Chrome58194890.70.125.89
UbuntuFirefox56174390.630.173.78
Windows 7Opera48171970.620.311.99
Windows 8Chrome61161660.580.222.62
Windows 7Chrome62149950.540.22.75
AndroidChrome Mobile61145250.5211.820.04
Windows 10Opera48139020.50.31.67
Windows 7Firefox52132900.480.222.23
Windows 7Chrome50131910.480.0315.05
AndroidUC Browser11132170.480.510.93
Windows 10Chrome50133950.480.60.8
Windows 7Chrome60127650.460.281.62
Mac OS XSafari10119690.430.760.57
Windows 7Chrome52115940.420.14.08
Mac OS XFirefox56106880.390.281.36
Windows 10Firefox53103920.380.0222.13
Windows 10Chrome62103870.380.271.38
Windows 7Chrome43104140.380.0217.03
Windows 7Firefox5799980.360.0311.52
AndroidAndroid494290.340.920.37
Windows VistaFirefox5291410.330.074.8
Windows 7Firefox4591150.330.074.97
Windows XPIE687620.320.530.6
Windows 10Chrome6077120.280.281.0
Windows 10Edge1371110.260.073.52
Windows 8.1Firefox5570710.260.083.28
Windows 7Yandex Browser1765920.240.211.15
Windows CEIE466730.240.151.56
AndroidAndroid265570.240.181.33
Windows 7Chrome5562500.230.082.83
LinuxChrome6159480.210.141.55

Query used:

SET mapred.job.queue.name=nice;
set hive.mapred.mode=nonstrict;

with pdf_requests as (
  select
    user_agent_map['os_family'] as os_family,
    user_agent_map['browser_family'] as browser_family,
    user_agent_map['browser_major'] as browser_major,
    count(*) as n
  from

    wmf.webrequest
  where

    year = 2017
    and month = 10
    and day <= 28
    and webrequest_source = 'text'

    and uri_path like '%rest_v1/page/pdf%'

    and agent_type = 'user'
  group by
    user_agent_map['os_family'],
    user_agent_map['browser_family'],
    user_agent_map['browser_major']
),
total_pdf_requests as (
  select
    sum(n) as total
  from
    pdf_requests
),
pageviews AS (
  select
    user_agent_map['os_family'] as os_family,
    user_agent_map['browser_family'] as browser_family,
    user_agent_map['browser_major'] as browser_major,
    SUM(view_count) AS PVs
  from

    wmf.pageview_hourly
  where

    year = 2017
    and month = 10
    and day <= 28

    and agent_type = 'user'
  group by
    user_agent_map['os_family'],
    user_agent_map['browser_family'],
    user_agent_map['browser_major']
),
total_pageviews AS (
  select
    sum(PVs) AS total_PVs
  FROM
    pageviews
)

SELECT
  pdf_requests_top_50.os_family AS os_family,
  pdf_requests_top_50.browser_family AS browser_family,
  pdf_requests_top_50.browser_major AS browser_major,
  n AS PDF_downloads,
  round(n / total * 100, 2) as PDF_pct,
  ROUND(PVs / total_PVs * 100, 2) AS PV_pct,
  ROUND( (n / total) / (PVs / total_PVs) , 2) AS PDF_propensity
from
  total_pdf_requests
cross join (
  select
    *
  from
    pdf_requests

  order by
    n desc
  limit 50
) as pdf_requests_top_50
JOIN
total_pageviews
CROSS JOIN
pageviews
ON pdf_requests_top_50.os_family = pageviews.os_family AND
  pdf_requests_top_50.browser_family = pageviews.browser_family AND
  pdf_requests_top_50.browser_major = pageviews.browser_major

order by
  PDF_pct desc
;