Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None
    • None

    Description

      On lucene.apache.org we use Google Analytics tracking

      GOOGLE_ANALYTICS_TRACKING_ID = 'UA-94576-12'

      I think the reason was so that we could estimate downloads from mirrors, by counting number of clicks on the links from download pages. But are anyone ever looking at or publishing those numbers?

      The ASF wants projects to stop using 3rd party tracking of users and instead ask INFRA for aggregated stats for the page. WDYT? Should we

      1. Remove trackers from both sites and rely on stats from infra
      2. Continue using Google analytics, but have someone actually publish numbers from it every month?
      3. Use some other way of counting downloads?

      What do we get without a tracker?

      INFRA provides anonymous page view stats here https://uls.apache.org/exports/lucene.apache.org.yaml which gives some insight. But not downloads specifically. We see 12k visits to Solr downloads page last months, but we don't know how many of those clicked...

      Sheet3:
        Name: Most visited pages, past month
        Values:
          /solr/index.html: 33604
          /index.html: 27588
          /solr/downloads.html: 12118
          /core/2_9_4/queryparsersyntax.html: 11135
          /core/index.html: 10353
          /solr/guide/solr-tutorial.html: 9734
          /solr/resources.html: 8014
          /solr/features.html: 7046
          /solr/guide/8_8/solr-tutorial.html: 6099
          /solr/news.html: 5843
          /solr/guide/6_6/the-standard-query-parser.html: 5216
          /solr/guide/index.html: 4430
          /solr/guide/6_6/common-query-parameters.html: 4379
          /core/downloads.html: 3644
      

      There's an interesting section at the bottom of that YAML page, wonder if it could be enabled in some way

      Sheet6:
        Name: Downloads, past month
        Values: {}
      

      Attachments

        Issue Links

          Activity

            Google Analytics is a lot more than what INFRA provides, though that's useful too. It can do breakdown reports, such as referrers to specific pages, etc. This would allow figuring out if somebody refers to old pages and try to fix things. I wish we had Google Analytics on RefGuide as well, for the same reasons.

            But that's only useful if somebody actually has access to Google Analytics and does something with it. I work a tiny bit with GA, would be happy to have a look at the account and produce a couple of reports (maybe shared to committers or PMC only). Then, we can decide whether it is worth keeping it.

            There are also non GA ways to track analytics, e.g. https://matomo.org/ which is free on premise. But that needs INFRA to own it.

            arafalov Alexandre Rafalovitch added a comment - Google Analytics is a lot more than what INFRA provides, though that's useful too. It can do breakdown reports, such as referrers to specific pages, etc. This would allow figuring out if somebody refers to old pages and try to fix things. I wish we had Google Analytics on RefGuide as well, for the same reasons. But that's only useful if somebody actually has access to Google Analytics and does something with it. I work a tiny bit with GA, would be happy to have a look at the account and produce a couple of reports (maybe shared to committers or PMC only). Then, we can decide whether it is worth keeping it. There are also non GA ways to track analytics, e.g. https://matomo.org/ which is free on premise. But that needs INFRA to own it.
            janhoy Jan Høydahl added a comment -

            Alexandre, would you like to own this JIRA? I.e. analyze some traffic. Try to find a way to get solid numbers for outbound links to https://www.apache.org/dyn/closer.lua/*, which is the closest we get to download numbers.

            janhoy Jan Høydahl added a comment - Alexandre, would you like to own this JIRA? I.e. analyze some traffic. Try to find a way to get solid numbers for outbound links to https://www.apache.org/dyn/closer.lua/* , which is the closest we get to download numbers.

            I'll try to dig into this over the weekend.

            arafalov Alexandre Rafalovitch added a comment - I'll try to dig into this over the weekend.

            We don't have enough data yet for super deep insights, but it is already interesting. We have a dual-linked property configured (classic and GA4), I jumped around a bit trying to understand both.

            A couple of things that popped out:

            1. 93% of our users are on desktop. That's higher than normal, but perhaps reflects that we are delivering software, not service as such
            2. We get about 3x number of users on work days than on weekends. So, this seems to correspond to work-time interest.
            3. Home page is most popular which is usual; the second most popular (half of that) is downloads (good); everything else drops off half again or worse.
            4. The guide page is only 5% of visits; I find that under-promoted, though this does not count actual reference guide pages usage (no GA there)
            5. Landing (first) page is home page page for 67% and downloads 15%, the rest are much less and don't get much (first page) attention
            6. Half of the traffic arrives from Organic Search (and that's basically all Google); 30% is direct which may be copy/pastes, PDF links, etc; interestingly we get a tiny (but more than bing) traffic from the links on localhost:8983 (from our Admin UI links clearly); we don't have a partner promoting Solr so much that they are driving noticeable traffic to us.
            7. Most of the social traffic (70%) is from Twitter, our presence on Hacker News, Reddit and even Stack Overflow is not effective
            8. Most visitors are from USA, then China, India, and Germany and then a heavy drop-off; I would be curious who
            9. 83% of users are marked as new, though this may be issue of time and of cookie-clearing technologies

             

            Couple of notes, looking for feedback/+-1

            • I really really wish we would wire Reference Guide for analytics, even for 30 days or so - I think (hope) that may change a lot of above numbers and - even more importantly - would give us some visibility into user flow through the information
            • Search keywords that lead to the site are mostly in Search Console (as well as errors) - but it failed to validate through GA4 tag, we can do by uploading a file to the site - I recommend doing it as - again - this is about getting access to the information Google already collects (regardless of GA actually)
            • In GA4, I saw file_download event and it even had 'filename' attribute, but it would only show that attribute for last 30 minutes. I could not figure out how it was setup and/or how to see the values of filename attribute over longer period of time; if there is any information on whether/how that was setup, it would be good to know.

             

             

            arafalov Alexandre Rafalovitch added a comment - We don't have enough data yet for super deep insights, but it is already interesting. We have a dual-linked property configured (classic and GA4), I jumped around a bit trying to understand both. A couple of things that popped out: 93% of our users are on desktop. That's higher than normal, but perhaps reflects that we are delivering software, not service as such We get about 3x number of users on work days than on weekends. So, this seems to correspond to work-time interest. Home page is most popular which is usual; the second most popular (half of that) is downloads (good); everything else drops off half again or worse. The guide page is only 5% of visits; I find that under-promoted, though this does not count actual reference guide pages usage (no GA there) Landing (first) page is home page page for 67% and downloads 15%, the rest are much less and don't get much (first page) attention Half of the traffic arrives from Organic Search (and that's basically all Google); 30% is direct which may be copy/pastes, PDF links, etc; interestingly we get a tiny (but more than bing) traffic from the links on localhost:8983 (from our Admin UI links clearly); we don't have a partner promoting Solr so much that they are driving noticeable traffic to us. Most of the social traffic (70%) is from Twitter, our presence on Hacker News, Reddit and even Stack Overflow is not effective Most visitors are from USA, then China, India, and Germany and then a heavy drop-off; I would be curious who 83% of users are marked as new, though this may be issue of time and of cookie-clearing technologies   Couple of notes, looking for feedback/+-1 I really really wish we would wire Reference Guide for analytics, even for 30 days or so - I think (hope) that may change a lot of above numbers and - even more importantly - would give us some visibility into user flow through the information Search keywords that lead to the site are mostly in Search Console (as well as errors) - but it failed to validate through GA4 tag, we can do by uploading a file to the site - I recommend doing it as - again - this is about getting access to the information Google already collects (regardless of GA actually) In GA4, I saw file_download event and it even had 'filename' attribute, but it would only show that attribute for last 30 minutes. I could not figure out how it was setup and/or how to see the values of filename attribute over longer period of time; if there is any information on whether/how that was setup, it would be good to know.    
            janhoy Jan Høydahl added a comment -

            Thanks for doing this Alexandre.

            Guess this gives us some insight into traffic patterns. Would be interesting to see the source of the refguide 6_6 traffic, but I'm quite convinced it is Google, since 6.6 guide still shows up on top for some queries at my end.

            We should prepare for sunsetting GA on both Lucene and Solr sites, note board member Justin's comment on lucene dev list https://lists.apache.org/thread.html/re44bf57334ca786b3b5c7c66f27c45604e15f9850920dd565ad64889%40%3Cdev.lucene.apache.org%3E

            Also, I just noticed that INFRA must have started tracking downloads again, since numbers have started appearing in https://uls.apache.org/exports/solr.apache.org.yaml. Those numbers are so low that I wonder if they are counting the right thing. My guess is that Solr numbers are hidden in Lucene's stats since the current artifacts are actually in lucene/solr folder of download site.

            Of course any download numbers from this stat will only be the number of clicks from homepage to the mirrors - and we don't know anything about how many of those that actually ends up as a download, and we cannot know how many visit the mirrors directly the next time. Also, with Docker rising the number of Docker image pulls are euqally and increasingly interesting. We can get a count of total pulls with this command

            curl -s https://hub.docker.com/v2/repositories/library/solr/ | jq -r ".pull_count"

            I have created SOLR-15275 for getting rid of GA again. Can we perhaps set April 1st as a date for that? Then we have about 1 month stats in GA to continue analyzing.

            janhoy Jan Høydahl added a comment - Thanks for doing this Alexandre. Guess this gives us some insight into traffic patterns. Would be interesting to see the source of the refguide 6_6 traffic, but I'm quite convinced it is Google, since 6.6 guide still shows up on top for some queries at my end. We should prepare for sunsetting GA on both Lucene and Solr sites, note board member Justin's comment on lucene dev list https://lists.apache.org/thread.html/re44bf57334ca786b3b5c7c66f27c45604e15f9850920dd565ad64889%40%3Cdev.lucene.apache.org%3E Also, I just noticed that INFRA must have started tracking downloads again, since numbers have started appearing in https://uls.apache.org/exports/solr.apache.org.yaml . Those numbers are so low that I wonder if they are counting the right thing. My guess is that Solr numbers are hidden in Lucene's stats since the current artifacts are actually in lucene/solr folder of download site. Of course any download numbers from this stat will only be the number of clicks from homepage to the mirrors - and we don't know anything about how many of those that actually ends up as a download, and we cannot know how many visit the mirrors directly the next time. Also, with Docker rising the number of Docker image pulls are euqally and increasingly interesting. We can get a count of total pulls with this command curl -s https://hub.docker.com/v2/repositories/library/solr/ | jq -r ".pull_count" I have created SOLR-15275 for getting rid of GA again. Can we perhaps set April 1st as a date for that? Then we have about 1 month stats in GA to continue analyzing.

            If we are not going to wire-up Reference Guide, then the marginal value of GA over the INFRA numbers is fairly small from what I can see. In fact, INFRA does show the Ref Guide overall numbers. GA may have given us the traffic flow between the pages, but that's ok.

            So, I am +1 on removing GA on all properties on April 1 or even ASAP. That's what we promised to do on the dev list and there was nothing revolutionary in the GA stats to warrant course changes.

            Do we need another Jira for Lucene as well? I guess that's running a dead tracker, so definitely no need to wait.

            arafalov Alexandre Rafalovitch added a comment - If we are not going to wire-up Reference Guide, then the marginal value of GA over the INFRA numbers is fairly small from what I can see. In fact, INFRA does show the Ref Guide overall numbers. GA may have given us the traffic flow between the pages, but that's ok. So, I am +1 on removing GA on all properties on April 1 or even ASAP. That's what we promised to do on the dev list and there was nothing revolutionary in the GA stats to warrant course changes. Do we need another Jira for Lucene as well? I guess that's running a dead tracker, so definitely no need to wait.
            janhoy Jan Høydahl added a comment -

            I'm linking in the related lucene issue LUCENE-9858

            janhoy Jan Høydahl added a comment - I'm linking in the related lucene issue LUCENE-9858

            Looking at INFRA's analytics, it seems to be daily-generated last-31 days stats. Is there historical analytics available as well? Is there any way to compare last year's numbers? The link seems to not connect to any larger system, just a dead-end file drop.

            arafalov Alexandre Rafalovitch added a comment - Looking at INFRA's analytics, it seems to be daily-generated last-31 days stats. Is there historical analytics available as well? Is there any way to compare last year's numbers? The link seems to not connect to any larger system, just a dead-end file drop.
            janhoy Jan Høydahl added a comment -

            Closing this as Infra now has real download stats from CDN

            janhoy Jan Høydahl added a comment - Closing this as Infra now has real download stats from CDN

            People

              arafalov Alexandre Rafalovitch
              janhoy Jan Høydahl
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: