Droids
  1. Droids
  2. DROIDS-48

Support prioritizing in the TaskQueue

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.1.0
    • Fix Version/s: None
    • Component/s: core
    • Labels:
      None

      Description

      Use case:

      • when looping a directory, (imagine someone is too stupid and dunno the dmoz database can be downloaded and try to crawl it with Droids) we got collect a lot of links that will be handled later. assume the requirement is to fetch dmoz directory +1 link outside dmoz.org, In the original mechanism, it will keep adding new links to the TaskQueue. Ideally, there should be a mechanism to give a higher priority to the non-dmoz.org links, so when non-dmoz links are added, they are processed first, and be removed from the TaskQueue asap.

      with the patch in DROIDS-47, a constructor is added to the SimpleTaskQueue to support a custom Queue. This issue suggests to change the SimpleTaskQueue to use a PriorityBlockingQueue by default, and add a getWeight to the Task interface

      I'm also thinking about a more complex TaskQueue. to be discussed in the mail list later.

      1. DROIDS-48d2.patch
        3 kB
        Mingfai Ma
      2. DROIDS-48d.patch
        9 kB
        Mingfai Ma

        Activity

        Hide
        Mingfai Ma added a comment - - edited

        just come up with a even better design for weight.

        Weighted interface

        public interface Weighted {
            public int getWeight();
        }
        

        The Link/LinkTask, assume extends HashMap

        public class WeightedLink extends Link implements Weighted { //or LinkTask
            public int getWeight() {
                return Integer.parseInt(String.valueOf(this.get("weight")));
            }
        }
        

        WeightComparator :

        public class WeightComparator implements Comparator {
            public int compare(Object link1, Object link2) {
                int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0;
                int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0;
                return weight2 - weight1;
            }
        }
        

        Task Queue

         Queue queue = new PriorityBlockingQueue(10, new WeightComparator())
        

        so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight.

        p.s. I'm designing a filter framework that work at a broader sense than URL filter. The Weighted interface is actually designed to cater the ordering of Filter as well.

        Show
        Mingfai Ma added a comment - - edited just come up with a even better design for weight. Weighted interface public interface Weighted { public int getWeight(); } The Link/LinkTask, assume extends HashMap public class WeightedLink extends Link implements Weighted { //or LinkTask public int getWeight() { return Integer .parseInt( String .valueOf( this .get( "weight" ))); } } WeightComparator : public class WeightComparator implements Comparator { public int compare( Object link1, Object link2) { int weight1 = link1 instanceof Weighted ? ((Weighted) link1).getWeight() : 0; int weight2 = link2 instanceof Weighted ? ((Weighted) link2).getWeight() : 0; return weight2 - weight1; } } Task Queue Queue queue = new PriorityBlockingQueue(10, new WeightComparator()) so, weighted becomes optional. if user want to support weight, then, they implement Weighted and let the user decide how to weight. p.s. I'm designing a filter framework that work at a broader sense than URL filter. The Weighted interface is actually designed to cater the ordering of Filter as well.
        Hide
        Thorsten Scherler added a comment -

        Hmm, yeah sounds more flexible for the future and I see the point to store related infos. I like it.

        Show
        Thorsten Scherler added a comment - Hmm, yeah sounds more flexible for the future and I see the point to store related infos. I like it.
        Hide
        Mingfai Ma added a comment -

        let me submit another patch. i have a habit to use the formatter of my IDE but I haven't set it to use the coding style of this project, so. ... :-P

        p.s. for this issue, it could be handled just by adding a weight integer field. but i feel it is most flexible if the LinkTask could whole any arbitrary data. And the simplest way is to make it extends Map.

        public class LinkTask extends HashMap<String, Serializable> { //other interface are skipped;
            protected final String id; //whatever data type for ID
            protected final URI uri; //refer to DROIDS-52, this may cause problem for URI)
        
           // all the other data are optional
        

        use cases:

        • say, in submitting a link, we want to associate information about cookie/http header, so the fetcher could use the cookie info when fetching
        • any optional fields like weight could be used
        • any component, such as filter or parser or whatever, could mark arbitrary tag for a link. say, a parser/factory, may read a "parser"/"contentType" value to decide how the data could be parsed. (so the parser doesn't depends on HttpEntity in interface) or the outlink could be attached directly to a LinkTask.

        i throw the initial idea here to see if anyone has comment. more details on the implementation could be provided.

        Show
        Mingfai Ma added a comment - let me submit another patch. i have a habit to use the formatter of my IDE but I haven't set it to use the coding style of this project, so. ... :-P p.s. for this issue, it could be handled just by adding a weight integer field. but i feel it is most flexible if the LinkTask could whole any arbitrary data. And the simplest way is to make it extends Map. public class LinkTask extends HashMap< String , Serializable> { //other interface are skipped; protected final String id; //whatever data type for ID protected final URI uri; //refer to DROIDS-52, this may cause problem for URI) // all the other data are optional use cases: say, in submitting a link, we want to associate information about cookie/http header, so the fetcher could use the cookie info when fetching any optional fields like weight could be used any component, such as filter or parser or whatever, could mark arbitrary tag for a link. say, a parser/factory, may read a "parser"/"contentType" value to decide how the data could be parsed. (so the parser doesn't depends on HttpEntity in interface) or the outlink could be attached directly to a LinkTask. i throw the initial idea here to see if anyone has comment. more details on the implementation could be provided.
        Hide
        Thorsten Scherler added a comment -

        I am fine with the feature but I have problems with the diff. It adds formating changes to the code which makes it hard to identify the real changes.

        Show
        Thorsten Scherler added a comment - I am fine with the feature but I have problems with the diff. It adds formating changes to the code which makes it hard to identify the real changes.
        Hide
        Mingfai Ma added a comment -

        the prev patches breaks a testcase and it is fixed. This patch patches only one file and should be used on top of DROIDS-48d.patch

        Show
        Mingfai Ma added a comment - the prev patches breaks a testcase and it is fixed. This patch patches only one file and should be used on top of DROIDS-48 d.patch
        Hide
        Mingfai Ma added a comment -

        merged code to allow applying to the current snapshot. no functional change.

        Show
        Mingfai Ma added a comment - merged code to allow applying to the current snapshot. no functional change.
        Hide
        Mingfai Ma added a comment -

        any comment to this feature?

        could a weight field be added to Task? or could Task be enhanced to support a map of custom data? without adding weight to the Task interface, this feature cannot be implemented.

        for the Queue, there could be diff options:
        1. include in SimpleTaskQueue as provided in this patch, or
        2. make a separated TaskQueue implementation, e.g. PrioritizedTaskQueue, or
        3. do not include in the distribution (maybe provide in any example)

        re. between 1 and 2, the so-called prioritization is not too complex, so I think it is ok to include SimpleTaskQueue rather than separate to another queue, if it is to be included in the dist at all.

        Show
        Mingfai Ma added a comment - any comment to this feature? could a weight field be added to Task? or could Task be enhanced to support a map of custom data? without adding weight to the Task interface, this feature cannot be implemented. for the Queue, there could be diff options: 1. include in SimpleTaskQueue as provided in this patch, or 2. make a separated TaskQueue implementation, e.g. PrioritizedTaskQueue, or 3. do not include in the distribution (maybe provide in any example) re. between 1 and 2, the so-called prioritization is not too complex, so I think it is ok to include SimpleTaskQueue rather than separate to another queue, if it is to be included in the dist at all.
        Hide
        Mingfai Ma added a comment -

        the previous patch reversed the order. the higher the weight, the sooner a task should be processed

        Show
        Mingfai Ma added a comment - the previous patch reversed the order. the higher the weight, the sooner a task should be processed
        Hide
        Mingfai Ma added a comment -

        the previous attachment missed some files

        Show
        Mingfai Ma added a comment - the previous attachment missed some files
        Hide
        Mingfai Ma added a comment -

        the patches changes quite a number of files, but it's all about

        • added int getWeight() to Task
          remarks: LinkTask consumes 72 bytes per instance in a sample test. If the servers do not handle links fast enough, LinkTask will be kept adding to the memory. Just a quick calculation (maybe wrong), 1.5G memory could hold 20M LinkTask. It is preferable to minimize the field in a LinkTask, and use the shortest field. (int instead of long)
        • changed the SimpleTaskQueue from using ConcurrentLinkedQueue to PriorityBlockingQueue by default. Notice that there is a constructor for the user to provide a Queue, so it's not necessary to provide more configuration options such as providing a comparator. (there is no harm to do so, however)
        • notice that the method for FileTask is not implemented. not sure if a FileTask need a weight.

        How it works:

        • when a task is added to the queue, it checks the weight to decide if a task should be position at the top.
        • if two tasks has the same weight, the older one go first.
        Show
        Mingfai Ma added a comment - the patches changes quite a number of files, but it's all about added int getWeight() to Task remarks: LinkTask consumes 72 bytes per instance in a sample test. If the servers do not handle links fast enough, LinkTask will be kept adding to the memory. Just a quick calculation (maybe wrong), 1.5G memory could hold 20M LinkTask. It is preferable to minimize the field in a LinkTask, and use the shortest field. (int instead of long) changed the SimpleTaskQueue from using ConcurrentLinkedQueue to PriorityBlockingQueue by default. Notice that there is a constructor for the user to provide a Queue, so it's not necessary to provide more configuration options such as providing a comparator. (there is no harm to do so, however) notice that the method for FileTask is not implemented. not sure if a FileTask need a weight. How it works: when a task is added to the queue, it checks the weight to decide if a task should be position at the top. if two tasks has the same weight, the older one go first.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mingfai Ma
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development