Details

    • Type: Task Task
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      http://twissandra.com/ is a demo Cassandra application built on django + pycassa. It's a great Cassandra showcase and very useful for people learning Cassandra. We could use more of those.

      Jake Luciani suggested one that presents full-text search of Wikipedia using Lucandra (see http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/ and http://github.com/tjake/Lucandra). Feel free to propose other application ideas here.

      Rackspace is willing to provide a VM to deploy on for a live demo, but remember, to be really useful this needs full DIY instructions, the final product is not the demo but the code + instructions.

        Activity

        Hide
        Janith Bandara added a comment - - edited

        I'm eligible student for the GSoC 2010. I'm Interested in Cassandra and this issue. I'd like to implement this idea.

        Show
        Janith Bandara added a comment - - edited I'm eligible student for the GSoC 2010. I'm Interested in Cassandra and this issue. I'd like to implement this idea.
        Hide
        Ben Standefer added a comment -

        I'm interesting in mentoring for this position. I think Ruby or Python would be best for an example app, and I think that having a well-documented Lucandra implementation (Wikipedia search) could really help a broad range of people out. I've developed a Cassandra-based recommendation engine in Python for Eventbrite (launches soon), and I'm about to start with the SimpleGeo crew, which is all Cassandra-based.

        Show
        Ben Standefer added a comment - I'm interesting in mentoring for this position. I think Ruby or Python would be best for an example app, and I think that having a well-documented Lucandra implementation (Wikipedia search) could really help a broad range of people out. I've developed a Cassandra-based recommendation engine in Python for Eventbrite (launches soon), and I'm about to start with the SimpleGeo crew, which is all Cassandra-based.
        Hide
        Ben Standefer added a comment - - edited

        A good idea brought up by Edward Capriolo (not a student, so students feel free to run with this one) is a Splunk knock-off. Splunk is software that indexes logs (syslog, Apache logs, app logs, whatever) in lots of different ways and makes your logs highly searchable and filter-able via a front-end web interface. http://www.splunk.com/product.

        While the Splunk product is powerful and awesome, the licensing is not (they license by usage instead of per-seat).

        I think a Splunk knock-off would be a good demo app for people just getting into Cassandra because parsing logs is an easy concept to understand and it could start off very simple. There is a lot of opportunity to utilize all features of the Cassandra API (range queries, search indexes, property-specific indexes). This could be made very high scalable by utilizing Scribed ( http://github.com/facebook/scribe ), a scalable logging solution that many Cassandra users are already using to store their logs. It's like rsyslogd on crack.

        Think Facebook's personalized search indexes for each of their 400M users, but applied to log data and properties.

        Show
        Ben Standefer added a comment - - edited A good idea brought up by Edward Capriolo (not a student, so students feel free to run with this one) is a Splunk knock-off. Splunk is software that indexes logs (syslog, Apache logs, app logs, whatever) in lots of different ways and makes your logs highly searchable and filter-able via a front-end web interface. http://www.splunk.com/product . While the Splunk product is powerful and awesome, the licensing is not (they license by usage instead of per-seat). I think a Splunk knock-off would be a good demo app for people just getting into Cassandra because parsing logs is an easy concept to understand and it could start off very simple. There is a lot of opportunity to utilize all features of the Cassandra API (range queries, search indexes, property-specific indexes). This could be made very high scalable by utilizing Scribed ( http://github.com/facebook/scribe ), a scalable logging solution that many Cassandra users are already using to store their logs. It's like rsyslogd on crack. Think Facebook's personalized search indexes for each of their 400M users, but applied to log data and properties.
        Hide
        priyanka sharma added a comment -

        Ben

        Splunk seems to be a good idea. if its very big, we can think of like network logging application.
        For example, like in network security system, servers usually maintains the big log and they have to search and set according to the IPs or ports or my be applications and URLs. show the chart or table based on different parameters and search all these in a big data stored database.
        for example in Snort.
        So, usually big servers have to maintain the whole traffic and they have to store it . And according to the query show the table or chart.
        what you say ?

        Thanks
        Priyanka Sharma
        (pix1)

        Show
        priyanka sharma added a comment - Ben Splunk seems to be a good idea. if its very big, we can think of like network logging application. For example, like in network security system, servers usually maintains the big log and they have to search and set according to the IPs or ports or my be applications and URLs. show the chart or table based on different parameters and search all these in a big data stored database. for example in Snort. So, usually big servers have to maintain the whole traffic and they have to store it . And according to the query show the table or chart. what you say ? Thanks Priyanka Sharma (pix1)
        Hide
        Olga Zdanchuk added a comment - - edited

        I am an eligible student.
        I have an idea to develop time planning application. I want develop a tool which help people to distribute tasks and plan their own time.

        Show
        Olga Zdanchuk added a comment - - edited I am an eligible student. I have an idea to develop time planning application. I want develop a tool which help people to distribute tasks and plan their own time.
        Hide
        Janith Bandara added a comment -

        I agree with priyanka. If we are going to develop a application like Splunk, defiantly it will become a huge project. but ben's one is pretty good idea.
        But this application should be demonstrate Cassandra's features and understand those easily. I think Lucandra (Lucene) based system will be better one.

        Show
        Janith Bandara added a comment - I agree with priyanka. If we are going to develop a application like Splunk, defiantly it will become a huge project. but ben's one is pretty good idea. But this application should be demonstrate Cassandra's features and understand those easily. I think Lucandra (Lucene) based system will be better one.
        Hide
        Ben Standefer added a comment -

        I think these are all good ideas. I would focus on building something simple that showcases Cassandra's strengths. I have found that Cassandra is powerful when storing and query large, dense, interconnected datasets. Think about what data is out there that could be analyzed at a deeper, more domain-specific level. (domain being per user, per host, per term, per property of an object).

        Facebook's inbox search is a search index per user. Twitter timelines are extremely customized and personalized based on the relationships between users. Digg first used Cassandra for the personalized "green badges" feature by creating a per-user index that stored each story you dugg and which of your friends also dugg each of those stories. Cassandra is good at storing large amounts of data that is a few layers deep.

        I think the demo app should not necessarily be feature-rich, but it should show us a slice of data we've never seen before. Maybe it could perform a locale-sensitive search or indexing task. One of the main strenghts of Cassandra is that it allows you to go from an extremely broad, large set of data to a very narrow, important set of data very quickly. Try to think of large sources of data. Chances are you might have to build some sort of crawler/indexer to get this data into Cassandra.

        A dumb example could be something like this: Let's crawl the pages 10,000 websites and build an index of page properties (number of images on the page, title of the page, outbound links on the page). Then we could build a interface that let's users quickly drill down through all that data to find the pages that have no images, "dogs" in the title, and at least 5 outbound links. Now, this is not really a useful application in itself, but if you understand that Cassandra lets you drill down from a LOT of data to a very local-specific set of data very quickly, you should be able to think of something truly simple yet useful.

        Another example could be to build a service that let's websites pass a user id and page id every time a user views a page on their site. You could then serve back to that site an Amazon-style "users who viewed this page also viewed this page" module. You would maintain a per-page index in Cassandra that, given a page, shows what the most other popular pages are. For a decent-sized website, this is a lot of write traffic that people are having a hard time doing anything useful with in MySQL due to the sheer size of the data (ie pageviews).

        Replacing the backend of a blog engine might be something that Cassandra can do, but it doesn't really showcase why people would ever use Cassandra and how Cassandra is good at querying specific data out of large, broad datasets. Think of relationships between objects and properties, that's where the real value can come.

        Like I said, the demo app doesn't need to have a ton of features, but it needs to showcase the capacity for handling large volumes of data.

        Show
        Ben Standefer added a comment - I think these are all good ideas. I would focus on building something simple that showcases Cassandra's strengths. I have found that Cassandra is powerful when storing and query large, dense, interconnected datasets. Think about what data is out there that could be analyzed at a deeper, more domain-specific level. (domain being per user, per host, per term, per property of an object). Facebook's inbox search is a search index per user. Twitter timelines are extremely customized and personalized based on the relationships between users. Digg first used Cassandra for the personalized "green badges" feature by creating a per-user index that stored each story you dugg and which of your friends also dugg each of those stories. Cassandra is good at storing large amounts of data that is a few layers deep. I think the demo app should not necessarily be feature-rich, but it should show us a slice of data we've never seen before. Maybe it could perform a locale-sensitive search or indexing task. One of the main strenghts of Cassandra is that it allows you to go from an extremely broad, large set of data to a very narrow, important set of data very quickly. Try to think of large sources of data. Chances are you might have to build some sort of crawler/indexer to get this data into Cassandra. A dumb example could be something like this: Let's crawl the pages 10,000 websites and build an index of page properties (number of images on the page, title of the page, outbound links on the page). Then we could build a interface that let's users quickly drill down through all that data to find the pages that have no images, "dogs" in the title, and at least 5 outbound links. Now, this is not really a useful application in itself, but if you understand that Cassandra lets you drill down from a LOT of data to a very local-specific set of data very quickly, you should be able to think of something truly simple yet useful. Another example could be to build a service that let's websites pass a user id and page id every time a user views a page on their site. You could then serve back to that site an Amazon-style "users who viewed this page also viewed this page" module. You would maintain a per-page index in Cassandra that, given a page, shows what the most other popular pages are. For a decent-sized website, this is a lot of write traffic that people are having a hard time doing anything useful with in MySQL due to the sheer size of the data (ie pageviews). Replacing the backend of a blog engine might be something that Cassandra can do, but it doesn't really showcase why people would ever use Cassandra and how Cassandra is good at querying specific data out of large, broad datasets. Think of relationships between objects and properties, that's where the real value can come. Like I said, the demo app doesn't need to have a ton of features, but it needs to showcase the capacity for handling large volumes of data.
        Hide
        Ben Standefer added a comment - - edited

        Also, the MoinMoin wiki (what the ASF and Cassandra use as their wiki engine) is another GSOC project that is a good candidate for Cassandra work. One of their top projects is to have better item metadata indexing and search. You could probably propose working on this type of project to Apache/Cassandra or MoinMoin. We know the ASF would like to have a better wiki (Cassandra's is kind of slow as you may notice) with improved searching and filtering options and performance. The ASF runs the same MoinMoin instance for a lot of the ASF projects, so this would have a big impact, make the ASF happy, and get you some good Cassandra/Python experience.

        http://moinmo.in/GoogleSoc2010
        http://moinmo.in/GoogleSoc2010/InitialProjectIdeas

        Show
        Ben Standefer added a comment - - edited Also, the MoinMoin wiki (what the ASF and Cassandra use as their wiki engine) is another GSOC project that is a good candidate for Cassandra work. One of their top projects is to have better item metadata indexing and search. You could probably propose working on this type of project to Apache/Cassandra or MoinMoin. We know the ASF would like to have a better wiki (Cassandra's is kind of slow as you may notice) with improved searching and filtering options and performance. The ASF runs the same MoinMoin instance for a lot of the ASF projects, so this would have a big impact, make the ASF happy, and get you some good Cassandra/Python experience. http://moinmo.in/GoogleSoc2010 http://moinmo.in/GoogleSoc2010/InitialProjectIdeas
        Hide
        Olga Zdanchuk added a comment -

        I am still thinking about my idea of time planing. How It can be difficult to contol what a man is doing while he use a computer? Are there some tools wich save, process and analize information how many time does a man spend working in each application? Could with Сassandra (and may be some programming tool else) to analize the informatinon? Is it too weak or good task for Cassandra?

        Show
        Olga Zdanchuk added a comment - I am still thinking about my idea of time planing. How It can be difficult to contol what a man is doing while he use a computer? Are there some tools wich save, process and analize information how many time does a man spend working in each application? Could with Сassandra (and may be some programming tool else) to analize the informatinon? Is it too weak or good task for Cassandra?
        Hide
        Ben Standefer added a comment -

        Olga,

        With added detail, I think a behavior tracking/reporting app could be a good example. There is a lot of data to be gathered from how a person (or hopefully many people!) use computers. Cassandra is very good at taking in and organizing lots of writes. In your application, you'd want to be specific in how you will collect the data (a desktop app? JavaScript tracking on a webpage?) and how you will make it query-able and useful. Good luck!

        -Ben

        Show
        Ben Standefer added a comment - Olga, With added detail, I think a behavior tracking/reporting app could be a good example. There is a lot of data to be gathered from how a person (or hopefully many people!) use computers. Cassandra is very good at taking in and organizing lots of writes. In your application, you'd want to be specific in how you will collect the data (a desktop app? JavaScript tracking on a webpage?) and how you will make it query-able and useful. Good luck! -Ben
        Hide
        Ben Standefer added a comment -

        ***IMPORTANT: Applications are due by April 9th 19:00 UTC! You apply through Google at this URL: http://socghop.appspot.com/gsoc/student/apply/google/gsoc2010 Even if you have sent your application to a member of the Cassandra community, you must submit it to Google for it to be eligible! Please read the FAQs: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs ***

        If you would like feedback, feel free to email your application to benstandefer-at-gmail-dot-com and I'll try to find time to send you some comments and pointers.

        Show
        Ben Standefer added a comment - ***IMPORTANT: Applications are due by April 9th 19:00 UTC! You apply through Google at this URL: http://socghop.appspot.com/gsoc/student/apply/google/gsoc2010 Even if you have sent your application to a member of the Cassandra community, you must submit it to Google for it to be eligible! Please read the FAQs: http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs *** If you would like feedback, feel free to email your application to benstandefer-at-gmail-dot-com and I'll try to find time to send you some comments and pointers.
        Hide
        Olga Zdanchuk added a comment -

        After spending some time on detail investigation, I understood that this task will be hard for me to complete in time due to other commitments. Sorry, for wasting your time.

        Show
        Olga Zdanchuk added a comment - After spending some time on detail investigation, I understood that this task will be hard for me to complete in time due to other commitments. Sorry, for wasting your time.
        Hide
        Emmanuel Pastor added a comment -

        Hi, I just sent my GSoC proposal for this project through the GSoC web application, and through email to benstandefer-at-gmail-dot-com, any feedback will indeed very much appreciated, thanks in advance.

        Show
        Emmanuel Pastor added a comment - Hi, I just sent my GSoC proposal for this project through the GSoC web application, and through email to benstandefer-at-gmail-dot-com, any feedback will indeed very much appreciated, thanks in advance.
        Hide
        Jonathan Ellis added a comment -

        if there were a "screwed over by asf red tape" jira resolution, this issue would get it. (none of our gsoc mentor applications were processed by the deadline.)

        Show
        Jonathan Ellis added a comment - if there were a "screwed over by asf red tape" jira resolution, this issue would get it. (none of our gsoc mentor applications were processed by the deadline.)

          People

          • Assignee:
            Unassigned
            Reporter:
            Jonathan Ellis
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development