Wednesday, May 31, 2017

About my Project for GSoC 2017: List-Extractor

    Okay, so today I'll be writing a brief summary of what my project is all about. As the name itself suggests...
It extracts data from Wikipedia lists. 

wikipedia lists

Now hold on.... 

    Isn't that a simple task? That's something a *noob* can do by writing a simple script that scrapes data off the Wikipedia pages. What's so special about your project, huh?

    It's slightly more subtle than that. It's not all about just scraping the data and dumping it. The whole point of this project is to extract data and make it meaningful and connected, the very essence of semantic Web. Also, making it user friendly so that a person with limited computer knowledge can add more domains to the extractor.

    From the existing data present in the wiki lists, we form triples, which follow the W3C-RDF standards. Instead of using static constants or strings, we actually store the URI of the resources, which helps us in connecting all the triples, which as a result forms a large knowledge graph, which can be used to answer complex queries. The following snippet shows the sample extracted triplets for a musical album and its related artist. Pretty sweet eh?

@prefix dbo: <> .
@prefix dbr: <> .
@prefix rdf: <> .
@prefix rdfs: <> .
@prefix xml: <> .
@prefix xsd: <> .

dbr:In_The_Light_of_Fires_Burning dbo:musicalArtist <> ;
    dbo:releaseYear "2016-01-01"^^xsd:gYear .

dbr:In_The_Mood dbo:musicalArtist dbr:Nicole_Moudaber ;
    dbo:releaseYear "2013-01-01"^^xsd:gYear .

   We use the existing dbpedia ontologies to gather the related resources. In this project, we use JSONpedia Live, another project which was started in GSoC 2014 and is being currently maintained by Michele Mostrada. This live service provides us with a valid JSON response to a given resource, which can be parsed to extract relevant information. Of course, being a Web based service, it might be down if it receives high volume requests, so we need to use the JSONpedia library in our project. A small catch though, it's written in Java. Integrating the library will be an important task in my project in later stages.

    So, to summarize, the main objective of my project will be to add more data to the existing knowledge base, extend the existing list-extractor tool and add different resources, and as a result generating new datasets which can be added in the DBpedia datasets, along with integrating the JSONpedia library with the project to make the extractor independent of using the live service!

Let the code begin!! 

Sunday, May 21, 2017

About DBpedia

    With Community Bonding going on in its full swing, let me tell you something about the organization I'm contributing to, i.e. DBpedia. Since they already have a fantastic summary about the organisation on their page, I'm going to summarize  (more like quote) the summary they have already provided. (Yes, I'm very lazy :P)

    DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data.

    Knowledge bases are playing an increasingly important role in enhancing the intelligence of Web and enterprise search and in supporting information integration. Today, most knowledge bases cover only specific domains, are created by relatively small groups of knowledge engineers, and are very cost intensive to keep up-to-date as domains change. At the same time, Wikipedia has grown into one of the central knowledge sources of mankind, maintained by thousands of contributors. The DBpedia project leverages this gigantic source of knowledge by extracting structured information from Wikipedia and by making this information accessible on the Web under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.

     The DBpedia knowledge base has several advantages over existing knowledge bases: it covers many domains; it represents real community agreement; it automatically evolves as Wikipedia changes, and it is truly multilingual. The DBpedia knowledge base allows you to ask quite surprising queries against Wikipedia, for instance “Give me all cities in New Jersey with more than 10,000 inhabitants” or “Give me all Italian musicians from the 18th century”. Altogether, the use cases of the DBpedia knowledge base are widespread and range from enterprise knowledge management, over Web search to revolutionizing Wikipedia search.

So, to summarize the summary of the summary,

What is DBpedia?

  • DBpedia is an open, free and comprehensive knowledge base constantly improved and extended by a large global community
  • DBpedia can be used to directly answer fact questions about a wide range of topics
  • users exploit DBpedia as background knowledge for document ranking, natural language understanding, as well as data integration methods
  • our data grows with Wikipedia and Wikidata
  • the extractors are updated frequently to build our 8.8 billion fact, large-scale-cross-domain knowledge graph
  • DBpedia has thousands of users, for example: 
    • large companies such as Wolters Kluwer
    • libraries
    • researchers
    • web developers

Why is DBpedia important?

    DBpedia provides a complementary service to Wikipedia by exposing knowledge (from 130 Wikimedia projects, in particular the English Wikipedia, Commons, Wikidata and over 100 Wikipedia language editions) in a quality-controlled form compatible with tools covering ad-hoc structured data querying, business intelligence & analytics, entity extraction, natural language processing, reasoning & inference, machine learning services, and artificial intelligence in general. Data is published strictly in line with “Linked Data” principles using open standards (e.g., URIs, HTTP, HTML, RDF, and SPARQL) and open data licensing. 

You can visit the official DBpedia website for more information about the DBpedia Organisation, community, projects and more! 

Tuesday, May 16, 2017

Getting selected for GSoC

    Okay, so as I posted last week, I've been selected for the Google Summer of code program for the year 2017, and I'll be working with DBpedia. As an integral part of my project, I'll have to record my progress on my project every week, and so, I'll be posting all my GSoC related posts under the GSoC tag.

    Okay, so how did this journey start?

    I guess it started back in December last year, when I was at home spending time with my family during the winter vacations. I was trying to do something "actually" useful from whatever I had learned so far. So, I decided to check out a few Open Source organizations. After browsing through the many great organizations, DBpedia caught my eye. I visited the ideas page, and looked at the kind of ideas they were working on and wanted to expand into. I started to get more involved with the community and interacted with them. Went to their github page and checked out their projects, how they worked and got some idea about how the entire organization works. 

    Then I browsed the ideas page for the ideas I would've be interested in, and started interacting with the respective project mentors. I was interested in quite a few projects, but in the end decided to solely focus on the list-extractor. As the organizations list was announced, all the ideas were updated with warm-up tasks, and I finished the warm up tasks that were assigned by my mentor, Marco. We discussed some approaches and ideas about how the current list-extractor was working and what could be done to improve that. After some productive discussions, I decided to write a proposal for the project. I'd been following the idea for 3 months by then, and knew the existing code inside out. So, as discussed with my mentor, I came up with my proposal, explaining all the new features and domains I planned to add and how it would benefit the community. Since I was writing the proposal for the first time, I didn't come up with the best of proposals, and my deliverables and timeline was a bit vague. But with proper feedback from my mentor, I was able to improve it and finally submitted my proposal. 

    Then came the agonizing 1 month wait for the result. I was pretty confident of my chances, because I had a solid proposal and had a good interaction with my mentor, but you can never be sure. So, I tried not to think about it and got back to my university work, which I conveniently ignored during this time, as a result of which my university workload for the rest of the semester remained immense. So much for taking it easy, eh? 

    So finally, the day was here. 4th may. The results were going to be out at 21:30 IST. It was afternoon and I was a tad bit nervous. Also, I slept for like 9 hours in the past 4 days due to exams and project presentations. I was sleepy, so I decided to take a small nap. And as with all naps, I woke up many hours later, not knowing what time or place it was, and my head all being groggy. It took me a few minutes to come back to reality, at which point I realized it was 22:06 and the result was out. And now, along with grogginess, I was feeling nauseated too. Somehow, I gathered all my strength, and opened the GSoC website. And then suddenly, all of my sickness turned into pure joy, as I realized I was selected. I smiled rechecked the result again just to make sure I wasn't still sleeping. Once I realized it was indeed real, I called my parents and told them about it. They were very happy and the conversation lasted for a while. 

    After that, I sat outside with a slight breeze blowing across my face thinking,what happens now? Sipping on the milkshake I just bought, I thought "Community bonding starts tomorrow... Tonight might be the final free night I have. I must celebrate and do something that makes me happy..."

So, I finished my milkshake, and went back to my room, and slept for 11 hours straight.

That day, was a good day!

Thursday, May 11, 2017

Google Summer of Code!

    Wow, it has been more than a year (see: eternity) since I last wrote a blog post. I guess this was one of those things that just fade away with time. To be honest, it was more about the lack of time than anything else. University life has been busier than I expected. I completed the 5th semester of my undergraduate studies by the end of 2016, and I was yet to do anything that would "validate" my existence as a "Computer Engineer". I had to do something meaningful with whatever I've learned in the past .... umm.. many years? So I started finding some Open Source organisations whose work and ideas were aligned to (one of) my interests, and that's when I discovered DBpedia. I contacted the contributors and understood how DBpedia functions and got my hands dirty with some of the source code. I actually wanted to contribute to the organisation, and since DBpedia was (is) a regular GSoC organisation, I decided to apply for it. 

    Thankfully, I was accepted in Google Summer of Code (GSoC) 2017 with DBpedia (Yup, one of my proudest moments :P). So, I'll be spending my summer working on "Wikipedia List Extractor" project. I have to document the progress project of my project regularly, so this gives me the opportunity to re-kindle my Blog.

    I'll keep updating my experiences from my GSoC crusade, and maybe some other random stuff. Hopefully, I'll stay put for a longer period this time :P