Git Product home page Git Product logo

Comments (3)

tonofshell avatar tonofshell commented on August 30, 2024

Progress update 1:

  • Created a storage bucket
  • I believe I can allow access to it with the emails associated with your Google Cloud accounts
  • Some of the smaller XML files have already finished converting, larger ones might have to be done on the cloud
    • I believe Nikki is adapting my code to work on MapReduce for this reason
  • Tomorrow morning before the workshop I will upload any finished CSVs and all the XML files, this should (hopefully) take no more than a few hours

from big-data-project.

sanittawan avatar sanittawan commented on August 30, 2024

@tonofshell Not sure if you've seen my comment in another issue (I posted it since last Saturday). So, I'm going to post it here again.

  • Can you please tell me which line in your code causes the script to scan the whole file?

  • I have a question about these lines (in startElement()).

if self.row == 1:
  self.out.write(str(attributes.keys())[1:-1] + "\n")
if len(attributes) > 0:
  self.out.write(str(attributes.values())[1:-1] + "\n")

It seems to me that attributes is a dictionary. Order in Python dictionary is not guaranteed. How do you know that the keys and attributes of each row of data will have the exact same order? (If it happens to yield the right thing, it's luck.) For example, if attributes of the first row 0 is {id: 0, name: "a", link: "url0"}, how can you be sure that row 1's attributes dict would not be something like {name: "b", id: 1, link: "url1"}? So, the resulting CSV is:

id, name, link
0, "a", "url0"
"b", 1, "url1"

If you agree with me that this could potentially be a problem, we should control the order by keeping a list of attributes. (That was the reason why I hard-coded the column names, but I did plan to make changes to make it more generic).

MapReduce might not be a good idea for this task. I am going to write an MPI script that does the conversion tomorrow and will make changes to the code according to the potential problem that I pointed out above.

from big-data-project.

liu431 avatar liu431 commented on August 30, 2024

I've uploaded a notebook on applying VADER package to extract the sentiments from the data. It is easy to use and parallelable.

from big-data-project.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.