Progress update 1: Created a storage bucket I believe

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Week 7 plan - Division of Labor about big-data-project HOT 3 OPEN

sanittawan commented on August 30, 2024

Week 7 plan - Division of Labor

from big-data-project.

Comments (3)

tonofshell commented on August 30, 2024

Progress update 1:

Created a storage bucket
I believe I can allow access to it with the emails associated with your Google Cloud accounts
Some of the smaller XML files have already finished converting, larger ones might have to be done on the cloud
- I believe Nikki is adapting my code to work on MapReduce for this reason
Tomorrow morning before the workshop I will upload any finished CSVs and all the XML files, this should (hopefully) take no more than a few hours

from big-data-project.

sanittawan commented on August 30, 2024

@tonofshell Not sure if you've seen my comment in another issue (I posted it since last Saturday). So, I'm going to post it here again.

Can you please tell me which line in your code causes the script to scan the whole file?
I have a question about these lines (in startElement()).

if self.row == 1:
  self.out.write(str(attributes.keys())[1:-1] + "\n")
if len(attributes) > 0:
  self.out.write(str(attributes.values())[1:-1] + "\n")

It seems to me that attributes is a dictionary. Order in Python dictionary is not guaranteed. How do you know that the keys and attributes of each row of data will have the exact same order? (If it happens to yield the right thing, it's luck.) For example, if attributes of the first row 0 is {id: 0, name: "a", link: "url0"}, how can you be sure that row 1's attributes dict would not be something like {name: "b", id: 1, link: "url1"}? So, the resulting CSV is:

id, name, link
0, "a", "url0"
"b", 1, "url1"

If you agree with me that this could potentially be a problem, we should control the order by keeping a list of attributes. (That was the reason why I hard-coded the column names, but I did plan to make changes to make it more generic).

MapReduce might not be a good idea for this task. I am going to write an MPI script that does the conversion tomorrow and will make changes to the code according to the potential problem that I pointed out above.

from big-data-project.

liu431 commented on August 30, 2024

I've uploaded a notebook on applying VADER package to extract the sentiments from the data. It is easy to use and parallelable.

from big-data-project.

Week 7 plan - Division of Labor about big-data-project HOT 3 OPEN

Comments (3)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent